iGen Language

The iGen Language Project investigates the language of young people, aged 16-25 years old, in order to learn more about their cultures, mindsets, and values.

 

Research Question

What does the language of young people born after 1994 tell us about their culture, view of the world, mindset, and values?

 

Known variously as Gen Z, Zoomers, iGen, or digital natives, this is the first generation to have never known a world without the internet. Based on the principle that language is the key to culture, we want to explore the mindset of today’s 16 to 25 year olds through their language.

 

To this end, we created the iGen Corpus, a large digital repository of their language, taken from social media and interviews, which we are analyzing to find out more about their attitudes and values.

 

Some preliminary results have been published in a book Gen Z, Explained: the art of living in a digital age  by University of Chicago Press. 

 

iGen Corpus

The iGen Corpus is a 70-million-word digital repository of spoken and written language of people aged sixteen to twenty-five years taken from a variety of natural contexts.

 

These contexts include transcripts of focus groups and interviews conducted by the research team at Stanford University, Foothill College, and Lancaster University; data from social media platforms representing different types of engagement online (e.g. social [Twitter], discussion [Reddit], gaming [Twitch], imageboards [4chan], and video [YouTube]); and memes, emoji, and copypastas from Facebook and Instagram. 

 

We applied machine learning algorithms where necessary to extract the language of people within our target age group (sixteen to twenty-five years old).

 

The iGen Corpus is comprised of the following:

Twitter: 16,001,261 tokens (1,069,598 distinct word types)

Reddit: 36,884,918 tokens (415,257 distinct word types)

Twitch: 10,075,603 tokens (527,798 distinct word types)

4chan: 6,110,549 tokens (292,922 distinct word types)

YouTube: 24,382 tokens (3,022 distinct word types)

Interviews: 891,882 tokens (18,013 distinct word types)

Memes: 11,701 tokens (3,126 distinct word types)

 

Upon completion of our research, we will release the iGen Corpus for public access to allow others to use it for their own research or to replicate the study in languages other than English, thereby creating a large global network of Gen Z corpora.

 

Background

This project grew out of a Stanford class on words taught by Sarah Ogilvie, and is part of a larger interdisciplinary study, Understanding the iGeneration by Roberta Katz (Stanford), Sarah Ogilvie (Oxford), Jane Shaw (Oxford), and Linda Woodhead (Lancaster) hosted by Stanford's Center for Advanced Study in the Behavioral Sciences and funded by the Knight Foundation.

 

Corpus Research Team

Sarah Ogilvie (Oxford) Principal Investigator

Robert Fromont (Canterbury, NZ) Corpus Consultant

Martin Wynne (Oxford) Corpus Consultant

Natalie Cho Tsang (Oxford) Student Research Assistant

Max Farr (Stanford) Student Research Assistant

Jacob Kupperman (Stanford) Student Research Assistant

Angela Lee (Stanford) Student Research Assistant

Amelia Leland (Stanford) Student Research Assistant

Anna-Marie Springer (Stanford) Student Research Assistant

 

Many people contributed to the creation and archiving of the iGen Corpus, and we gratefully acknowledge the advice and support of the following colleagues:

Arto Anttila (Stanford)

Alex Chekholko (Stanford)

Jon Edwards (Oxford)

Will Hamilton (McGill)

Danny Hernandez (Twitch)

Dan Jurafsky (Stanford)

David Jurgens (Michigan)

Margaret Levi (Stanford)

Chris Manning (Stanford)

Ruth Marinshaw (Stanford)

Byron Reeves (Stanford)

Simon Todd (UC Santa Barbara)

Emily Winter (Lancaster)

Daniel Yee (Oxford)