The iGen Language Project investigates the language of young people, aged 16-25 years old, in order to learn more about their cultures, mindsets, and values.
What does the language of young people born after 1994 tell us about their culture, view of the world, mindset, and values?
Known variously as Gen Z, Zoomers, iGen, or digital natives, this is the first generation to have never known a world without the internet. Based on the principle that language is the key to culture, we want to explore the mindset of today’s 16 to 25 year olds through their language.
To this end, we created the iGen Corpus, a large digital repository of their language, taken from social media and interviews, which we are analyzing to find out more about their attitudes and values.
Some preliminary results will be published in a book Gen Z, Explained: the art of living in a digital age by University of Chicago Press on 5 Nov 2021.
The iGen Corpus is a 70-million-word digital repository of spoken and written language of people aged sixteen to twenty-five years taken from a variety of natural contexts.
These contexts include transcripts of focus groups and interviews conducted by the research team at Stanford University, Foothill College, and Lancaster University; data from social media platforms representing different types of engagement online (e.g. social [Twitter], discussion [Reddit], gaming [Twitch], imageboards [4chan], and video [YouTube]); and memes, emoji, and copypastas from Facebook and Instagram.
We applied machine learning algorithms where necessary to extract the language of people within our target age group (sixteen to twenty-five years old).
The iGen Corpus is comprised of the following:
Twitter: 16,001,261 tokens (1,069,598 distinct word types)
Reddit: 36,884,918 tokens (415,257 distinct word types)
Twitch: 10,075,603 tokens (527,798 distinct word types)
4chan: 6,110,549 tokens (292,922 distinct word types)
YouTube: 24,382 tokens (3,022 distinct word types)
Interviews: 891,882 tokens (18,013 distinct word types)
Memes: 11,701 tokens (3,126 distinct word types)
Upon completion of our research, we will release the iGen Corpus for public access to allow others to use it for their own research or to replicate the study in languages other than English, thereby creating a large global network of corpora.
Corpus Research Team
Sarah Ogilvie (Oxford) Principal Investigator
Robert Fromont (Canterbury, NZ) Corpus Consultant
Max Farr (Stanford) Student Research Assistant
Jacob Kupperman (Stanford) Student Research Assistant
Angela Lee (Stanford) Student Research Assistant
Amelia Leland (Stanford) Student Research Assistant
Anna-Marie Springer (Stanford) Student Research Assistant
Many people contributed to the creation and archiving of the iGen Corpus, and we gratefully acknowledge the advice and support of the following colleagues:
Arto Anttila (Stanford)
Alex Chekholko (Stanford)
Jon Edwards (Oxford)
Will Hamilton (McGill)
Danny Hernandez (Twitch)
Dan Jurafsky (Stanford)
David Jurgens (Michigan)
Margaret Levi (Stanford)
Chris Manning (Stanford)
Ruth Marinshaw (Stanford)
Byron Reeves (Stanford)
Simon Todd (UC Santa Barbara)
Emily Winter (Lancaster)
Daniel Yee (Oxford)