by Kevin Breen
This article is the second in a biweekly series that will examine AI and its implications for publishers and authors.
Large language models (LLMs) are the manifestation of genAI most relevant to publishers and authors. After all, we are in the business of sharing the written word with readers, and LLMs exist to produce text with minimal to no human involvement. It makes sense, then, that we might care deeply about how large language models get built. Unfortunately, conversations about LLM construction can make general readers feel like they need an advanced computer science degree to understand the topic.
By analyzing how these models are built, we can better parse ethical concerns, environmental impacts, and the knock-on effects felt by authors, booksellers, and other professionals in our industry. First, like any conscientious consumer, we need a firm understanding of this emerging–and sometimes purposefully opaque–technology.
Data is the raw ingredient
“Garbage in, garbage out” is a programmer’s reminder that their output’s quality is only as good as the input material they’ve relied on. The same mantra applies to building LLMs: higher-quality text data, like professionally edited long-form narratives, is the best input for any large language model. The challenge for companies like OpenAI, Meta, and other genAI leaders is that their models require huge amounts of data. Often low quality input data is mixed with grade A inputs to meet the quantitative needs of their evolving tools.
Consistently adding input data to a model has two benefits—it fine-tunes outputs and ensures the data remains contemporary. Vast quantities of data help improve the reliability of an AI model’s responses. The more information a model has at its disposal, the less it replies, “Sorry, I don’t know.” For example, OpenAI’s ChatGPT release notes routinely announce refinements to its keystone product, ranging from enhanced “problem-solving capabilities for STEM” to improvements that curb “overly agreeable responses.” Improvements like these are at least partially attributable to the model’s continued ingestion of data, which it uses to fine-tune its interpretations of users’ prompts.
Meanwhile, regularly adding more training data also ensures responses are contemporary and relevant to modern-day users. Imagine if LLMs were only trained on publicly available data from the dawn of the Internet until 2015. With a ten-year gap of data ingestion, the model’s outputs would not account for technological advances, societal and geopolitical changes, and revisions made to crucial fields like medicine, cybersecurity, and finance. What we consider to be “correct” gets constantly revised, which means large language models need to keep up.
In April 2024, The New York Times reported that, in 2021, OpenAI needed way more training data for ChatGPT to continue improving, so the company transcribed more than one million hours of YouTube videos. This was not considered a high-quality data source, but it was certainly an abundant one. Even still, the New York Times says research institutes like Epoch believe LLMs could ingest “all the high-quality data on the internet as soon as 2026.” Based on forecasts like these, Meta entertained ideas like acquiring Simon & Schuster to access its titles or even paying $10 per e-book to access professionally published, copyright-protected input text.
For LLMs, the data-ingestion challenge is twofold. Models need more data to effectively communicate in nuanced, human-seeming ways. At the same time, the goalposts of human knowledge are always moving: scientific consensus changes, languages adapt, cultures shift. In other words, even if an LLM caught up to humankind’s ability to communicate, it would constantly require new, contemporary input data to keep pace with our world’s changes.
In the future, this shortage will force genAI companies to rely more and more on LLM-generated content used as inputs, a process known as ingesting synthetic data. Output text from an LLM will be fed back into the model as training data, and the model will continue to learn from itself. Using AI-generated data to fuel further training and refinement leads to concerns about hallucinations, aberrant replies, and even consequences we can’t yet anticipate.
The recipe for any LLM: tokens, parameters, and responses
Now that we’ve looked at input data, let’s discuss what happens to dataset text once inside the model. How does dataset text become new or different output text?
Tokenization is the first step in breaking down input datasets. Within the context of genAI, tokens refer to groupings of characters, syllables, words, or clauses. Once input text is tokenized, the model can make connections between tokens: how often they appear, in what sequences, whether they’re used together or not, and the situations in which they arise. Basically, tokens allow a model to move away from language—toward vector relationships—while still arriving at similar results (generated text). It’s a bit like an acrostic puzzle: when filled in, numbered blank spaces help puzzlers complete a second answer. You input text, get numbers (tokens) back, then output new text thanks to the tokens. It may sound like the most complicated, data-intensive way to avoid learning a language, but it’s what works for computers.
Once equipped with tokens, the model’s parameters are applied. Parameters are simply variables that dictate how the model behaves: processing data, labeling it, and using it to produce outputs. In publishing, we use all kinds of parameters—like industry-wide or in-house style guides—to shape the books we put forth. The difference is that LLMs rely on billions of parameters. We know ChatGPT-3 relies on 175 billion parameters, and some speculate ChatGPT-4 is constrained by trillions of them. A trillion parameters would make for one very fat edition of the Chicago Manual of Style!
By applying parameters to tokens, large language models can generate unique outputs, effectively responding to questions posed by users. Of course, “effective” is a relative term; currently, LLMs perform best when answering specific, constrained questions without being overly verbose. As models ingest more and more data, both human-authored and synthetic, tech companies anticipate their models’ responses will grow more sophisticated and useful.
What’s next
And that, broadly speaking, is how the genAI sausage gets made! Huge quantities of data are scraped from the Internet and compiled into datasets, which are broken down into tokens, interpreted by parameter-constrained language models, and reassembled in new ways to produce novel outputs of text.
When it comes to taking a genAI stance, being a decisive publisher, editor, or author starts with being a well-informed consumer. Now that we better understand what’s gone into LLMs and what happens within these models, our industry can begin to tackle weightier questions. In this series, we’ll dive deeper into topics like the ethics of genAI, how others in our industry are using these tools, and how to identify their usage when working with authors, freelancers, and designers.
Kevin Breen lives in Olympia, Washington, where he works as an editor. He is the founder of Madrona Books, a small press committed to place-based narratives from the Pacific Northwest and beyond.