Data Nationalism and the AI Hunger Games: When Foundation Models Run Out of Fuel

“Data is the new oil? No, data is the new soil.” – David McCandless

When the Well Runs Dry: A Deep Dive into Foundation Model Training Data, Global Disparities, and the Coming Data Drought

For decades, the engines of artificial intelligence have been fueled by data — digital detritus, meticulously labeled corpora, curated knowledge bases, and the chaotic rawness of the internet. But as the frontier of AI advances into foundation models, the scale and quality of that data has become both a differentiator and a bottleneck. Quietly, a crisis has been forming: the world is running out of data to train the next generation of frontier models.

This isn’t just a technical problem — it’s a geopolitical, ethical, and economic one. The origins of this challenge trace back to the data sources these models rely on, how those sources are distributed across global boundaries, and how uneven access is shaping what these models “know” and what they miss.

A Brief History of Training Data

When GPT-2 debuted in 2019, it was trained on around 40GB of text. By GPT-3, that number ballooned to 570GB. Now, OpenAI’s GPT-4 and Anthropic’s Claude models have likely been trained on corpora measured in the tens of terabytes — if not more. This includes Common Crawl, Wikipedia, books, academic papers, forums like Reddit and StackOverflow, code repositories like GitHub, and likely privately licensed data sets from publishers.

In the early days, the Western internet — particularly English-language content — dominated these corpora. This skew introduced a particular linguistic, cultural, and epistemological lens through which models learned to predict and generate language. Whether discussing history, ethics, or humor, models became reflections of their overwhelmingly Western training data.

Western Data: Rich, Open, and Exhaustible

The U.S. has long enjoyed relatively open access to large-scale public data. From Common Crawl’s open web snapshots to the ArXiv preprint server, to the corpus of U.S. legal documents, and even leaked proprietary datasets (e.g., Books3), the West built its AI leadership on a foundation of data abundance and a lax approach to copyright.

But this model has limits. As lawsuits from The New York Times and authors like Sarah Silverman multiply, legal access to quality proprietary data is narrowing. Regulatory pressure from the EU’s AI Act, data localization rules in India, and API gatekeeping from Reddit and X/Twitter is slamming the doors shut. The internet may be vast, but high-quality, diverse, and copyright-safe training data is finite. And we’re hitting the ceiling.

The Other Internet: What the West Can’t Access

While U.S.-based labs dominate the conversation around large language models (LLMs), they are not the only ones building them. China, for instance, has access to an entirely different internet — one that is firewalled from much of the global web and filled with its own vast social media, ecommerce, governmental, and technical ecosystems. This data, while potentially restricted from Western access, is feeding domestic models like Baidu’s Ernie, Alibaba’s Tongyi Qianwen, and Tsinghua’s ChatGLM series.

Similarly, Arabic, African, and Indian language datasets — encompassing vast oral and written traditions — remain underrepresented in Western-trained models. This introduces major limitations. A foundation model trained primarily on Western-centric data might offer excellent grammar correction in English or legal drafting in U.S. jurisdictions but falter when asked to summarize Urdu poetry, translate African proverbs, or interpret the nuance of Mandarin idioms.

Thought Leaders Sound the Alarm

Researchers like Timnit Gebru and Emily Bender have long warned of the risks of linguistic and cultural monocultures in AI. Their seminal work on the “Stochastic Parrot” phenomenon highlighted how overreliance on Western data risks producing models that are syntactically impressive but semantically shallow or biased.

Meanwhile, experts like Jack Clark (Anthropic co-founder) and Abeba Birhane (Mozilla) have underscored how model performance is deeply tied to whose data is being used, and under what assumptions. Whose voice counts as “truth”? Who gets representation in the training set? These aren’t peripheral issues — they’re central to what the models become.

What Good Looks Like

Some efforts have tried to correct for this imbalance. BigScience’s BLOOM project made a multilingual, open-science foundation model with datasets covering 46 languages. The team emphasized transparency, community involvement, and equitable access. Similarly, Hugging Face’s dataset repository and tools like Datasets empower researchers to curate training sets with broader linguistic and geographic scope.

RLAIF (Reinforcement Learning with AI Feedback), synthetic data generation, and fine-tuning on user-specific content offer potential avenues for customization and diversity — but these approaches still depend on a strong foundational base.

What Happens When the Data Runs Out?

As the supply of quality, unencumbered training data shrinks, several things happen:

Diminishing Returns: New models trained on similar or even duplicate data may plateau in performance. You can only re-train on Wikipedia so many times.
Rise of Synthetic Data: Model-generated content (aka synthetic data) will increasingly supplement real-world corpora. But this poses a risk of compounding errors — models learning from models can become echo chambers.
Closed Loops & Homogenization: With fewer novel data sources, foundation models may converge toward similar outputs, reducing diversity and innovation.
Data Protectionism: Nations may treat data as a sovereign resource. China already restricts data exports. India and Brazil are not far behind. This will deepen the AI divide.
Licensing Wars: Companies like OpenAI and Google will increasingly compete not just on algorithms but on who owns what data. Think: BloombergGPT trained on financial data — and only available behind a paywall.

The Coming Data Industrial Complex

Expect data licensing to become a massive industry. Universities, governments, media companies, and even individuals may sell “high-quality” data to foundation model builders. Agents like the Data Provenance Initiative and emerging startups will focus on tracking, verifying, and compensating data usage.

There may also be a push toward federated learning — training models on-device (like on your phone or laptop) without sending data to centralized servers. This could help preserve privacy and increase diversity, but it also fragments the training process.

The Takeaway for Engineering and Product Leaders

If you’re building on top of foundation models — either fine-tuning them or integrating them into workflows — you must understand their blind spots:

Don’t assume model knowledge is global. Check its training corpus.
When building for non-Western users or domains, validate rigorously.
Push for transparency from vendors about what the model was trained on.
Consider investing in custom data collection pipelines to close gaps.

Wrapping up…

Foundation models are only as good as the foundations they’re built on. As the world wakes up to the value — and scarcity — of high-quality training data, access will become a competitive moat, a diplomatic issue, and an ethical battleground. The data that taught today’s AI models how to think won’t be enough to teach tomorrow’s.

The well is running dry. The question is: who will own the next spring?