Sweatshop data is over

Tamay Besiroglu, Matthew Barnett, Ege Erdil
July 10, 2025

High-quality data is the fuel that drives AI progress, but our approach to AI data needs rethinking.

In the past, it was usually sufficient to hire third-party contractors to create datasets for basic text, visual, and audio tasks. This typically involved monotonous, narrowly-scoped labeling and generation tasks performed en masse by low-skill workers, often paid just a few dollars per hour. This “sweatshop data” enabled friendly chatbots, art generators, speech-to-text software, and so on. Back then, sweatshop data was enough because early AIs were simple, and teaching them the basics quickly turned them into useful assistants.

These days, the situation is different. Existing models have mastered the basics but now struggle with sophisticated long-horizon tasks such as managing large-scale software projects, autonomously debugging intricate systems, and solving novel problems. Teaching AIs these new capabilities will require the dedicated efforts of high-skill specialists working full-time, not low-skill contractors working at scale, or even high-skill contractors working sporadically without sustained context.

For example, to train an AI to fully assume the role of an infrastructure engineer, we need RL environments that comprehensively test what’s required to build and maintain robust systems. This involves far more than configuring setups that simply function in controlled conditions. AIs must learn to build and maintain systems that are highly available, fault-tolerant, and easily scalable, preventing single points of failure or configuration drift. They must uphold good security practices, ensuring infrastructure remains resilient against threats, and anticipate potential performance bottlenecks in complex distributed environments.

Current AI coding tools, which were rewarded mainly for producing code that satisfies simple test cases, routinely fall short of these standards, creating headaches and frustration for anyone who tries to use them to build or maintain complex software.

To move beyond current AI capabilities, we think three things will need to change:

Software, not datasets. Today’s AI systems are best instructed by repeatedly having them interact with digital environments, attempt tasks, and learn from outcomes. Unlike static datasets, interactive environments offer ongoing, interesting challenges that remain valuable as models improve, much like how games continue to engage players across a wide range of skill levels. Therefore, our focus should shift toward developing sophisticated software environments explicitly designed to enable this iterative, autonomous learning process, rather than merely curating datasets.
Full-time contributors, not contractors. Instead of monotonous, narrowly-scoped tasks performed by contractors, future progress will depend on full-time specialists who can design environments that teach AIs how to perform jobs end-to-end, including strategic thinking, and long-horizon problem-solving capabilities. Building tasks within these environments will require careful attention applied continuously over months, reflecting the protracted nature of the work being tested.
Deep expertise. To move beyond current AI capabilities, subject-matter experts will play an essential role. Their tacit knowledge, skills, and experience are now the bottleneck to further AI progress. We need a way to put what they know into the AI systems that we are building. This is no easy task. It will require reframing how we think about the data generation process: from a low-status activity outsourced to workers in poor countries, to an elaborate process requiring the world’s finest talent and clever engineering.

Good RL environments are the bottleneck

Historically, the importance of data has been underrated in the field of AI. Decades ago, many assumed the key to AGI would come from devising the right “theory of intelligence”, which we could then implement by hand; the role of training data was sidelined.

Despite being trained on more compute than GPT-3, AlphaGo Zero could only play Go, while GPT-3 could write essays, code, translate languages, and assist with countless other tasks. The main difference was training data. AlphaGo Zero learned from Go games, whereas GPT-3 learned from natural language. This meant that while Google was playing games, OpenAI was able to seize the opportunity of a lifetime. What you train on matters.

We may soon witness a similar lesson if AI labs continue to scale up their models without similarly scaling up the quality of their training environments. Many have observed that pretraining is already saturating. GPT-4.5, while impressive in its own right, didn’t feel like a major generational leap in the way GPT-4 did over GPT-3.5.

The recent reinforcement learning with verifiable rewards (RLVR) paradigm seeks to revive progress by getting AIs to learn how to perform formally checkable reasoning inside contained environments. What we’ve seen so far is necessary for progress, but it is far from sufficient. Current methods will get us to the point where AIs can prove theorems and solve hard puzzles, but it won’t be enough to get models to deal with the open-ended nature of reality, where the quality of our actions cannot be so easily “verified” as either correct or incorrect.

To make progress, there’s no way around designing better rewards, and ultimately better RL environments. A simple scoring script can’t determine whether an AI would make an effective lawyer: that requires evaluating its ability to construct cogent arguments, properly contextualize information, and ultimately prevail in court. Until AIs can learn through real-world trial and error like humans do, we must create custom environments that can faithfully simulate reality and accurately reward AIs for skillfully navigating the simulation.

Want to help build software that advances AI? We’re hiring software engineers.