Introducing GBA Eval

Stephen Yang, Ege Erdil, Tamay Besiroglu
May 2, 2026

See the benchmark, leaderboard, and emulator outputs at gbaeval.com.

We’ve argued for some time that current evals, benchmarks, and RL environments for LLM capabilities have significant quality problems, and that these problems pose a major barrier to future improvements in model capabilities. We’ve written about this previously in Cheap RL tasks will waste compute and The upcoming GPT-3 moment for RL. Our solution in our work at Mechanize has been to adopt a quality-first, quantity-second approach to evals and environments, particularly for software engineering, our current area of focus.¹

For many reasons, we believe that automating software engineering will prove to be one of the most important, and economically valuable, applications of LLMs in the coming years. However, public metrics fail to accurately capture the strength of LLMs at coding, sometimes underselling frontier capability and sometimes overselling it.

Public benchmarks misrepresent frontier capability

First, public benchmarks often understate how capable frontier models actually are. Many users observe a major jump in model capability across recent generations, yet scores on common coding benchmarks fail to track those gains — APEX-SWE, for example, has shown flat or regressing scores across some subsets despite broad user-reported gains. None of these things are hard evidence on their own, but there are also several generally known failure modes reinforcing this theory. For example, many test cases contain poorly specified problem statements or environments, where even a human expert software engineer would fail in the place of the model, which effectively caps achievable scores.

One good example is SWE-Bench Verified, where (until very recently) scores clustered around 80%. One reason was such flawed test cases. For example, OpenAI’s analysis of a subset of problems found that nearly 60% of the problems in that subset had flawed tests that would reject correct solutions. When we look at test cases in public evals, we often find that they don’t seem complex or difficult, yet models supposedly score very poorly on them. When we invest serious effort in fair grading for tasks at comparable levels of complexity, we find that the frontier models score quite well. This suggests something is wrong with how these benchmarks are built.

Another failure mode of test cases is the opposite, where certain mistakes in benchmark construction lead to scores above what models would earn against a fair task. Some common examples of issues include:

Easily cheatable problems: tasks using public repositories where solutions (or even grader code) are available in the git history, or cases where graders are “reward hackable” and the model does not need to truly solve the problem in good faith.
Underspecified rubrics used in LLM-based grading, where multiple reasonable interpretations of the problem lead to drastically different scores, often awarding undeserved partial credit.

Beyond overestimation or underestimation of capabilities, evals often also measure the wrong abilities. For example, eval prompts are often poorly specified, such that the correct course of action is ambiguous. In this case, the model needs to guess and infer the intent behind the task to figure out what approach will receive the highest reward, which is an important ability, but distinct from the ability to complete the task in the first place. Another example is when models lack sufficient affordances or permissions to accomplish the task via the happy path or intended solutions, but some especially smart models can “hack” their way to high reward around these issues.

A valid criticism of these positions on benchmark quality is that while the issues are easy to point out as an observer, making high-quality environments or evals is simply too expensive or difficult to be viable. We’re releasing GBA Eval as an example of what it looks like to invest time in careful grading for a single long-horizon SWE task.

What is GBA Eval?

We task models with writing, from scratch, a Game Boy Advance (GBA) emulator in Rust that compiles to WebAssembly. We grade accuracy using a combination of existing open-source test suites and test cases of gameplay patterns in real ROMs, evaluated using a custom harness.

This kind of grading is tractable because the GBA console itself has no entropy source. There is no RTC, wall clock, or analog input on the console.² This means that randomness in these ROMs derives entirely from the exact timing of inputs. This is a well-known fact about the GBA, and this “determinism” is sometimes exploited by speedrunners to manipulate RNG, for example, in tool-assisted speedruns.

Scope

GBA Eval is not intended as a complete benchmark or a source of truth for general AI coding capability. Rankings of models on GBA Eval are not necessarily representative of their general “software engineering capability.” If we sought to make a comprehensive public SWE benchmark, it would comprise many test cases across various capabilities needed for the entire software development life cycle, from writing good end-to-end tests to debugging code or working with large, existing codebases. One section would likely include tasking models with building complex software from scratch, and GBA Eval could function as one such test case. We are releasing GBA Eval to show a concrete example of what we believe a high-quality test case for a software engineering benchmark can look like.

How grading gameplay works

The grading for this task consists of several components, but the most interesting is replay, where we check whether actual gameplay works on the emulator in various games. We leverage the “determinism” to pre-record input sequences for the game (e.g., passing the first few levels). Then, the candidate and reference emulators consume identical inputs, and their outputs are compared on every frame. We use a lightly modified fork of Mesen2³, an open-source cross-platform emulator widely regarded as one of the most accurate GBA emulators available, as the reference. We can also compare audio matches following the same action sequences. You can find more details about this in our posts on iterating on the grading strategy and how gameplay replay scoring works.

Come work with us

GBA Eval is analogous to the work we do at Mechanize with top AI labs when we make environments for evaluating and training LLMs. Although the details of our commercial work are confidential, we hope GBA Eval provides some insight into what our day-to-day work at Mechanize looks like. If designing and building environments like this sounds like something you’d want to do, we’re hiring software engineers.

Benchmarks and RL environments are extremely similar. For the purposes of this post, we treat the two as essentially interchangeable; strictly speaking, RL environment tasks could be considered as living in a subspace of benchmarks. ↩︎
A few real GBA games do use cartridge-side hardware for additional entropy or compute (most famously the gyro/rumble in Wario Ware Twisted, and the solar sensor in Boktai). None of those games are in our test corpus, so we don’t need candidates to emulate that hardware. ↩︎
Our changes are limited to a thin C ABI exposing the lockstep emulator interface (set_keys / run_frame / framebuffer) and a deterministic input-replay harness; the emulation core itself is unmodified from upstream Mesen2. ↩︎

← Back