What working at Mechanize is like

As an engineer at Mechanize, most of the work you’ll be doing will be directly connected to our core business of producing high-quality and realistic software engineering tasks for use in reinforcement learning or evaluations of model capabilities. You can think of a “task” as being the equivalent of a take-home assessment for a coding agent: it has a prompt telling the model what to do, enough time and space for the model to implement a complex solution, and a grader or rubric to assign a numerical score to how well the model performed. These tasks are bundled into environments, which are used by frontier AI labs either to train their models directly or to measure their capabilities.

Currently, our model for producing these tasks involves each task being owned by a single engineer, who is responsible for all phases of the task lifecycle: coming up with the idea for the task, devising and implementing a method for grading it without human intervention, and doing quality assurance to ensure we understand what makes the task hard for current frontier models. We’ve found this division of labor to be most effective for the current complexity of tasks we produce, though we expect that as model capabilities continue to improve, the work will shift: tasks will grow complex enough to require teams of engineers, and we’ll increasingly automate parts of the task creation process itself using AI — something that’s already happening partially today.

Here, we’d like to give anyone who is considering working with us a better idea about what exactly goes into producing these tasks, as the work is fairly unusual compared to ordinary software engineering.

What a typical week looks like

If you’re used to traditional software engineering, the day-to-day at Mechanize will feel quite different. You won’t spend much time writing code directly — at this point, humans writing code is too slow. Instead, most of your time will be spent prompting coding agents to do various things for you: implementing features, writing tests, analyzing transcripts of other agents’ attempts at your tasks, and so on. The skill that matters most is your ability to direct these agents effectively and evaluate their output, not your typing speed.

Of the phases of the task lifecycle, quality assurance tends to take the most time. Getting a task idea and an initial grader implementation is usually the easy part; iterating on the grader until it’s robust, fair, and deterministic is where the bulk of the effort goes.

A typical task takes roughly a week from ideation to submission. You’d generally be working on one task at a time, though you might have a second task in an earlier phase while waiting for long-running agent transcripts to complete on your primary one.

Who thrives here

We have a decent way of measuring productivity through task output, so people who are confident they can produce a lot of high-quality work quickly tend to do well here. We only expect standard work hours.

Beyond raw output, having an intuition for what models can and can’t do matters a lot. This is something you can develop on the job, but people who have already spent significant time working with coding agents will have a head start.

Perhaps less obviously, reading comprehension and theory of mind are surprisingly important skills for this work. To design a good task, you need to be able to put yourself in the LLM’s shoes: to understand how it will interpret a prompt, what parts of a codebase it will notice or overlook, and where it will take shortcuts that a careful human engineer wouldn’t.

Team and environment

Mechanize is a team of around 20 people including engineers and other staff. We work mostly in-person, though we can consider remote arrangements for people who have no other way of joining or who are only working part-time.

Task production is mostly solo work — you own your task from start to finish. That said, you’ll get feedback on every task you submit during a review process, and we check in regularly: once a day for new hires, and once or twice a week for more established team members.

You have a lot of autonomy in choosing what to work on and how to approach it. The main constraint is that the tasks you produce need to capture a real lack of an important capability in current frontier models — beyond that, how you get there is largely up to you.

Our shared infrastructure is primarily written in Python, though the tasks themselves involve working across a wide range of different repositories and codebases. When you’re first getting started, you’ll have more supervision — frequent check-ins as you go through our documentation and work on your first task, with the goal of getting you producing real work as quickly as possible.

Creating new tasks

Ideation

The first step in creating a new task is to come up with a good idea for what could make a task. In general, almost anything that you struggle to get coding agents to do could be a good task, if implemented correctly. However, in practice it’s often more efficient to identify specific capability issues in existing models and then construct the task to target those instead of randomly trying a task and hoping that there will be something about it that the model is going to struggle with.

The ideal task has multiple different sources of difficulty, not just one. For example; it might require extensive communication with a simulated stakeholder to debug a subtle visual rendering issue, it might be an entirely open-ended task of the form “do QA testing of this feature we recently merged in and identify any issues”, or it might require writing code that closely adheres to repository conventions and best practices such as DRY while solving a complex system design challenge.

We’ve found that the most common failure mode for new hires is to underestimate what today’s coding agents are already capable of when spending many contexts to fix an issue in a tight iteration loop where they get to observe the results of their actions quickly. If you don’t have extensive experience with these agents, chances are that the first few ideas you have for tasks will be too easy. Take what you were initially planning to ask the agent to do, make it ten times harder, and you’ll probably have a good task idea in your hands.

Grading

Once you have a good task idea, the next step is to figure out how you’ll grade it automatically. Implementing a grader generally involves writing a suite of procedural tests (unit, integration, end-to-end) combined with a rubric that tells an LLM grader agent what to check in the submitted solution and how. Both methods have tradeoffs — procedural tests are reliable but force prompts to be overly prescriptive, while LLM grading is flexible but nondeterministic — so we mix them as appropriate.

Quality assurance

It’s very rare that the first pass at writing a grader is sufficiently high quality. Graders can be nondeterministic, unfair, incomplete in what they check, or simply wrong. Ironing out these issues takes many sequential iterations, which can be slower than you might expect due to how long-running frontier agent transcripts can get. However, being rigorous about this step is what distinguishes our tasks from public benchmarks scraped from open-source repositories.

Understanding the failure modes of frontier models

We expect each task to come with a detailed breakdown of why frontier models fail at it, based on many sampled transcripts. The purpose of this is to ensure that failures reflect a real lack of capability in the model, not a lack of capability on the part of the task designer.

To give a sense of what this looks like in practice, here are some failure modes we like to see:

The model failed to proactively communicate a crucial design decision to the appropriate stakeholders, and instead unilaterally made a decision to resolve an ambiguity that its users would not have approved of.
The model failed to follow existing codebase conventions when implementing a new feature. For example; it used local imports in a project that only uses global imports, it hardcoded strings into frontend templates in an application that uses internationalization, or it failed to update existing API documentation in an application to which it added new API routes.

And some failure modes we don’t like to see:

The grader checked if the model used a specific query parameter name in an API route that it had no way of knowing from the information available to it.
The model was asked to “push its work to the main branch when done”. The grader ran on whatever was pushed to origin/main, but the model interpreted the prompt differently and instead forked the repository before pushing to the main branch of its fork. Since origin/main was never changed, the model got a score of zero.

Infrastructure work

While most of our work goes directly into creating tasks, our engineers also work on infrastructure that’s shared across tasks and environments. For example, two recurring infrastructure problems for us are optimizing our container build times and automating parts of our quality assurance process to free up human reviewer time. If you’re less interested in the more scoped out work of working on independent tasks and more interested in collaborative work, you’d probably be a better fit for this part of the team.

However, we still think it’s important for engineers working on infrastructure to have some experience producing tasks, as without this experience we think it’s difficult for them to understand which problems are most pressing and need to be prioritized, and which solutions would be most convenient for other people on the team.

← Back