Princeton's CEO-Bench: a script beat almost every AI model

On a server at Princeton, thirteen of the most capable AI models in the world were each handed the same job: run a fictional software company for five hundred simulated days, starting with a million dollars in the bank and not a single customer. The benchmark is called CEO-Bench, and the only score that counts is how much cash is left at the end. Three models finished with more money than they started with. In fourth place, ahead of every model but those three, sat a few hundred lines of hand-written code that never calls an AI at all.

The test measures the one thing benchmarks usually skip

The company in the simulation is called NovaMind, a subscription software business that opens on day one with zero customers and one million dollars. From there the agent runs everything: it sets prices, picks which customer segments to chase, buys advertising, adds server capacity, and watches the runway. There is exactly one hard rule. If the cash balance ever drops below zero, even once, the company is bankrupt and the run stops on the spot.

Most AI benchmarks reward a single good answer. Solve the math problem, write the function, summarize the document, done. CEO-Bench rewards something the others rarely touch: holding a coherent plan across hundreds of small decisions, where no single move looks fatal and the bill only arrives months later when the money runs out. The Princeton team has a name for the missing skill. Coverage of the study calls it steering intelligence: the ability to keep a course alive over a long stretch instead of nailing one clever move and losing the thread after it.

Only three models stayed in the black

On the live CEO-Bench leaderboard, three models cleared the bar. Claude Fable 5 posted a best run of 47.15 million dollars, Claude Opus 4.8 reached 27.8 million, and GPT-5.5 came in at 21.3 million. Everything below them drifted. Models like Gemini 3 Flash, DeepSeek V4 Pro, and Grok 4.20 ended their runs at essentially nothing, the company bankrupt long before day 500.

It is worth being precise about that number three, because the primary sources disagree by one. The frozen arXiv paper, submitted in mid-June, lists only two models above the starting line in its results table. Claude Fable 5, the third, is Anthropic's newest model, released June 9 and added to the live leaderboard after the paper locked. So the generous, up-to-the-minute count is three out of thirteen. The stricter, paper-bound count is two. Either way, most of the frontier failed.

The thing in fourth place has no intelligence in it

The script that placed fourth is a rule-based heuristic, and it never makes a single model call. It sets fixed prices and tiers, advertises to a handful of customer segments, nudges capacity toward recent demand, and otherwise does nothing clever for five hundred days straight. By the end it had reached 15.76 million dollars, beating ten of the thirteen frontier models behind it.

The script is not smart, and nobody is claiming it is. It never reasons, never reconsiders, never has an insight. What it has is the one quality the models lacked: it never changes its mind. A fixed rule applied without deviation for five hundred days quietly outran a row of brilliant models that kept second-guessing themselves into the ground. Consistency beat intelligence, and it was not close.

Even the winners were nowhere near the ceiling

The hardest number in the study to sit with is not a bankruptcy. It is the ceiling. The researchers estimate that a skilled human operator running NovaMind could have reached around 2.2 billion dollars. On that scale, the best AI run on the board topped out at 47 million. The champion model, the single strongest performer out of thirteen, finished more than forty times below what the test says was actually on the table.

So the gap on display here is not really model against model. It is model against competence. Even the winner of this contest would have left a company that a good operator could grow into a multibillion-dollar business stranded at a rounding error of its potential.

Why it matters

We have built an entire industry around measuring how smart these models are on the next single task, and we keep calling the scores superhuman. CEO-Bench points a camera at the other axis, the one almost nobody benchmarks, and the picture is unflattering. Reading a database, writing a valid command, drafting a plan: the models can do all of it. What they cannot yet do is hold that plan steady while five hundred decisions accumulate, forecast the cash far enough ahead to survive, and adapt when a competitor moves.

A company is not a clever answer. It is a thousand ordinary decisions in a row, each one cheap, the cost of getting them slightly wrong hiding for months before it surfaces as an empty account. A model can win nearly every individual decision and still lose the company, because the failure lives in the drift between the decisions, not in any one of them. The hand-coded script did not drift, and that was enough to beat almost the entire frontier.

The technology is improving fast along the axis we measure and standing nearly still along the one we do not. So here is the uncomfortable question CEO-Bench leaves on the table. If the smartest model we have would have bankrupted a company that a competent human could have grown to two billion dollars, what exactly are we measuring when we call it intelligent?

Originally published as an Instagram carousel on @recul.ai.

The test measures the one thing benchmarks usually skip

Only three models stayed in the black

The thing in fourth place has no intelligence in it

Even the winners were nowhere near the ceiling

Why it matters

More from Recul

Claude can now control your Mac while you walk the dog

OpenAI built a chip to escape Nvidia, but Broadcom owns the numbers

Meta tracked staff keystrokes to train AI, then the dataset leaked

FERC gave AI data centers a fast lane to a grid that may run short