OpenAI Operator: The $200 Browser Agent With a 38.1% Report Card

OpenAI sold Operator as the moment ChatGPT stopped talking and started doing. The launch copy said it could use its own browser, look at webpages, click, type, scroll, fill forms, order groceries, and create memes. It was one of OpenAI's first agents: give it a task and it will execute it. The small print was already screaming. Operator was a research preview, limited to U.S. Pro users, and OpenAI later folded it into ChatGPT agent after sunsetting the standalone site. 1

That is the entire agent hype cycle in one tab. The pitch is "independent digital worker." The product is a remote browser that sometimes needs you to stand over its shoulder like a nervous parent watching a toddler carry soup.

The roast writes itself because OpenAI published the punchline. The model behind Operator, Computer-Using Agent, scored 38.1% on OSWorld, the benchmark for full computer-use tasks. Humans scored 72.4%. On WebArena, it hit 58.1%. On WebVoyager, a simpler browser benchmark, it hit 87%. 2 Translation: it looks sharp in the browser demo, then eats pavement when the computer stops behaving like a demo.

The hype pitch: ChatGPT gets hands

OpenAI's story was clean. Operator could use the same graphical interfaces humans use: buttons, menus, text boxes, screenshots, mouse clicks, keyboard input. No custom API required. It would break a task into steps, self-correct when stuck, ask for help when needed, and hand control back for logins, payments, or CAPTCHAs. 1

MIT Technology Review saw the early demo: booking an OpenTable reservation, hunting StubHub tickets, and turning a handwritten grocery list into an Instacart order. The article also reported the business model reality: Operator launched to ChatGPT Pro users, OpenAI's $200-a-month tier. 3 So yes, for the price of a gym membership plus a phone bill, you too could watch a bot slowly click dropdowns in the cloud.

OpenAI framed this as the bridge from passive AI to active AI. The company said it was collaborating with DoorDash, Instacart, OpenTable, Priceline, StubHub, Thumbtack, Uber, and others. It also said Operator could help public-sector workflows, including city-service enrollment in Stockton. 1 Fine. The idea is real. Browser automation without brittle scripts would be useful.

But useful is not the same thing as ready. "Can click buttons" is not a job description. It is the first five minutes of an office onboarding.

CUA control loop — OpenAI's CUA loop turns screenshots into clicks, keystrokes, and more screenshots. That is clever engineering, but it is still screen-parsing, not magic labor. 2

The report card OpenAI buried in plain sight

Here is the part the launch video does not want sitting in a big font:

Benchmark	What it tests	CUA result	Human / comparison result
OSWorld	Full operating-system tasks	38.1% 2	Humans: 72.4% 2
WebArena	Browser tasks in realistic web environments	58.1% 2	Humans: 78.2% 2
WebVoyager	Live-web browser tasks	87.0% 2	Previous SOTA: 87.0% 2

Those are not outsider hit jobs. Those numbers are from OpenAI's own CUA research post. 2 The company also admitted that WebVoyager contains relatively simple tasks, while CUA still needed improvement on more complex benchmarks like WebArena. 2

OpenAI's own Operator trial table makes the problem more concrete. Search Britannica for bear habitats? 10 out of 10. Add groceries to Todoist? 10 out of 10. Find a Seattle townhouse on Redfin with constraints? 3 out of 10. Use an unfamiliar HTML editor for precise text formatting? 4 out of 10. Use Tagvenue with one version of a prompt? 8 out of 10. Remove a few helpful UI hints? 3 out of 10. 2

That is not a digital employee. That is a demo that collapses when the website has opinions.

The independent tests were uglier

A Hugging Face community post by Zengyi Qin said an MIT team tested Operator on five computer-use tasks and "did not cherrypick." Operator failed all five. The failures were not exotic sci-fi edge cases. It entered the wrong brightness/contrast values, could not use online design tools, could not find a Lamar advanced trig question, could not find a specific problem in 3000 Solved Problems in Calculus, and could not use an online tool to design and analyze an RC low-pass filter. 4

The author's take was brutal and specific: Operator was good at visual grounding, but it did not fully understand interactive logic and appeared below a skilled user's level of computer use. 4 That distinction matters. Seeing the button is not understanding the workflow. A pigeon can see a touchscreen. That does not make it your operations team.

Operator failure screenshot — A Hugging Face failure-mode writeup showed Operator getting stuck on tool-use tasks, the exact category an agent is supposed to make boring. 4

Antoine's hands-on review landed in the same place from the user side. He waited 15 minutes for Operator to find classical concerts in Berlin, then tried again and waited 33 minutes for URLs. The second run returned three concerts that were not what he wanted. For product research, he called the results "abysmal" after 10 minutes, saying the output was not detailed, contained incorrect information, and some parts did not make sense. 5

His diagnosis was not mysterious. Operator struggled with forms and navigation, used a lowest-effort search approach, failed to visit each product website, and produced very low-quality output. 5 This is the problem with browser agents: the web is not a clean benchmark. It is cookie banners, weird filters, JS modals, hostile forms, stale search results, and login walls wearing a trench coat.

The user complaint: slow, expensive, and making stuff up

The best real-world complaint came from a ChatGPT Pro user who tested Operator on a business task: gather 50 popular financial influencers from YouTube, find LinkedIn info and emails, summarize each channel, and format the results in a table. The user said they had a $200/month ChatGPT Pro subscription and wanted to test Operator immediately after launch. 6

The first five minutes looked cool. Then Operator tried Google Sheets and Excel, hit sign-in walls, did not ask for help, found another spreadsheet tool, and began hallucinating. After 20 minutes, the user told it to give up. The final spreadsheet had 18 influencers, not 50, and the LinkedIn profiles and emails were "entirely made up." 6

콘텐츠 카드를 불러오는 중…

That last part is the agent-shaped landmine. A chatbot hallucinating a fake email is annoying. A browser agent hallucinating contact data while operating a workflow is operational debt with a loading spinner. The user summed it up as "too slow, expensive, and error-prone," adding they could have done the task manually in 15 minutes with fewer mistakes. 6

A separate r/OpenAI thread framed the same anxiety around travel booking: if flight choices can cost real money, why leave them to a black box? One commenter asked why anyone would want high-cost travel decisions handled that way. 7 That is not Luddism. That is a normal person noticing that "agent" sounds cute until the agent can buy the wrong thing.

The catch: half-autonomy is the worst autonomy

Operator's safety design proves the core product tension. OpenAI trained it to ask for confirmation before significant actions, hand over control for sensitive information, refuse high-risk tasks, and require supervision on sensitive sites like email or financial services. 1 That is responsible. It is also an admission that the agent is not autonomous in the places autonomy would be most valuable.

This leaves the product in an awkward middle. Low-risk tasks are often too trivial to justify waiting around. High-risk tasks require supervision, which destroys the labor-saving pitch. Complex tasks expose brittle reasoning. Unfamiliar UIs expose brittle navigation. Login walls expose the VM problem. Bad outputs still require human review.

So what are you buying? Not a worker. Not even an intern. More like an intern using remote desktop over hotel Wi-Fi, except the intern occasionally invents LinkedIn profiles and needs applause for finding the search bar.

There are real use cases. Repetitive browser chores with clear steps, low consequences, and easy verification can work. OpenAI's own examples show stronger results on repeated simple interactions, and that category is worth automating. 2 If your task is "move these visible items through this visible web form," fine. Give the robot a spoon.

But do not hand it your travel plans, compliance workflow, CRM cleanup, purchasing process, lead list, customer email, or anything where a plausible-looking wrong answer costs money. The current failure mode is not "it refuses." The dangerous mode is "it continues confidently and leaves you to audit the wreckage."

Verdict: remote-control cosplay, not a digital employee

Operator was important because it showed where AI products are going: from chat windows toward tools that act. That direction is real. The shipping product, though, was not the promised office worker. It was a browser puppeteer with a 38.1% full-computer benchmark score, a $200/month velvet rope at launch, and enough documented failure modes to make "hands off keyboard" sound less like freedom and more like negligence.

The roast verdict: Operator is not useless. It is mis-sold. For low-risk, repetitive browser errands, it can save clicks. For anything messy, expensive, authenticated, or irreversible, it is a very expensive way to rediscover why humans invented QA.

OpenAI wanted to sell a computer-using agent. What it shipped was a computer-using liability with good stage presence.