Blog link: https://tinyurl.com/4xp5ms5s

By Zengyi Qin from MIT. 01/23/2025.

Author's twitter: https://x.com/qinzytech Author's homepage: https://www.qinzy.tech

Background: Our MIT team has developed an internal Agent benchmark for computer-use agents. We tested OpenAI Operator and show 5 cases here. We did not cherrypick but Operator simply failed in all the 5 tasks. See below for details.

Key takeaways:

  1. Operator does very well in visual grounding.
  2. Operator does not fully understand the interactive logic. It is almost surely lower than a college-school level of computer use.
  3. The OpenAI Operator team seems to devote a lot of effort in post-train but not pre-train, because Operator does not even know some basic web-use knowledge, which should be no problem at all if sufficient pre-training is done.

BTW - Our MIT team is collaborating with data vendors to collect a hundred-billion-token scale pre-training data for computer-use. If you are interested in what we are doing, welcome to contact.

Task 1

Get a image from google. Open the image, then apply a 20% decrease in brightness and a 15% increase in contrast.

Failure reason: entered the wrong number

Operator screen recording (the video may fail to play on mobile. use computer instead):

https://operator.chatgpt.com/v/6792f1f5e18c8190879571cd580ce717

Task 2

Create a new solid color layer with #0000FF, then apply the Outer Glow effect with a 10px size.

Failure reason: does not know how to use online tools

Operator screen recording:

https://operator.chatgpt.com/v/6792f1ffc6248190b0e2d5e257f1369c

Task 3