OpenAI’s o3 AI Model Scores Lower Than Expected in Independent Evaluation
OpenAI is under scrutiny after new independent benchmark results revealed its o3 AI model performs below previously suggested levels, renewing debate over transparency in the artificial intelligence industry.
When OpenAI introduced the o3 model in December 2024, Chief Research Officer Mark Chen claimed during a livestream that o3 achieved over 25% accuracy on FrontierMath, a challenging benchmark for advanced math problem-solving. In contrast, competing models reportedly scored under 2%.
But now, new data from Epoch AI — the organization behind FrontierMath — suggests that the o3 model only scored around 10% in real-world testing, substantially lower than the numbers initially shared by OpenAI.
Why the Discrepancy?
The difference appears to stem from variations in test conditions. Epoch tested a version of o3 released for public use, while OpenAI’s higher performance claims were based on a more powerful internal version with increased compute capacity.
Epoch clarified that their test used the updated FrontierMath-2025-02-28-private set, not the earlier version OpenAI may have referenced. Differences in compute environment, model tuning, and question subsets likely account for the performance gap.
Adding context, the ARC Prize Foundation, which also tested an earlier build of o3, noted that the public release of o3 is a more lightweight version, tuned for real-world applications such as chatbot use and faster responses — not optimized for high-end benchmark testing.
OpenAI’s Wenda Zhou, during a recent livestream, echoed that sentiment, explaining the public model was refined to be more cost-efficient and responsive, which could explain the trade-off in benchmark results.
What This Means for OpenAI and the AI Industry
While OpenAI didn’t falsify results — their original release included a lower-bound benchmark consistent with Epoch’s findings — the situation highlights growing concerns about benchmark inflation in the AI space.
The release of more powerful versions like o3-pro is expected soon, and OpenAI’s o3-mini-high and o4-mini already outperform o3 on the same benchmark. However, this incident serves as a reminder that AI benchmark claims should be approached with caution, especially when they come from companies with products to promote.
OpenAI isn’t alone in facing scrutiny. Meta recently admitted to similar discrepancies with its AI model benchmarks, and xAI, Elon Musk’s AI venture, has also been called out for presenting potentially misleading performance data.
Bottom Line
The AI race is heating up, and as companies compete to showcase breakthroughs, transparency and consistency in testing methods are more important than ever. OpenAI’s o3 remains a powerful model, but the benchmark gap underscores the need for independent validation and clearer communication between developers and the public.