Begin typing your search above and press return to search.
exit_to_app
DEEP READ
Ukraine
access_time 2023-08-16T11:16:47+05:30
The Russian plan: Invade Japan and South Korea
access_time 2025-01-16T15:32:24+05:30
Putin
access_time 2025-01-02T13:36:49+05:30
What is Christmas?
access_time 2024-12-26T11:19:38+05:30
Munambam Waqf issue decoded
access_time 2024-11-16T22:48:04+05:30
exit_to_app
Homechevron_rightTechnologychevron_rightOpenAI’s o3 model...

OpenAI’s o3 model falls short of benchmark expectations, scores lower than initial claim

text_fields
bookmark_border
OpenAI
cancel

OpenAI’s recently released o3 AI model is facing scrutiny after independent testing revealed it scored significantly lower than previously reported on a leading mathematics benchmark.

The discrepancy has sparked debate in the AI community, raising questions about performance transparency and model comparisons.

Back in December 2024, OpenAI showcased its new o3 model during a live announcement, touting its advanced reasoning abilities. During the presentation, the company claimed that o3 achieved an unprecedented 25% score on the FrontierMath benchmark - a rigorous test designed by more than 70 mathematicians to evaluate problem-solving capabilities in AI. The benchmark is considered resistant to overfitting, as it uses entirely unpublished math problems.

However, with the public release of the o3 and o4-mini models last week, Epoch AI - the organisation behind the FrontierMath test - ran its own evaluation and reported a much lower score: just 10%. While this score still makes o3 the top-performing model on the benchmark, it falls well short of the previously claimed 25%.

Importantly, this gap does not suggest OpenAI misled the public.

Experts believe the version of o3 used internally by OpenAI in December likely operated on significantly higher compute resources than the public release. To make the model more efficient for everyday use, it may have been fine-tuned, which possibly reduced some of its raw performance power.

Supporting this, the ARC Prize organization - which oversees the ARC-AGI benchmark for general intelligence - also commented on the discrepancy. They confirmed that the o3 model currently available is not the same as the one tested late last year. According to ARC, the released version has lower computing capabilities but was not trained on ARC-AGI data at any stage.

Both ARC Prize and Epoch AI have announced plans to re-evaluate the newly released o3 and o4-mini models and update their benchmark results accordingly.


Show Full Article
TAGS:OpenAI OpenAI O3 Model Artificial Intelligence AI 
Next Story