OpenAI's latest AI models, o3 and o4-mini, are facing unexpected challenges, with internal tests revealing they "hallucinate" — or generate inaccurate or fabricated information — more frequently than their predecessors.
This troubling development comes as the company races to outperform rivals like Google, Meta, xAI, Anthropic, and DeepSeek in the intensifying global AI arms race.
The o-series models, launched on April 16, were designed with enhanced reasoning capabilities intended to pause and analyze queries more deeply before responding. Despite this advancement, a report by TechCrunch claims the models are demonstrating higher rates of hallucination than even earlier non-reasoning versions such as GPT-4o.
According to OpenAI’s own technical documentation, the issue remains poorly understood. The company admitted that more research is required to determine why reasoning models are increasingly producing inaccurate content. A former OpenAI employee suggested that the specific type of reinforcement learning used in training the o-series models could be worsening problems that were previously kept in check by conventional post-training methods.
Although such hallucinations may sometimes contribute to creative or novel outputs, experts caution that for enterprise-level applications, accuracy is critical. The unpredictability could hamper the models' appeal to businesses seeking dependable AI solutions.
Despite these issues, OpenAI maintains that its new models offer competitive performance. The o3 model reportedly achieved a 69.1% score on SWE-bench (a benchmark used to test coding abilities), while o4-mini trailed closely at 68.1%.
In a separate concern, a recent collaborative study by OpenAI and the MIT Media Lab has raised questions about the psychological impact of ChatGPT.
The research found that users who frequently relied on and emotionally bonded with the chatbot were more likely to report feelings of loneliness. While the study acknowledges that loneliness is influenced by various factors, it suggests that the growing emotional attachment to AI may warrant closer scrutiny in mental health discussions.