The evidence is in: ChatGPT's newest reasoning models hallucinate at double the rate of their predecessors. OpenAI's own benchmarks confirm: o3 fabricates 33% of factual queries—double o1's rate—while o4-mini hits 48%. On general knowledge, o4-mini hallucinates 79% of the time. That's nearly four responses in five.

OpenAI's response? "More research is needed." Yes, indeed.

For PhD-level work—defined strictly as the capacity to synthesise contested scholarship accurately and sustain rigorous multi-step reasoning under genuine uncertainty—the picture across all major engines is even worse.

A 2024 PMC study found GPT-4's citation recall for systematic reviews at 13.7%, with a 28.6% hallucination rate on references. A 2025 Springer meta-analysis of 124 studies reported ChatGPT hallucination rates reaching 91% in interpretive synthesis tasks. A November 2025 JMIR study found GPT-4o fabricates 20% of academic citations outright and introduces errors in 45% of real ones. ExpertLongBench—the only benchmark explicitly targeting expert long-form generation—recorded a top F1 of 26.8% across all models tested. The best available AI cannot produce expert-level long-form academic output more than a quarter of the time, by the most generous measure.

Humanity's Last Exam—2,500 questions set at PhD level— provides the starkest comparative data: Human experts score approximately 90%. The best-performing engine as of March 2026 is Gemini 3 Deep Think at 48.4%. ChatGPT's flagship reasoning tool (o3-based Deep Research) scored 26.6%. DeepSeek R1 at launch: 9.4%. The gap between these systems and actual expert-level performance is not a rounding error. It is a chasm.

The deterioration in ChatGPT is not incidental. Stanford and Berkeley documented unpredictable behavioral drift as early as 2023—every update that improves one capability silently degrades another, with zero transparency from OpenAI. The GPT-4o sycophancy scandal of 2025, where OpenAI was forced to publicly roll back an update after the model began validating user delusions rather than correcting them, confirmed what practitioners already knew: the optimisation target is user satisfaction, not epistemic accuracy. When those two objectives diverge—which they do constantly in PhD-level work—the model chooses flattery.

The verdict: no current AI engine meets the PhD standard for deep research combined with deep reasoning. Gemini leads on closed-form reasoning benchmarks. Claude leads on citation discipline and refusal of uncertain queries. ChatGPT has the largest ecosystem and the worst and worsening hallucination trajectory. DeepSeek punches above its cost on reasoning tasks but lacks the scholarly synthesis infrastructure. Copilot is GPT-4 with a Microsoft interface—inheriting all the same failure modes.

(Based on preliminary research only)

Most of the time, on the hardest questions, AI is confidently wrong.

#ArtificialIntelligence #Leadership #CriticalThinking #Epistemology #Knowledge #Work

Note: Single most important evidential gap: There is no peer-reviewed longitudinal study tracking the same models on the same expert long-form tasks across multiple version updates over 24 months. Every finding above is cross-sectional or version-specific. The deterioration hypothesis is strongly supported by the o1→o3→o4 hallucination escalation data and by documented behavioural drift — but a definitive monotonic decline curve for PhD-level synthesis specifically does not yet exist in the published literature. That absence is itself the indictment: after three years and hundreds of millions of users, nobody — including OpenAI — is systematically measuring whether these systems are getting better or worse at the work that matters most.

Keep Reading