In 2025, OpenAI introduced new language models designated as
o3 and
o4-mini, which according to official documentation achieve above-average results in performance tests focused on logical reasoning, programming, and scientific tasks.
However, in contrast, internal testing revealed a concerning trend: these newer models generate substantially more false or fabricated information than their predecessors (
OpenAI, 2025).
The increased rate of so-called hallucinations is a problem that can have significant consequences for the credibility and deployment of
AI systems in areas where accuracy is crucial, such as healthcare, law, or security analytics.
Hallucination Rates in Numbers
OpenAI's internal measurements on the PersonQA benchmark showed the following comparison between different model generations:
| Model |
Hallucination Rate (%) |
| o1 |
16 |
| o3-mini |
14.8 |
| o3 |
33 |
| o4-mini |
48 |
Interestingly, the o3-mini model had a lower hallucination rate than o1, which may suggest that lower parameter capacity sometimes paradoxically contributes to greater caution when generating claims.
Another contrasting fact: the o4-mini model also achieved 68.1% success rate on the SWE-bench Verified benchmark, which is significantly more than, for example, Claude 3.7 Sonnet (62.3%) - yet o4-mini is the most prone to hallucinations.
Why Do Models "Make Things Up"?
1. Statistical Nature of Generative AI
Models like o3 are not databases of facts but systems for predicting the next word. If the model never "saw" a given fact during training, it creates its own estimate.
This principle enables, for example, creative writing, but it is also the cause of hallucinations, especially in specialized queries.
2. Absence of Metacognition
According to research in Nature, models cannot reflect on their own uncertainty:
"The model lacks a mechanism that would allow it to label its own statement as speculation" (Li et al., 2024, Nature AI).
3. Excessive Performance Optimization
Benchmarks like GPQA or MATH are currently the dominant training target - and they don't always reflect reality. Models are therefore tuned more for performance than reliability.
Interesting Fact: Hallucinations in "Citations" and References
One of the most noticeable forms of hallucination is fabricating links to documentation or scientific articles. Models often generate credible-looking DOIs that don't actually exist.
This phenomenon is so common that it has been described as Citation Hallucination Bias (Choubey et al., 2023, arXiv).
For example, when tested by the startup Workera, the o3 model generated a link to a GitHub repository that didn't exist - and referenced a method that was never implemented.