Umělá inteligenceMay 5, 2025|4 min

New OpenAI AI Models: More Powerful, but Less Accurate?

In 2025, OpenAI introduced new language models designated as o3 and o4-mini, which according to official documentation achieve above-average results...

Tým Apertia

Apertia.ai

In 2025, OpenAI introduced new language models designated as o3 and o4-mini, which according to official documentation achieve above-average results in performance tests focused on logical reasoning, programming, and scientific tasks. However, in contrast, internal testing revealed a concerning trend: these newer models generate substantially more false or fabricated information than their predecessors (OpenAI, 2025). The increased rate of so-called hallucinations is a problem that can have significant consequences for the credibility and deployment of AI systems in areas where accuracy is crucial, such as healthcare, law, or security analytics.

Hallucination Rates in Numbers

OpenAI's internal measurements on the PersonQA benchmark showed the following comparison between different model generations:

Model	Hallucination Rate (%)
o1	16
o3-mini	14.8
o3	33
o4-mini	48

Interestingly, the o3-mini model had a lower hallucination rate than o1, which may suggest that lower parameter capacity sometimes paradoxically contributes to greater caution when generating claims.

Want a Custom AI Solution?

We help companies automate processes with AI. Contact us to find out how we can help you.

Response within 24 hours
No-obligation consultation
Solutions tailored to your business

Another contrasting fact: the o4-mini model also achieved 68.1% success rate on the SWE-bench Verified benchmark, which is significantly more than, for example, Claude 3.7 Sonnet (62.3%) - yet o4-mini is the most prone to hallucinations.

Why Do Models "Make Things Up"?

1. Statistical Nature of Generative AI

Models like o3 are not databases of facts but systems for predicting the next word. If the model never "saw" a given fact during training, it creates its own estimate.
This principle enables, for example, creative writing, but it is also the cause of hallucinations, especially in specialized queries.

2. Absence of Metacognition

According to research in Nature, models cannot reflect on their own uncertainty:
"The model lacks a mechanism that would allow it to label its own statement as speculation" (Li et al., 2024, Nature AI).

3. Excessive Performance Optimization

Benchmarks like GPQA or MATH are currently the dominant training target - and they don't always reflect reality. Models are therefore tuned more for performance than reliability.

Interesting Fact: Hallucinations in "Citations" and References

One of the most noticeable forms of hallucination is fabricating links to documentation or scientific articles. Models often generate credible-looking DOIs that don't actually exist.
This phenomenon is so common that it has been described as Citation Hallucination Bias (Choubey et al., 2023, arXiv).

For example, when tested by the startup Workera, the o3 model generated a link to a GitHub repository that didn't exist - and referenced a method that was never implemented.