In recent years, artificial intelligence has expanded the debate about how much language models are capable of truly "thinking" and where advanced word pattern prediction ends.
While AI evaluation focuses on performance in translation, text comprehension, or code generation, a new testing direction focuses on traditional analytical and cognitive abilities of models - that is, what we commonly refer to as intelligence.
The TrackingAI.org project brings an entirely new perspective to this debate. They decided to test language models on tasks commonly used in human IQ tests, such as Raven's Progressive Matrices or Mensa Norway.
Testing AI Models Using IQ Tests
IQ tests - built on the ability to recognize patterns, reason deductively, and understand structure - have so far been the domain of human intelligence. However, as the results from the TrackingAI platform show, it is possible to apply these tests to artificial intelligence as well, with interesting implications.
TrackingAI uses two main types of testing:
-
Offline IQ tests - tasks created independently, without appearing in model training data.
-
Standardized Mensa Norway test, commonly used for evaluating human IQ.
Results show that some models (e.g., Gemini 2.5 Pro, Claude 3, or GPT-4.5) achieve results above the IQ 110 threshold, which in human terms would correspond to above-average intelligence.
In contrast, others, including earlier versions of Llama or some visual models, score in the 60-80 point range, which is below average.


