AI Excellence: GPT-5 and Gemini 2.5 Pro Dominate the IOAA

GPT-5 and Gemini 2.5 Pro have redefined AI capabilities by achieving gold medal performance at the IOAA, showcasing advanced reasoning and problem-solving skills in complex astrophysical contexts.

Article written by

Jan Lisowski

GPT-5 and Gemini 2.5 Pro have recently set a groundbreaking benchmark by achieving gold medal performance at the International Olympiad on Astronomy and Astrophysics (IOAA), outperforming elite human competitors across multiple years, including 2022 through 2025[1][3].

This achievement is deeply insightful from an AI model architecture and performance perspective. Both GPT-5 and Gemini 2.5 Pro excelled particularly in the theoretical exam components of IOAA, where the problems demand advanced scientific reasoning, geometric spatial imagination, and the integration of complex astrophysical concepts underpinned by mathematical rigor[1].

The models’ superior capability in handling difficult questions compared to easier ones challenges typical performance curves of AI systems. Analysis reveals this anomaly arises partly because the exam features a small number of questions in each difficulty category, which naturally causes score fluctuation when few errors occur. More importantly, GPT-5’s errors clustered in questions requiring sophisticated geometric reasoning and spatial visualization, indicating ongoing challenges in bridging symbolic spatial inference within predominantly language- and pattern-based architectures[1].

Further technical nuances arise from the exam’s data-analysis section, where error distribution was more diffuse. Common failure modes included image/chart interpretation and large-scale calculation tasks—highlighting a domain where the underlying neural network optimization and numerical stability may yet lag behind theoretical reasoning strengths[1].

This advance underscores a key emerging paradigm in AI research: hybridizing transformer architectures with domain-specific reasoning modules to tackle intricate, multidisciplinary STEM problems. While pure transformer models like GPT-5 showcase tremendous raw reasoning and language understanding, their spatial and quantitative reasoning capabilities remain areas of active optimization, especially evident when evaluated against elite human benchmarks.

Ultimately, these results elevate the benchmark for AI model evaluation beyond natural language tasks—demanding proficiency in mathematically grounded scientific problem-solving and multi-modal data interpretation. The ability of GPT-5 and Gemini 2.5 Pro to navigate this complex space signals a new frontier in AI: models not only generating text but also reliably engaging with structured scientific knowledge and real-world problem complexity at a highly competitive level[3].

Article written by

Jan Lisowski

Want to see us in action?

Schedule a 30-min demo

Get candidates this week

Short-list in 2–4 days. Pilot in 1–2 weeks. Scale on proof.