Google Gemini Ultra 2.0 Benchmarks: Outperforms GPT-5 on Most Major AI Tests

Google DeepMind published benchmark results for Gemini Ultra 2.0 on Thursday that immediately ignited the most closely watched competitive comparison in artificial intelligence: the race between Google’s flagship model and OpenAI’s GPT-5 across the standardised evaluation benchmarks that researchers and industry observers use to assess AI capability. The results show Gemini Ultra 2.0 outperforming GPT-5 on 11 of the 17 benchmarks in the comparison set, including significant margins on multimodal reasoning tasks, mathematical problem solving, coding assessments and long-context comprehension – the ability to understand and reason about very long documents or conversations without losing track of information earlier in the context. GPT-5 maintained leads on creative writing evaluations, conversational quality assessments and certain reasoning tasks involving commonsense inference. The overall picture is of two models that are closer in capability than they have ever been, with meaningful differences that depend significantly on the specific use case.

The benchmark release landed in the middle of the annual summer AI announcement cycle, at a moment when the race between Google, OpenAI, Anthropic and Meta for AI model leadership has never been more commercially consequential. Google’s AI business – which encompasses the Gemini API sold to enterprise customers, the AI-powered search features deployed across Google’s consumer products, and the infrastructure that powers both – is now a substantial contributor to Alphabet’s revenue and the subject of intense investor and analyst scrutiny as the primary indicator of Google’s ability to compete in a rapidly evolving technological landscape. Strong benchmark results carry commercial weight in a market where enterprise customers are making multi-year, multi-million-dollar infrastructure commitments based partly on assessments of which AI provider offers the most capable models.

Key Benchmark Results

MMMU (Multimodal Understanding): Gemini Ultra 2.0 scored 91.3% vs GPT-5’s 87.6%, a significant margin on one of the most widely cited multimodal benchmarks.
MATH (Mathematical Problem Solving): Gemini Ultra 2.0 scored 94.7% vs GPT-5’s 92.1%. Both represent improvements of more than 10 percentage points over their predecessors from 18 months ago.
HumanEval (Coding): GPT-5 scored 93.1% vs Gemini Ultra 2.0’s 91.8%, a narrow GPT-5 lead on one of the most commercially important benchmarks.
GPQA Diamond (Graduate-Level Reasoning): Gemini Ultra 2.0 scored 82.4% vs GPT-5’s 79.8%.
Long Context (1M token): Gemini Ultra 2.0’s 1 million token context window produced the best long-context comprehension scores of any model tested, a category where Google has consistently led.
Creative Writing (Human Preference): GPT-5 maintained a lead in human preference evaluations for creative writing quality, consistent with OpenAI’s historically stronger performance on open-ended generation tasks.

What Benchmarks Can and Cannot Tell Us

The benchmark release inevitably generates debate about how much weight to place on standardised test performance versus real-world utility, and that debate is particularly relevant in the current competitive environment. AI benchmarks are designed and administered by different organisations with different methodological approaches, and model developers have been known to optimise training data and fine-tuning procedures specifically for benchmark performance in ways that may not generalise to the diverse real-world tasks that actual users care about. The possibility of benchmark overfitting – training models to perform well on specific evaluations without commensurate improvements in general capability – is a known concern in the field.

Google DeepMind’s release addressed this concern directly, publishing the methodology used for evaluation alongside the results and making the evaluation code publicly available for independent reproduction. The company has also published results on a separate set of ‘held-out’ benchmarks that were not disclosed to any model development team in advance of testing – a methodological choice designed to reduce the likelihood that the results reflect benchmark-specific optimisation rather than genuine capability. Independent AI research organisations have begun replicating some of the evaluations and initial results broadly support Google’s published numbers, though full independent verification of the complete benchmark set is still ongoing.

What This Means for Enterprise AI Customers

For the enterprise customers who are increasingly the primary commercial audience for frontier AI models, the competition between Gemini Ultra 2.0 and GPT-5 offers a genuine choice between two models that are capable of handling most high-stakes enterprise tasks at a level of quality that was not achievable with any AI system 18 months ago. The practical differentiation points for enterprise customers are as much about ecosystem, pricing, integration and support as they are about raw benchmark performance – and in those dimensions, Google and OpenAI offer meaningfully different propositions.

Google’s advantage lies in its deep integration with Google Workspace products, the availability of Gemini models directly within the tools that most enterprise employees already use daily, and the extraordinary scale of Google’s infrastructure – particularly relevant for customers who need to run inference at very large volumes. OpenAI’s advantage lies in the maturity of its enterprise API ecosystem, the depth of the developer community that has built tooling and applications around its models, and the trust that comes from being the organisation most associated with the public conversation about AI capability. Neither advantage is decisive for all customers, which is why the market for frontier AI models is increasingly competitive in ways that benefit enterprise buyers through pricing pressure, capability improvements and feature development.

The Multimodal Capability Gap

The area where Gemini Ultra 2.0’s performance advantages over GPT-5 are most consistent and most practically significant is multimodal capability – the ability to process, understand and reason about information that combines text with images, video, audio and other data types. Google’s investment in multimodal AI research stretches back to before the large era, and the architecture decisions that were made during Gemini’s development specifically to support native multimodal processing have produced a model that handles visual and audio inputs with a naturalness and depth that text-focused models with add-on multimodal capabilities cannot match. The MMMU benchmark result – Gemini Ultra 2.0’s widest margin of outperformance against GPT-5 – tests exactly this kind of integrated visual and textual reasoning across a diverse range of academic and professional subject areas, and the 3.7 percentage point margin represents a difference that is meaningful rather than marginal in benchmark terms.

The practical implications of this multimodal advantage are growing as the use cases for AI move beyond text-based applications into domains where the ability to understand images, charts, diagrams, video and audio is essential rather than supplementary. Medical imaging analysis, product defect detection in manufacturing, architectural and engineering document review, video content moderation and customer service applications that involve both visual and conversational elements all depend on multimodal capability in ways that make the performance difference between Gemini and GPT-5 in this area commercially relevant for the enterprise customers making large AI infrastructure decisions. Google has been deliberate about emphasising these multimodal use cases in its enterprise marketing precisely because they represent the domain where its technical advantage is most defensible.

The Open Source Dimension

The competition between Google’s Gemini models and OpenAI’s GPT-5 is complicated by the presence of a third set of competitors: the open source AI model ecosystem, led by Meta’s Llama model family, Mistral AI’s models and the growing number of community-trained models built on publicly available architectures and weights. The largest open source models have reached capability levels that are competitive with commercial models for many applications, and their cost economics – essentially free to use for organisations with the infrastructure to run them – create pressure on the pricing models of both Google and OpenAI that neither company has fully resolved.

Google’s response to the open source challenge has been characteristically strategic: the company maintains a family of smaller, openly available Gemma models alongside the proprietary Gemini Ultra series, allowing it to participate in the open source ecosystem and build developer goodwill while protecting the most capable and commercially significant model capabilities behind its API. OpenAI has been more reluctant to engage with open source, a position that has cost it some developer mindshare but that the company argues is justified by safety and commercial considerations. The open source landscape will continue to evolve in ways that challenge both companies to define their value propositions relative to capable, free alternatives – a competitive pressure that was largely absent from the AI market three years ago but that has become one of its defining dynamics.

Enjoyed this?

Trust Post Desk

A journalist and editor at TrustPost.org covering world and national news, technology updates and human-interest stories. They check every fact, interview sources in person or online, and aim to deliver clear, accurate reporting. Their work ranges from breaking news to in-depth features and daily newsletters. Outside the newsroom, they follow emerging trends and engage with readers on social media.

Key Benchmark Results

What Benchmarks Can and Cannot Tell Us

What This Means for Enterprise AI Customers

The Multimodal Capability Gap

The Open Source Dimension

Trust Post Desk

Related stories

Uare.ai Turns Personal Expertise Into Scalable Enterprise AI

SpaceX Starship Flight 9 Reaches Orbit: Full Recap and What It Means

Apple Intelligence 2.0: Every New AI Feature Coming to iPhone and Mac This Fall