top of page

Stay Connected

facebook icon.png
X icon.png
linkedin icon.png

Stay Connected - Get our latest news and updates

  • LinkedIn
  • Youtube
  • X
  • Facebook

New AI models and the benchmark paradox

  • Writer: FirstPrinciples
    FirstPrinciples
  • Aug 7
  • 4 min read

Z.ai's new open-source model and Harmonic's math-native Aristotle highlight contrasting strategies for AI reasoning, while a new wave of increasingly specialized benchmarks invites us to rethink how we measure progress.


Last week offered a snapshot of AI's evolving ambitions and contradictions. On one end, China's Z.ai unveiled its flagship open-source language model, GLM-4.5, alongside a cost-efficient variant, GLM-4.5-Air. On the other side of the Pacific Ocean, Silicon Valley startup Harmonic, debuted Aristotle, an AI model designed to deliver formally verified, hallucination-free mathematical reasoning. 


Also in focus at FirstPrinciples last week was the continued proliferation of AI benchmarks. Each benchmark offers insights into model capabilities, but together, they reveal a fractured approach to measuring progress.


GLM-4.5: Open-source generalism with agentic intent

GLM-4.5 is more than just another GPT-style model. Built on a vertical mixture-of-experts (MoE) architecture, it selectively activates only 32 billion of its 355 billion parameters during inference, emphasizing depth over breadth. The design enables a dual operating mode, including a lightweight “non-thinking" mode for fast, simple tasks and a heavyweight “thinking" mode for multi-step reasoning and agentic planning. Supporting native function calling, document and web browsing, image generation, and a 128k token context window, GLM-4.5 is a generalist agent out of the box. At less than $0.30 per million tokens output, it also sets a new price-performance bar among open models, outcompeting recent releases from DeepSeek and Grok-mini.


Benchmark results reflect this performance. GLM-4.5 ranks second or third globally across most categories, trailing only proprietary giants like xAI's Grok-4 and OpenAI's GPT-4o. It performs well on reasoning, math, agentic tasks, and programming challenges. Yet the most provocative aspect may be its openness. Released on Hugging Face and ModelScope, the model is auditable and deployable without API lock-in—a clear challenge to the current foundation model oligopoly.


Harmonic's Aristotle: Formal correctness in a generative age

Meanwhile, Harmonic's Aristotle steps in as a bold counterpoint. Instead of generalist capabilities, it aspires to formal mathematical rigor. Aristotle claims not only to solve problems but to verify its reasoning using formal logic systems like Lean 4. Its core pitch is as much philosophical as technical; if generative models continue to hallucinate, shouldn't we instead pursue correctness in specific domains over generalist fluency?


Blue logo with a circular icon and the word "Harmonic" in bold font on a white background.
Harmonic Logo (Credit: Harmonic)

Aristotle also recently achieved Gold Medal-level performance at the 2025 International Mathematical Olympiad (IMO). This accolade gives weight to Harmonic's claim that formalism can deliver competitive, even superhuman, performance in structured domains. Still, its scope is deliberately constrained. Harmonic isn’t claiming to replace broad LLMs, it’s betting that in high-stakes technical fields, correctness, not coverage, will define the next frontier.


The benchmark paradox: Evaluating the evaluators

Every week seems to bring a new benchmark, evaluating some slice of a models’ capability—tool use, multi-step reasoning, browsing, math proofs, planning, citation integrity. The granularity offers precision, but also proliferation. At this pace, one wonders whether we risk evaluating ourselves into incoherence.


From domain-specific efforts like PhysUniBench and the development of a living physics LLM benchmark, to more generally applicable frameworks like GAIA, the signal is increasingly hard to parse. Each of these benchmarks offers its own lens on model performance, but how are we supposed to sift through the results and compare them meaningfully? With this proliferation, one has to ask, at what point will we need benchmarks for our benchmarks?


Pie chart titled PhysUni Bench. Segments: Classical Mechanics 34%, Electromagnetism 29%, Optics 10%, others. Icons next to each. Light colors.
Subfield distribution of PhysUniBench (Credit: Wang et al. 2025)

Of course, there's utility in this benchmark explosion. It allows us to expose hidden failure modes and better understand edge cases. But as benchmarks begin to shape model architecture, training curricula, and even product direction, we are confronted with the risk of overfitting to benchmarks rather than focusing on real-world performance. Already, some research teams are calling for meta-benchmarking efforts to evaluate the evaluators, rank the relevance of tasks, and model interdependencies among test domains.


To move forward, we may need to reimagine our current evaluative toolkit. Concern about the quality and comparability of AI benchmarks has begun to bubble up in the scientific community. In high-stakes environments, benchmarks serve not only to measure model capabilities, but to influence downstream deployment, research agendas, and policy recommendations. However, not all benchmarks are created equal. A NeurIPS paper introduced an assessment framework built around 40 best practices and applied it to 25 widely used AI benchmarks. The findings were striking.


Large quality differences exist across benchmarks, with many failing to report statistical significance or ensure replicability of their results. To remedy this, the authors proposed a checklist for minimum quality assurance and launched a living repository of benchmark assessments to support more transparent comparisons.


Grid of 16 squares in yellow, blue, and green with a white upward trend line. Dark background, futuristic digital theme.

This week’s conversations made one thing clear. The frontier is splintering into many paths: open and closed, fluent and rigorous, task-driven and theory-aware. If we want AI to do more than perform, our evaluations must evolve from benchmarks into blueprints, guiding not just what we test, but what we value. And if we are to build AI systems that reason, act, and explain with integrity, we will need evaluations that evolve just as thoughtfully.


This article was created with the assistance of artificial intelligence and thoroughly edited by FirstPrinciples staff and scientific advisors.

 
 
iStock-1357123095.jpg
bottom of page