AI faces a tough physics exam: New benchmark reveals the challenge
- FirstPrinciples
- Jul 9, 2025
- 7 min read
Large language models have advanced dramatically in recent years, yet when physicists gave them an undergraduate-level test, even the best models were only correct on around one out of every three questions. The new PhysUniBench benchmark exposes how far AI still has to go in mastering fundamental science.
Large language models (LLMs) have made impressive progress across domains and can now draft legal briefs, debug code, and ace high-school exams. However, a new study shows that they remain far from proficient in fundamental physics. An international research team (including contributors from Australia, China and USA) have introduced PhysUniBench, a benchmark of undergraduate-level physics problems designed to test AI reasoning in fundamental science.
The results offer a clear picture: even state-of-the-art AI systems were capable of answering only about one-third of the questions correctly, underscoring the difficulty of achieving true proficiency in physics and other fundamental sciences. At the same time, the study also highlights the inherent complexity of evaluating large language models. From potential training data overlap to variation in how models approach reasoning tasks, performance metrics alone may not fully capture a model’s true scientific understanding.
A new benchmark for scientific reasoning
PhysUniBench was developed from university curricula and contains over 3,300 physics questions across 8 major subfields from classical mechanics to quantum theory. Each question pairs text with a visual diagram or element, reflecting how physics problems often require interpreting figures or graphs.

“Unlike prior benchmarks that focus on text-only math or physics tasks, PhysUniBench emphasizes multimodal scientific reasoning” by forcing models to integrate text and visuals in “concept-rich, symbol-heavy, and context-dependent” problems.
The questions range from multiple-choice conceptual queries to open-ended problems that demand free-form solutions. To ensure the benchmark is truly challenging, the team used an iterative model-in-the-loop process whereby the model was repeatedly tested on draft questions, and any problems it solved easily were filtered out. Each remaining question was assigned a difficulty level from 1 to 5, creating a graduated benchmark grounded in real-world curricula.
It is important to note that while this filtering method ensures the benchmark includes genuinely difficult problems, it also introduces a selection bias. By design, the final dataset excludes easier questions that current models can already solve. As a result, overall model accuracy scores on PhysUniBench may appear disproportionately low, reflecting a deliberate emphasis on failure modes rather than average performance.
Underwhelming performance by top AI models
Even the most sophisticated models today fell short of basic proficiency, with overall accuracies ranging from ~17% to ~37%. While some models like Claude 3.5-Sonnet and GPT-4o-mini lead slightly on overall scores, no system demonstrates consistent strength in both multiple-choice and open-ended questions or across all physics subfields. In particular, performance on open-ended questions remained weak across the board, highlighting ongoing challenges in symbolic reasoning, multi-step problem-solving, and conceptual understanding.

Certain topics proved especially challenging. For instance, none of the models managed to correctly solve open-ended questions in quantum mechanics, apart from GPT-4o, which achieved just 6.2% accuracy in that category. In subjects like thermodynamics and relativity, performance was similarly poor on open-ended questions, with all models scoring in the low single digits. By comparison, a well-prepared undergraduate student would be able to answer a large majority of these questions correctly.

As a point of contrast, OpenAI once celebrated GPT-4’s 80% score on the AP Physics 2 exam, a high-school test. PhysUniBench shows undergraduate physics problems are in an entirely different league for today’s LLMs.
These results make it clear that current AI systems “struggle with advanced physics reasoning, especially on multi-step problems and those requiring precise diagram interpretation”. Unlike trivia or straightforward question answering, physics problems often require chaining together multiple reasoning steps (e.g. understanding the problem, recalling the right principles, setting up equations, performing calculations, checking units or conditions). Any break in this logical chain leads the model astray. And while language models excel at generating fluent answers, fluency doesn’t equal accuracy when it comes to physics. It’s easy for an AI to produce an answer that sounds plausible but is wrong.
Yet there's also a deeper layer of ambiguity. The 3,300 questions in PhysUniBench are sourced from publicly available textbooks, exams, exercises, and competition problems that may overlap with the training data of large language models. Recent studies (e.g. on LLaMA 3.1) have shown that frontier models can memorize substantial portions of their training data, reproducing long passages verbatim. This raises the possibility that some model responses on PhysUniBench don’t reflect genuine reasoning, but rather partial memorization of questions or solutions. It also makes it hard to compare models, since their results can depend as much on the data they’ve seen as on their architecture or reasoning ability.
To mitigate this concern, some researchers are advocating for benchmarks built from curated, unpublished, or adversarially generated problems that are unlikely to appear in any pretraining corpus. Such efforts are still emerging, but they point toward a more reliable future for measuring scientific reasoning in AI.
Why physics stumps AI
Why is undergraduate physics such a tall order for AI? These findings underscore the issue: current AI systems, particularly LLMs, excel at pattern recognition but lack robust reasoning mechanisms.
The challenges highlighted by PhysUniBench reflect some of the fundamental gaps in AI reasoning:
Multi-step logical reasoning
Physics questions typically require several steps of deduction or calculation, not just a single inference. A model must combine concepts (e.g. Newton’s laws or Maxwell’s equations) with a problem’s specific requirements and then work through a solution step by step. LLMs often lose coherence across such sequences.
Mathematical rigor
Physics relies on algebra, calculus, and symbolic manipulation. This demands a level of mathematical precision that AI models struggle with. Researchers note that solving equations requires not only understanding natural language but also manipulating mathematical notation correctly and recalling the right formulas and constants. Current models have a tendency to make algebraic mistakes or arithmetic slips that a trained human would catch or never make in the first place.
Interestingly, there have been cases where LLMs have produced correct answers but arrived at them through reasoning paths that differ significantly from conventional mathematical logic, highlighting a gap between outcome accuracy and true methodological understanding.
Visual and contextual understanding
Every PhysUniBench question includes a diagram or figure, so the AI must interpret visual information in context. For example, a problem might show an electric circuit or a block-on-incline diagram. Integrating text and image understanding is still an open challenge. Today’s multimodal models can describe images, but parsing a diagram to extract the relevant physical parameters (like angles, distances, or connections) is far more complex.
Deep conceptual understanding
Beyond math, true proficiency requires grasping the underlying concepts and laws of physics. A human student leverages intuition about how systems behave (e.g. knowing a dropped object falls downward, or how conservation laws constrain outcomes). AI models lack genuine physical intuition, only recognizing patterns from training data. If a problem’s solution hinges on understanding a subtle concept (say, entropy or quantum superposition), an LLM might easily misuse it or fall back on clichés.
Self-correction and verification
Human problem-solvers will often double-check if an answer makes sense. Current AI models do not have reliable mechanisms to sanity-check their answers. They might confidently output a response that is physically impossible with no awareness of the mistake. In the benchmark, models frequently gave answers that looked plausible but were fundamentally incorrect, a well-known issue of AI “hallucination” in scientific contexts.
All these factors illustrate why achieving true AI proficiency in fundamental science is such a complex goal. It’s not enough to ingest lots of textbooks; AI must be able to reason through novel problems, much like a scientist or student would. As one analysis of a precursor model noted, even when an AI gets to a correct numerical answer, it might have arrived there by flawed logic, raising concerns about whether it truly understands the problem.
Next steps: From benchmarking to better models
Despite the sobering results, the PhysUniBench study paints a clear picture of the current state: today’s AI would flunk a university-level physics exam. Yet the tone of the researchers (and the broader field) remains cautiously optimistic. By identifying these failures in a rigorous way, benchmarks like PhysUniBench can guide researchers in building better models. “By providing a broad and rigorous assessment tool, PhysUniBench aims to drive progress in AI for Science”, the authors write, encouraging the development of models with stronger physics reasoning and problem-solving skills.

In practical terms, future models may pair large-scale language pre-training with symbolic solvers or physics-specific fine-tuning. These strategies are already yielding progress in systems like Google’s Minerva, which has demonstrated improved quantitative reasoning by training on step-by-step solutions. The hope is that with targeted benchmarks and new techniques, future AI systems will steadily increase their scientific aptitude.
At the same time, PhysUniBench also highlights how difficult it is to measure that aptitude accurately. Given the likelihood that some benchmark questions overlap with models’ training data, current performance levels may partially reflect memorization rather than true reasoning. As a result, there’s growing interest in building cleaner, more controlled evaluation datasets composed of novel, unpublished, or adversarial problems to better isolate generalization and scientific understanding. Such efforts will be essential if benchmarks are to remain not just diagnostic, but genuinely predictive of real-world scientific capability.
Scientific understanding remains a hard problem
The goal isn’t just to have AI pass a physics test for bragging rights. AI that truly understands science could become a powerful partner for researchers and educators. However, to get there, the AI community must grapple with the very challenges highlighted here: combining language with logic, perception with principles, and computation with conceptual insight.
The PhysUniBench results are a reality check against hype, but also a roadmap. They remind us that real intelligence in science requires more than crunching data; it requires reasoning through the laws of nature. It means AI must move beyond pattern matching and toward a deeper alignment with the structure of scientific reasoning itself, which is not just a challenge, but an invitation to meet a new standard. PhysUniBench shows that AI has begun the coursework, but graduation is still a long way off.
This article was created with the assistance of artificial intelligence and thoroughly edited by FirstPrinciples staff and scientific advisors.



