Lab Notes: Teaching models to reason a little better

FirstPrinciples
Apr 7
6 min read

A recent recap of work at FirstPrinciples, from fine-tuning and RLVR to ensembles and 122B-scale models. We’re beginning to clarify what improves scientific reasoning, and these are our notes.

One of the questions we keep returning to as we build the AI Physicist is how models move from producing answers to conducting real scientific reasoning.

New base models appear every few months, often arriving with impressive raw capability. In practice, however, much of the progress in reasoning comes from how these models are adapted and shaped through fine-tuning. This process, using targeted data and training signals, is now a central part of how modern models improve.

Much of our earlier work at FirstPrinciples focused on smaller models. From an infrastructure standpoint, they are easier to iterate on and serve as useful testbeds for early ideas around datasets and reasoning traces, however, we ran into clear limitations. Fine-tuning on smaller models often made it difficult to update knowledge effectively. Lower precision constrained learning, which in practice led to reduced accuracy, higher rates of hallucination, and more brittle models that struggled to generalize. In many cases, our fine-tuning experiments would break behaviours that previously held.

Five beige cubes with ascending 3D mountain formations on a brown textured surface, each cube increases in complexity and size.

At the same time, some capabilities only begin to emerge at larger scales. To properly evaluate our approaches, we needed to move to models where those behaviours could actually appear. To make that possible, we recently integrated DeepSpeed and implemented parameter sharding across multiple GPUs. Practically speaking, this just means we can now fine-tune models that previously would not fit within the limits of our available memory.

The goal here isn’t to push the boundaries of distributed training. Rather, it is to give the team the flexibility to try more ideas across a broader range of models. Once the pipeline was in place, we started running experiments.

Early experiments and results

Our initial tests focused on supervised fine-tuning using synthetic datasets generated through our internal pipeline. This allowed us to iterate quickly on different training setups and reasoning formats. We have only just begun to explore this path, as this pipeline gives us great flexibility in generating varied (and more targeted) datasets.

More recently, we began experimenting with the MegaScience dataset, which, in our case, led to stronger results on the UG Physics benchmark, which is a test of undergraduate-level physics reasoning. So far, the best performance has come from a 30B parameter model after fine-tuning.

One observation was how modest the gap turned out to be between the 30B model and our previous 8B model. The larger model did perform better overall, but not by as much as we initially expected. More importantly, though, the same training approach appears to produce consistent improvements across model sizes, suggesting that the underlying method is robust rather than tightly coupled to scale. As we extend these experiments to larger models, we expect to get a clearer signal on how these gains evolve.

Tradeoffs and emerging behaviour

What we’ve seen in our own experiments, especially when fine-tuning on more focused datasets, is that tradeoffs begin to emerge. Improvements in one area often come at the expense of others, and we don’t yet have full visibility into what is being lost or degraded in the process. We’re beginning to build better evaluation frameworks to surface these effects, but for now, the results need to be interpreted with that uncertainty in mind.

We have also been exploring reinforcement learning with verifiable rewards (RLVR) as an alternative to standard fine-tuning approaches. In these experiments, models are not given explicit reasoning traces during training. Instead, they are trained on question and ground-truth answer pairs, and rewarded when they arrive at the correct result.

In these experiments, we observed that models often discovered more efficient ways to reach the correct answer, using fewer tokens while still constructing useful intermediate steps when needed. At the same time, this efficiency comes with an important caveat, in that models can also develop surprisingly ingenious ways of gaming the reward signal rather than genuinely solving the task in the intended way.

Teaching models to check their work

We’re also equally interested in whether fine-tuning can shape how models reason, not just whether they arrive at the correct answer. To explore this, we built a pipeline for generating synthetic datasets in which larger models produce structured reasoning traces that smaller models can learn from.

An open book with flowing digital lines above, set against a dark background. The lines glow with orange and white, creating a magical effect.

One early experiment focused on modifying those traces to include explicit self-verification steps. Instead of simply producing an answer, the training data encouraged the model to pause, review its reasoning, and check whether the result actually made sense. A Qwen-8B model trained on these traces began producing noticeably longer and more self-critical reasoning chains. In some cases, the model would actively question intermediate steps before continuing.

The benchmark results that followed were mixed. Some tasks improved slightly, while others showed a small drop in performance, but what stood out more was the shift in behaviour. Even where capability appeared unchanged, the fine-tuned models produced responses that were more structured and easier to follow. In a limited evaluation, an LLM acting as a judge consistently preferred these outputs, and we suspect human evaluators would as well, though this still needs to be tested.

This points to a distinction between what a model knows and how it expresses that knowledge. Improving reasoning style (ex. clarity, structure) may not always translate directly into benchmark gains, but could make these models more useful as components within larger systems, where one model generates or validates ideas and another communicates them.

Scaling results and system-level performance

We’ve also begun to test how systems of models perform together. One recent direction has been the development of an ensemble system, composed of eight open-source models alongside two of our fine-tuned variants (ranging from 8B to 30B). Questions are dynamically routed to the model best suited to the task.

Diagram with four blocks: "Input Question" (magnifying glass), "Dynamic Router" (grid icon), "8x Open-Source Models", "2x Fine-Tuned Variants (8B-30B)", and "Output" (light bulb).

In this configuration, nicknamed the “Avengers”, the ensemble achieved strong results across several benchmarks. On MATH500, it reached 98%, matching or outperforming frontier models including GPT-5, Opus 4.1 (Thinking), and Grok 4. On GSM-Symbolic, performance improved from 95.67% to 97.8% after context engineering, surpassing GPT-4o and o1. More notably, under the same setup, the ensemble also proposed the conjecture that was recently described as a new result in theoretical physics by OpenAI.

Across these experiments, one consistent pattern has been that our fine-tuned models are frequently selected within the ensemble for physics tasks, particularly on UG Physics-style questions. This reinforces the role of domain-specific post-training, even within broader multi-model systems.

These results point toward a practical path forward. Systems composed of smaller, optimized models can achieve competitive performance with frontier models while also making it easier to identify where targeted post-training is most effective.

In parallel, we have begun scaling individual models into the 100B+ range. Our first Qwen-122B runs are now complete, with early evaluations showing strong performance on internal benchmarks. This includes our own quantum information benchmark, where the fine-tuned model achieved 93.2% compared to a deterministic baseline of 65%. The BS Benchmark, designed to measure resistance to nonsensical or leading claims, also showed promising results for our fine-tuned model. Outperforming several frontier systems, such as GPT 5.4, Kimi K2.5, Grok 4 and Gemini 3.1, it suggests improved robustness in scientific contexts. Evaluation of UG Physics is ongoing, but results from smaller models suggest similar gains may carry over.

Alongside capability improvements, we continue to explore behavioural fine-tuning, particularly around steerability and dataset construction. This is increasingly important for domains where high-quality data is limited or where specific reasoning styles are required.

What comes next

With the infrastructure in place, the next step is a logical one. That is, to run more experiments. Our current goal is to automate large batches of training runs (around 100 experiments per quarter), each exploring different mixtures of training data, reasoning traces, and behavioural signals.

A central focus is the distinction between exposure and evaluation. Many base models have been trained on textbooks and scientific materials, so the challenge is less about introducing new knowledge and more about how that knowledge is elicited and assessed.

In our case, we’ve been working with textbook-style problems in areas like quantum information and quantum computing, transforming them into novel, parametric variants that models are unlikely to have encountered before. Alongside this, we are developing our own benchmarks to measure how models handle these problems, with a focus on behaviours that scientific reasoning depends on. For example, checking assumptions, verifying intermediate steps, and revisiting earlier conclusions when something doesn’t quite add up.

It is still early work. But the training pipeline is now in place, and the experiments are starting to scale.