< >
< >

Google’s new AI training method helps small models tackle complex considerations



researchers at Google Cloud And UCLA have proposed a new reinforcement learning framework that significantly improves the ability of language models to learn very demanding multi-step reasoning tasks. Supervised reinforcement learning (SRL) reframes problem solving as a sequence of logical “actions” and provides rich learning signals during the training process.

This approach allows smaller models to learn complex problems that were previously inaccessible to other common training techniques. Experiments show that SRL not only performs excellently on mathematical reasoning benchmarks, but also effectively generalizes to agent software development tasks.

SRL is a versatile training framework that can help smaller and more cost-effective models achieve higher-level thinking skills.

The limits of current LLM argumentation training

Recent advances in training large language models (LLMs) to think have been driven largely by reinforcement learning with verifiable rewards (RLVR), a method in which a model is rewarded based on the correctness of its final answer. Through repeated attempts to solve problems and receiving feedback on the end result, the model gradually learns effective problem-solving strategies.

However, the success of this outcome-based approach depends on the model’s ability to find a correct solution within a limited number of trials "Rollouts." Since each rollout is computationally intensive, models cannot be tried out indefinitely. This method reaches its limits when the problems are so difficult that the model rarely, if ever, finds the right answer within its budget.

This creates a critical learning bottleneck. In many multi-step reasoning problems, a model may solve several steps correctly but be thrown off track by a single error, leading to an incorrect answer. With RLVR, all of this effort is negatively rewarded and the model learns nothing from its partially correct work. It’s an all-or-nothing approach that doesn’t provide detailed feedback and offers scant rewards.

An alternative method is supervised fine-tuning (SFT), in which the model learns from examples that contain the entire reasoning process presented by experts. While SFT can teach reasoning skills, it often leads to overfitting (the model simply learns to mimic the trajectories in the training data, rather than learning to generalize to problems beyond the examples it has seen). This problem is compounded by the fact that high-quality, human-generated training data is both scarce and expensive to produce.

As the paper notes, these restrictions no longer apply "a critical gap for training small open source models to effectively learn difficult problems."

How supervised reinforcement learning works

SRL introduces a framework that reformulates problem solving as "sequential decision making," to find a balance between pure result-oriented RL and pure imitation learning. Rather than just optimizing for the final answer or forcing the model to mimic an expert’s entire thought process, SRL teaches the model to reproduce a sequence of key actions that form the backbone of expert thinking. This allows the model to learn to act similarly to an expert while developing its own internal thinking style.

Within SRL, expert demonstrations are broken down into a series of concrete intermediate actions, each of which represents a meaningful step. In a math problem, an action might be an algebraic manipulation. For a software engineering agent, it could be a command executed in a code repository. To generate training data, SRL uses a powerful teacher model to create solution histories, which are then used to train a smaller model.

According to I-Hung Hsu, a research scientist at Google and co-author of the paper, this middle ground approach is key to its effectiveness in real-world scenarios. "SRL sits in the middle: it captures the structured flexibility of problem solving in the real world, where there are multiple valid strategies but also clear ideas about what “good thinking” looks like at every step." Hsu told VentureBeat. "This makes SRL suitable for areas such as data science automation or perhaps supply chain optimization – tasks that reward informed intermediate considerations rather than mere final answers."

During training, the model first generates a "inner monologue" (its internal reasoning process, included in tags) before committing to an action. At each step, SRL provides a reward based on the similarity between the model’s predicted action and the expert’s action. This incremental reward system provides dense, fine-grained feedback, allowing the model to learn and improve even if its overall solution is not perfect. This solves the sparse reward problem faced by RLVR.

SRL in action

The researchers’ experiments show that SRL significantly outperforms strong baselines on both demanding mathematical reasoning and agent software engineering benchmarks. They also found that SRL promotes more flexible and sophisticated reasoning patterns in models, such as: B. nested planning and self-verification that improve solution quality without just prolonging results.

For business leaders, performance improvements are only valuable if they do not come with uncontrolled costs. Hsu clarifies that SRL-trained models are more efficient in their reasoning. "The benefits come from better quality and structure of the argument, not from verbosity." he said. "In terms of efficiency, the models trained by SRL are at about the same level as the base model when it comes to token usage. Although SRL is not designed to reduce inference costs, it achieves stronger reasoning performance without increasing it."

The team did some fine-tuning for the math tests Qwen2.5-7B guide on a data set of 1,000 difficult math questions. They compared its performance with models trained with SFT and RLVR (using the GRPO algorithm, common in models like…). DeepSeek R1) on four competitive-level mathematical benchmarks. The SRL-trained model achieved a significant average performance improvement of 3.0% compared to other methods.

The team expanded SRL into agent software engineering, an area critical to enterprise automation. They trained a model specialized in coding, Qwen2.5-Coder-7B-Instructon 5,000 expert histories of agents interacting with a coding environment. The SRL-trained model was compared to the original baseline model and SWE-Gym-7B, a strong baseline refined with SFT. SRL achieved a task completion rate of 14.8%, a relative improvement of 74% over the SFT-based model. This demonstrates SRL’s ability to train more competent AI agents for complex, real-world programming tasks.

A new standard for high-stakes AI?

The article’s strongest results were achieved by combining methods: first, SRL was used to teach basic reasoning, and then RLVR was used to refine this skill. When the researchers used SRL before training and applied RLVR after training in their experiments, they observed an average increase of 3.7%, demonstrating a powerful curricular learning strategy.

This raises the question of whether this could be a new blueprint for building specialized AI.

"We consider SRL as a strong foundation," Hsu said. "In a sense, SRL provides a curriculum that teaches step-by-step models of how to think and act before we refine those behaviors through outcomes-based reinforcement learning. This SRL-first approach not only stabilizes the later RL phase, but also makes thinking more interpretable and generalizable, which is critical for high-stakes applications."

Looking forward, Hsu admits that scaling this pipeline still faces challenges, particularly the high cost and complexity of end-to-end RLVR for agent tasks. However, he is optimistic about the way forward. "While high-quality expert careers continue to be important," he came to the conclusion "We believe the next big leap will be in automating their generation and filtering – leveraging strong teacher models or even self-improving student models to generate new data."

Leave a Reply

Your email address will not be published. Required fields are marked *

< >