Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

How the scaling of test time in small language models unlocks hidden argumentation skills (and enables them to exceed LLMs)


Take part in our daily and weekly newsletters to get the latest updates and exclusive content for reporting on industry -leading AI. Learn more


Very small voice models (SLMS) can exceed leading major language models (LLMS) in argumentation New study From Shanghai Ai Laboratory. The authors show that an SLM with 1 billion parameters with the right tools and the test time scaling techniques can outperform a 405B LLM for complicated mathematical benchmarks.

The possibility of providing SLMs in complex argumentation tasks can be very useful because companies are looking for new opportunities to use these new models in various environments and applications.

Explained test time calming

Test-time scaling (TTS) is the process in which LLMS provide additional computing cylences during the inference to improve their performance on various tasks. Leading argumentation models such as Openai O1 And Deepseek-R1Use “internal TTS”, which means that you are trained to think slowly by creating a long series of Chain of thought (Cot) token.

An alternative approach is “external TTS”, in which the model output (as the name suggests) is improved external help. External TTS are suitable for circulation of models for the argumentation of tasks without further fire. An external TTS setup usually consists of a “guideline model”, which is the main -llm that generates the answer, and a process reward model (PRM) that evaluates the answers to the guideline model. These two components are coupled together by a sample or search method.

The simplest setup is “best-of-n”, whereby the guideline model generates several answers and the PRM selects one or more best answers to write the final answer. Advanced external TTS methods use the search. In “Beam search” the model divides the answer into several steps.

For each step, several answers are scanned and executed by the PRM. Then it selects one or more suitable candidates and generates the next step of the answer. And in “Various Purier Tree Search” (DVTS), the model generates several answers to create a more diverse series of candidate answers before they are synthesized for a final answer.

Different test time calculation methods (source: Arxiv)

What is the right scaling strategy?

The selection of the right TTS strategy depends on several factors. The authors of the study carried out a systematic examination of how different political models and PRMS influence the efficiency of TTS methods.

Their results show that efficiency largely depends on the guidelines and PRM models. For example, exceed search base for small guideline models best-of-n. For large political models, however, best-of-N is more effective because the models have better argumentation functions and do not need a reward model to check every step of their argument.

Their results also show that the right TTS strategy depends on the difficulty of the problem. For example, for small guideline models with fewer than 7b parameters, best-of-nbing-N works better for simple problems, while searching for jet works better for difficult problems. For guideline models with parameters of 7b and 32b, a diverse tree search is good for simple and medium problems and the search for jet is best suited for hard problems. For large political models (72b parameters and more) best-of-N is the optimal method for all levels of difficulty.

Why small models can beat large models

SLMs outperform large models at Math and Aime-24 (source: Arxiv)

Based on these findings, developers can create Calculate optimal TTS strategies This takes into account the guideline model, PRM and problem difficulties in order to use the calculation budget best to solve argumentation problems.

For example, the researchers found that a Lama-3.2-3b The model with the computing-optimal TTS strategy exceeds the Lama-3.1-405b in Math-500 and AIME24, two complicated mathematical benchmarks. This shows that an SLM can outperform a model that is larger by 135 times when using the strategy of computing optimal TTS.

In other experiments, they found that a QWen2.5 model with 500 million parameters can surpass GPT-4O With the right arithmetic TTS strategy. With the same strategy, the 1.5b distilled version of Deepseek-R1 O1-Preview and O1-Mini over Math-500 and AIME24 exceeded.

When considering training and inference budgets, the results show that SLMs with arithmetic-optimal scaling strategies can outperform larger models with fewer flops with 100-1000X.

The results of the researchers show that the calculation of optimal TTS significantly improves the argumentation skills of voice models. However, when the political model increases, the improvement of the TTS gradually decreases.

“This indicates that the effectiveness of TTS is directly related to the argumentation of the political model,” the researchers write. “Especially for models with weak argumentation skills, the scaling of test-time computers leads to a significant improvement, while the profit is limited for models with strong argumentation skills.”

The study validates that SLMs can do better than larger models when using computation-optimization scaling methods. While this study focuses on mathematical benchmarks, the researchers are planning to expand their study to other argumentation tasks such as coding and chemistry.


Leave a Reply

Your email address will not be published. Required fields are marked *