Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Take part in our daily and weekly newsletters to get the latest updates and exclusive content for reporting on industry -leading AI. Learn more
Researchers at Together ai And agent have released Deepcoder-14b, a new coding model that delivers an impressive performance that is comparable to leading proprietary models how Openais O3-Mini.
This model builds on Deepseek -R1 and offers more flexibility in the integration of high -performance codegenization and exploration functions in real applications. It is important that the teams have completely opened the model, its training data, code, protocols and system optimizations, which can help researchers improve their work and accelerate progress.
The research team’s experiments show that deep cooder-14b highly performs in several challenging coding benchmarks, including Livecodebench (LCB), Codeforces and Humaneval+.
“Our model shows a strong performance in all coding benchmarks … comparable to the performance of O3-Mini (low) and O1,” the researchers write in one Blog post This describes the model.
Interestingly, despite the training, the model mainly shows improved mathematical thinking and achieved 73.8% for the Aime 2024 benchmark, an improvement of 4.1% compared to its basic model (Deepseek-R1 distill-QWen-14b). This indicates that the argumentation skills developed by RL in the code can be effectively generalized to other domains.
The most striking aspect is to achieve this level of performance with only 14 billion parameters. This makes deep code significantly smaller and potentially more efficient than many border models.
During the development of the model, the researchers solved some of the most important challenges in Training coding models Use of reinforcement learning (RL).
The first challenge was to curate the training data. Learning to reinforce requires reliable reward signals that state that the output of the model is correct. As the researchers point out “in contrast to mathematics-only, high quality data is easily available on the Internet-the coding domain under a relative scarcity of such data.”
In order to address this problem, the deep cover team has implemented a strict pipeline, which collects examples from different data records and filters for validity, complexity and duplication. This process resulted in 24,000 problems with high quality and formed a solid basis for effective RL training.
The team has also developed a simple reward function that only provides a positive signal if the generated code exists all the unit tests for the problem within a certain period. In combination with the high -quality training examples, this result -oriented reward system prevents the model from learning learning tricks such as the pressure of answers for public tests for public tests or optimization for simple edge cases without solving the core problem.
The core training algorithm of the model is based on group relatics policy optimization (GRPO), an algorithm for reinforcement that proved itself Very successful in Deepseek-R1. However, the team made several changes to the algorithm to make it more stable and further improve the model if the training extends over a longer time.
Finally, the team expanded the context window of the model iterative, first trained in shorter argumentation sequences and gradually increased the length. They also developed a filter method to avoid that the model exceeded the context restrictions when solving a hard input request.
The researchers explain the core idea: “In order to maintain the long context argument and at the same time enable efficient training, we have installed overhang filtering. This technique masked masked sequences during training so that models do not be punished for the generation of thoughtful but long-term outlets that exceed the current context limits.”
The training was gradually scaled by a 16K to a 32 -km context window, and the resulting model was also able to solve problems that required up to 64,000 tokens.
The training of large models with RL, especially for tasks that require long -generated sequences such as coding or complex thinking, are computing and slow. A big bottleneck is the “sampling” press, in which the model generates potentially thousands of tokens per example in the batch. Variations in the reaction length mean that some answers end up much later than other ends, so that GPUs are idle and slow down the entire training loop.
In order to accelerate this, the Verl-Pipeline team developed an optimized expansion of the open source losing library for Reinforcement learning from human feedback (RLHF). The most important innovation, which you call “one-time pipelining”, organizes the reaction scanning and model updates in order to shorten the bottlenecks and the idle time of the accelerators.
Their experiments showed that one-off pipelining up to 2 times acceleration caused the coding of RL tasks compared to Baseline implementations. This optimization was of crucial importance for the training of deep code within a reasonable time frame (2.5 weeks to 32 H100) and is now open and structured as part of the Verl pipeline for the community.
The researchers have made all artifacts for training and running deepcoder-14b Girub And Hug Under a permissible license.
“By fully shared our data set, code and training recipe, we enable the community to reproduce our work and make RL training accessible to everyone,” the researchers write.
Deepcoder-14b illustrates a broader, accelerating trend in the AI landscape: the rise of top-class but efficient and openly accessible models.
For the Enterprise world, this shift means more options and a higher accessibility of progressive models. The state-of-the-art performance is no longer just the domain of hyperscalers or those who are willing to pay premium API fees. Models such as deep cooders can empower organizations of all sizes, to use the production and argumentation of codes codes, to adapt solutions to their specific requirements and to use them safely in their environments.
This trend can reduce the entry barrier for the introduction of AI and promote more competitive and innovative ecosystem, in which progress is driven by working with open source cooperation.