Join our everyday and weekly newsletter for the latest updates and exclusive content for top -class AI in the field. More information
Researchers of Together AI and Agentica have released Deepcoder-14B, a new coding model that provides impressive performance comparable to leading profitical models such as OpenIi O3-Mini.
This model, built on the top of Deepseek-R1, provides greater flexibility for integrating high-performance capabilities of code generation and thinking in the real world. Importantly, teams have a fully open model, its training, code, protocols and system optimization, which can help scientists improve their work and speed up progress.
Competitive capabilities of coding in a smaller package
The research team experiment shows that Deepcoder-14B strongly imports in several demanding encoding benchmarks, including Livecodebench (LCB), Codefoce and Humanival+.
“Our model shows powerful performance across all benchmarks coding… comparable to O3-Mini (LOW) and O1,” scientists write in a blog post that describes the model.
Interestingly, even though the model is primary in coding, it shows improved mathematical reasoning and deploys 73.8% on the AIME 2024 benchmark, which is 4.1% improvement compared to the basic model (Deecseek-R1-Distill-Qwen-Qwen-4B). This suggests that the development of the justification skills through RL on the code can be effectively generalized to other domains.

The most significant aspect is to achieve this level of performance with only 14 billion parameters. As a result, Deepcoder is significantly smaller and potentially more efficient for operation than many borders.
Innovation controls the performance of Deepcoder
During the development of the model, scientists have solved some key challenges in model coding models using Reinflanca learning (RL).
The first challenge was the curation of training data. Strengthening learning requires reliable reward signals that indicate that the output of the model is correct. As scientists emphasize, “unlike mathematics -de abundant high quality, verifiable data is easily available in the coding domain that suffers from a relative lack of this data.”
To solve this problem, the Deepcoder team has implemented a strict gas pipeline, which collects from different data sets and filters them for validity, complexity and duplication. This process Yaighted 24,000 high -quality problems and provided a solid foundation for RL.
The team also proposed a direct reward function that only provides a positive signal if the generated code all tested units sampled for a problem at a certain time. Combined with high -quality examples of training, this reward -focused reward system prevents the model from learning tricks, such as printing memorized answers to public tests or optimization for simple edges without solving the basic problem.
The basic training algorithm of the model is based on the optimization of the relative policy (GRPO), the amplification algorithm, which has proved to be very successful in Deepseek-R1. However, the team made several algorithm modifications to make it more stable and allow the model to continue to improve how training has been expanding for a long time.

Finlly, the team expanded the context window of the model iteratively, first trained it on shorter considering sequences and gradually increased the length. They also developed the filtering method to prevent the model from being penalized when it created chains of reasoning that exceeded contextual limits when solving a hard challenge.

Scientists explain the basic idea: “To maintain a long context, and at the same time allow training, we have incorporated inverted filtering…
The training was gradually adjusted from 16K to the 32K context window and the resulting model could also solve problems that required up to 64k tokens.
Optimization of training with a long RL context
The training of large RL models, especially on tasks requiring long -generated sequences, such as coding or complicated thinking, is computing and slow. The main narrow point is the step of “sampling”, where the model of the generates of potential thousands of chips on the batch. Changes in the responsible length mean that some answers end much later than others, so the GPUS remains idle and slows down the entire training loop.
To accelerate, the team developed Verl-Pipeline, the optimized extension of the Open-Source Verl library to strengthen human feedback learning (RLHF). The key innovation they call the “disposable pipe” samples the sampling of response and the update of the model to shorten the narrow places and time of the accelerator.

Their experiment showed that one -off pipe provided up to 2x acceleration for RL coding compared to basic implementations. This optimization was essential for Deepcoder training in a reasonable time frame (2.5 weeks to 32 H100) and is now open within the Verl-Pipeline for a community on which the community can be used and built.
Impact
Scientists have made all artifacts available for training and operating Deepcoder-14B available on Github and hugging the face under a permissive license.
“By fully sharing our data file, code and recipe for training, we seize the community to reproduce our work and make RL training available to everyone,” scientists write.
Deepcoder-14B strongly illustrates the wider and accelerating trend in Landcape AI: the rise of highly capable but effective and openly accessible models.
For the world of the company, this shift means more options and higher availability of advanced models. Top performance is no longer just a domain of hyperscalers or those who are willing to pay premium API fees. Models like Deepcoder can seize organizations of all sizes to use sophisticated codes and reasoning generation, adapt solutions to their specific needs and safely deploy them in their vicinity.
This trend can reduce the AI adoption barrier and support more competitive and innovative ecosystem where progress is in cooperation with open source code.
Leave a Reply