This post was originally published on this site.
I donāt want to get a reputation for reactivity or hyperbolic statements.. but what just happened in the AI world (also the real world) changed the development trajectory of humanity.
I promised myself I wouldnāt go big š
Canāt help it. The game legitimately changed on January 22, 2025 when DeepSeek dropped their paper, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.
Not just because of their performance – which is wild.
Not even because of their cheap training costs vs the power of their model ā that is mind-blowing, though. Talk about leverage.
No, the most industry-wobbling thing here is how DeepSeek achieved their feat.
We can organically GROW intelligence of the general reasoning variety. They clearly demonstrate it.
DeepSeek-R1-Zero was trained using RL directly on a base model without initial SFT, marking a significant departure from conventional methods. This approach allowed the model to explore chain-of-thought (CoT) reasoning and develop capabilities such as self-verification and reflection.
This suggests that reasoning abilities can be incentivized through RL, rather than relying on supervised data. Please read that again if it didnāt blow your mind.
We will get to the implications later.
The models exhibited a self-evolution process where they naturally increased their thinking time and developed complex reasoning behaviors through RL…. An “aha moment” was observed in an intermediate version of DeepSeek-R1-Zero, where the model learned to re-evaluate its initial approach, showcasing the potential of RL to produce unexpected and sophisticated problem-solving strategies.
This highlights the potential of RL to allow models to discover problem-solving strategies autonomously. To learn like a child and become a much, much smarter adult model.
DeepSeek-R1 incorporates cold-start data and multi-stage training to further enhance reasoning performance.
The model was fine-tuned with high-quality, long CoT examples before undergoing RL, which improved readability and performance. This indicates that a carefully designed training pipeline involving both SFT and RL can lead to models that are not only powerful but also more user-friendly.
DeepSeek’s research demonstrates that the reasoning patterns developed by larger models can be successfully distilled into smaller models. This was achieved by fine-tuning open-source models with data generated by DeepSeek-R1. This suggests that the reasoning abilities of larger models can be transferred to smaller, more efficient models, which is important for real world applications.
DeepSeek’s research strongly suggests that LLMs can indeed organically develop elements of general reasoning capabilities through reinforcement learning.
The self-evolution process and the emergence of sophisticated behaviors during training underscore the potential for models to learn and refine their problem-solving strategies autonomously.
Even if you have an America-first orientation and are worried about China gaining ground in a key area, the results from DeepSeek’s research are significant, showing a promising direction for future advancements in artificial intelligence.
They may have shown the path to AGI in their approach.
Some say DeepSeek simply digested OAIās models and have replicated their intelligence, and the benefit of all that human-in-the-loop reinforcement, at a fraction of the cost.
You could argue OAI scraped the internet to make ChatGPT and now DeepSeek have scraped ChatGPT.
All is fair in love and AI, right?
According to the paper the training process for DeepSeek-R1 involves several stages, each designed to enhance different aspects of the model’s reasoning capabilities. It builds upon the foundation of DeepSeek-R1-Zero, which itself was trained using a novel approach ground-up approach as the name suggests, and then adds further refinements to create the final DeepSeek-R1 model.
This initial stage focused on pure reinforcement learning (RL) without any supervised fine-tuning (SFT).
The base model was apparently DeepSeek-V3-Base. Group Relative Policy Optimization (GRPO) was the RL algorithm used to train the model. GRPO estimates the baseline from group scores instead of using a separate critic model. Many researchers use this technique.
A rule-based reward system was employed, focusing on accuracy and format. The model was rewarded for providing correct answers in a specific format, including putting its thinking process within and tags.
The model was trained using a simple template that required it to produce a reasoning process followed by the final answer, without content-specific biases.
The result of this pure RL process was a model that demonstrated an improved ability to solve complex reasoning tasks with behaviors such as self-verification, reflection, and generating long chains of thought (CoT)
However, DeepSeek-R1-Zero exhibited issues such as poor readability and language mixing ā it wasnāt easy! To address the issues of DeepSeek-R1-Zero and further enhance reasoning performance, a multi-stage training pipeline was developed for DeepSeek-R1.
The keyword here is ācold startā.
Thousands of long Chain-of-Thought (CoT) examples were collected to fine-tune the base model, DeepSeek-V3-Base, before applying RL. This contrasts with DeepSeek-R1-Zero, which started directly from the base model. These examples were designed to be human-readable and included a summary of the reasoning results.
After the cold start, the model underwent large-scale RL, similar to the process used for DeepSeek-R1-Zero. The focus was on enhancing reasoning capabilities in areas like coding, mathematics, science, and logical reasoning. A language consistency reward was also introduced to mitigate language mixing, although this slightly degraded the modelās performance.
Once the reasoning-oriented RL converged, the checkpoint was used to create new SFT data. This included reasoning data generated through rejection sampling and additional data from other domains like writing, factual QA, and self-cognition. Data that had language mixing, long paragraphs or code blocks was filtered out. This improved results and decreased thinking time.
The model underwent a second RL stage, aimed at improving its helpfulness and harmlessness, as well as refining reasoning capabilities. This involved using diverse prompts and a combination of rule-based rewards for reasoning tasks and reward models for general data.
To transfer the reasoning capabilities of DeepSeek-R1 to smaller, more efficient models, the curated data from DeepSeek-R1 was used to fine-tune several open-source models like Qwen and Llama.
This process involved only SFT, without any additional RL, to demonstrate the effectiveness of distillation.
The resulting distilled models showed impressive results, with the smaller models outperforming other open-source models on reasoning benchmarks.
To sum it up neatly ā the training process of DeepSeek-R1 is a multi-stage process that starts with a pure RL approach to establish reasoning capabilities, then introduces a cold start with high-quality data, followed by further refinement through both RL and SFT, and finally distillation to transfer these reasoning capabilities to smaller models. This combination of techniques resulted in a model that performs comparably to OpenAI-o1-1217 (cutting edge as of this writing) on various reasoning tasks.
The fact that DeepSeek-R1 starts with pure reinforcement learning (RL) and develops reasoning abilities without initial supervised fine-tuning (SFT) is groundbreaking. This suggests a more natural and potentially more powerful way for LLMs to learn, similar to how humans acquire knowledge through exploration and interaction with their environment.
By combining RL with SFT and distillation, DeepSeek-R1 achieves comparable performance to cutting-edge models like OpenAI’s, potentially at a fraction of the training cost. This could democratize access to advanced AI, making it more affordable and accessible for researchers, developers, and smaller organizations.
RL-based training can lead to more transparent reasoning processes. The model learns to “think” step-by-step, making its decisions easier to understand and trust compared to traditional black-box LLMs.
Also, the team at DeepSeek have a lot more āgas in the tankā to try to reach AGI velocity with this architecture.
Organically grown intelligence might enable LLMs to adapt to new situations and tasks more effectively. They could learn from their mistakes and continuously improve their performance without relying solely on pre-defined datasets.
Here’s why this is a big deal in simpler terms:
Imagine teaching a child to ride a bike. You could give them a detailed manual (SFT), but they’ll likely learn better by trying it themselves (RL), falling, getting up, and gradually improving. DeepSeek-R1’s training process is similar – it allows the LLM to “learn by doing” and develop its own reasoning abilities, leading to a more robust and adaptable intelligence.
This approach could revolutionize the field of AI, leading to more capable, efficient, and trustworthy LLMs that can be used for a wider range of applications.
This is the way to AGI.
Thank you for helping us accelerate Life in the Singularity by sharing.
I started Life in the Singularity in May 2023 to track all the accelerating changes in AI/ML, robotics, quantum computing and the rest of the technologies accelerating humanity forward into the future. Iām an investor in over a dozen technology companies and I needed a canvas to unfold and examine all the acceleration and breakthroughs across science and technology.
Our brilliant audience includes engineers and executives, incredible technologists, tons of investors, Fortune-500 board members and thousands of people who want to use technology to maximize the utility in their lives.
To help us continue our growth, would you please engage with this post and share us far and wide?! š