Chapter 10: Exploring Deepseek: Building a RAG system with stronger reasoning abilities

In the last tutorial, we discussed how to improve the reading comprehension and information extraction capabilities of the model through fine-tuning, thereby enhancing the overall effect of the RAG (Retrieval Augmentation Generation) system. Recently, DeepSeek has become popular around the world. With its powerful thinking chain reasoning capabilities, it has excellent performance in mathematical reasoning. Therefore, in this tutorial, we will take advantage of the mathematical reasoning advantages of DeepSeek-R1 and use knowledge distillation technology to migrate its mathematical reasoning capabilities to the small model of the previous period. This small model will not only maintain the enhanced reading comprehension ability of the previous period, but also have mathematical reasoning capabilities that are close to DeepSeek-R1, thereby enhancing the performance of the RAG system in complex reasoning tasks.

DeepSeek Introduction

DeepSeek is a large language model developed by the Chinese artificial intelligence company DeepSeek. It is based on the Transformer architecture and uses a hybrid expert model (MoE) and a multi-head latent attention mechanism (MLA) to perform well in reasoning efficiency and performance. Its core product DeepSeek-V3 has 671 billion parameters, and the number of parameters activated each time is 37 billion, which greatly reduces computing costs.

Technical features
- Mixed Expert Model (MoE): The model is divided into multiple expert modules, and only a few experts are activated for each task, reducing the amount of activation parameters and improving efficiency.
- Multi-head latent attention mechanism (MLA): Reduce Key-Value cache through low-rank compression technology, significantly improving inference efficiency.
- Training method: Using large-scale reinforcement learning combined with high-quality synthetic data, there is no need to rely on a large amount of annotated data.
Application scenarios
- Natural language processing: text generation, translation, sentiment analysis, etc.
- Intelligent dialogue: intelligent customer service, chat robot.
- Code generation: supports hundreds of programming languages, generates, interprets and repairs code.

The following is a demonstration of the effectiveness of DeepSeek-R1 in solving mathematical problems:

DeepSeek’s impact on the industry

The emergence of DeepSeek has had a profound impact on the field of AI. With efficient reasoning capabilities, low-cost deployment solutions, and powerful generation capabilities, it breaks the limitations of traditional large models and allows more enterprises and individual users to enjoy top-notch AI technology. DeepSeek not only performs well on general NLP tasks, but also brings new possibilities to applications such as RAG (Retrieval Augmented Generation). From commercial applications to academic research, DeepSeek is reshaping the AI ecosystem and providing transformative solutions to a wider range of industries. Compared with other large models, it has the following significant features:

Efficient Inference: While ensuring high-quality generation, DeepSeek significantly improves the inference speed, making it more responsive in various tasks.
Localized deployment: Supports offline deployment, reduces dependence on cloud computing resources, and improves data privacy and security.
Optimized training architecture: Adopt advanced model architecture and training strategies to improve generation quality and stability.
Powerful reasoning capabilities: Excellent performance in text, code, mathematical reasoning and other fields, broadening application scenarios.

These advantages make DeepSeek an ideal choice for large-scale AI applications, especially in tasks such as RAG that require efficient information retrieval and generation.

DeepSeek-R1 Basic Principle

Introduction to thinking chain

Before introducing DeepSeek-R1, it is necessary to explain what a thinking chain is. The chain of thought was first proposed by Google in the paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" in 2022. By guiding large models to gradually decompose complex problems into sub-problems and solve them sequentially, the reasoning ability of the model can be significantly improved. These intermediate steps of reasoning are called Chain of Thought (CoT).

CoT structure

Different from the traditional Prompt mapping from input directly to output output>, CoT completes the mapping from input to thinking chain to output, that is, reasoning chain—>output>. If you break down the Prompt using CoT, you can observe the workflow of CoT in more detail.

As shown in the figure above, a complete Prompt containing CoT often consists of three parts:

Instruction: clarify the task objectives and output format specifications (such as JSON/Markdown structural requirements);
Logic (Rationale): includes multi-hop reasoning paths, domain knowledge invocation and intermediate verification steps;
Exemplars: Provides a few-shot problem-solving paradigm. Each example contains a complete question-reasoning-answer triplet.

CoT Classification

According to the difference in sample guidance strength, CoT can be divided into two types of typical implementations:

Zero-Shot-CoT: Activate the model's implicit reasoning ability by adding meta-prompts such as "Let's think step by step" without providing specific examples;
Few-Shot-CoT: Provides 3-5 reference cases containing complete reasoning chains, guiding the model to generate a structured reasoning process through analogy learning.

CoT Advantages

In the construction of industrial-grade AI systems, CoT has demonstrated multiple application values:

Complex problem solving: Dynamically allocate computing resources to key reasoning nodes through problem decomposition;
Model Interpretability: The inference chain output allows developers to locate error nodes (such as misuse of mathematical formulas, knowledge retrieval deviations). Compared with the black box model, CoT's inference chain output makes debugging more based;
Cross-model generalization: Compatible with mainstream models such as Transformer series and MoE architecture. Capability migration can be achieved by just adding prompt words;

In complex reasoning tasks, Chain of Thoughts (CoT) can significantly improve the model's reasoning capabilities. However, the acquisition of CoT data has been highly dependent on manual annotation, which is difficult to generate and apply on a large scale. DeepSeek-R1 optimizes this process through reinforcement learning, enabling the model to independently generate high-quality CoT data, and exposes this method to make end-to-end training more efficient and reproducible. In addition, DeepSeek also uses the data generated by the large model to train small models, so that the small models also have powerful reasoning capabilities, thereby reducing the cost of reasoning and making high-quality reasoning capabilities easier to implement.

Training process

The model and process can be simply divided into four stages:

① Phase 1: Reinforcement learning verification, perform pure reinforcement learning on DeepSeek-V3, and obtain DeepSeek-R1-Zero

② Phase 2: Data synthesis model, use DeepSeek-R1-Zero to generate data to train DeepSeek-V3, obtain DeepSeek-V3-checkpoint, and use DeepSeek-V3-checkpoint to generate a 600k inference data set

③ Phase 3: DeepSeek-R1 training, mixed inference data and non-inference data set (800k), fine-tuned all parameters of the model DeepSeek-V3, and obtained DeepSeek-R1

④ Phase 4: Distillation experiment, using the same mixed data set used to train R1, perform full parameter fine-tuning on each open source model (Qwen, Llama), and obtain DeepSeek-R1-Distill-(Qwen/Llama)-(*B)

It should be noted that experiments have proven that only "smart enough" base models can perform subsequent reinforcement learning training. Therefore, for the selection of the base model here, deepseek uses their own MoE model with strong enough mathematical capabilities, and deepseek-V3 is used as the basis for training.

DeepSeek-R1 deployment and use

The basic principles of DeepSeek-R1 are introduced above, so how can we use it? Especially how to customize our own DeepSeek-R1 based on LazyLLM's rich tools and powerful application orchestration capabilities? Then give full play to the capabilities of DeepSeek-R1.

Online model usage

The use of the DeepSeek-R1 online model is the same as the other models introduced in Lecture 3. Just specify the model in OnlineChatModule as DeepSeek-R1. After configuring API-KEY in the command line terminal (the relevant API-KEY of SenseTime is configured here, SenseTime has supported the use of DeepSeek large model), you can implement a chatbot through the following lines of code:

import lazyllm
chat = lazyllm.OnlineChatModule('DeepSeek-R1')
if __name__ == '__main__':
    lazyllm.WebModule(chat, port=range(23466, 23470)).start().wait()

After the above code is run, a link will be provided. Copy it to the browser to open it and start chatting:

Call DeepSeek’s DeepSeek-R1 model:

chat = lazyllm.OnlineChatModule('deepseek-reasoner', source='deepseek')

Call Alibaba’s DeepSeek-R1 model:

chat = lazyllm.OnlineChatModule('deepseek-r1', source='qwen')

Local deployment

If you have sufficient computing resources, you can also deploy DeepSeek R1 locally. Similar to the online model, you only need to use the TrainableModule module in LazyLLM and pass in the model name "DeepSeek-R1" to complete the deployment of the model. LazyLLM will automatically call the corresponding tools to complete the deployment. The specific code is as follows:

import lazyllm
from lazyllm import deploy, launchers
chat = lazyllm.TrainableModule('DeepSeek-R1').\
            deploy_method((deploy.Vllm,{
                'tensor-parallel-size': 8,
                'pipeline-parallel-size': 2,
                'max-num-batched-tokens': 131072,
                'launcher': launchers.sco(nnode=2, ngpus=8, sync=False)
            }))
if __name__ == '__main__':
    lazyllm.WebModule(chat, port=range(23466, 23470)).start().wait()

It is worth noting that because the model is too large and involves cross-node deployment (a node can be simply understood as a machine for 8 A100 cards, cross-node means at least two machines, with a total of 16 cards), the parameters need to be configured. In the above code:

tensor-parallel-size: 8, perform tensor parallelism on the model (simply can be understood as dividing the model into 8 small models along the direction of data flow);
pipeline-parallel-size: 2, pipeline parallelize the model (simply can be understood as dividing the model into 2 segments perpendicular to the direction of data flow, that is, 2 small models. Counting tensor parallelism, the total is 2*8=16 small models, each small model is placed on a card);
launcher: used to specify the platform and computing power to launch the task. Here launchers.sco specifies SenseCore, SenseTime’s public cloud platform, and the corresponding computing power parameters:
- nnode: 2, specifying that 2 nodes are required;
- ngpus: 8, specify 8 computing cards for each node;
- sync: False, which means that there is no need to wait after deploying the command, and you can continue to perform other tasks. For example, you need to deploy the interactive interface of WebModule later, so you must use False here; otherwise it will be stuck waiting for the deployment task to end normally. In fact, the deployed task will always exist if it is not actively terminated under normal circumstances, so it will be stuck;
max-num-batched-tokens: The maximum number of tokens in a batch; the larger the value, the better the performance, but the more graphics memory resources of the computing card will be occupied.

LazyLLM also supports various Infrastructure as a Service (IaaS) platforms:

SenseCore
Slurm
K8S
bare metal

R1 encounters Waterloo

In the previous second lecture, we have introduced how to build a simple online large model dialogue system based on LazyLLM. Next, we will directly compare the performance of two different online models.

import lazyllm

#Specify test data
context = 'Based on the games developed on "Kart Racing" and "Bubble Hall", developed and published by South Korea's Nexon. Mainland China is operated by Shanda Games. This is the first time in 6 years that Nexon has granted Shanda Network its game operation rights. Taiwan is operated by Game Orange. Players use water guns, small guns, hammers or water bombs to blister enemies (players or NPCs), which is a bubble, and breaking the bubbles is a kick. If the bubble is not exploded within the time, it will be released from the bubble or rescued by teammates (this is a rescue). Each bubble will reduce the number of lives. When the number of lives is exhausted, it will be counted as a kick. The reborn person is invincible for a certain period of time. The person with more kicks and explosions wins. The rules vary depending on the mode. With 2V2 and 4V4 random matching, players can climb the rankings according to the number of wins (in order: rough stone, bronze medal, silver medal, gold medal, platinum, diamond, master), and can choose classic, hot-blooded, sniper and other game modes. If you lose during the game, you will not be able to match for 4 minutes (+4 minutes for each time you lose). It is open from time to time during the summer or winter vacation. 8 people are randomly matched in the classic mode and a scoring method is used. The more points you get during the event, you will get rewards based on your ranking at the end. '
question = 'If there is a match in the game, how many minutes will it take to match? '

# Define chat roles and behaviors
pt = lazyllm.ChatPrompter(f"Please use the original text of the following paragraph to answer the question\n\n### Known paragraph: {context}\n\n### Question: {question}\n")

#Initialize OnlineChatModule
# Specify using SenseNova as the chat source and DeepSeek-R1 model
# Pass in API_KEY for authentication
# Use the defined ChatPrompter (p_cat) as a prompt for the model
llm = lazyllm.OnlineChatModule(source="sensenova", model="DeepSeek-R1").prompt(pt)

# Call model
res = llm('Please tell me the result')

#Print the output results of the model
print(res)

In order to compare the performance of the two models (DeepSeek-R1 and Qwen2-7B) more clearly, we selected a sample from the test set of the CMRC2018 data set used in the previous issue. In order to eliminate the impact of RAG recall document fragments on the results, we directly use the original text of the reading comprehension as context and input it into the large model together with the question to observe its generation effect.

{
    "context": "Based on the games developed on \"Kart Runner\" and \"Bubble Hall\", it is developed and published by South Korea's Nexon. Mainland China is operated by Shanda Games. This is the first time Nexon has granted Shanda Network its game operating rights again after 6 years. Taiwan is operated by Game Orange. Players use water guns, small guns, hammers or water bombs to soak enemies (players or NPCs), which is a bubble seal, and the bubble is broken into a kick. If the bubble is not there at the time If kicked out within a certain period of time, the number of lives will be reduced, and the number of lives will be exhausted. The reborn person will be invincible for a certain period of time, and the player with the most points will win. The rules vary depending on the mode. In 2V2 and 4V4 random matching, players can climb up the ranking list (in order: rough stone, bronze medal, silver medal, gold medal, platinum, diamond, and master). , you can choose classic, hot-blooded, sniper and other modes to play. If you are in the game, you will not be able to match within 4 minutes (each time you are in the game + 4 minutes). The opening time is from time to time during the summer or winter vacation. The 8-player classic mode is randomly matched and the points are scored. The more points you get during the event, the rewards will be obtained at the end. ",
    "question": "If the match is lost during the game, how many minutes will it take to match?",
    "answers": "4 minutes"
}

Among them, the design of Prompt is as follows:

f"Please use the original text of the following passage to answer the question\n\n### Known passage: {context}\n\n### Question: {question}\n"

The following are the running results of the DeepSeek-R1 model and Qwen2-7B model. It can be seen that Qwen2-7B's answer is more concise and closer to the standard answer ("4 minutes"), and is better in tasks that require accurate extraction of key information; while DeepSeek-R1 provides quotes + explanations, which are more detailed but redundant and more suitable for tasks that require contextual support.

	Qwen2-7B	DeepSeek-R1
Answer	`If you are disconnected during the game, you are not allowed to match within 4 minutes' \|`According to the original text provided: "If you are disconnected during the game, you are not allowed to match within 4 minutes (each time you disconnect + 4 minutes)." Therefore, if you are disconnected during the game, you are not allowed to match within 4 minutes. `

Let us run the evaluation with DeepSeek-R1 and compare it with the results of the previous issue:

Model	Exact match rate	Semantic similarity	Original text inclusion
Internlm2-Chat-7B	2.10%(21)	74.51%(746.6)	5.19%(52)
DeepSeek-V3	5.29%(53)	74.85%(750.0)	15.17%(152)
DeepSeek-R1	2.30%（23）	69.62%（697.56）	7.78%（78）
After Internlm2-Chat-7B training	39.72%(398)	86.19%(863.6)	94.91%(951)

From the table above, we can see that DeepSeek-R1 has the lowest score in semantic similarity, and its exact matching and original text inclusion are only slightly better than the small model Internlm2-Chat-7B. This shows that DeepSeek-R1 does not perform well in reading comprehension information extraction tasks.

As the old saying goes, we must learn from each other's strengths to offset our weaknesses, so what are the strengths of DeepSeek-R1? One of the obvious advantages of this model is its mathematical reasoning capabilities. As shown in the evaluation report, it has clear advantages on both AIME 2024 and MATH-500:

So, is there any way to transfer the mathematical capabilities that DeepSeek-R1 is good at to the small model in our previous tutorial? A common method is - distillation!

Detailed explanation of Deepseek R1 distillation

In the previous section, we found that DeepSeek-R1 did not perform well in reading comprehension content extraction tasks, so we wanted to learn from each other's strengths and migrate the powerful mathematical reasoning capabilities of DeepSeek-R1 to a small model. The method to achieve migration is knowledge distillation! So what is the principle of knowledge distillation?

Distillation principle

The so-called "knowledge distillation" simply means to let the small model learn the "advantages" of the large model through "some way". The "some advantage" here for R1 is R1's reasoning ability, and the "some way" is simply Supervised Fine-Tuning (Supervised Fine-Tuning, or SFT). The data for supervised fine-tuning is the chain of thinking generated during R1 inference. The basic idea is as follows:

Select a certain field data set {question:"xxxxx?", answer:"yyyyy"}, take the question: "xxxxx?" and use the DeepSeek R1 model to perform inference, filter and clean the inference results, and obtain the R1 thought chain data R1_CoT_Ans:"wwwwww".
Combine the original question "xxxxx?" in the data set and the thought chain data of R1 to form a new data set: {question:"xxxxx?", R1_CoT_Ans:"wwwwww"}, which is used as a data set for fine-tuning the small model, that is, a distillation data set.
Use the new data set {question:"xxxxx?", R1_CoT_Ans:"wwwwww"} to conduct supervised fine-tuning of the small model, and evaluate the fine-tuned small model to compare the effects of the model before and after fine-tuning.

With the basic principles clear, we can design our distillation scheme.

Distillation scheme

We use DeepSeek-R1 to distill the small model. The process is roughly shown in the figure below: First we select GSM8K Data set, use the questions in its training set to continuously feed DeepSeek-R1 to obtain question and answer pairs for training. After obtaining the distilled training set, in order to see the effect of the thinking chain in it, we also separately removed the thinking chain in the data set as another training set, and fine-tuned two small models based on these two training sets. Finally, we evaluated the two small models after fine-tuning, DeepSeek-R1 and the small model before fine-tuning to see the improvement effect.

1. Data preparation

Here we choose GSM8K (Grade School Math 8K). This data set was built by the OpenAI team. It is a text data set used for solving mathematical problems. It contains more than 8,000 questions at the primary school mathematics level (training set: 7473 questions, test set: 1319 questions). The questions mainly involve basic arithmetic operations such as addition, subtraction, multiplication and division, as well as some simple mathematical word problems. Each question is accompanied by an answer in natural language, which not only provides the final result but also explains in detail the steps and process of solving the problem. The following is a piece of data from the data set:

{
    "question": "James decides to run 3 sprints 3 times a week.  He runs 60 meters each sprint.  How many total meters does he run a week?",
    "answer": "He sprints 3*3=<<3*3=9>>9 times\nSo he runs 9*60=<<9*60=540>>540 meters\n#### 540"
}

It can be seen that the answers in this data set all start with \n#### ans. This feature facilitates us to have a clear numerical evaluation of the model's reasoning effect.

2. Distilled data

Next, we reconstruct the training data set in the GSM8K data set to obtain a training set for distilling the small model. The purpose of data set reconstruction is: we hope to obtain an answer with a chain of thought that ultimately provides the correct format and the correct answer. The main process is:

Extract the training set of the GSM8K data set: 7473 questions, remove the answers, and retain only the questions.
Feed 7473 questions to DeepSeek-R1 and save their corresponding output as new answers, that is, reconstruct the question-answer pair.

Here are some key points to note:

DeepSeek-R1 needs to explicitly put the answer into '\\boxed{true_answer}', so you need to set a Prompt for it: <｜begin of sentence｜>You are a math problem solver. Provide the final answer in a boxed format using \\boxed{{answer}}.\n<｜User｜>{question}<｜Assistant｜>, where question It is the place where the questions are placed;
In order to improve the speed of inference, it is necessary to increase the concurrency as much as possible (it can be set to the maximum according to the supplier's concurrency limit);
The answers generated by DeepSeek-R1 need to meet certain standards before they can be saved. The specific requirements are as follows:
1. The reasoning answer must contain the standard answer and meet the format: '\\boxed{{{true_answer}}}'
2. The reasoning answer needs to contain a chain of thinking, and the screening mark is: </think>
Strategy for saving answers: If the above two saving conditions are not met, the question will be recorded. After completing a round of reasoning, all the recorded questions will be repeated again. This process will be repeated at least 15 times.

Finally, the constructed question-and-answer pair that meets the requirements is as follows:

{
    "instruction": "Mel is three years younger than Katherine.  When Katherine is two dozen years old, how old will Mel be in years?",
    "output": "<think>\nOkay, let's see. The problem says Mel is three years younger than Katherine. So, if I figure out Katherine's age first, then I can subtract three to get Mel's age. \n\nThe question is asking, when Katherine is two dozen years old, how old will Mel be? Hmm, a dozen is 12, so two dozen would be 24. Wait, is that right? Yeah, 12 times 2 is definitely 24. So Katherine will be 24 years old at that time.\n\nSince Mel is three years younger, I need to subtract 3 from Katherine's age. Let me do the math here: 24 minus 3 equals 21. So that should be Mel's age when Katherine is 24. Let me double-check. If Mel is always three years younger, then no matter when, the difference stays the same. So when Katherine is 24, subtracting 3 gives 21. Yeah, that makes sense. I think that's the answer.\n</think>\n\nWhen Katherine is two dozen (24) years old, Mel, being three years younger, will be:\n\n\\boxed{21}",
    "input": ""
}

where:

instruction is the question question in GSM8K;
output is the answer output by DeepSeek-R1, with a thought chain and the format and answers are correct;
input is empty, this field is mainly needed for fine-tuning training;

3. Fine-tune the model

Technology Selection: As in the previous tutorial, LoRA (Low-Rank Adaptation) is used to achieve lightweight fine-tuning and significantly reduce computing overhead;
Parameter configuration:

Parameters	Model	Epochs	Batch-size	GPUs	Gradient Accumulation step	learning_rate	lora_rank
Description	InternLM2-7B-Chat	2	16	8 (A800)	1	1.00E-04	8

In the process of fine-tuning the model here, the data set distilled based on DeepSeek-R1 is mainly used. At the same time, in order to verify whether the thinking chain is really effective and for comparison, we can process the data again and remove the thinking chain part. In other words, in the fine-tuning part, two models need to be fine-tuned: one based on the distilled data set, and the other based on the distilled data set with the thinking chain removed.

4. Model evaluation

After obtaining the fine-tuned model, the model needs to be evaluated. Here we use the following standards:

Evaluation Criteria:
- Numerical correctness: the generated answer exactly matches the standard answer;
- Format Compliance: Answers need to be wrapped in \\boxed{...}.
Comparative Experimental Design:
- Pre-distillation model: original small model without optimization;
- Distilled model 1: A fine-tuned model based on complete distillation data (with thinking chain);
- Distilled model 2: A model based on fine-tuning of distilled data without thinking chain;
- Baseline model: DeepSeek-R1 as a performance upper limit reference.

Deepseek-R1 distillation practice

Here we implement the above process based on LazyLLM. Based on LazyLLM, fine-tuning, deployment, reasoning and evaluation can be easily achieved with one click.

1. Get data

First we need to obtain the data set: GSM8K. See link for the code.

import os
import json
from modelscope.msdatasets import MsDataset
def build_data_path(file_name):
    data_root = os.path.join(os.getcwd(), 'dataset')
    if not os.path.exists(data_root):
        os.makedirs(data_root)
    save_path = os.path.join(data_root, file_name)
    return save_path
def get_dataset():
    train_path = build_data_path('train_set.json')
    eval_path = build_data_path('eval_set.json')
    ds = MsDataset.load('modelscope/gsm8k', subset_name='main')
    ds = ds.rename_column('question', 'instruction').rename_column('answer', 'output')
    with open(train_path, 'w') as file:
        json.dump(ds['train'].to_list(), file, ensure_ascii=False, indent=4)
    with open(eval_path, 'w') as file:
        json.dump(ds['test'].to_list(), file, ensure_ascii=False, indent=4)
    return train_path, eval_path

In the above code:

Create a data directory: call build_data_path to generate a storage directory, define the training set path train_path and the evaluation set path eval_path.
Dataset transformation:
Load data: Obtain the GSM8K data set through the ModelScope API;
Field mapping: convert the original fields question → instruction, answer → output to adapt to subsequent fine-tuning needs.
Save data: Save the processed training set and evaluation set into train_path and eval_path respectively.

After the above steps, we have completed the preprocessing of the data set GSM8K.

2. Distilled data

Next, based on the preprocessed training set, we process it and feed it to DeepSeek-R1 to implement data distillation to construct a new training set.

Code GitHub link

import json
from lazyllm import warp

def load_data(data_path):
    with open(data_path, 'r') as file:
        dataset = json.load(file)
    return dataset

def save_res(data, file_path):
    with open(file_path, 'w') as file:
        json.dump(data, file, ensure_ascii=False, indent=4)

def distill_dataset(data_path, model=None, demo=False):
    inputs = load_data(data_path)[:1] if demo else load_data(data_path)
    with warp(_concurrent=1) as wp:
        wp.func = model
    res_list = []
    try_n = 0
    while inputs:
        print(">>>" * 12, f"{try_n+1} times left: ", len(inputs))
        querys = [item['instruction'] for item in inputs]
        results = wp(querys)
        valid_data, inputs = filter(inputs, results)
        res_list.extend(valid_data)
        try_n += 1
        if try_n == 15:
            break
    res_list = res_list * 120 if demo else res_list
    distilled_train_set_path = build_data_path('distilled_train_data.json')
    save_res(res_list, distilled_train_set_path)
    save_res(inputs, build_data_path('left_data.json'))
    return distilled_train_set_path

In the above code, we define a distill_dataset function to implement the distillation of data:

Load data: distill_dataset calls load_data to load the preprocessed training set, and supports quick debugging through the demo parameter (only a single piece of data is loaded).
Concurrent inference: The warp workflow based on LazyLLM calls the DeepSeek-R1 model concurrently (the amount of concurrency is controlled through _concurrent).
Iterative filtering:
- Extract the question (instruction) and trigger the reasoning process;
- Use the filter function to filter answers that meet the criteria (including \\boxed{{true_answer}} and </think> tags) and store them in res_list;
- Failed data is looped into as new input and retried up to 15 times.
Result saving: The final output is qualified data distilled_train_data.json and failure record left_data.json.
Key parameters:
- _concurrent: Controls the number of concurrent inference threads;
- demo: debug mode switch (loading a single piece of data);
- Retry limit: 15 times (filtering low quality samples).

The code below is filter Implementation details (screen the inference results of DeepSeek-R1 and only retain results with thinking chains and correct answers and formats), GitHub link：

def filter(inputs, results):
    valid = []
    retry = []
    for i, item in enumerate(inputs):
        true_v = item['output'].split('\n#### ')[-1].strip()
        if f'\\boxed{{{true_v}}}' in results[i] and '</think>' in results[i]:
            valid.append({'instruction': item['instruction'], 'output': results[i], 'input': ''})
        else:
            retry.append(item)
    return valid, retry

3. Fine-tune the model

Based on the training set obtained in the previous step, we can implement fine-tuning, inference and evaluation based on LazyLLM. As in the previous tutorial, LoRA (Low-Rank Adaptation) is used to implement lightweight fine-tuning. The code is as follows ( GitHub link ):

import lazyllm
from lazyllm import finetune, deploy, launchers

# Get data
train_set_path, eval_set_path = get_dataset()
eval_set = load_data(eval_set_path)

# Distillation data
teacher_model = lazyllm.OnlineChatModule('DeepSeek-R1')
sft_data_path = distill_dataset(train_set_path, teacher_model)

# Fine-tune the model
infer_data = [item['instruction'] for item in eval_set]
student_model = lazyllm.TrainableModule('internlm2-chat-7b')\
    .mode('finetune')\
    .trainset(sft_data_path)\
    .finetune_method((finetune.llamafactory, {
        'learning_rate': 1e-4,
        'cutoff_len': 5120,
        'max_samples': 20000,
        'val_size': 0.01,
        'per_device_train_batch_size': 2,
        'num_train_epochs': 2.0,
        'launcher': launchers.sco(nnode=1, nproc=8, ngpus=8)
    }))\
    .prompt(dict(system='You are a helpful assistant.', drop_builtin_system=True))\
    .deploy_method(deploy.Vllm)
student_model._prompt._soa = '<|im_start|>assistant\n\n<think>'
student_model.evalset(infer_data)
student_model.update()

## Evaluation model
score = calculate_score(eval_set, student_model.eval_result)
print("All Done. Score is: ", score)

In the above code, obtaining data and distilling data have been introduced in the previous content. Here we look directly at the fine-tuning model:

Data preparation: Extract instruction from the evaluation set as the inference data set infer_data and bind it as .evalset(infer_data).
We used the TrainableModule module in LazyLLM to configure fine-tuning:
- Pass in the model name parameter: `'internlm2-chat-7b'', which indicates the small model to be fine-tuned. If it is not available locally, it will be automatically downloaded;
- Set the mode to fine-tuning .mode('finetune');
- At the same time, the training set is set to the data distilled in the previous step: .trainset(sft_data_path);
- Set the basic parameters of fine-tuning through .finetune_method:
  - The fine-tuning engine used is llamafactory;
  - Relevant key parameters include: learning rate learning_rate, maximum data cutoff length cutoff_len, maximum number of samples in the training set max_samples, percentage of the verification set in the training set val_size, training batch size per_device_train_batch_size on each device, total number of training rounds num_train_epochs, platform to start training and parameter launcher (the platform here uses SenseTime's public cloud SCO, and 8 cards are set to fine-tune this model);
- Configure system prompts through .prompt;
- The inference engine used during deployment is configured as vllm through deploy_method
- In addition, we also separately configured the soa special tag in Prompt and added the <think> tag at the end;
After configuring the parameters, we use .update() to achieve: fine-tuning, deployment, and inference with one click.
After the above process is completed, the inference results will be placed in student_model.eval_result;

4. Model evaluation

After obtaining the inference results of the evaluation set, we can implement a caculate_score to evaluate its results:

import re

def extract_boxed_content(text):
    pattern = r'boxed{((?:[^{}]*|{.*?})*)}'
    contents = re.findall(pattern, text)
    return contents

def caculate_score(eval_set, infer_set):
    assert len(eval_set) == len(infer_set)
    score = 0
    for index, eval_item in enumerate(eval_set):
        output = infer_set[index]
        if 'boxed{' in output:
            res = extract_boxed_content(output)
            res = list(set(res))
            res = res[0] if len(res) == 1 else res
            if type(res) is list:
                continue
            true_v = eval_item['output'].split('\n#### ')[-1].strip()
            if true_v == res.strip():
                score += 1
    return f'{score}/{len(eval_set)}, {round(score/len(eval_set),4)*100}%'

The above code mainly captures the correct answers contained in boxed from the inference results, and compares them with the standard answers. If they are equal, one point can be accumulated. Finally, the total score and score proportion are returned as strings. So far, this is the implementation of the DeepSeek-R1 distillation small model based on LazyLLM. For the complete code script, please see: LazyLLM/examples/distill_deepseek_r1.py

Distillation Performance of DeepSeek-R1

We summarize the evaluation results as follows:

Model	InternLM2-7B-Chat【Original】	InternLM2-7B-Chat【After distillation - without thinking chain】	InternLM2-7B-Chat【After distillation】	DeepSeek-R1【Teacher Model】
Number of correct answers	331	839	951	1201
Accuracy rate (1319 questions)	25.09%	63.61%	72.10%	91.05%

Based on the above table we can see:

Basic distillation gain: Distillation without thinking chain makes the accuracy jump from 25.09% to 63.61%, absolute improvement reaches 38.5 percentage points, proving that basic distillation is effective;
Added value of thinking chain: After the introduction of the CoT mechanism, the accuracy rate increased by another 8.5 percentage points, verifying the strengthening effect of thinking chain on knowledge transfer;
Teacher-student gap: There is a performance difference of 18.95 percentage points between the student model (72.1%) and the teacher model (91.05%), revealing the key impact of model capacity on reasoning ability;
Scale efficiency ratio: The 7B distillation model reaches the 79.2% accuracy level of the 671B teacher model, **achieving 4/5 performance with nearly 1/100 of the number of parameters! **

RAG liberal arts assistant system construction

The previous distillation is mainly based on mathematical reasoning data alone. Our goal is to strengthen the ability of the generative model in RAG, so here we not only need to strengthen the mathematical reasoning ability, but also need to retain the reading comprehension ability of the model as much as possible. Therefore, here we will reuse the mixed data set (this issue’s GSM8K data set with thinking chain and the previous issue’s CMRC2018 training set) based on InternLM2-7B-Chat to perform the above fine-tuning process. The following are the evaluation results after mixed training:

Task (data set)	Reading comprehension information extraction ability (CMRC2018)	Same as before	Same as before	Mathematical reasoning ability (GSM8K)
Model	Exact match rate	Semantic similarity	Original text inclusion	Accuracy
Internlm2-Chat-7B	2.10%(21)	74.51%(746.6)	5.19%(52)	25.09%（331）
DeepSeek-R1	2.30%（23）	69.62%（697.56）	7.78%（78）	91.05%（1201）
After Internlm2-Chat-7B training (only CMRC2018 data)	39.72%(398)	86.19%(863.6)	94.91%(951)	-(-)
After Internlm2-Chat-7B training (GSM8K distilled data only)	-(-)	-(-)	-(-)	72.10% (951)
After Internlm2-Chat-7B training (mixed data)	39.22%(393)	86.22%(863.9)	93.71%(939)	73.24%(966)

Evaluation process:

Let's build our arts and science assistant based on the RAG with data flow from the previous tutorial. The code is as follows (GitHub link):

import lazyllm
from lazyllm import bind
from lazyllm.tools import IntentClassifier

template = "Please use the original text of the following paragraph to answer the question\n\n### Known paragraph: {context}\n\n### Question: {question}\n"
base_model = 'path/to/internlm2-chat-7b-chinese-math2'
base_llm = lazyllm.TrainableModule(base_model)

# Document loading
documents = lazyllm.Document(dataset_path="path/to/cmrc2018/data_kb")

with lazyllm.pipeline() as ppl:
# Retrieve component definition
    ppl.retriever = lazyllm.Retriever(doc=documents, group_name="CoarseChunk", similarity="bm25_chinese", topk=3)
    ppl.formatter = (lambda nodes, query: template.format(context="".join([node.get_content() for node in nodes]), question=query)) | bind(query=ppl.input)
# Generate component definition
    ppl.llm = base_llm

with IntentClassifier(lazyllm.OnlineChatModule()) as ic:
    ic.case['Math', base_llm]
    ic.case['Default', ppl]

lazyllm.WebModule(ic, port=23496).start().wait()

In the above code:

We introduce an intent classifier IntentClassifier to identify the user's intention. A large online model is used here. The intent classifier can judge the intention entered by the user and dispatch it to different branches. Here we design two branches:
- 'Math' branch: used to perform mathematical calculations and directly call our small model of distilled DeepSeek-R1;
- 'Default' branch: used to perform RAG tasks, which calls the data flow-based RAG we designed in the previous tutorial;

The large model used in the two branches here is the same - a small model with reading comprehension ability and mathematical reasoning ability (based on distillation) that we have fine-tuned in this tutorial. Let's start it and see the effect:

The questions in the video above come from the evaluation sets of GSM8K and CMRC2018 (note that the training set and the evaluation set are completely isolated!). We can see that if we randomly extract a few questions into our RAG system, the answers obtained are consistent with the standard answers in the evaluation set (except for the first question, where the online model misidentifies the user's intention, and the other results are very good).

References

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Igniting Language Intelligence: The Hitchhiker’s Guide From Chain-of-Thought Reasoning to Language Agents

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning