Chapter 9: Fine-Tuning in Practice — Help Large Models and Embedding Models Better Understand Your Domain

In the previous tutorial, we improved the retrieval accuracy of the RAG system by optimizing the retrieval strategy, recall strategy, and query rewriting strategy based on large models. However, the final response results still need to be fused and processed by large models. The strength of the model directly affects the final results. This is like a good dish requires not only high-quality ingredients (an optimized retrieval module provides high-matching text), but also a good chef (a capable LLM that fuses information to generate answers). In this tutorial, we will further improve the accuracy of the RAG system, go into the generation module of the RAG system, discuss the role of large models in it, and introduce how to significantly improve the generation capabilities of large models through fine-tuning of LazyLLM, and even use small models to surpass general large models (i.e., train a good chef), thereby improving the reply quality of the RAG system.

The role of large models in the RAG system

Review of RAG process

In the previous tutorial, we have been able to quickly build the RAG system. Let us quickly review the RAG operation process:

The retrieval module will query the knowledge base based on the user's Query to recall Top-K related documents through vector similarity;
The generation module splices the search results with the user Query, and inputs LLM to generate the final response;

From this we can see that the LLM large model mainly plays the role of content generation in the RAG system. Specifically we can summarize it as follows:

Semantic Understanding: Analyze the true intention of the query;
Knowledge Fusion: Coordinate the fusion of original knowledge (knowledge inherent in the LLM large model) and new knowledge (retrieval content);
Logical reasoning: Reason a reasonable result based on the intent and context of the query;

So there are two main factors that affect the accuracy of the RAG system:

Effectiveness of recall + Ability to generate model content

In the previous tutorials, we have introduced in detail how to improve the recall effect (quick review: Chapter 7: Search upgrade practice: build a “smarter” document understanding system by yourself!, Chapter 8: More than just cosine! The matching strategy determines the quality of your recall). In this issue, we will focus on enhancing the model capabilities of the generation module to improve the quality of the content generated by the model.

Impact of model capabilities

In a typical RAG architecture, the baseline capability of the large language model (LLM) directly affects the reliability of the final output of the system, and its performance bottlenecks are mainly reflected in the following three dimensions:

1. Domain knowledge adaptability defects

General-purpose large models (such as DeepSeek-R1, GPT-4, Claude-3) show strong capabilities in open domain knowledge understanding, but their performance drops significantly when facing vertical fields:

Difficulties in interpreting professional terms: For example, IC can refer to “Intensive Care” in the medical field, and “Integrated Circuit” in the electronics field.
Lack of long tail knowledge:
- In the medical field, general large models may be mainly exposed to data of common diseases during training, and have less exposure to data of rare diseases. As a result, when the model encounters cases of rare diseases, it may not be able to accurately identify or diagnose, leading to misdiagnosis or missed diagnosis;
- In natural language processing, general models are usually trained based on Mandarin or mainstream languages, and there is less training data for local dialects. As a result, the model may have misunderstandings or incomprehensibility when processing local dialects, affecting the communication effect;
- In cultural knowledge question and answer or recommendation systems, the general model has a good grasp of mainstream cultural knowledge, but lacks understanding of niche cultural knowledge. As a result, when users ask questions about niche culture, the model may not be able to give accurate or relevant answers.
Limitations in domain reasoning ability:
- In the legal field, general large models are used to assist in case analysis. When dealing with complex legal cases, it may not be possible to understand the deep legal logic and the relationship between provisions. As a result, the analysis suggestions provided by the model may not be accurate or comprehensive enough, affecting legal decision-making;
- In the field of education, general large models are used to assist in solving mathematical problems. When dealing with advanced mathematical problems, the model may not be able to perform in-depth reasoning and calculations. The model cannot solve complex mathematical problems, or the answers given may be wrong;
- In the field of scientific research, general large models are used to assist in the design of experiments. The model may lack an in-depth understanding of a specific scientific field and cannot take into account all experimental variables and potential effects. As a result, the designed experiments may be flawed and fail to achieve the expected research goals.

2. Weak structured output control

When the system requires fixed format output (such as JSON data tables, standardized reports), the model is susceptible to two key issues:

Format Drift Phenomenon:
- In financial data analysis, the model needs to output a standardized JSON format data table. However, because the model does not grasp the format details accurately enough, the nesting level of some fields in the output JSON data is incorrect. For example, data that should belong to subfields is placed directly under the parent field, causing data parsing to fail;
- In the field of e-commerce, the model is responsible for generating JSON data of product information. However, due to format drift, unnecessary spaces, line breaks, or missing quotation marks appear in the data output by the model, causing errors in the front-end system's parsing and affecting product display;
Hallucination Interference:
- In the medical report generation scenario, the model needs to output a report containing the patient's basic information, diagnosis results, treatment plan and other fields in a fixed format. However, the model exerts itself when generating the report, fabricating non-key field contents outside the format, such as adding non-existent examination items or treatment recommendations, resulting in distortion of the report content, which may interfere with the doctor's diagnosis;
- In the generation of legal documents, the model needs to output documents according to a standard template. However, due to the interference of hallucinations, the model added fictitious facts or legal provisions irrelevant to the case in the document, resulting in inaccurate document content and affecting the legal validity;

3. Performance is limited by the deployment environment

The knowledge base required by RAG is often private to users, and users prefer to deploy it locally, which means deploying local large models at the same time, and local deployment of large models requires computing power support. It is difficult for ordinary users to have strong computing power. At this time, with computing power resources, they often can only choose some smaller LLM models, such as 7B size models, and the basic capabilities of these smaller LLMs cannot be compared with LLMs of more than 600B, and the capabilities of the models are also weak. Therefore, privacy protection requirements force enterprises to adopt localized deployment solutions, but this creates computing power constraints:

Model scale limit: A single A100 server (80G video memory) supports the deployment of a maximum 70B parameter model, and the cloud can call 600B DeepSeek-R1 at the same cost;
Generational differences in capabilities: For example, the accuracy of the LLaMA-7B model is significantly lower than the accuracy of LLaMA-65B.

Fine-tuning to improve large model capabilities

The problems mentioned above can be solved to a certain extent through targeted fine-tuning strategies, which can significantly improve the domain adaptability and output standardization of the generation module in the RAG system.

🧐 So what is fine-tuning?

Fine-tuning is a lightweight technology based on pre-trained large models and secondary training with data in specific fields to adapt the model to professional scenarios.

Fine-tune core methods

Supervisory fine-tuning (SFT): Inject domain QA data (such as legal cases/medical reports)
Domain adaptation: LoRA low-rank adaptation technology, freezing original parameters + training adaptation layer

Core Advantages

Efficiency: After fine-tuning the 7B model, the performance on specific tasks is comparable to the untuned 70B model
Low cost: Parameter efficient fine-tuning (PEFT) only needs to update 0.1%~20% of the parameters
Controllability: Strengthen structured output and content output constraints (such as JSON format calibration)

Typical application scenarios

(1)Enhanced domain knowledge adaptability

a. Vertical domain knowledge injection: Construct a fine-tuned data set based on domain-specific data (such as medical cases, legal documents, scientific research papers), and use instruction fine-tuning (Instruction Tuning) to enable the model to learn the knowledge expression pattern in the domain. For example, in a medical scenario, use data to train a model to understand the correlation between medical terms.

b. Long-tail knowledge compensation: For rare terms or low-frequency scenes (such as dialect vocabulary, niche cultural concepts), construct question and answer pairs containing such knowledge, and enhance the model's ability to capture long-tail features through fine-tuning. For example, in dialect processing tasks, dialect-standard language control samples are injected into the training data to strengthen the model's generalization ability across language variants.

(2)Structural output control enhancement

a. Explicit learning of format tags: Embed structured tags (such as {{KEY_START}}diagnosis results{{KEY_END}}) in the fine-tuning data to force the model to learn the separated expression of format and content.

b. Anti-interference training: Inject 20%-30% noise samples into the training set (such as randomly deleting quotation marks, disrupting the JSON hierarchy), requiring the model to maintain content accuracy while repairing the format. In this way, the model's robustness to format drift is improved.

c. Lightweight deployment adaptation environment

Small model capability enhancement: Through progressive knowledge transfer, the capabilities learned by the large model in domain tasks are distilled into the small model. For example, use a large model to generate pseudo-labels for domain-related data (you can also directly use annotated data in relevant professional fields), and then fine-tune the small model based on this data, so that the 7B model can approach the performance of the 65B model on specific tasks.

Here we give an example of a specific scenario:

In the medical field, text information in medical reports (for example, "The patient exhibits symptoms of hypertension and the systolic blood pressure is 150mmHg") needs to be converted into JSON format, which contains key information such as symptoms and blood pressure readings.

With fine-tuning, the following benefits can be achieved for this type of task:

1. Accurate information extraction

Fine-tuning can help the model more accurately extract structured information from natural language text, such as symptom type, numerical value, unit, etc.

2. Contextual understanding

Information in natural language often depends on context, and fine-tuning allows the model to better understand these contexts and thus convert information more accurately.

3. Specific format requirements

JSON structured text often has strict formatting requirements, and fine-tuning can help the model generate output that meets these requirements.

4. Handling of domain-specific terms

Fine-tuning can help the model better understand and process professional terms.

Converting natural language to JSON structured text is a typical scenario where fine-tuning can significantly improve the performance and output quality of LLM. Through fine-tuning, the model can learn how to identify and extract key information from natural language and present this information in a predefined structured format.

Let's put the fine-tuned model into RAG, like this:

Principles and methods of LoRA fine-tuning

We mentioned above that fine-tuning can solve various problems, so what specific fine-tuning methods are used?

In response to the inherent challenges of traditional full-parameter fine-tuning in terms of resource consumption and catastrophic forgetting (over-adjustment of pre-training weights will destroy the knowledge expression of the model in general fields), LorRA (Low-Rank Adaptation) fine-tuning provides an effective optimization approach and has become a common method in the field of model fine-tuning.

Overview of LoRA fine-tuning

Let's take LoRA fine-tuning as an example to explain the application of fine-tuning in LazyLLM.

LoRA fine-tuning is an advanced fine-tuning technology designed for Transformer models. Its core idea is to simulate and implement the fine-tuning process by introducing low-rank matrices $A$ and $B$ on the basis of keeping the pre-trained model parameters $W \in \mathbb{R}^{d \times k}$ fixed.

The uniqueness of this method is that it does not directly adjust the parameters $W \in \mathbb{R}^{d \times k}$ in the original model (such as $W_0$, $W_1$, $W_2$, $W_3$), but introduces additional trainable parameters for each layer - the orange matrix $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$, only fine-tuning these low-rank matrices.

LoRA fine-tuning steps

Load the pre-trained model: First, select a pre-trained Transformer model as the basis.
Introduce low-rank matrices: In each layer of the model, add trainable low-rank matrices $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$.
Fine-tune low-rank matrices: Input task-specific data into the model, and use the backpropagation algorithm to update parameters for low-rank matrices $A$ and $B$.
Evaluation and optimization: Evaluate the performance of the fine-tuned model on the validation set, and further optimize the model based on actual needs.

LoRA mathematical analysis

Going a step further, let's look at the mathematical analysis of LoRA. The LoRA (Low-Rank Adaptation) method is based on the low-rank matrix approximation theory and achieves efficient fine-tuning of parameters by freezing the parameters of the pre-trained model and injecting trainable low-rank matrices. At the mathematical level, for the pre-training weight matrix $W ∈ R^{d×k}$, LoRA decomposes its update amount into the product of two low-rank matrices:

\[ΔW = B ⋅ A\]

Among them, $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$, and $r \ll \min(d, k)$. Such decomposition results in updates with lower rank, thus reducing the number of parameters that need to be trained.

During forward propagation, the output of the model becomes:

\[ h = Wx + \dfrac{\alpha}{r}BAx \]

in:

$W$ is the original weight matrix of the pre-trained model
$x$ is the input
$B$ and $A$ are low-rank matrices (rank $r$)
$\alpha$ is a scaling factor used to flexibly control the adjustment strength of the low-rank matrix to the original weights.
$r$ is rank

1. The influence of rank $r$

Parameter efficiency and model capacity
- $r$ determines the approximation ability of the low-rank matrix $BA$.
  - A larger $r$ allows the matrix to capture more complex updates and improves the model's adaptability to tasks, but will increase the training parameters (the number of parameters is $r \times (\text{dim}_A + \text{dim}_B)$), which may lead to overfitting.
  - A smaller $r$ reduces the number of parameters and improves training speed, but may limit the model's expression ability.
Overfitting and generalization
- Small $r$: more suitable for small data sets, reducing the risk of overfitting, but may underfit complex tasks.
- Large $r$: Suitable for large data sets or complex tasks, but more data is needed to avoid overfitting.
Experience Points: Typically $r = 8$ or $16$ performs well on most tasks.

2. Effect of scaling factor $\alpha$

Update amplitude control
- $\alpha$ adjusts the weight of the low-rank update term $BAx$.
  - Larger $\alpha$ enhances the impact of incremental updates, which may speed up convergence, but may also destroy pre-training knowledge;
  - Smaller $\alpha$ makes updates gentler, retaining more of the original model's capabilities, but requires longer training time.
Synergy with Rank
- The $\frac{\alpha}{r}$ design in the formula balances the effects of different $r$ values. For example, when $r$ increases, dividing by $r$ can prevent the update range from being too large and maintain training stability.
Experience value: Usually $\alpha$ is set to 2 times $r$ (such as $\alpha = 16$)

3. The influence of $\frac{\alpha}{r}$

Update intensity and convergence speed
- $\frac{\alpha}{r}$ directly controls the overall contribution of the low-rank term:
  - High ratio (such as large $\alpha$ or small $r$): strong update intensity, suitable for quickly adapting to new tasks, but may cause gradient instability.
  - Low ratio (such as small $\alpha$ or large $r$): the update is gentle, suitable for retaining pre-training knowledge, and requires longer time for fine-tuning.
Relationship with learning rate
- $\frac{\alpha}{r}$ acts like an adaptive learning rate for low-rank terms.

4. Impact of freezing pre-training weights

Mitigation of catastrophic amnesia
- The original weight $W$ is frozen, and only $B$ and $A$ are trained to avoid destroying pre-training knowledge and significantly reduce the risk of overfitting.
Computational efficiency
- Reduce training parameters to the size of low-rank matrices (such as from hundreds of millions of parameters to millions), saving video memory and computing resources.

Parameter tuning suggestions

Task complexity and data volume
- Simple tasks/small data: Choose small $r$ (e.g. $r=4$) and medium $\alpha$.
- Complex tasks/big data: Try larger $r$ (e.g. $r=32$) and increase $\alpha$.
Resource limitations
- When there is insufficient video memory, priority is given to reducing $r$ instead of reducing model size.

Parameter summary

$r$: balances expression ability and parameter efficiency, determining the flexibility of model fine-tuning.
$\alpha$: adjust the intensity of new knowledge injection, and jointly control training stability with $r$.
$\frac{\alpha}{r}$: directly affects the convergence speed and final effect, and needs to be adjusted according to task requirements.

Fine-tune large models based on LazyLLM

So how should we implement fine-tuning? Here we implement a LoRA fine-tuning based on LazyLLM.

We choose a small model and use the training data in the CMRC2018 data set to fine-tune this small model so that it can have better Chinese reading comprehension information extraction capabilities. The models used in RAG before will answer questions based on the knowledge base information we retrieved. The answers are relatively divergent and the degree of freedom is relatively high. Here we hope that the models in RAG have the following characteristics:

Extract necessary information to answer based on user questions and recalled text paragraphs, without extending the information (corresponding to the above mentioned: weak structured output control and poor adaptability of domain knowledge);
The model should be small, about 7B, for easy deployment, and a model that is not too large should not be used (corresponding to the contradiction between lightweight deployment requirements and reduced model accuracy mentioned above).

Our overall fine-tuning steps are as follows:

Question: The answers to the original RAG medium and large models are divergent and need to be streamlined + accurately follow the original text output;
Goal: Strengthen Chinese reading comprehension information extraction capabilities on the 7B small model *Model selection: InternLM2-Chat-7B.
Data preparation: Use CMRC2018 training data, focusing on Chinese question and answer and information extraction.
Fine-tuning method:
- LoRA strategy: freeze the original weights and only train low-rank matrices (BA) to reduce video memory usage.
- Control output: Strengthen the ability to follow the original content and streamline the output.

Data preparation

Dataset Introduction

We choose the CMRC2018 dataset, which consists of nearly 15,000 real-world questions annotated by human experts on Wikipedia paragraphs. In addition, the corresponding task of this data set is: "Span-Extraction Reading Comprehension". Based on a given document and a question, the model needs to extract the answer to the question from the document, where the answer is a continuous fragment of the article.

Dataset size

The CMRC2018 data set mainly consists of three parts: test, validation and train. Here we use the train part as our training set, the test part as our evaluation set, and we do not use the validation part.

Dataset	Number of essays	Number of questions	Purpose
test	256	1002	evaluation set
train	2403	10142	training set
validation	848	3219	not used

Note:

The knowledge base in the previous tutorial was built based on the passages in the test data set in CMRC2018;

Here we use the train data set in CMRC2018 for training;

During the evaluation process, in order to avoid interference from the retrieval module in the RAG system, we use the control variable method, assuming that 100% recall is no problem, that is: we directly use the passage fragments and questions in the test data set in CMRC2018 to splice them together as the input of the large model LLM. The variables are two 7B models before and after fine-tuning to compare and see whether the fine-tuned model has improved its ability to extract Chinese reading comprehension information;

Data structure

The structures of test and train in the data set are consistent. We extract one of the data and see as follows:

[
    {
        "id": "TRIAL_154_QUERY_0",
"context": "Eugene Kangaroo (\"Macropus Eugenii\") is a small member of the Kangaroo family, and is usually the subject of research on kangaroos and marsupials. Eugene's kangaroos are distributed in the southern islands and west coast of Australia. Because they breed in large numbers on Kangaroo Island every season, destroying the living environment on Echidna Island, they are considered a pest. The Eugene's kangaroo was first discovered in Western Australia by survivors of a shipwreck in 1628. It is the earliest recorded discovery of kangaroos by Europeans, and may be the earliest discovery. There are three subspecies of Australian mammals: the Eugene kangaroo is very small, weighing only about 8 kilograms, and is suitable for breeding. There is a substance in the milk of the Eugene kangaroo, which may be a miracle drug and an improvement on penicillin. AGG01 is a protein that has been proven to be 100 times more effective than penicillin and can kill 99% of bacteria and fungi, such as Salmonella, Proteus vulgaris and Staphylococcus aureus.
"question": "Where are Eugene's kangaroos distributed?",
        "answers": {
            "text": [
"Eugene's kangaroo is distributed in the southern islands and west coast of Australia"
            ],
            "answer_start": [
                52
            ]
        }
    },
...

In the above data:

context: is a text segment;
question: is a question directed at the text segment;
answers: Answers to corresponding questions are given, including:
text: The specific content of the answer, derived from the text segment;
answer_start: The starting position of the answer in the text paragraph;

Training set processing

We extract the article field context and the question field question from the original train data set, and splice them into the instruction field for fine-tuning. The splicing template is: "Please use the original text of the following paragraph to answer the question\n\n### Known paragraph: {context}\n\n### Question: {question}\n".
Use the answers and its text fields in the original data as the fine-tuned output field
Since fine-tuning also requires an input field, which we do not need for this task, it is set to empty.

Code Github link🔗

# Template for constructing QA prompts
template = "Please use the original text of the following paragraph to answer the question\n\n### Known paragraph: {context}\n\n### Question: {question}\n"

def build_train_data(data):
    """Format training data using predefined template"""
    extracted_data = []
    for item in data:
        extracted_item = {
            "instruction": template.format(context=item["context"], question=item["question"]),
            "input": "",
            "output": item["answers"]["text"][0]
        }
        extracted_data.append(extracted_item)
    return extracted_data

We take a piece of processed data as follows:

[
    {
"instruction": "Please use the original text of the following paragraph to answer the question\n\n### Known text: Huangdu (scientific name:) is a perennial twining vine with spherical or conical tubers in the leaf axils, with spherical or oval bulbils of varying sizes and yellowish-brown outer skin. It has checkered small transverse veins, with 7-9 obvious veins at the base; the base of the petiole is twisted and slightly wider, as long as the leaves or slightly shorter. It blooms in summer and autumn, and is dioecious. The fruiting period is from September to October in Oceania, Korea, and Africa. Asia, India, Japan, Taiwan, Myanmar and China's Jiangsu, Guangdong, Guangxi, Anhui, Jiangxi, Sichuan, Gansu, Yunnan, Hunan, Tibet, Henan, Fujian, Zhejiang, Guizhou, Hubei, Shaanxi and other places, growing at an altitude of 300 meters to 2,000 meters In rice-rich areas, it mostly grows along river valleys, valley ditches or on the edges of mixed woods. It has not yet been artificially introduced and cultivated in the Americas. It is an exotic species to the Americas and has the opportunity to multiply in farmland and climb tall trees to compete for sunlight. Potato. Huangyao (original Materia Medica), Shanzigu (plant names and facts), Lingyuzi Dioscorea (Russian, Latin, and Chinese seed plant names), Lingyuzi (Fora of Guangzhou, Hainan Flora), Huangyaozi (name of medicinal materials in Jiangsu, Anhui, Zhejiang, Yunnan and other provinces), Shanzigu (Chuxiong, Yunnan)\n\n### Question: What is the color of the skin of Huangdu? \n",
        "input": "",
"output": "Tawny skin"
    },
...

Evaluation set processing

The fields in our evaluation set are basically retained. Only the answers field is processed and the content is extracted:

Code Github link🔗

def build_eval_data(data):
    """Extract necessary fields for evaluation dataset"""
    extracted_data = []
    for item in data:
        extracted_item = {
            "context": item["context"],
            "question": item["question"],
            "answers": item["answers"]["text"][0]
        }
        extracted_data.append(extracted_item)
    return extracted_data

We take a piece of processed data as follows:

[
    {
"context": "Furong Cave is located on the banks of the Furong River in Jiangkou Town, Wulong County, Chongqing, 20 kilometers away from Wulong County. Furong Cave was discovered in 1993 and opened to tourists in 1994. It was listed as a national 4A tourist attraction in China in 2002, and in June 2007, it became a component of the Southern China Karst-Wulong Karst. It is the first karst cave in China to be listed as a World Natural Heritage. It is 2,846 meters long and famous for its numerous shafts and complete types of cave deposits. There is a rare group of karst shafts in the world within an area of about 20 square kilometers. There are at least 50 shafts over 100 meters scattered, among which Qikeng Cave has a depth of 920 meters, the longest in Asia. There are more than 70 kinds of sediments in Furong Cave, including almost all the karst cave sedimentary types named by scientists, among which the pool sediment is the essence. There is a "gypsum" at the east end of Furong Cave. "Huazhi Cave", the antler-shaped curled stone branches in the cave are 57 centimeters long, the longest in the world. The gypsum Huazhi Cave is currently permanently sealed. A Jiangkou Power Station Dam has been built on the Furong River adjacent to the entrance of Furong Cave. The impact of the reservoir's impoundment on groundwater circulation and the evolution of the karst landscape is currently difficult to estimate.",
"question": "Where is Furong Cave located?",
"answers": "The banks of Furong River in Jiangkou Town, Wulong County, Chongqing"
    },
 ...

Fine-tuning the model

After the data processing is complete, we can start fine-tuning.

It is worth noting: LazyLLM supports fine-tuning, deployment, and inference in one package!

The fine-tuning related configuration code is mainly as follows:

Code GitHub link🔗

import lazyllm
from lazyllm import finetune, deploy, launchers

model = lazyllm.TrainableModule(model_path)\
    .mode('finetune')\
    .trainset(train_data_path)\
    .finetune_method((finetune.llamafactory, {
        'learning_rate': 1e-4,
        'cutoff_len': 5120,
        'max_samples': 20000,
        'val_size': 0.01,
        'per_device_train_batch_size': 2,
        'num_train_epochs': 2.0,
        'launcher': launchers.sco(ngpus=8)
    }))\
    .prompt(dict(system='You are a helpful assistant.', drop_builtin_system=True))\
    .deploy_method(deploy.Vllm)
model.evalset(eval_data)
model.update()

In the above code, LazyLLM's TrainableModule is used to implement: fine-tuning->deployment->inference:

Model configuration:
- model_path specifies the model we want to fine-tune. Here we use Internlm2-Chat-7B and directly specify its path;
Fine-tuned configuration:
- .mode sets the startup fine-tuning mode finetune;
- .trainset sets the data set path for training. What is used here is the training set we processed earlier;
- .finetune_method sets which fine-tuning framework to use and its parameters. A tuple is passed in here (only two elements can be set):
  - The first element specifies the fine-tuning framework used is Llama-Factory: finetune.llamafactory
  - The second element is a dictionary containing the parameter configuration of the fine-tuning framework;
Inference configuration:
- .prompt sets the Prompt used during inference. Note that in order to be consistent with the system field in the fine-tuned Prompt, drop_builtin_system is turned on to replace the original system-prompt with `You are a helpful assistant.`
- .deploy_method sets the inference framework for deployment, and the vLLM inference framework is specified here;
Evaluation configuration:
- Here we use .evalset to configure the evaluation set we processed before;
Start task:
- .update triggers the start of the task: the model is fine-tuned first. After the fine-tuning is completed, the model will be deployed. After deployment, it will automatically use the evaluation set to go through inference to obtain the results;

Some key parameters in fine-tuning are as follows:

Parameters	Function	Recommended settings	Tuning suggestions
learning_rate	Control parameter update range	1e-4~5e-5	Take the smaller value for large models
cutoff_len	Maximum context length	5120	Adjusted according to GPU memory
max_samples	Maximum training sample size	20000	Note that too small will result in too little training data,
per_device_train_batch_size	Single card batch size	2	Decrease when there is insufficient video memory
num_train_epochs	Training rounds	2.0	Set according to task loss reduction

In addition, we can also configure some LoRA-related parameters (LazyLLM has set a set of experience parameters by default, so it is not reflected in the above code. Here we show it as follows. You can try various parameters to make elixirs:

Parameters	Function	Recommended settings	Description
lora_alpha	LoRA scaling factor	16	-
lora_dropout	LoRA dropout rate	0.0	-
lora_rank	LoRA rank	8	-
lora_target	LoRA target module of the total model	all	All linear modules in the model.

Effect evaluation

After fine-tuning in the previous step and obtaining the inference results of the evaluation set, we need to compare the results with the correct answers of the evaluation set to confirm the effect of our fine-tuning.

Evaluation purpose

Verify the fine-tuning effect: Compare the model output with the standard answer, and quantify the model optimization results (whether it surpasses the general large model)
Task Adaptability Test: Common indicators (faithfulness, Answer Relevance) are not applicable to the "Chapter Fragment Extraction" task and require customized evaluation
Optimization Direction Guidance: Locate model shortcomings through indicator differences (such as complete consistency, semantic accuracy, original text dependence)

Evaluation indicator design

Here, based on the characteristics of our tasks, three-dimensional evaluation indicators are designed:

Exact Match
Semantic similarity (Cosine Score)
Origin Score

Let us now introduce the design details of these three indicators in detail:

1. Exact matching rate

We define exact matching as follows:

\[ EM = \frac{1}{N} \sum_{i=1}^N \mathbb{I}(y_i = \hat{y}_i) \]

where:

$N$: total number of test samples;
$y_i$: the standard answer of the $i$th sample;
$\hat{y}_i$: model prediction results;
$\mathbb{I}$: indicator function (takes 1 when there is a complete match, otherwise takes 0);

The characteristic of this indicator is that the prediction result and the standard answer need to be completely consistent

Without considering the calculation of the average, the code to implement only one of them is simple, as follows:

exact_score = 1 if output == true_v else 0

Code function:

This code is used to determine whether the model's prediction result (output) for a single sample is completely consistent with the standard answer (true_v) of the sample, and give an exact matching score (exact_score) accordingly.

Code explanation:

output: The prediction result of the model for a certain sample.
true_v: The standard answer for this sample.
exact_score: Exact match score, value is 1 or 0. If the prediction result is completely consistent with the standard answer, the score is 1; otherwise, the score is 0.

2. Semantic similarity

We define semantic similarity as follows:

\[ CS = \dfrac{1}{N} \sum_{i=1}^N \max\left(0, \min\left(1, \dfrac{emb(y_i) \cdot emb(\hat{y}_i)}{\|emb(y_i)\| \cdot \|emb(\hat{y}_i)\|}\right)\right) \]

where:

$N$: total number of test samples;
$y_i$: the standard answer of the $i$th sample;
$\hat{y}_i$: model prediction results;
$emb()$: is a vector encoding based on the BGE model (bge-large-zh-v1.5), which can encode natural language into a vector, that is:

$$ emb(text) = BGE_Encoder(text) $$

It is worth noting that this evaluation index truncates the original value of [-1,0). As long as the semantics of negative correlation are 0 points, the score can only be positive correlation.

The corresponding code is implemented as follows. There is no averaging here, just one of the items. It is also assumed that the text has been vectorized by the BGE model:

import numpy as np

def cosine(x, y):
    """Calculate cosine similarity between two vectors"""
    product = np.dot(x, y)
    norm = np.linalg.norm(x) * np.linalg.norm(y)
    raw_cosine = product / norm if norm != 0 else 0.0
    return max(0.0, min(raw_cosine, 1.0))

Code function:

This code implements a function that calculates the cosine similarity between two vectors, and is particularly useful for evaluating the semantic similarity between model predictions and standard answers. This is the core calculation part in the above exact matching definition (CS) and is used for similarity calculation of a single sample.

Code explanation:

$x$ and $y$: represent two vectors respectively. In practical applications, these two vectors are usually the result of vectorizing text by the BGE model, namely $emb(y_i)$ and $emb(\hat{y}_i)$.
$\text{np.dot}(x, y)$: Calculate the dot product of two vectors.
$\text{np.linalg.norm}(x)$ and $\text{np.linalg.norm}(y)$: Calculate the L2 norm (i.e. Euclidean norm) of vectors $x$ and $y$ respectively.
$\frac{\text{product}}{\text{norm}}$: Calculate the original cosine similarity value.
$\mathrm{raw\_cosine}$: Stores the original cosine similarity value, but returns 0.0 directly if the denominator is 0 (that is, at least one of the two vectors is a zero vector).
$\max(0.0, \min(\mathrm{raw\_cosine}, 1.0))$: Truncate the original cosine similarity value to ensure that the result is within the range of $[0, 1]$, and the semantics of negative correlation are scored 0 points.

3. Inclusion of the original text

We define the original text inclusion degree as follows:

\[ OS = \dfrac{1}{N} \sum_{i=1}^N \mathbb{I}\left(\forall w \in \hat{y},\ w \in Context \right) \]

where:

$N$: total number of test samples;
$\hat{y}_i$: model prediction results;
$w$: represents each word in the prediction result;
$Context$: article content;
$\mathbb{I}$: indicator function (takes 1 when all words appear in the original text, otherwise takes 0 if there are more words than the original text).

We implement the corresponding code as follows. This does not include averaging, only one of them is involved:

def check_words_from_content(infer, content):
    """Check if all words in inference output exist in original context"""
    return 1 if all(w in content for w in infer.split()) else 0

Code function:

This code implements a function check_words_from_content, which is used to check whether all words in the model prediction results appear in the original content. This is the core calculation part in the above definition of original inclusion (OS), which is used to determine the inclusion of a single sample.

Code explanation:

infer: The prediction result of the model for a certain sample, that is, $\hat{y}$.
content: The original content of the sample, that is, $Context$.
infer.split(): Split the prediction results into a list of single words.
all(w in content for w in infer.split()): Use Python's all function and a generator expression to check whether each word or vocabulary in infer is present in content. Returns True if all words or terms exist; otherwise returns False.
return 1 if ... else 0: Returns 1 or 0 depending on the check result, as the implementation of the indicator function $\mathbb{I}$. Returns 1 if all words or terms are present in the original text; otherwise returns 0.

4. Comparative analysis of indicators

We will conduct a comparative analysis of the above three evaluation indicators:

Indicator dimensions	Value range (single item)	Ideal value (single item)	Numerical characteristics (single item	Advantages	Limitations	Task evaluation dimension description
Exact match rate	{0, 1}	1.0	Binary judgment	The result is clear and unambiguous	Zero tolerance for differences in expression	How much reasoning can be 100% faithful to the answer
Semantic similarity	[0, 1]	1.0	Continuous value	Capture semantic similarity	Depend on encoding model quality	Due to the selection of fragment range or the change of expression, similarity is used to evaluate different expressions of the same content;
Original text inclusion degree	{0, 1}	1.0	Binary judgment	Ensure answers are faithful to the original text	Ignore reasonable synonymous substitutions	Task requirements, answers must be in the original text, this indicator can reflect whether the answers are all derived from the original text;

Comprehensive evaluation script

Let us bring all the above evaluation indicators together to achieve the following complete evaluation:

Code GitHub link🔗

def caculate_score(eval_set, infer_set):
    """Calculate three evaluation metrics: exact match, cosine similarity, and word containment"""
    assert len(eval_set) == len(infer_set)

    # Initialize embedding model
    m = lazyllm.TrainableModule('bge-large-zh-v1.5')
    m.start()

    accu_exact_score = 0
    accu_cosin_score = 0
    accu_origi_score = 0
    res = []
    for index, eval_item in enumerate(eval_set):
        output = infer_set[index].strip()
        true_v = eval_item['answers']
        # Exact match scoring:
        exact_score = 1 if output == true_v else 0
        accu_exact_score += exact_score
        # Cosine similarity scoring:
        outputs = json.loads(m([output, true_v]))
        cosine_score = cosine(outputs[0], outputs[1])
        accu_cosin_score += cosine_score
        # Word containment scoring:
        origin_score = check_words_from_content(output, eval_item['context'])
        accu_origi_score += origin_score
        res.append({'context':eval_item['context'],
                    'true': true_v,
                    'infer':output,
                    'exact_score': exact_score,
                    'cosine_score': cosine_score,
                    'origin_score': origin_score})
    save_res(res, 'eval/infer_true_cp.json')
    total_score = len(eval_set)
    return (f'Exact Score : {accu_exact_score}/{total_score}, {round(accu_exact_score/total_score,4)*100}%\n'
            f'Cosine Score: {accu_cosin_score}/{total_score}, {round(accu_cosin_score/total_score,4)*100}%\n'
            f'Origin Score: {accu_origi_score}/{total_score}, {round(accu_origi_score/total_score,4)*100}%\n')

In the above code:

First pass in the test set and the results of inference, and ensure that the two sets are the same size;
Then in order to implement text vectorization here, we use LazyLLM's TrainableModule to load a bge-large-zh-v1.5 model, and use .start to deploy it;
Then we go through all the data, calculate each score under the three indicators, and accumulate them;
Finally we save all results, calculate the final evaluation result and return it as a string.

Comparison of evaluation results

Here we compare the model Internlm2-Chat-7B before fine-tuning and the model after fine-tuning. We also compare the online model DeepSeek-V3.

In the results in the table below, the scores are in brackets, the total score is 1002 points, and the percentage is the score as a percentage of the total score.

Model	Exact match rate	Semantic similarity	Original text inclusion
Internlm2-Chat-7B	2.10%(21)	74.51%(746.6)	5.19%(52)
DeepSeek-V3	5.29%(53)	74.85%(750.0)	15.17%(152)
After Internlm2-Chat-7B training	39.72%(398)	86.19%(863.6)	94.91%(951)

From the above evaluation results, we can see that the model after fine-tuning has significantly better indicators than before fine-tuning, even for large online models.

Exact Match Rate Leap
1. After fine-tuning, the improvement was 37.62 percentage points (2.10% → 39.72%), nearly 8 times higher than the online model
2. Explain that the model learns to follow a specific answer format
Semantic relevance optimization
1. The similarity increased by 11.68 percentage points (74.51% → 86.19%)
2. Compared with online models (large models with more than 600 B): +11.34 percentage points advantage
Qualitative change in original text dependence
1. The inclusion rate jumped from 5.19% to 94.91%, an increase of 18.3 times
2. Show that the model has been mastered:
  - ✅ Key information positioning capabilities
  - ✅ Original text extraction strategy
  - ✅ Knowledge boundary control (avoiding hallucinations)

Based on the experimental data we can conclude:

In the RAG system, under the premise that the recall module is accurate and the recall is correct, the degree of adaptation of the model in the generation module to the task greatly affects the final effect. Fine-tuning is an effective means to improve the model's ability to adapt to downstream tasks, and even the ability improved through fine-tuning can surpass the general large model.

General evaluation comparison

Here we evaluate the model based on the evaluation indicators of the two general generation modules introduced in Retrieve with Higher Accuracy, and compare them with the evaluation indicators we designed.

From the data in the table above, we can see that the general evaluation index shows that after fine-tuning, the model's answer correlation has dropped significantly. Under this evaluation index, it means that fine-tuning has failed, but is this really the case? Let’s extract the data from the evaluation process and take a look:

After fine-tuning InternLM2-Chat-7B, answer correlation

{
"context": "Based on the games developed on \"Kart Runner\" and \"Bubble Hall\", it is developed and published by South Korea's Nexon. Mainland China is operated by Shanda Games. This is the first time Nexon has granted Shanda Network its game operating rights again after 6 years. Taiwan is operated by Game Orange. Players use water guns, small guns, hammers or water bombs to soak enemies (players or NPCs), which is a bubble seal, and the bubble is broken into a kick. If the bubble is not there at the time If kicked out within a certain period of time, the number of lives will be reduced, and the number of lives will be exhausted. The reborn person will be invincible for a certain period of time, and the player with the most points will win. The rules vary depending on the mode. In 2V2 and 4V4 random matching, players can climb up the ranking list (in order: rough stone, bronze medal, silver medal, gold medal, platinum, diamond, and master). , you can choose classic, hot-blooded, sniper and other modes to play. If you are in the game, you will not be able to match within 4 minutes (each time you are in the game + 4 minutes). The opening time is from time to time during the summer or winter vacation. The 8-player classic mode is randomly matched and the points are scored. The more points you get during the event, the rewards will be obtained at the end. ",
    "exact_score": 1,
    "cosine_score": 0.9999,
    "origin_score": 1,
"question": "What does it count as if the lives are exhausted?",
"true_answer": "Kick blast",
"answer": "Kick blast",
    "infer_questions": [
        {
"question": "\nWhat does the word kick mean?",
            "score": 0.3781
        },
        {
"question": "\nWhat does kicking mean? In what context is this word used?",
            "score": 0.3825
        },
        {
"question": "\nWhat does kicking mean? In what context is this word usually used?",
            "score": 0.3829
        }
    ],
    "final_score": 0.3812
},

DeepSeek-V3, Answer Relevance

{
"context": "Based on the games developed on \"Kart Runner\" and \"Bubble Hall\", it is developed and published by South Korea's Nexon. Mainland China is operated by Shanda Games. This is the first time Nexon has granted Shanda Network its game operating rights again after 6 years. Taiwan is operated by Game Orange. Players use water guns, small guns, hammers or water bombs to soak enemies (players or NPCs), which is a bubble seal, and the bubble is broken into a kick. If the bubble is not there at the time If kicked out within a certain period of time, the number of lives will be reduced, and the number of lives will be exhausted. The reborn person will be invincible for a certain period of time, and the player with the most points will win. The rules vary depending on the mode. In 2V2 and 4V4 random matching, players can climb up the ranking list (in order: rough stone, bronze medal, silver medal, gold medal, platinum, diamond, and master). , you can choose classic, hot-blooded, sniper and other modes to play. If you are in the game, you will not be able to match within 4 minutes (each time you are in the game + 4 minutes). The opening time is from time to time during the summer or winter vacation. The 8-player classic mode is randomly matched and the points are scored. The more points you get during the event, the rewards will be obtained at the end. ",
    "exact_score": 0,
    "cosine_score": 0.6646,
    "origin_score": 1,
"question": "What does it count as if the lives are exhausted?",
"true_answer": "Kick blast",
"answer": "When the number of lives is exhausted, it will be counted as a kick.",
    "infer_questions": [
        {
            "question": "\nIn game terminology, what does \"when the number of lives is exhausted is counted as a kick\" mean?",
            "score": 0.7191
        },
        {
"question": "\nUnder what circumstances will it be considered a kick? What will happen when the life count is exhausted?",
            "score": 0.6896
        },
        {
"question": "\nIn the game rules, what happens after the number of lives is exhausted?",
            "score": 0.7918
        }
    ],
    "final_score": 0.7335
},

In the above comparison, true_answer is the annotated answer, answer is the answer of model inference, question is the question, and answer correlation needs to generate possible questions (question in infer_questions) based on the answer answer of model inference. Here are three, and then let the possible questions and the real questions be vectorized to find the cosine similarity. Under the short answer, it is difficult for the model to infer questions related to the real question question.

The answer of InternLM2-Chat-7B after fine-tuning is: "Kicked"
The answer of DeepSeek-V3 is: "When the number of lives is exhausted, it will be counted as a kick."

It can be seen that because DeepSeek-V3 provides more information, the questions inferred by the evaluation model are more accurate, so the score is higher. But actually this is not what we expected, we wanted it to be short and precise, just like the standard answer "kick it to the punch"! Correspondingly, the semantic similarity cosine_score and exact_score we use here reflect the expectations very well.

Faithfulness makes little difference here:

After fine-tuning InternLM2-Chat-7B, faithfulness:

{
    "context": "China (Nanjing) Chess Super Competition (Pearl Spring Super Tournament), formerly known as the China (Nanjing) Chess Grandmaster Invitational Tournament, the first competition was held at the Mingfa Pearl Spring Hotel in Pukou District, Nanjing from December 11 to 22, 2008. This competition was hosted by the Nanjing Municipal People's Government and the Chess and Card Sports Management Center of the State Sports General Administration, and was hosted by the Pukou District People's Government, Organized by the Nanjing Municipal Sports Bureau and co-organized by Kangyuan Pharmaceutical Co., Ltd., Yangtze Evening News, and Mondale International Entrepreneurship University, it is designated as a level 21 event by FIDE and is the highest level chess competition held in Asia so far. The competition will consist of ten rounds, with the first five rounds from December 11th to 15th, and a break on the 16th and 17th. The last 5 rounds were played on the 21st. Each side had 90 minutes and 30 seconds added to each move. The total prize money was 250,000 euros, of which the winner was 80,000 euros, and the second to sixth place were 55,000 euros, 30,000 euros, 25,000 euros, and 20,000 euros. As a result, Topalov won the championship, Aronyan won the runner-up, and Bu Xiangzhi won the third place. Name. It was admitted as a Grand Slam event on February 1, 2009 and was renamed the China (Nanjing) Chess Super Competition. The second competition was held from September 27 to October 9, 2009. The \"Kangyuan Pharmaceutical Cup\" 2010 China (Nanjing) Chess Super Competition was held from October 19 to 30, 2010.",
    "exact_score": 1,
    "cosine_score": 0.9999,
    "origin_score": 1,
    "question": "Where will the first competition be held?",
    "true_answer": "Mingfa Pearl Spring Hotel, Pukou District, Nanjing",
    "answer": "Mingfa Pearl Spring Hotel, Pukou District, Nanjing",
    "statements": "\nThe first competition was held in Pukou District, Nanjing City.|||The specific location of the competition is Mingfa Pearl Spring Hotel.",
    "scores": [
        {
            "statement": "The first competition was held in Pukou District, Nanjing.",
            "score": 1
        },
        {
            "statement": "The specific venue of the competition is Mingfa Pearl Spring Hotel.",
            "score": 1
        }
    ],
    "final_score": 1.0
},

DeepSeek-V3, faithfulness:

{
    "context": "China (Nanjing) Chess Super Competition (Pearl Spring Super Tournament), formerly known as the China (Nanjing) Chess Grandmaster Invitational Tournament, the first competition was held at the Mingfa Pearl Spring Hotel in Pukou District, Nanjing from December 11 to 22, 2008. This competition was hosted by the Nanjing Municipal People's Government and the Chess and Card Sports Management Center of the State Sports General Administration, and was hosted by the Pukou District People's Government, Organized by the Nanjing Municipal Sports Bureau and co-organized by Kangyuan Pharmaceutical Co., Ltd., Yangtze Evening News, and Mondale International Entrepreneurship University, it is designated as a level 21 event by FIDE and is the highest level chess competition held in Asia so far. The competition will consist of ten rounds, with the first five rounds from December 11th to 15th, and a break on the 16th and 17th. The last 5 rounds were played on the 21st. Each side had 90 minutes and 30 seconds added to each move. The total prize money was 250,000 euros, of which the winner was 80,000 euros, and the second to sixth place were 55,000 euros, 30,000 euros, 25,000 euros, and 20,000 euros. As a result, Topalov won the championship, Aronyan won the runner-up, and Bu Xiangzhi won the third place. Name. It was admitted as a Grand Slam event on February 1, 2009 and was renamed the China (Nanjing) Chess Super Competition. The second competition was held from September 27 to October 9, 2009. The \"Kangyuan Pharmaceutical Cup\" 2010 China (Nanjing) Chess Super Competition was held from October 19 to 30, 2010.",
    "exact_score": 0,
    "cosine_score": 0.5574,
    "origin_score": 1,
    "question": "Where will the first competition be held?",
    "true_answer": "Mingfa Pearl Spring Hotel, Pukou District, Nanjing",
    "answer": "The first competition was held at the Mingfa Pearl Spring Hotel in Pukou District, Nanjing from December 11 to 22, 2008.",
    "statements": "\nThe first competition was held from December 11 to 22, 2008.|||The competition location is Pukou District, Nanjing City.|||The specific venue is Mingfa Pearl Spring Hotel.",
    "scores": [
        {
            "statement": "The first competition was held from December 11 to 22, 2008.",
            "score": 1
        },
        {
            "statement": "The competition location is Pukou District, Nanjing City.",
            "score": 1
        },
        {
            "statement": "The specific venue is Mingfa Pearl Spring Hotel.",
            "score": 1
        }
    ],
    "final_score": 1.0
},

In the above comparison, we can see that although DeepSeek-V3 provides more information, it is the same as the fine-tuned model. It basically uses the content in the original text to answer the questions as required. Therefore, although it has more statements (3), the final score is the same as the fine-tuned model. This explains why the final faithfulness scores are indistinguishable. It is worth noting that cosine_score semantic similarity and exact_score exact matching are used here to distinguish the difference between the two.

From the comparison results of the two indicators above, we can see that if there are specific task requirements for the generation module in the RAG system, the commonly used evaluation indicators cannot be measured well. At this time, the evaluation indicators need to be designed according to the final effect requirements of the task!

Use fine-tuned large models in RAG

LazyLLM supports one-stop fine-tuning, deployment, and inference, but what should you do if you have fine-tuned a large model and want to use it directly? It's very simple: the base_model remains unchanged, and you can use target_path to specify the fine-tuned model path, as shown below:

base_model = 'internlm2-chat-7b'
sft_model = '/path/to/sft/internlm2-chat-7b'

llm = lazyllm.TrainableModule(base_model, sft_model)

Taking the basic RAG as an example:

import os
import lazyllm

prompt = ('You will act as an AI question-answering assistant and complete a dialogue task.'
          'In this task, you need to provide your answers based on the given context and questions.')

base_model = 'internlm2-chat-7b'
sft_model = '/path/to/sft/internlm2-chat-7b'
llm = lazyllm.TrainableModule(base_model, sft_model)

documents = lazyllm.Document(dataset_path=os.path.join(os.getcwd(), "KB"), embed=embed = lazyllm.TrainableModule('bge-large-zh-v1.5'), manager=False)
documents.create_node_group(name='split_sent', transform=lambda s: s.split('\n'))
with lazyllm.pipeline() as ppl:
    ppl.retriever = lazyllm.Retriever(
        doc=documents, group_name="split_sent", similarity="cosine", topk=1, output_format='content', join='')
    ppl.formatter = (lambda nodes, query: dict(context_str=nodes, query=query)) | lazyllm.bind(query=ppl.input)
    ppl.llm = llm.prompt(lazyllm.ChatPrompter(instruction=prompt, extra_keys=['context_str']))

ppl.start()

Experimental process

The experimental processes of the three models are shown in the following videos:

Deployment, reasoning and evaluation of Internlm2-Chat-7B:

Fine-tuning, deployment, reasoning and evaluation of Internlm2-Chat-7B:

Reasoning and evaluation of DeepSeek-V3:

Fine-tune the Embedding model based on LazyLLM

In the RAG (Retrieval-Augmented Generation) system, the Embedding model plays a key role:

Semantic encoding: Convert text data into high-dimensional vector representation, retaining semantic information
Similarity calculation: Efficient correlation retrieval through vector cosine similarity
Knowledge Base Index: Pre-coded document library to create a vector index for fast retrieval
Query Understanding: Convert user query into vector and match the most relevant knowledge fragments

In this tutorial, we will use BAAI's bge-large-zh-v1.5 as the basic model to improve vertical field effects through financial field data fine-tuning (SFT).

Data preparation

In embedding learning, our goal is to:

Semantically similar samples (positive pairs) are closer in vector space.
Semantically irrelevant or opposite samples (negative pairs) are farther away.

The role of negative samples is to provide a comparison reference to let the model know "which ones should not be close." For example, when our query is "What is ChatGPT?", the doc of our positive sample is "ChatGPT is a language model developed by OpenAI, based on the Transformer architecture...", and the doc of our negative sample can be set to "Midjourney is an AI image generation model...".

Here we use the financial question and answer data set: virattt/financial-qa-10K for demonstration:

Data processing flow:

Load the original data set
Generate negative samples (10 negative examples for each sample)
Create training set/evaluation set split (9:1 ratio)
Build knowledge base files

The main code implementation is as follows:

def build_dataset_corpus(instruction: str, neg_num: int = 10, test_size: float = 0.1, seed: int = 1314) -> tuple:
    """Process dataset and create training/evaluation files.

    Args:
        instruction (str): Instruction template for prompts
        neg_num (int): Number of negative samples per instance
        test_size (float): Proportion of data for test split
        seed (int): Random seed for reproducibility

    Returns:
        tuple: Paths to training data, evaluation data, and knowledge base directory
    """
    # Load and preprocess dataset
    ds = load_dataset("virattt/financial-qa-10K", split="train")
    ds = ds.select_columns(column_names=["question", "context"])
    ds = ds.rename_columns({"question": "query", "context": "pos"})

    # Generate negative samples
    np.random.seed(seed)
    new_col = []
    for i in range(len(ds)):
        ids = np.random.randint(0, len(ds), size=neg_num)
        while i in ids:  # Ensure no self-match in negatives
            ids = np.random.randint(0, len(ds), size=neg_num)
        neg = [ds[int(i)]["pos"] for i in ids]
        new_col.append(neg)

    # Create dataset splits
    ds = ds.add_column("neg", new_col)

    def str_to_lst(data):
        data["pos"] = [data["pos"]]
        return data
    ds = ds.map(str_to_lst)  # Convert pos to list format
    ds = ds.add_column("prompt", [instruction] * len(ds))
    split = ds.train_test_split(test_size=test_size, shuffle=True, seed=seed)

    # Save training data
    train_data_path = build_data_path('dataset', 'train.json')
    split["train"].to_json(train_data_path)

    # Process and save evaluation data
    test = split["test"].select_columns(["query", "pos"]).rename_column("pos", "corpus")
    eval_data_path = build_data_path('dataset', 'eval.json')
    test.to_json(eval_data_path)

    # Create knowledge base
    kb_data_path = build_data_path('KB', 'knowledge_base.txt')
    corpus = "\n".join([''.join(item) for item in test['corpus']])
    with open(kb_data_path, 'w', encoding='utf-8') as f:
        f.write(corpus)

    return train_data_path, eval_data_path, os.path.dirname(kb_data_path)

After processing, a piece of data from the training set is as follows (json file):

{"query":"What was the total stockholder's equity (deficit) for Peloton Interactive, Inc. as of June 30, 2021?","pos":["As of June 30, 2021, Peloton Interactive, Inc.'s consolidated statements reflected a total stockholder's equity (deficit) of $1,754.1 million."],"neg":["In June 2023, the company entered into an ASR agreement to repurchase $500 million of its common stock with a completion date no later than August 2023, and in 2024, the company expects to repurchase $2.0 billion of its common stock.",...,"\u2022Overhead costs as a percentage of net sales increased 40 basis points due to wage inflation and other cost increases, partially offset by the positive scale impacts of the net sales increase and productivity savings."],"prompt":"Represent this sentence for searching relevant passages: "}

The following fields need to be included:

query: (str) User question
pos: (List[str]) Correct answer paragraph
neg: (List[str]) Randomly sampled negative samples
prompt: (str) command template

A piece of data from the evaluation set is as follows (json file):

{"query":"How have certain vendors been impacted in the supply chain financing market?","corpus":["Certain vendors have been impacted by volatility in the supply chain financing market."]}

The following fields need to be included:

query: user questions
corpus: the correct text fragment corresponding to the question

The knowledge base part is as follows (txt file):

Certain vendors have been impacted by volatility in the supply chain financing market.
Recruitment As the demand for global technical talent continues to be competitive, we have grown our technical workforce and have been successful in attracting top talent to NVIDIA. We have attracted strong talent globally with our differentiated hiring strategies for university, professional, executive and diverse recruits. The COVID-19 pandemic created expanded hiring opportunities in new geographies and provided increased  flexibility for employees to work from locations of their choice. Our workforce is about 80% technical and about 50% hold advanced degrees.
In 2023, Moody’s Investors Service upgraded AbbVie’s senior unsecured long-term credit rating to A3 with a stable outlook from Baa1 with a positive outlook.

Fine-tuning process

Distributed fine-tuning through the LazyLLM framework:

embed = lazyllm.TrainableModule(embed_path)\
    .mode('finetune').trainset(train_data_path)\
    .finetune_method((
        lazyllm.finetune.flagembedding,
        {
            'launcher': lazyllm.launchers.remote(nnode=1, nproc=1, ngpus=4),
            'per_device_train_batch_size': 16,
            'num_train_epochs': 2,
        }
    ))

docs = Document(kb_path, embed=embed, manager=False)
docs.create_node_group(name='split_sent', transform=lambda s: s.split('\n'))
retriever = lazyllm.Retriever(doc=docs, group_name="split_sent", similarity="cosine", topk=1)
retriever.update()

This code is consistent with the previous configuration using LazyLLM's TrainableModule to fine-tune LLM:

embed_path: used to specify the fine-tuned model;
train_data_path: Data set path used for training;
lazyllm.finetune.flagembedding: specifies the fine-tuning framework;

Key parameters:

ngpus=4: Use 4 GPUs for parallel training
per_device_batch_size=16: batch size per GPU
num_train_epochs=2: train for 2 epochs

It is worth noting that in the code here, we not only gave embed the fine-tuning configuration parameters, but also put it into the Document later. The Document registered a strategy to split the knowledge base document according to newlines. Finally, we also configured Retriever to act on the document and its corresponding segmentation method, and used cosine similarity as a measurement tool, while allowing only the most relevant text segment (topk=1) to be returned. Because LazyLLM supports one-click fine-tuning, deployment and inference, after executing update(), LazyLLM will first fine-tune the embed model, and then deploy the fine-tuned model to provide vectorization for Document and Retriever.

Effect evaluation

Here we use the contextual recall rate and Context Relevance introduced in the previous tutorial to evaluate our fine-tuned model. As a comparison, here we use

bge-large-zh-v1.5 is used as the base model, and the changes in the two indicators before and after fine-tuning are compared.

The call of the evaluation indicators is as follows:

from lazyllm.tools.eval import NonLLMContextRecall, ContextRelevance

def evaluate_results(data: list) -> tuple:
    """Evaluate retrieval results using multiple metrics.

    Args:
        data (list): List of retrieval results to evaluate

    Returns:
        tuple: Evaluation scores (context recall, context relevance)
    """
    recall_eval = NonLLMContextRecall(binary=False)
    relevance_eval = ContextRelevance()
    return recall_eval(data), relevance_eval(data)

The logic of fine-tuning, deployment and inference is mainly as follows:

# Prepare dataset
train_data_path, eval_data_path, kb_path = build_dataset_corpus(
    instruction=args.instruction,
    neg_num=args.neg_num,
    test_size=args.test_size,
    seed=args.seed
)
# Deploy retrieval service
retriever = deploy_serve(
    kb_path=kb_path,
    embed_path=args.embed_path,
    train_data_path=train_data_path,
    train_flag=args.train_flag,
    per_device_batch_size=args.per_device_batch_size,
    num_epochs=args.num_epochs,
    ngpus=args.ngpus
)
# Run SFT or Evaluation
results = []
query_corpus = load_json(eval_data_path)
for item in tqdm(query_corpus, desc="Processing queries"):
    query = item['query']
    inputs = f"{args.instruction}{query}" if args.use_instruction or args.train_flag else query
    retrieved = retriever(inputs)
    results.append({
        'question': query,
        'context_retrieved': [text.get_text() for text in retrieved],
        'context_reference': item['corpus']
    })

# Save and report results
save_json(results, args.output_path)
recall_score, relevance_score = evaluate_results(results)
print(f"Evaluation Complete!\nContext Recall: {recall_score}\nContext Relevance: {relevance_score}")

Based on the above logic, we obtain the following results:

	Before fine-tuningbge-large-zh-v1.5	After fine-tuningbge-large-zh-v1.5
Context Recall	78.28	88.57
Context Relevance	75.71	86.57

It can be seen that after fine-tuning, both indicators have been significantly improved. It shows that fine-tuning is effective! The overall evaluation process is: load the evaluation set → use the retrieval service (service deployed after fine-tuning) → perform batch inference → calculate dual indicators.

【bge-large-zh-v1.5 before fine-tuning】

【After fine-tuning bge-large-zh-v1.5】

Use the fine-tuned Embedding model in RAG

Similar to using a fine-tuned LLM, here we can also use a fine-tuned Embedding model, as shown below:

import os
import lazyllm

prompt = ('You will act as an AI question-answering assistant and complete a dialogue task.'
          'In this task, you need to provide your answers based on the given context and questions.')

embed = lazyllm.TrainableModule('bge-large-zh-v1.5', 'path/to/sft/bge')

documents = lazyllm.Document(dataset_path=os.path.join(os.getcwd(), "KB"), embed=embed, manager=False)
documents.create_node_group(name='split_sent', transform=lambda s: s.split('\n'))
with lazyllm.pipeline() as ppl:
    ppl.retriever = lazyllm.Retriever(
        doc=documents, group_name="split_sent", similarity="cosine", topk=1, output_format='content', join='')
    ppl.formatter = (lambda nodes, query: dict(context_str=nodes, query=query)) | lazyllm.bind(query=ppl.input)
    ppl.llm = lazyllm.OnlineChatModule(source="sensenova")\
        .prompt(lazyllm.ChatPrompter(instruction=prompt, extra_keys=['context_str']))

ppl.start()

Fine-tuning method	Computing resource consumption	Parameter update range	Advantages	Disadvantages
Full Parameter Fine-tuning	High	All parameters	Fully tap the potential of the model, strong adaptability	High resource consumption, easy to overfit
Freeze Fine-tuning (Freeze)	Low	Only some layer parameters	Save computing resources, fast training	Performance may not be as good as full parameter fine-tuning
LoRA fine-tuning	Medium	Low-rank matrix	Save resources and maintain the advantages of pre-training	There are certain restrictions on the model structure

Chapter 9: Fine-Tuning in Practice — Help Large Models and Embedding Models Better Understand Your Domain

The role of large models in the RAG system

Review of RAG process

Impact of model capabilities

1. Domain knowledge adaptability defects

2. Weak structured output control

3. Performance is limited by the deployment environment

Fine-tuning to improve large model capabilities

Fine-tune core methods

Core Advantages

Typical application scenarios

Principles and methods of LoRA fine-tuning

Overview of LoRA fine-tuning

LoRA fine-tuning steps

LoRA mathematical analysis

Parameter tuning suggestions

Parameter summary

Fine-tune large models based on LazyLLM

Data preparation

Dataset Introduction

Dataset size

Data structure

Training set processing

Evaluation set processing

Fine-tuning the model

Effect evaluation

Evaluation purpose

Evaluation indicator design

1. Exact matching rate

2. Semantic similarity

3. Inclusion of the original text

4. Comparative analysis of indicators

Comprehensive evaluation script

Comparison of evaluation results

General evaluation comparison

Use fine-tuned large models in RAG

Experimental process

Fine-tune the Embedding model based on LazyLLM

Data preparation

Fine-tuning process

Effect evaluation

Use the fine-tuned Embedding model in RAG

Further reading

More fine-tuning methods

1. Fine-tuning of all parameters

2. Freeze fine-tuning

Fine-tune data format

1. Alpaca command fine-tuning data format

2. OpenAI command fine-tuning data format