Chapter 13: RAG + Multimodal — A Q&A System That Handles Images and Tables Alike
In the previous courses, we explored the basic principles of RAG (Retrieval-Augmented Generation) and its application in plain text processing. RAG improves the capabilities of text-based question answering systems by retrieving relevant information from external knowledge bases and combining context to generate more accurate and informative answers. However, information in the real world is not limited to text. For example, multi-modal data such as pictures and tables in PDF documents also carry a large amount of valuable knowledge. In some cases, this graphic content is more intuitive and effective than plain text. However, RAG mainly relies on text retrieval and generation, and has weak processing capabilities for images in PDF files. It cannot directly parse and use image information for retrieval or generation, which may lead to the omission of key information and affect the quality of the final answer. Therefore, when RAG processes PDF documents containing important image information, it needs to be supplemented with OCR (optical character recognition) or computer vision technology to improve content parsing capabilities. This course will introduce how to process pictures and tables in PDF in the RAG system, and extract key information for question and answer, thereby enhancing the system's understanding and application of multi-modal data.
Multimodal model
Why introduce multimodality?
In practical applications, we often need to extract useful information from various forms of documents such as contracts, reports, product specifications, etc. These documents not only contain rich text content, but may also include non-text information such as pictures, charts, tables, etc. The following is an information display of the original view of a PDF page.

Now, we use the normal model and the multi-modal model to read the contents of the PDF respectively. When reading a PDF containing graphic information, the differences between the resulting "text formats available for retrieval/generation" are as follows:

We can see that the original RAG mainly relies on the retrieval and generation of text, and cannot fully understand and utilize the important content carried in images and charts, which will lead to the loss of relevant content. This is especially obvious when processing complex documents such as PDF. Therefore, in order for ordinary RAG to maintain efficient and intelligent performance in more real-world scenarios, we urgently need to introduce multi-modal capabilities, combined with OCR, image understanding and other technologies, so that the model has the ability to "read images and recognize meaning", so as to truly realize intelligent question answering and generation of multi-modal information.
Introduction
In the real world, information is never presented in a single modality. Human daily perception relies on collaborative input from multiple modalities, including vision, hearing, language, touch, etc. These perceptions together constitute our overall understanding of the world. When research tasks or data involve multiple modal information, we call it a "Multimodal Problem". Effectively dealing with such problems is a key step to promote artificial intelligence systems to move towards human-like cognition and achieve "human-like intelligence".

Because of this, Multimodal Large Language Models (MLLM) came into being. It is developed based on the continuous evolution of large language models (LLM) and large visual models (LVM). LLM continues to make breakthroughs in language understanding, reasoning capabilities, instruction following, contextual learning, and thinking chains, promoting the widespread application of natural language processing technology. However, it still faces limitations when processing non-verbal modal information such as images and audio. At the same time, LVM has achieved remarkable results in visual tasks such as image recognition, target detection, and image segmentation. Some models can already accept language instructions and perform specific visual tasks, but their reasoning and cross-modal capabilities still need to be improved. This has prompted researchers to try to integrate the language capabilities of LLM and the perceptual capabilities of LVM to realize the understanding and generation of information in multiple modalities such as graphics and text. MLLM is the product of this fusion. Through technical paths such as joint modeling and cross-modal alignment, MLLM can simultaneously process multi-modal inputs such as text and images, showing stronger understanding and generative power in tasks such as image and text question and answer, multi-modal retrieval, and assisted creation, laying the foundation for artificial intelligence to move towards a broader cognitive world.
Basic principles
After understanding the concept and development background of multi-modal large models, we will discuss its basic principles in depth. The core reason why multi-modal large models are powerful lies in their ability to understand, align and fuse data from different modalities, such as text, images, speech and even videos, to achieve unified information modeling and reasoning. We can understand its basic principles from the following three levels:
1. Perception layer: modal perception and feature extraction
Data in different modalities have different structural properties, such as:
- The image is a two-dimensional grid of pixels;
- Speech is a continuous sound wave signal;
- Text is a discrete sequence of symbols;
- Video is a combination of sequences of frames.
At the perceptual layer, each modality requires a dedicated encoder to extract its features. for example:
- Images usually use Convolutional Neural Network (CNN) or Visual Transformer (ViT) to extract spatial features;
- Audio can be processed using convolutional networks or spectral Transformers;
- The text is encoded by pre-trained language models (such as BERT, GPT).
The key to this step is to convert data of different modalities into a vector representation (embedding) of a unified structure for subsequent processing.
2. Alignment layer: representation alignment and semantic mapping
Since the embeddings of different modalities essentially come from different spaces, their semantic distribution is inconsistent and cannot be directly fused. At this time, a "bridge" module - Connector is needed to realize the mapping and alignment of the representation space. Common alignment strategies include:
- MLP mapping: through a simple multi-layer perceptron network, the vectors of non-text modalities are mapped to the text embedding space to achieve spatial transformation;
- Cross-Attention: Use image or audio vectors as Key/Value to interact with the text space through the Transformer decoder module;
- Internal fusion mechanism: Implement fusion between modalities at the internal structure level of the model, such as directly mixing input image patches and text tokens.
The core goal of this layer is to establish semantic correlations between modalities so that the model can "understand" the relationship between objects in the image and text descriptions, or align speech with text.
3. Comprehension and generation layer: unified semantic modeling and cross-modal reasoning
When the representations of all modalities are mapped to a unified space, they can be input into a large language model (LLM) for in-depth understanding and generation. at this time:
- LLM plays the role of "inference center" and uses its powerful context understanding capabilities to integrate multi-modal information;
- Drive the entire multi-modal system to learn cross-modal reasoning capabilities through language modeling goals (such as word prediction, question and answer, etc.);
- The output is not limited to text, but can also be in the form of graphic dialogue, image description, audio Q&A, etc.
We can see that the multi-modal large model does not simply "splice" images and text together, but achieves semantic fusion and unified reasoning between modalities through the process of encoding → alignment → understanding. This design enables MLLM to have human-like cognitive capabilities: it can look at pictures and speak, understand audio, cross-modal question and answer, and even perform multi-modal creation.
Model architecture
From the perspective of analogy to the human perception system, the architecture of the multi-modal large model can be understood as: the encoder is like human eyes and ears, responsible for receiving external information such as vision and hearing, and converting it into "perception signals"; the LLM (large language model) is similar to the human brain, understanding, reasoning and generating these signals.

The overall structure of a large multimodal model usually includes three core modules: Modality Encoder, Connector and Language Model (LLM).
- Modality Encoder
- Function: Convert data in different modalities (such as images, audio, video, text) into vector representations (embedding).
- Features: Each modality uses a specialized model for encoding, such as CNN/ViT to process images, a speech model to process audio, and a language model to process text.
- Connector
- Function: Map non-text modal vectors to a space that is compatible or alignable with text embedding to solve the problem of semantic space inconsistency.
- Common connection methods: MLP mapping: spatial transformation through multi-layer perceptron; Cross-Attention mapping: using a cross-modal attention mechanism to establish interaction between modalities; internal fusion of the model: such as mixing input of image patches and text tokens to achieve underlying structure-level fusion.
- Language Model (LLM)
- Function: receive the unified vector representation and perform understanding and generation tasks;
- Features: As the core of reasoning, LLM supports tasks such as graphic and text dialogue, question and answer, description, etc., and is a key part of the entire multi-modal system.
Multimodal input (such as images, audio, video, etc.) and text data are different in form, so they need to be converted into vector representations (embedding) through their respective encoders first. But these vectors are not necessarily initially in the same space as the text embedding. This is where the connector module is needed, its role is to map multi-modal embeddings to a space that is compatible or alignable with text embeddings. The three common connection methods include the previously mentioned MLP mapping, Cross-Attention mapping and internal model fusion. No matter which connection method we use, the final core is the pre-trained LLM, which understands and generates the unified representation and is the most critical inference unit in the entire multi-modal processing process.
Mainstream open source model

Before the rise of multimodal large language models (MLLM), a lot of work has been devoted to multimodal learning, which is mainly divided into two types of technical routes:
- Discriminative Paradigm: Represented by CLIP, it focuses on learning image-text matching and alignment, and is mainly used for tasks such as image classification and retrieval;
- Generative Paradigm: Represented by OFA, VL-T5, Flamingo, etc., it emphasizes cross-modal generation, such as image description, visual question and answer, etc.
MLLM can be seen as the continuation of the development of generative multi-modal models. With the support of more powerful language models and new training paradigms, the boundaries of capabilities have been greatly expanded. The current mainstream open source MLLM is usually based on large-scale pre-trained language models (such as LLaMA, Vicuna), combined with image encoders (such as CLIP-ViT, BLIP) and connector modules to effectively map image information to a representation space that can be processed by the language model. Most models use multi-modal instruction fine-tuning strategies to enable them to have complex capabilities such as "looking at pictures and talking", "picture-text dialogue" and "cross-modal reasoning". Typical open source projects include LLaVA, MiniGPT-4, InstructBLIP and Fuyu, covering multiple directions such as lightweight deployment, dialogue experience optimization and reasoning capability improvement. These models generally adopt the modular architecture of "image encoder + connector + LLM", combining different training goals and task scenarios, and continue to promote breakthroughs in the generalization capabilities and practical application of multi-modal models.
Using multimodal large models in LazyLLM
LazyLLM already supports multi-modal large model access (such as graphic and text Q&A, graphic and text understanding, etc.), which can be quickly started and used in the following ways. Taking GLM's multi-modal model glm-4v-flash as an example, use LazyLLM to start the service:
import lazyllm
chat = lazyllm.OnlineChatModule(source="glm", model="glm-4v-flash")
lazyllm.WebModule(chat, port=23333, files_target=chat).start().wait()
In addition to
glm-4v-flash, LazyLLM also supports a variety of mainstream multi-modal large models, such as Qwen-VL series, OpenAI GPT-4V, etc., but users need to apply for and configure API Key by themselves.
After startup, users can upload pictures and enter questions through the WebUI page to achieve a mixed question and answer experience with graphics and text.

Multimodal RAG
RAG multi-modal architecture
Multi-modal RAG (Retrieval-Augmented Generation) expands the capabilities of traditional RAG, enabling it to process multiple modalities such as text, images, and audio to achieve richer information retrieval and generation. Its core lies in cross-modal alignment and fusion, relying on multimodal embedding to map different types of data into a unified representation space for efficient retrieval and generation. The development of models such as CLIP and BLIP has promoted this technology, allowing multi-modal RAG to more accurately integrate multi-modal information and enhance understanding and generation capabilities. Next, we will start from the overall architecture and lead readers to have an in-depth understanding of the basic principles and functions of multi-modal RAG.
Multimodal RAG VS Basic RAG
In the previous RAG architecture diagram, we mainly retrieved and generated text data. However, to support multi-modal input and output, certain adjustments and extensions to the original RAG architecture are required.
First, the modules that need to be modified include:
- Retrieval module: Originally only supported text retrieval, but now it needs to be expanded to support multi-modal retrieval, such as indexing and matching of images, audio and other information.
- Generation module: The original RAG was only for text generation, and now needs to be expanded to support multi-modal output, such as the ability to generate text combined with images and audio.
Secondly, the modules that need to be added include the following four parts:
| Module | Function |
|---|---|
| Multimodal Encoder | Used to encode data of different modalities (text, images, audio, etc.) for unified representation and use in retrieval and generation. |
| Multi-modal Fusioner | Used to fuse information from different modalities so that they can work together to improve the accuracy and richness of generated content. |
| Cross-modal searcher | Supports input of multiple data formats and can find relevant information in multi-modal knowledge bases. |
| Multi-modal decoder | Responsible for decoding the generated results into multiple forms, such as text, pictures, speech, etc., to adapt to different output requirements. |
Core components of multimodal RAG
In order to support multi-modal input and output, the adjusted and extended multi-modal RAG flow chart is as follows.

Let’s briefly talk about the functions of each component and related input and output examples.
Multi-modal coding and fusion
In the multi-modal RAG system, the multi-modal coding module is a key component that uniformly maps different types of inputs (such as text, images, audio, etc.) to vector space. Its main function is to encode user queries and knowledge base content for similarity retrieval and context construction. Compared with traditional RAG that only processes text, multi-modal RAG needs to uniformly encode multi-source information such as images and texts to achieve understanding and matching of cross-modal information.

- Commonly used models for text encoding:
- BAAI/bge series (such as bge-m3: supports multi-language and long text processing, generates high-quality semantic embeddings)
- Cohere Embed (provides efficient semantic encoding and context understanding, suitable for text retrieval and generation tasks)
- GTE-large (focuses on high-precision semantic matching and retrieval, suitable for large-scale text data processing)
- Commonly used models for multi-modal coding:
- Image encoding: NFNet-F6, ViT, CLIP ViT
- Audio encoding: Whisper, CLAP
- Video Coding: CMVC
In RAG systems, the fusion of multi-modal data encoding is a core issue. In order to achieve effective interaction and retrieval of cross-modal information, we need to map data from different modalities into a unified vector space. In this way, whether it is text or non-text information, similarity measurement, retrieval and generation tasks can be performed in the same semantic space. Here are three commonly used methods.
1. Unified Modal
Convert all modes into the same mode (usually text) to achieve the purpose of unifying the vector space. A common practice is that for non-text modalities (such as images, audio, tables, etc.), first convert them into descriptive text with semantic information through a specialized model, and then uniformly use a text encoder for vectorization processing. For example, images can be converted into natural language descriptions using image description generation models, audio can be transcribed into text using speech recognition models, and structured data (such as tables) can use table understanding or summarization models to generate text descriptions. Finally, these text contents are uniformly input into the text encoding model to obtain a vector representation located in the same semantic space.
2. Cross-modal encoding
Through joint embedding technology, multi-modal data can be directly encoded into the same unified vector space without the need to convert non-text modalities into intermediate forms. Currently, a variety of models have implemented multi-modal hybrid coding, which can simultaneously process multiple modal inputs such as text, images, and audio. Common multi-modal encoders include image-based CLIP, audio-based CLAP, etc.. This encoding method more effectively retains the information of the original modality, which can improve the performance of downstream tasks, while simplifying the data process and providing a consistent vector interface for systems such as RAG, which has more engineering advantages.
3. Separate search
The third method uses multiple specialized models to process different modalities of data (such as images, audio, text), and then fuses the results in the retrieval stage. This approach is highly flexible and can use the optimal model for each data type, thereby improving the accuracy and relevance of retrieval. Despite increasing system complexity and implementation difficulty, such multi-modal RAG architectures for multi-model collaboration are becoming increasingly attractive in the context of the continuous development of strong modal models.
Multimodal retrieval module
1.Basic search
Handles single-modality queries such as text-to-text and image-to-image matching.
| input | output |
|---|---|
| Text query: "A black and white kitten" | Related text paragraphs, such as "A black and white kitten is usually an American shorthair cat" |
Image query: ![]() |
Related images, as follows: ![]() |
When the input is a text query, such as "a black and white kitten", the relevant text paragraph will be output, such as "a black and white kitten is usually an American shorthair cat"; and when the input is an image query, such as a photo of a black and white kitten, the relevant image will be output, such as a picture of a black and white cat. Retrieval often relies on embedding similarity calculations such as cosine similarity:
where q is the embedding vector of the query document and d is the embedding vector of the candidate document. If the calculated similarity is above a certain threshold (e.g. 0.9), the document is considered relevant.
2. Cross-modal retrieval
Supports cross-modal queries, such as text queries matching images, or image queries matching text.
| input | output |
|---|---|
| Text query: "A black and white kitten" | Related images, such as a picture of a black and white cat, as follows: ![]() |
When the input is text, such as "What does the kitten look like?", the most relevant cat picture will be output; when the input is a cat photo, the relevant description text will be output, such as "This is a black and white American shorthair cat"
Multimodal generation module
The original RAG only supported text generation, but now it has been extended to support multi-modal output, such as generating answers with pictures or explanations combined with speech.
| input | output |
|---|---|
| Text query: "What does a black and white cat look like?" | Graphic answer: text + a picture of a black and white cat, as follows: ![]() |
When the input is a text query, such as "What does a black and white cat look like?", the relevant search results (text, pictures) will be passed to the large model, and finally a graphic answer will be output, such as related text + a picture of a black and white cat.
Text generation relies on the autoregressive Transformer model to calculate probability:
Among them, x is the input search content, yt is the currently generated word, and y<t is the previously generated word. Assume that the model may generate P("A")=0.8; P("black")=0.9; P("and")=0.85; P("white")=0.95 step by step, and the final output is "A black and white cat".
Three-stage example
Next, taking the "travel question and answer" scenario as an example, we will introduce the result output of the three stages of multi-modal RAG.
1. Coding Phase
enter:
- User query (text + picture): "What style is the building in this photo?" + [Byzantine style church photo]
- Knowledge base content:
- Text 1: "Characteristics of Byzantine Architecture" document (including keywords such as dome, mosaic, etc.)
- Image 2: Photo of Gothic church
- Image 3: Photo of St. Mark's Basilica in Venice
- ...
【query picture】

【Picture 2】

【Picture 3】

Encoding result shows:
#Multimodal coding (such as CLIP hybrid encoder implementation)
query_text_embed = [0.23, -0.57, ..., 0.89] # The embedding of query is 512 dimensions.
doc1_text_embed = [0.20, -0.52, ..., 0.91] # Embedding of text 1, similar to query_text_embed
query_img_embed = [0.67, 0.12, ..., -0.33] # Embedding of the image in query
img2_embed = [0.02, 0.45, ..., 0.11] # The embedding of picture 2 is quite different from query_img_embed
img3_embed = [0.63, 0.09, ..., -0.30] # Embedding of picture 3, similar to query_img_embed
#Multimodal fusion representation (taking weighted average as an example)
query_fused = [0.45, -0.22, ..., 0.28] # Text 0.4 + Image 0.6 weight
2. Retrieval phase
Vector database operations:
- Calculate the cosine similarity between
query_fusedand all knowledge base vectors respectively - Return Top3 results:
| Ranking | Content type | Similarity | Snippet representation |
|---|---|---|---|
| 1 | Image 3 | 0.93 | St. Mark's Basilica (Byzantine style photo) |
| 2 | Text 1 | 0.82 | "Byzantine architecture with domes and golden mosaics..." |
| 3 | Picture 2 | 0.31 | Gothic church spire photo |
Retrieval result JSON representation:
{
"retrieved_results": [
{
"image_path": "St. Mark's Basilica.jpg",
"score": 0.93,
"modality": "multimodal"
},
{
"content": "Features of Byzantine architecture include...",
"score": 0.82,
"modality": "text"
}
]
}
3. Generation phase
Multimodal LLM input:
[USER_QUERY]
Text: "What style is the building in this photo?"
Image: <Photo of Byzantine Church>
[CONTEXT]
1. [Picture] <Photo of St. Mark’s Basilica>
2. [Text] Characteristics of Byzantine architecture: "Circular dome, golden mosaic decoration..."
[OUTPUT]
Generate results:
"The building in your photo has typical Byzantine style features:
- Dome structure (such as the central dome of St. Mark’s Basilica)
- Mosaic decoration (you can see the golden reflective part in the photo)
...
Multimodal RAG based on PDF documents
Take a PDF document as an example to demonstrate the overall implementation process of multi-modal RAG. In order to fully embed the PDF document, other elements such as document paragraphs, titles, and icons are first identified and extracted through the layout recognition model. The figure below shows the image and text RAG system process based on OCR document parsing. Next, we will briefly introduce each step, including the three key steps of document parsing, multi-modal embedding and query and generation.

1. Document analysis
PDF format overview and parsing method
1.PDF format introduction
PDF (Portable Document Format) is a file format that is independent of applications, hardware, and operating systems and is mainly used for sharing and viewing documents. The main features of PDF documents include:
- Has a fixed layout and format that remains consistent across devices and operating systems.
- Can contain text, images, links, buttons, form fields, audio, video and business logic.
- Suitable for document storage, exchange and printing, widely used in office and academic fields.
2.Main types of PDF
Depending on the source, PDFs can be divided into the following two categories.
①Machine-generated PDF:
- Created with professional software such as Adobe Acrobat, Microsoft Word, Typora, etc.
- Contains selectable, searchable, editable text, images and hyperlinks inside.
- Suitable for direct parsing and structured processing.
② PDF generated by scanning
- Converted from a scanning device or photographed image, usually stored in bitmap format.
- Its content is not directly searchable and editable and contains only images.
- Requires OCR (Optical Character Recognition) technology for text extraction and parsing.
3.PDF parsing method
For different types of PDF, the parsing methods are different:
①Machine-generated PDF
- Can be parsed directly using Python libraries, such as pdfminer, pdfplumber, etc.
- These tools can extract structured information such as text, images, tables, etc.
② PDF generated by scanning
- OCR technology is required to convert images into text. Common methods include deep learning models (such as Tesseract OCR, LayoutLM) or cloud OCR services.
- Deep learning methods usually have better parsing results in complex layouts and multi-language scenarios.
By choosing an appropriate parsing method, valuable information can be efficiently extracted from PDF documents to provide support for document understanding and processing.
PDF document parsing and structural processing
The goal of PDF parsing is to extract text from non-editable documents and convert it into a standard encoding format for storage while retaining the original structural information. Due to the complexity of typesetting, direct extraction of text may result in the loss of structural information, such as titles, paragraphs, tables, etc., which is crucial for semantic understanding and logical reasoning. Therefore, PDF parsing is usually regarded as a generalized OCR task, including two steps: text recognition (OCR) and layout analysis.
- OCR Extract Text: OCR (Optical Character Recognition) technology is used to convert text in paper documents or scanned images into editable, searchable electronic text. For example, open source tools such as PaddleOCR and EasyOCR can efficiently identify uneditable text in PDFs and store them in plain text (TXT) format.
- Layout Analysis: The goal of layout analysis is to restore the original text organization structure of the PDF, identify the relationship between each text block, and organize the text according to the original format of the document. For example, the LayoutLM model can jointly model the text, layout, and image information in the document, accurately identifying the position of the text on the page and its layout structure. In addition, PDFs containing complex content such as tables and formulas can be parsed in conjunction with a specialized table analysis model to retain the complete information of the document.
In practical applications, MinerU is a powerful PDF parsing tool that encapsulates functions such as text extraction, layout restoration, table and formula analysis, and provides a convenient Python interface. By using MinerU, the PDF parsing process can be significantly simplified, the efficiency and accuracy of multi-modal information processing can be improved, and richer support can be provided for subsequent document understanding, retrieval, and generation tasks.
2. Multi-modal embedding
After completing the document parsing to extract the documents and charts in the document, we need to embed the document. When getting the parsed document (that is, the extracted text and image collection), the simplest idea is to vector embedding the text and image respectively to obtain the vector representation. Although this method is simple, it has a significant shortcoming: it cannot capture the intrinsic correlation between different modalities (such as text and images). Specifically, text and images are mapped into two independent vector spaces respectively, so that there is no direct semantic connection between them. If you want to implement RAG, you need to retrieve text and images separately, and perform some form of reordering after retrieval to find the most relevant content. Although this method is relatively simple to implement, from the perspective of system performance and performance, during the query phase, when the system needs to extract the most relevant information from multiple modalities and reorder it, the more modalities involved, the more complex the retrieval and sorting tasks will be.
In order to reduce the complexity of the online stage, we can directly map the information from different modalities into a shared vector space during the offline embedding stage of the document. In this way, during the online stages of retrieval and generation, the system only needs to operate within a unified vector space, thus avoiding separate retrieval and subsequent reordering processes between multiple modalities, greatly improving the efficiency and response speed of the system. In this way, not only can the complexity of online reasoning be effectively reduced, but the relationship between text and images can also be made closer, thereby improving the performance of multi-modal tasks. The figure below provides two methods of unifying the vector space. One is to directly map the data of multiple modalities to the same vector space (left side), and the other is to unify the modalities first and then perform vector embedding (right side).

The left side is the mapping method using multi-modal models, and the right side is the vector embedding method after unifying the data modalities.
(1) Multi-modality is directly embedded in a unified space
When we only deal with the embedding problem of text and images (we can think of tables as images, or parse tables into formatted text in the parsing stage), they can be mapped to the same vector space through specific models (such as CLIP, VisualBERT, etc.). This method can make the semantic association between different modalities closer and eliminate the need to process each modality separately during retrieval. Through the shared vector space, text and images can be compared and sorted in the same semantic space, thereby improving the efficiency and accuracy of multi-modal retrieval and generation.
(2) Modal unification before embedding
This method usually first converts non-text modalities such as images into text form through some transformation model (such as a text description generation model). For example, images can generate text descriptions related to them through visual models, and tables can be converted into structured text information through table analysis models. After completing the modal unification, these text data can be further embedded through text embedding models (such as BERT, T5, GPT, etc.). The advantage of this method is that existing mature text embedding technology can be used to process information in different modalities. However, the disadvantage is that the original information of modalities such as images and videos may lose some important details in the process of being converted into text, and the generated text may not fully retain the rich semantics of the original modality.
Both methods have their own advantages and disadvantages, and which method to choose depends on the specific application requirements and model design goals. If the goal is to retain the original information of each modality to the maximum extent and improve the correlation between multi-modal data, the first method may be more suitable. On the other hand, if the project involves a large number of non-textual modalities and can be represented by the transformation of text, then the second method can also be an effective solution. No matter which method is used, it only involves the offline process of RAG. You only need to change the process related to document storage in the text RAG system to get a basic multi-modal RAG. Generally speaking, if a detailed and accurate description of the chart is generated, or a better-performing embedding model is used, it will better complement the document content and help provide more detailed answers through the chart content when responding to user questions.
Use multi-modal models to parse images: automatically extract information and generate QA pairs
In addition to the two image parsing methods mentioned above, when parsing images, we can also use multi-modal models such as: InternVL-Chat-V1-5 to automatically extract key information in the image and generate corresponding question and answer (QA) pairs. The core idea of this method is: first, the multi-modal model combines visual and text understanding to analyze the image content, including objects, text, scenes, structures, etc.; then, based on the extracted information, relevant questions and answers are generated to help users quickly understand the core content of the image. For example, when parsing a schematic diagram or experimental result diagram in a paper, the model can identify the title of the picture, data trends, key conclusions and other information, and automatically generate questions and corresponding answers such as "What are the main conclusions of this experiment?", thereby improving the understanding of the paper content and the efficiency of information acquisition.
3. Generate responses with pictures and text
If when processing a chart in a document, the system only outputs a description related to the chart but cannot display the chart itself, this may make the user feel that the information presentation is not intuitive and vivid enough. And if you can display both charts and related descriptions in the generated answers, the effect will undoubtedly be more attractive and valuable. Currently, outputting illustrated content is still a challenging task for large language models, but most models are already able to generate accurate Markdown format documents. Markdown is a lightweight markup language that not only supports text formatting, but also allows embedding of images and other multimedia elements. Therefore, in the Markdown document generated by the system, we can display charts or pictures through image links. You can add the following three operations to the document processing process of the RAG system to obtain output with pictures and texts:
- Formatted Save Chart URL: When parsing the document, the system needs to identify and extract the chart and image content. For each image, you can generate descriptive text from the diagram description in the document or an image understanding model and convert it into a description node for the image. At the same time, you need to save these image links (such as URLs stored on an image hosting service) and image descriptions as markdown syntax
; - Node classification: In the retrieval stage, after the retrieval component returns similar nodes, it needs to be classified according to the type of the node. If the recalled nodes contain image nodes, relevant prompt word processing needs to be done for the subsequent generation stage;
- Large model prompt word modification: You may need to provide prompt words for the large model that RAG generates answers. For example: if the answer needs to display an image, please provide links and explanations related to the image.

In this way, you can have the RAG system use images and diagrams to enhance the expressiveness and understandability of your answers as they are generated. This is a simple output method that uses the original image as part of the answer. In fact, you can also schedule multiple models to achieve the chart generation function. You can pay attention to the relevant parts in the next Agent tutorial for related learning.
Multimodal content vectorization effect optimization skills
In multimodal RAG systems, vectorization is the basis for achieving cross-modal semantic alignment and similarity retrieval. However, in practical applications, non-text modal information such as images and tables often suffer from problems such as "semantic loss" or "semantic drift" during the encoding process, resulting in inconsistent vector expression and original content understanding, affecting the final retrieval effect. In order to improve the accuracy and usability of multi-modal vectorization, optimization can be started from the following aspects:
1.Text completion: Combining text information such as image titles and annotations
Images or diagrams are usually equipped with a title, a caption or surrounding explanatory text in the knowledge base. These texts often highly condense the semantic information of the image and are the best source of semantic supplements to the image content.
Optimization Strategy:
- Concatenate the generated picture descriptions and picture annotations and then perform unified encoding;
- Splice the image itself with its title, caption and other text as a joint input, and use a joint coding model (such as CLIP) to generate vectors;
This approach can significantly improve the expressiveness and semantic accuracy of image representation in cross-modal retrieval tasks.
Examples and effects:
- Scenario: An illustration of a scientific research paper in the knowledge base
- Image Content: A diagram showing the "Transformer Architecture".
2.Structured generation: Extract QA pairs from multi-modal data in advance
Multimodal information such as images and tables often carries complex relationships or key factual information, and relying solely on embedded coding can easily miss semantic details. Extracting implicit information through structured question and answer generation is another effective means to improve the effectiveness of retrieval.
Optimization Strategy:
- Use OCR, image understanding model or table extraction model to parse images/tables;
- Combined with LLM to automatically generate "Question-Answer Pairs (QA Pairs)" or "Summary Text" from the parsing results;
- Encode these structured QA pairs jointly with the original image as knowledge enhancement material, or join these QA pairs to a vector database.
This strategy is especially suitable for scenarios that are sensitive to details and structured information, such as chart data, statistical reports, or medical images.
Example:
Scenario background: A screenshot of a monthly KPI report in the company's internal operating system. The chart is in PDF or image format and cannot be queried directly. Chart content: The bar chart shows sales and customer growth from January to March 2024. The data is as follows:
| Month | Sales (10,000 yuan) | Number of new customers (people) |
|---|---|---|
| January | 132 | 58 |
| February | 147 | 76 |
| March | 125 | 52 |

Effect:
- Before optimization: Direct coding will only get rough labels such as "chart" and "value".
- After optimization: For example, KPI charts can significantly enhance the ability to query and understand image data by automatically generating Q&A ("Which month has the highest sales?").
3.Context enhancement: encode the context together
The information of many images requires context to be understood, and encoding images alone can easily lose Context Relevance. Through the context enhancement strategy, the semantic linkage between the image and the text and paragraph where it appears can be achieved.
Optimization Strategy:
- Before inputting the image coding model, the relevant contextual texts are spliced or fused together to build the "Image + Text Context Window";
- Use a multi-modal hybrid coding model or jointly code them after unifying the modalities;
This strategy is suitable for task scenarios where images and text are highly coupled (such as news, textbooks, web encyclopedias, etc.), and can help the model better grasp the true intention of image semantics in the context.
Examples and effects:
Scenario: The electronic medical record system of a hospital contains chest X-ray images and report text. Image Content: Chest X-ray

-
Fine-tuning the multi-modal model: improving the adaptability of the model in specific multi-modal fields
-
Example:
In the field of medical images, simple diagnosis is achieved based on multi-modal models. No additional contextual information is even needed.

**Picture and text question and answer: "Is the area of the brain infarcted?". **
❌ Normal general model: No diagnosis can be made.
✅ Fine-tuned model: Yes. **There is an infarct in the area of the brain. **
- Data introduction

VQA-RAD is a question-answer pair dataset on radiological imaging.
- Dataset purpose
- Training and testing of medical imaging VQA (visual question answering) systems
- Supports open-ended questions (such as "Where is the lesion?") and binary questions (such as "Is the tumor present?")
- Data source
- Based on MedPix (open medical imaging database)
- Manually annotated by clinicians to ensure professionalism
- Core strengths
- The first VQA data set focusing on radiological images
- Clear structure, covering common clinical problem types
| Training set | Testing set | |
|---|---|---|
| Questions | 1,793 | 451 |
| images | 313 | 203 |
- Data processing
-
Data acquisition
Python from datasets import load_dataset dataset = load_dataset("flaviagiammarino/vqa-rad")
-
Before processing
JSON { "image": <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=566x555>, "question": 'are regions of the brain infarcted?', "answer": 'yes' }
-
After processing (OpenAI format):
JSON [ { "messages": [ { "content": "<image>are regions of the brain infarcted?", "role": "user" }, { "content": "yes", "role": "assistant" } ], "images": [ path/to/train_image_0.jpg" ] },
-
Fine-tune the model
Python import lazyllm model_path = 'path/to/Qwen2.5-VL-3B-Instruct' data_path = 'path/to/vqa_rad_processed/train.json' # Transformers and llamafactory in the environment need to be upgraded to the latest development branch m = lazyllm.TrainableModule(model_path) .mode('finetune') .trainset(data_path) .finetune_method( (lazyllm.finetune.llamafactory,{ 'learning_rate': 1e-4, 'cutoff_len': 5120, 'max_samples': 20000, 'val_size': 0.01, 'num_train_epochs': 2.0, 'per_device_train_batch_size': 16, })) m.update()
-
Model configuration:
model_pathspecifies the model we want to fine-tune. Here we use Qwen2.5-VL-3B-Instruct and directly specify its path;- Fine-tuned configuration:
.modesets the startup fine-tuning modefinetune;.trainsetsets the data set path for training;.finetune_methodsets which fine-tuning framework and parameters are used. Thellamafactoryframework is used here (a library that supports efficient fine-tuning technologies such as LoRA and QLoRA).learning_rate:1e-4learning rate, indicating the magnitude of parameter update in each step of the model. Higher values train faster but may be unstable.cutoff_len: 5120The maximum length of the input sequence. Text exceeding this length will be truncated. Suitable for long dialogue or long description tasks.max_samples: 20000Maximum number of samples used for training. If you don't want to train the entire dataset, you can limit it to a certain number.val_size: 0.01Validation set proportion. Here it is 1%, which means 99% of the data is used for training and 1% is used to evaluate the performance of the model during training.num_train_epochs: 2.0Number of training epochs. Each epoch means that the model has viewed the entire training set once.per_device_train_batch_size: 16Training batch size on each device (usually GPU). Choose an appropriate batch size based on the size of the video memory.- Start the task:
.updatetriggers the start of the task: the model is fine-tuned first. After the fine-tuning is completed, the model will be deployed. After deployment, it will automatically use the evaluation set to go through inference to obtain the results;- Fine-tune the loss curve

- Model evaluation:
In the test set with a total of 451 questions, the exact matching rate and semantic similarity of the model Qwen2.5-VL-3B-Instruct before and after fine-tuning are as follows:
| Qwen2.5-VL-3B-Instruct | Before fine-tuning | After fine-tuning |
|---|---|---|
| Exact match rate | 0.00% | 55.43% |
| Semantic similarity | 31.85% | 80.64% |
- Example: For the following images, fine-tune the output before and after

Before fine-tuning:
{
"query": "is the liver visible in the image?",
"true": "no",
"infer": "yes, the liver is visible in the image. it appears as a large, dark gray structure located in the upper left quadrant of the abdomen.",
"exact_score": 0,
"cosine_score": 0.3227266048281184
}
After fine-tuning:
{
"query": "is the liver visible in the image?",
"true": "no",
"infer": "no",
"exact_score": 1,
"cosine_score": 1.0
}
Extension: Multimodal RAG in ColPali
The previous article introduced the method of processing images and text separately after parsing the document. Can we directly process the document as a whole without parsing and extracting it? There are currently relevant papers confirming the possibility of this idea. We have learned that multi-modal large models (only two modalities, image and text, are discussed) can extract information from images and text and process it. If we regard a page of documents as an image, we can use multi-modal large models for embedding, thus eliminating the process of document parsing. However, related methods that use visual multi-modal large models to embed and retrieve them include DSE, ColPali, M3DocRAG, etc. The following uses ColPali as an example to explain the differences with methods based on document parsing (source ColPali paper).

ColPali (Contextualized Late Interaction over PaliGemma) uses PaliGemma (a powerful multi-modal pre-trained model) as a multi-modal encoder to embed documents. Specifically, a certain page of document is regarded as n image blocks, each block corresponds to a vector, so that a page of images can be represented by these n vectors. When calculating similarity, ColPali is different from the usual way of representing a sentence or a paragraph through a vector. Instead, it uses an embedded list and uses the "MaxSim" operation to calculate the similarity between two texts. MaxSim calculates the maximum similarity between each word in the query and all blocks corresponding to a page in the document, and accumulates the maximum similarity between each word and block to obtain the total similarity score between the page and the query. Calculate the total similarity score of the query for all pages, then sort the pages in descending order according to the total similarity score, and select the page with the highest total similarity as the page that best matches the query. The specific formula is as follows, where S is the final similarity score between query q and document d, Eq represents the embedding of query q, and Edi represents the embedding of image patches:
ColPali has excellent performance and shows great potential in multi-modal retrieval and generation, but it still faces some challenges in practical applications. For example, when processing a large number of PDF documents, the time complexity of similarity calculation shows a significant increase, which may lead to performance bottlenecks in actual projects. In addition, even though it is innovative in multi-modal information processing, there is still room for further improvement in its performance in text retrieval tasks. At present, ColPali is more suitable for retrieving diagram-rich documents. ColPali provides a new paradigm for multimodal RAG systems. In the context of the current rapid development of large multimodal models, more multimodal solutions with more powerful functions and higher efficiency will inevitably emerge, bringing more accurate and efficient multimodal information processing methods, promoting the development of multimodal RAG systems, and demonstrating stronger advantages and practicality.
References
MLLM:https://arxiv.org/pdf/2306.13549
layoutlmv3 :https://arxiv.org/abs/2204.08387
VisRAG:https://arxiv.org/pdf/2410.10594
ColPali:https://arxiv.org/pdf/2407.01449
Three-stage example: https://www.eyelevel.ai/post/multimodal-rag-explained



