Build a Question Answering application with chat history
A key challenge in Q&A systems is handling conversational context. This tutorial shows how to add memory to a RAG application so it can handle follow-up questions that depend on chat history.
We'll use the LazyLLM framework to build a conversational RAG system that can handle multi-turn conversations effectively.
In this section, you will learn the following key points about LazyLLM:
- How to load and manage documents with embeddings using [Document]
- How to use [Retriever] to retrieve relevant documents based on similarity
- How to use [Reranker] to re-rank retrieved documents for improved relevance
- How to combine [ReactAgent] to create an agent capable of multiple retrieval iterations
Design Approach
Our multi-turn conversational RAG system is built on the core idea of context-aware query rewriting, which transforms history-dependent user questions into standalone, retrievable queries.
The process begins by leveraging the most recent dialogue turns to prompt an LLM to rewrite the current user question into a self-contained form. This rewritten query is then used to perform hybrid retrieval, combining dense vector search (via embeddings) and sparse lexical matching (BM25) in parallel to recall relevant document chunks. A cross-encoder-based re-ranker subsequently refines the retrieved results to surface the most pertinent passages.
Finally, the rewritten query, top-ranked context, and trimmed dialogue history are jointly fed into an LLM to generate a coherent and grounded response.
The entire pipeline is implemented using LazyLLM, with explicit management of dialogue history (length-constrained to prevent context overload). Tool functions—such as query rewriting, hybrid retrieval, and re-ranking—are encapsulated via fc_register to enable seamless integration into a ReAct Agent for dynamic tool orchestration. A web interface serves as the user-facing entry point, delivering an end-to-end multi-turn conversational QA experience.

Setup
Dependencies
import os
import tempfile
from typing import List, Dict, Any
import lazyllm
from lazyllm import pipeline, parallel, bind, fc_register
from lazyllm import Document, Retriever, Reranker, SentenceSplitter, OnlineEmbeddingModule
Chains
A key challenge for conversational RAG is that the user messages and the LLM's responses aren't the only forms of context. User messages can reference previous portions of the conversation, making them challenging to understand in isolation.
For example, the second question below depends on context from the first:
Human: What is the standard method for task decomposition?
AI: The standard method for task decomposition is Chain of Thought (CoT) prompting...
Human: What are the common extensions of this method?
The second question "What are the common extensions of this method?" is ambiguous without the context of the first question. Our system needs to understand that "this method" refers to "Chain of Thought prompting".
Stateful management of chat history
To handle this, we need to contextualize the question based on the chat history. Let's build our conversational RAG system:
class ConversationalRAGPipeline:
"""Multi-turn conversational RAG pipeline based on LazyLLM components"""
def __init__(self, docs_path: str = None):
self.chat_history = []
self.docs_path = docs_path or self._create_sample_docs()
self.pipeline = None
self.init_pipeline()
def _create_sample_docs(self) -> str:
"""Create sample documents directory"""
temp_dir = tempfile.mkdtemp(prefix="rag_docs_")
# Create sample documents
sample_docs = {
"task_decomposition.txt": """
Task decomposition is a technique for breaking down complex tasks into smaller, more manageable subtasks.
Chain of Thought (CoT) prompting has become a standard technique for enhancing model performance on complex tasks.
The model is instructed to "think step by step" to utilize more test-time computation to decompose hard tasks into smaller and simpler steps.
CoT transforms big tasks into multiple manageable tasks and sheds light into an interpretation of the model's thinking process.
""",
"tree_of_thoughts.txt": """
Tree of Thoughts (ToT) extends CoT by exploring multiple reasoning possibilities at each step.
It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure.
The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier or majority vote.
""",
"decomposition_methods.txt": """
Common task decomposition methods include:
1) Using simple prompting with LLM like "Steps for XYZ" or "What are the subgoals for achieving XYZ?"
2) Using task-specific instructions like "Write a story outline" for writing a novel
3) Incorporating human inputs to guide the decomposition process
""",
"conversational_rag.txt": """
Conversational RAG systems need to handle chat history and context.
Key challenges include reformulating questions based on conversational context,
maintaining relevant chat history, and generating contextually relevant responses.
The system needs to understand pronoun references and contextual dependencies.
""",
"agent_rag.txt": """
Agent-based RAG systems can iterate over multiple retrieval steps.
Unlike chains that generate at most one query per input, agents can execute multiple retrieval steps to serve a query,
or iterate on a single search to gather more comprehensive information.
Agents leverage reasoning capabilities to make decisions during execution.
"""
}
for filename, content in sample_docs.items():
with open(os.path.join(temp_dir, filename), 'w', encoding='utf-8') as f:
f.write(content.strip())
print(f"Created sample documents directory: {temp_dir}")
return temp_dir
First, we set up our documents and embeddings:
def init_pipeline(self):
"""Initialize RAG pipeline"""
# Create documents and embeddings
documents = Document(
dataset_path=self.docs_path,
embed=OnlineEmbeddingModule(
source="qwen",
type="embed",
embed_model_name="text-embedding-v1"
),
manager=False
)
documents.create_node_group(
name="sentences",
transform=SentenceSplitter,
chunk_size=512,
chunk_overlap=100
)
Next, we will create a contextualization prompt for rephrasing questions based on chat history. Simultaneously, we will create a prompt for RAG to ensure it correctly understands instructions and generates responses that meet the requirements:
contextualize_prompt = """You are an assistant that helps rewrite questions. Your task is to:
1. Analyze the user's chat history
2. Understand the latest question
3. If the question depends on historical context, rewrite it as a standalone understandable form
4. If the question is already standalone, return the original question
5. Only return the rewritten question, do not add any explanations
Example:
History: "User: Tell me about John\nAssistant: John is an engineer..."
Question: "What is his job?"
Rewrite as: "What is John's job?"
History: {chat_history}
Question: {question}
Rewritten question:"""
# Define RAG answer generation prompt
rag_prompt = """You are a helpful AI assistant. Please answer the user's question based on the provided context information.
If there is no relevant information in the context, please state that you don't know and don't make up answers.
Chat history:
{chat_history}
Context information:
{context_str}
User question: {query}
Please provide a helpful and accurate answer:"""
self.contextualizer = lazyllm.OnlineChatModule(
source="deepseek",
timeout=30
).prompt(lazyllm.ChatPrompter(contextualize_prompt))
Then we construct the main RAG pipeline with parallel retrieval and reranking, which includes dual-path parallel retrieval for questions (based on cosine similarity and BM25) along with result reranking:
with pipeline() as ppl:
# Parallel retrieval
with parallel().sum as ppl.prl:
ppl.prl.retriever1 = Retriever(
documents,
group_name="sentences",
similarity="cosine",
topk=3
)
ppl.prl.retriever2 = Retriever(
documents,
"CoarseChunk",
"bm25_chinese",
0.003,
topk=2
)
# Reranking
ppl.reranker = Reranker(
"ModuleReranker",
model=OnlineEmbeddingModule(type="rerank", source="qwen"),
topk=3,
output_format='content',
join=True
) | bind(query=ppl.input)
# LLM generates answer
ppl.llm = lazyllm.OnlineChatModule(
source="deepseek",
stream=False,
timeout=60
).prompt(lazyllm.ChatPrompter(rag_prompt, extra_keys=["context_str", "chat_history"]))
self.pipeline = ppl
def contextualize_question(self, question: str) -> str:
"""Reformulate question based on chat history"""
if not self.chat_history:
return question
# Build historical conversation text
history_text = ""
for msg in self.chat_history[-6:]: # Last 3 turns of conversation
if msg["role"] == "user":
history_text += f"User: {msg['content']}\n"
elif msg["role"] == "assistant":
history_text += f"Assistant: {msg['content']}\n"
try:
# Pass parameters in dictionary format
response = self.contextualizer({
"messages": [
{"role": "system",
"content": "You are an assistant that helps rewrite questions. Please rewrite the user's question based on chat history to make it standalone and understandable."},
{"role": "user",
"content": f"Based on the following chat history, rewrite the question:\n\nChat history:\n{history_text}\n\nQuestion: {question}"}
]
})
return response.strip()
except Exception as e:
print(f"Question rewriting failed: {e}")
return question
Meanwhile, convert the restructured question into formatted dialogue text for subsequent processing or display:
def _format_chat_history(self, messages: List[Dict], max_turns: int = 3) -> str:
"""Format chat history"""
history = []
# Take recent conversation turns
for msg in messages[-max_turns * 2:]: # Each turn includes question and answer, so multiply by 2
if msg["role"] == "user":
history.append(f"User: {msg['content']}")
elif msg["role"] == "assistant":
history.append(f"Assistant: {msg['content']}")
return "\n".join(history)
def chat(self, question: str) -> str:
"""Process single turn conversation"""
# 1. Reformulate question
contextualized_question = self.contextualize_question(question)
# 2. Build historical conversation text
history_text = self._format_chat_history(self.chat_history, max_turns=4)
# 3. Generate answer through RAG pipeline
response = self.pipeline(
contextualized_question,
chat_history=history_text
)
# 4. Update chat history
self.chat_history.append({"role": "user", "content": question})
self.chat_history.append({"role": "assistant", "content": response})
return response
Let's test our conversational RAG:
rag = ConversationalRAGPipeline()
# First question
response1 = rag.chat("What is the standard method for task decomposition?")
print(f"Response 1: {response1}")
# Follow-up question that depends on context
response2 = rag.chat("What are the common extensions of this method?")
print(f"Response 2: {response2}")
Note that the question generated by the model in the second question incorporates the conversational context, transforming "this method" into "Chain of Thought prompting extensions".
Agents
Agents leverage the reasoning capabilities of LLMs to make decisions during execution. Using agents allows you to offload additional discretion over the retrieval process. Although their behavior is less predictable than the above "chain", they are able to execute multiple retrieval steps in service of a query, or iterate on a single search.
Below we assemble a minimal RAG agent using LazyLLM's ReactAgent:
@fc_register("tool")
def conversational_rag_chat(question: str) -> str:
"""
Multi-turn conversational RAG tool
Args:
question (str): User question
Returns:
str: RAG system's answer
"""
global rag_system
if rag_system is None:
rag_system = ConversationalRAGPipeline()
return rag_system.chat(question)
def create_rag_agent():
"""Create RAG agent"""
agent = lazyllm.ReactAgent(
llm=lazyllm.OnlineChatModule(source="deepseek", timeout=60),
tools=["conversational_rag_chat"],
max_retries=3,
return_trace=False,
stream=False
)
return agent
The key difference from our earlier implementation is that instead of a final generation step that ends the run, here the tool invocation can loop back to gather more information. The agent can then either answer the question using the retrieved context, or generate another tool call to obtain more information.
Let's test this out with a question that would typically require an iterative sequence of retrieval steps:
agent = create_rag_agent()
response = agent(
"What is the standard method for Task Decomposition? "
"Once you get the answer, look up common extensions of that method."
)
print(response)
Note that the agent: 1. Generates a query to search for a standard method for task decomposition 2. Receiving the answer, generates a second query to search for common extensions of it 3. Having received all necessary context, answers the question
Running the Application
To start the web interface:
Then visit http://localhost:8849 to interact with the conversational RAG system through a web interface.
Full code
Click to expand full code
import os
import tempfile
from typing import List, Dict, Any
import lazyllm
from lazyllm import pipeline, parallel, bind, fc_register
from lazyllm import Document, Retriever, Reranker, SentenceSplitter, OnlineEmbeddingModule
class ConversationalRAGPipeline:
"""Multi-turn conversational RAG pipeline based on LazyLLM components"""
def __init__(self, docs_path: str = None):
self.chat_history = []
self.docs_path = docs_path or self._create_sample_docs()
self.pipeline = None
self.init_pipeline()
def _create_sample_docs(self) -> str:
"""Create sample documents directory"""
temp_dir = tempfile.mkdtemp(prefix="rag_docs_")
# Create sample documents
sample_docs = {
"task_decomposition.txt": """
Task decomposition is a technique for breaking down complex tasks into smaller, more manageable subtasks.
Chain of Thought (CoT) prompting has become a standard technique for enhancing model performance on complex tasks.
The model is instructed to "think step by step" to utilize more test-time computation to decompose hard tasks into smaller and simpler steps.
CoT transforms big tasks into multiple manageable tasks and sheds light into an interpretation of the model's thinking process.
""",
"tree_of_thoughts.txt": """
Tree of Thoughts (ToT) extends CoT by exploring multiple reasoning possibilities at each step.
It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure.
The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier or majority vote.
""",
"decomposition_methods.txt": """
Common task decomposition methods include:
1) Using simple prompting with LLM like "Steps for XYZ" or "What are the subgoals for achieving XYZ?"
2) Using task-specific instructions like "Write a story outline" for writing a novel
3) Incorporating human inputs to guide the decomposition process
""",
"conversational_rag.txt": """
Conversational RAG systems need to handle chat history and context.
Key challenges include reformulating questions based on conversational context,
maintaining relevant chat history, and generating contextually relevant responses.
The system needs to understand pronoun references and contextual dependencies.
""",
"agent_rag.txt": """
Agent-based RAG systems can iterate over multiple retrieval steps.
Unlike chains that generate at most one query per input, agents can execute multiple retrieval steps to serve a query,
or iterate on a single search to gather more comprehensive information.
Agents leverage reasoning capabilities to make decisions during execution.
"""
}
for filename, content in sample_docs.items():
with open(os.path.join(temp_dir, filename), 'w', encoding='utf-8') as f:
f.write(content.strip())
print(f"Created sample documents directory: {temp_dir}")
return temp_dir
def init_pipeline(self):
"""Initialize RAG pipeline"""
# Create documents and embeddings
documents = Document(
dataset_path=self.docs_path,
embed=OnlineEmbeddingModule(
source="qwen",
type="embed",
embed_model_name="text-embedding-v1"
),
manager=False
)
documents.create_node_group(
name="sentences",
transform=SentenceSplitter,
chunk_size=512,
chunk_overlap=100
)
# Define question rewriting prompt
contextualize_prompt = """You are an assistant that helps rewrite questions. Your task is to:
1. Analyze the user's chat history
2. Understand the latest question
3. If the question depends on historical context, rewrite it as a standalone understandable form
4. If the question is already standalone, return the original question
5. Only return the rewritten question, do not add any explanations
Example:
History: "User: Tell me about John\nAssistant: John is an engineer..."
Question: "What is his job?"
Rewrite as: "What is John's job?"
History: {chat_history}
Question: {question}
Rewritten question:"""
# Define RAG answer generation prompt
rag_prompt = """You are a helpful AI assistant. Please answer the user's question based on the provided context information.
If there is no relevant information in the context, please state that you don't know and don't make up answers.
Chat history:
{chat_history}
Context information:
{context_str}
User question: {query}
Please provide a helpful and accurate answer:"""
# Create question rewriting LLM
self.contextualizer = lazyllm.OnlineChatModule(
source="deepseek",
timeout=30
).prompt(lazyllm.ChatPrompter(contextualize_prompt))
# Create main RAG pipeline
with pipeline() as ppl:
# Parallel retrieval
with parallel().sum as ppl.prl:
ppl.prl.retriever1 = Retriever(
documents,
group_name="sentences",
similarity="cosine",
topk=3
)
ppl.prl.retriever2 = Retriever(
documents,
"CoarseChunk",
"bm25_chinese",
0.003,
topk=2
)
# Reranking
ppl.reranker = Reranker(
"ModuleReranker",
model=OnlineEmbeddingModule(type="rerank", source="qwen"),
topk=3,
output_format='content',
join=True
) | bind(query=ppl.input)
# Format context
ppl.formatter = (lambda nodes, query: dict(
context_str=nodes,
query=query,
chat_history="" # Will be passed during call
)) | bind(query=ppl.input)
# LLM generates answer
ppl.llm = lazyllm.OnlineChatModule(
source="deepseek",
stream=False,
timeout=60
).prompt(lazyllm.ChatPrompter(rag_prompt, extra_keys=["context_str", "chat_history"]))
self.pipeline = ppl
print("RAG pipeline initialized")
def contextualize_question(self, question: str) -> str:
"""Reformulate question based on chat history"""
if not self.chat_history:
return question
# Build historical conversation text
history_text = ""
for msg in self.chat_history[-6:]: # Last 3 turns of conversation
if msg["role"] == "user":
history_text += f"User: {msg['content']}\n"
elif msg["role"] == "assistant":
history_text += f"Assistant: {msg['content']}\n"
try:
# Pass parameters in dictionary format
response = self.contextualizer({
"messages": [
{"role": "system",
"content": "You are an assistant that helps rewrite questions. Please rewrite the user's question based on chat history to make it standalone and understandable."},
{"role": "user",
"content": f"Based on the following chat history, rewrite the question:\n\nChat history:\n{history_text}\n\nQuestion: {question}"}
]
})
return response.strip()
except Exception as e:
print(f"Question rewriting failed: {e}")
return question
def _format_chat_history(self, messages: List[Dict], max_turns: int = 3) -> str:
"""Format chat history"""
history = []
# Take recent conversation turns
for msg in messages[-max_turns * 2:]: # Each turn includes question and answer, so multiply by 2
if msg["role"] == "user":
history.append(f"User: {msg['content']}")
elif msg["role"] == "assistant":
history.append(f"Assistant: {msg['content']}")
return "\n".join(history)
def chat(self, question: str) -> str:
"""Process single turn conversation"""
# 1. Reformulate question
contextualized_question = self.contextualize_question(question)
print(f"Contextualized question: {contextualized_question}")
# 2. Build historical conversation text
history_text = self._format_chat_history(self.chat_history, max_turns=4)
# 3. Generate answer through RAG pipeline
try:
response = self.pipeline(
contextualized_question,
chat_history=history_text
)
# 4. Update chat history
self.chat_history.append({"role": "user", "content": question})
self.chat_history.append({"role": "assistant", "content": response})
# 5. Keep history within reasonable range
if len(self.chat_history) > 20:
self.chat_history = self.chat_history[-20:]
return response
except Exception as e:
error_msg = f"Error generating answer: {e}"
print(error_msg)
return error_msg
# Global RAG system instance
rag_system = None
@fc_register("tool")
def conversational_rag_chat(question: str) -> str:
"""
Multi-turn conversational RAG tool
Args:
question (str): User question
Returns:
str: RAG system's answer
"""
global rag_system
if rag_system is None:
rag_system = ConversationalRAGPipeline()
return rag_system.chat(question)
def create_rag_agent():
"""Create RAG agent"""
# Create ReactAgent with integrated RAG tools
agent = lazyllm.ReactAgent(
llm=lazyllm.OnlineChatModule(model='Qwen3-32B', timeout=60),
tools=["conversational_rag_chat"],
max_retries=3,
return_trace=False,
stream=False
)
return agent
def start_web_interface():
"""Start web interface"""
print("Starting multi-turn conversational RAG system web interface...")
try:
# Create RAG agent
agent = create_rag_agent()
# Start web interface
web_module = lazyllm.WebModule(
agent,
port=8849,
title="Multi-turn Conversational RAG System (LazyLLM)"
)
print(f"Web interface started: http://localhost:8849")
print("Press Ctrl+C to stop service")
web_module.start().wait()
except KeyboardInterrupt:
print("\nStopping service...")
except Exception as e:
print(f"Failed to start: {e}")
if __name__ == "__main__":
# Start web interface
start_web_interface()
Example Output
<think>Alright, the user is asking about the standard methods for task decomposition, and after getting that answer, they want to find common extensions of those methods. First, I need to identify what the standard task decomposition methods are. Common ones include the Work Breakdown Structure (WBS), goal trees, and task lists. Among these, WBS is probably the most widely used standard approach in project management, as it organizes work by breaking a project down hierarchically into smaller, more manageable tasks.
Next, the user might want to know the steps or how to apply these methods in practice. However, based on the question, the user first wants to know what the standard method is, so I should start by answering that. Then, once the user confirms, I can go on to discuss the common extensions of that method. For example, extensions of WBS might include life-cycle-based decomposition, deliverable-oriented breakdowns, or agile user story decomposition.
Now, I should check whether there’s a need to use any tools to retrieve information. Since my knowledge cutoff is October 2023, I likely already have sufficient information about standard task decomposition methods and their extensions. However, to ensure accuracy, I could use internal knowledge bases or databases to verify the most current extensions. But based on existing knowledge, I can provide the standard method first and then suggest using a tool if the user wants more detailed or updated information.
Given the user’s instruction that each response must start with a Thought and may involve tool calls, in this case, since I already know the standard methods, I can provide the answer first and then ask whether further details are needed. Or, if the user’s question demands more in-depth extensions, I could use a tool to search for related materials. Considering that the question has two parts, it’s best to answer the first part (standard methods) first, then proceed to the second (extensions).</think>
Thought: The user is asking about standard methods for task decomposition and how they can be extended. I should first clarify the standard methods, then provide directions for possible extensions.
Answer: Standard methods of task decomposition typically include the following steps:
1. **Define objectives** – Clearly identify the final deliverables or goals;
2. **Hierarchical breakdown** – Divide the work logically into smaller sub-tasks (e.g., Work Breakdown Structure, WBS);
3. **Resource allocation** – Assign personnel, time, and other resources to each sub-task;
4. **Prioritization** – Determine the order in which tasks should be executed;
5. **Monitoring and feedback** – Track progress and adjust plans as necessary.
If you’d like to explore common extensions of these methods (e.g., Agile decomposition, goal tree analysis, etc.), please specify the domain or application scenario, and I can provide detailed information.
Next steps
We've covered the steps to build a basic conversational Q&A application:
- We used chains to build a predictable application that contextualizes questions based on chat history
- We used agents to build an application that can iterate on a sequence of queries
To explore different types of retrievers and retrieval strategies, visit LazyLLM's documentation on retrieval components.
For more advanced agent architectures, check out LazyLLM's agent documentation and examples.