Data Processing
Data Processing Operators
Base Operators
lazyllm.tools.data.LazyLLMDataBase
Base class for data processing operators registered via data_register. Provides concurrency, result persistence/resume, progress tracking, and error collection.
Key methods:
- forward(self, input, **kwargs): implement single-item processing.
- forward_batch_input(self, inputs, **kwargs): implement batch processing and return results.
- call(self, inputs): unified entry point; decides execution mode based on implemented methods and handles concurrency, resume and saving.
- set_output(self, path): set export path; when set, call writes results to a file and returns the file path.
Constructor args:
- _concurrency_mode (str): concurrency mode, one of 'process'|'thread'|'single'.
- _save_data (bool): whether to persist intermediate results for resume.
- _max_workers (int|None): maximum workers for concurrency, None means default.
- _ignore_errors (bool): whether to ignore exceptions in tasks.
- **kwargs (dict): additional operator arguments.
Config keys (via lazyllm.config):
- data_process_path (str): root folder to store pipeline outputs.
- data_process_resume (bool): enable resume from previous progress.
Examples:
from lazyllm.tools.data import LazyLLMDataBase
# simple usage: subclass and implement forward
class EchoOp(LazyLLMDataBase):
def forward(self, data):
return {'text': data.get('text', '')}
op = EchoOp(_save_data=True)
res = op([{'text': 'hello'}]) # returns list or exported path depending on set_output
Source code in lazyllm/tools/data/base_data.py
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 | |
forward(input_data, **kwargs)
Method to implement in subclasses for single-item processing. Supported return types:
- dict: processed single result.
- list: expand one input into multiple outputs.
- None: keep the original input unchanged. Exceptions or error returns are recorded to the error file and typically skipped from valid results.
Parameters:
-
input(dict) –a single input data dict.
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import LazyLLMDataBase
class MyOp(LazyLLMDataBase):
def forward(self, data):
# return dict or list or None
return {'text': data.get('text', '').upper()}
op = MyOp()
print(op([{'text': 'a'}]))
Source code in lazyllm/tools/data/base_data.py
forward_batch_input(inputs, **kwargs)
Optional batch-processing method for subclasses. Receives the whole input list and returns a final list of results. Useful for custom batching or single-call external services.
Parameters:
-
inputs(list[dict]) –list of input data dicts.
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import LazyLLMDataBase
class BatchOp(LazyLLMDataBase):
def forward_batch_input(self, inputs):
# implement batch processing and return a list
return [{'text': i.get('text', '').lower()} for i in inputs]
op = BatchOp()
print(op([{'text': 'A'}, {'text': 'B'}]))
Source code in lazyllm/tools/data/base_data.py
set_output(output_path)
Set output path for exporting final results to a JSONL file and return the file path.
Parameters:
-
output_path(str) –directory path or concrete .jsonl file path. If a directory is provided, a file named
.jsonl will be created inside it.
Behavior:
- If a folder path is provided, a file named
Examples:
from lazyllm.tools.data import Demo2
# export to a directory (will create DemoClass.jsonl)
op = Demo2.rich_content(input_key='text').set_output('./out_dir')
path = op([{'text': 'sample'}])
print(path) # ./out_dir/RichContent.jsonl or similar
# export to a specific file
op = Demo2.rich_content(input_key='text').set_output('./out_dir/results.jsonl')
path = op([{'text': 'sample'}])
print(path) # ./out_dir/results.jsonl
Source code in lazyllm/tools/data/base_data.py
Demo Operators
lazyllm.tools.data.operators.demo_ops
AddSuffix
Bases: Demo2
Class-based operator that appends a suffix to a specified field. Supports concurrency configuration via constructor args.
Parameters:
-
suffix(str) –suffix string to append
-
input_key(str, default:'content') –key name of the text field
-
_max_workers(int | None) –optional max concurrency
-
_concurrency_mode(str, default:'process') –optional concurrency mode
-
_save_data(bool) –optional whether to persist results
Examples:
from lazyllm.tools.data import Demo2
op = Demo2.AddSuffix(suffix='!!!', input_key='text', _max_workers=2)
data = [{'text': 'wow'}]
res = op(data)
print(res)
# [{'text': 'wow!!!'}]
Source code in lazyllm/tools/data/operators/demo_ops.py
build_pre_suffix(data, input_key='content', prefix='', suffix='')
Add a prefix and suffix to the specified field of each item in the input list. Registered as a batch operator.
Parameters:
-
data(list[dict]) –list of dicts
-
input_key(str, default:'content') –key name of the text field
-
prefix(str, default:'') –string to add before the field
-
suffix(str, default:'') –string to add after the field
Examples:
from lazyllm.tools.data import Demo1
op = Demo1.build_pre_suffix(input_key='text', prefix='Hello, ', suffix='!')
data = [{'text': 'world'}]
res = op(data)
print(res)
# [{'text': 'Hello, world!'}]
Source code in lazyllm/tools/data/operators/demo_ops.py
error_prone_op(data, input_key='content')
A test operator that raises an exception for specific input (content == 'fail') and otherwise returns a processed dict. Used to validate error collection and skipping behavior.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key name of the text field
Examples:
from lazyllm.tools.data import Demo2
op = Demo2.error_prone_op(input_key='text', _save_data=True, _concurrency_mode='single')
data = [{'text': 'ok'}, {'text': 'fail'}, {'text': 'ok2'}]
res = op(data)
print(res)
# [{'text': 'Processed: ok'}, {'text': 'Processed: ok2'}]
# valid results skip the failed item; error details written to error file
Source code in lazyllm/tools/data/operators/demo_ops.py
process_uppercase(data, input_key='content')
Convert the input text field to uppercase. Intended as a single-item processing function.
Parameters:
-
data(dict) –a dict representing a single data item.
-
input_key(str, default:'content') –key name of the text field, default 'content'.
Examples:
from lazyllm.tools.data import Demo1
op = Demo1.process_uppercase(input_key='text')
data = [{'text': 'hello'}]
res = op(data)
print(res)
# [{'text': 'HELLO'}]
Source code in lazyllm/tools/data/operators/demo_ops.py
rich_content(data, input_key='content')
Split a single input into multiple outputs (original + derived parts). Implemented as a forward that returns a list.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key name of the text field
Examples:
from lazyllm.tools.data import Demo2
op = Demo2.rich_content(input_key='text')
data = [{'text': 'This is a test.'}]
res = op(data)
print(res)
# [
# {'text': 'This is a test.'},
# {'text': 'This is a test. - part 1'},
# {'text': 'This is a test. - part 2'}
# ]
Source code in lazyllm/tools/data/operators/demo_ops.py
Preference Operators
lazyllm.tools.data.operators.preference_ops
IntentExtractor
Bases: PreferenceOps
Preference operator: intent extractor.
Extracts the core intent from a specified field of the input data dict and writes it to an output field, so that downstream steps can generate multiple candidate responses and construct preference pairs.
Notes:
- Internally uses a model plus a JSON formatter; it expects the model output to be a JSON dict. If it cannot be parsed as dict, the output is None.
- Default concurrency mode is thread.
Parameters:
-
model–a LazyLLM model object (required), will be shared via share().
-
input_key(str, default:'content') –input text field name, default 'content'.
-
output_key(str, default:'intent') –output intent field name, default 'intent'.
-
**kwargs–extra args passed to the base operator (e.g. _max_workers, _save_data).
Examples:
from lazyllm.tools.data.operators.preference_ops import IntentExtractor
# model 需要由你的项目环境提供,例如 lazyllm.xxx(...) 得到的模型对象
op = IntentExtractor(model=model, input_key='content', output_key='intent')
print(op({'content': 'I want to stay at a hotel in Beijing.'}))
# [{
# 'content': 'I want to stay at a hotel in Beijing.',
# 'intent': {
# 'intent': 'book_hotel',
# 'entities': [{'entity': 'location', 'value': 'Beijing'}]
# }
# }]
Source code in lazyllm/tools/data/operators/preference_ops.py
PreferencePairConstructor
Bases: PreferenceOps
Preference operator: preference pair constructor (chosen / rejected).
Given a list of candidate responses and their score list, constructs a (chosen, rejected) pair and outputs a preference sample:
- instruction: instruction text (by default read from the intent field)
- chosen: better response
- rejected: worse response
Two strategies are supported:
- max_min: choose the highest score as chosen and the lowest as rejected (requires highest > lowest).
- threshold: find a pair with score difference >= threshold, from high to low.
Note: if inputs are empty/mismatched, or no valid pair can be constructed, it returns an empty list [] (useful to filter invalid samples in pipelines).
Parameters:
-
strategy(str, default:'max_min') –'max_min' or 'threshold', default 'max_min'.
-
threshold(float, default:0.5) –minimum score gap when strategy == 'threshold', default 0.5.
-
instruction_key(str, default:'intent') –instruction field name, default 'intent'.
-
response_key(str, default:'responses') –candidate response list field name, default 'responses'.
-
score_key(str, default:'evaluation') –score list field name, default 'evaluation'.
-
output_chosen_key(str, default:'chosen') –chosen field name, default 'chosen'.
-
output_rejected_key(str, default:'rejected') –rejected field name, default 'rejected'.
-
**kwargs–extra args passed to the base operator.
Examples:
from lazyllm.tools.data.operators.preference_ops import PreferencePairConstructor
op = PreferencePairConstructor(strategy='max_min', instruction_key='intent',
response_key='responses', score_key='evaluation')
data = {
'intent': 'book a hotel',
'responses': ['good response', 'bad response'],
'evaluation': [10, 6],
}
print(op(data))
# [{
# 'instruction': 'book a hotel',
# 'chosen': 'good response',
# 'rejected': 'bad response'
# }]
Source code in lazyllm/tools/data/operators/preference_ops.py
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 | |
PreferenceResponseGenerator
Bases: PreferenceOps
Preference operator: multi-response generator.
Given the intent (or any instruction text), generates n candidate responses and writes them as a list to the output field.
Parameters:
-
model–a LazyLLM model object (required), will be shared via share().
-
n(int, default:3) –number of candidate responses to generate, default 3.
-
temperature(float, default:1.0) –sampling temperature, default 1.0.
-
system_prompt(str | None, default:None) –optional system prompt; if provided, applies .prompt(system_prompt) to the model.
-
input_key(str, default:'intent') –input field name, default 'intent'.
-
output_key(str, default:'responses') –output field name, default 'responses'.
-
**kwargs–extra args passed to the base operator.
Examples:
from lazyllm.tools.data.operators.preference_ops import PreferenceResponseGenerator
op = PreferenceResponseGenerator(model=model, n=3, temperature=0.8, input_key='intent', output_key='responses')
print(op({'intent': 'book a hotel'}))
# [{
# 'intent': {'intent': 'book a hotel'},
# 'responses': [
# "<think>Okay, the user wants to book a hotel. ...",
# "<think>Okay, the user wants to book a hotel. ..."
# ]
# }]
Source code in lazyllm/tools/data/operators/preference_ops.py
ResponseEvaluator
Bases: PreferenceOps
Preference operator: response evaluator.
Evaluates multiple candidate responses for the same instruction and outputs a score list, which can be used to build chosen/rejected pairs.
Scoring dimensions (total 10):
- Helpfulness: 4
- Truthfulness: 3
- Fluency: 3
Notes:
- Internally uses a model plus a JSON formatter; each evaluation is expected to return a dict with total_score.
- If total_score cannot be extracted, a warning is logged and the score defaults to 0 for that response.
Parameters:
-
model–a LazyLLM model object (required), will be shared via share().
-
input_key(str, default:'content') –instruction/raw content field name, default 'content'.
-
response_key(str, default:'responses') –candidate response list field name, default 'responses'.
-
output_key(str, default:'evaluation') –output score list field name, default 'evaluation'.
-
**kwargs–extra args passed to the base operator.
Examples:
from lazyllm.tools.data.operators.preference_ops import ResponseEvaluator
op = ResponseEvaluator(model=model, input_key='intent', response_key='responses', output_key='evaluation')
data = {
'intent': {'intent': 'book a hotel'},
'responses': [
'I can help you book a hotel in Beijing.',
'Here are some hotels for you.'
],
}
print(op(data))
# [{
# 'intent': {'intent': 'book a hotel'},
# 'responses': [
# 'I can help you book a hotel in Beijing.',
# 'Here are some hotels for you.'
# ],
# 'evaluation': [10, 8]
# }]
Source code in lazyllm/tools/data/operators/preference_ops.py
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 | |
Tool-Use Operators
lazyllm.tools.data.operators.tool_use_ops
ChainedLogicAssembler
Bases: ToolUseOps
Tool-use data operator: sequential task generator.
Given a list of atomic tasks, generates successor relationships and composed tasks to form linear or dependency-aware task chains.
Typical JSON structure:
- items: list of dicts:
- task: current atomic task
- next_task: its successor task
- composed_task: description combining task and next_task
Parameters:
-
model–a LazyLLM model object (required).
-
input_key(str, default:'atomic_tasks') –input atomic task field name, default 'atomic_tasks'.
-
output_key(str, default:'sequential_tasks') –output sequential task list field name, default 'sequential_tasks'.
-
system_prompt(str | None, default:None) –optional system prompt.
-
**kwargs–extra args passed to the base operator.
Examples:
from lazyllm.tools.data.operators.tool_use_ops import ChainedLogicAssembler
atomic_tasks = [
{'task': '获取出发地与目的地'},
{'task': '确认出行日期'},
{'task': '筛选符合条件的车次'},
]
op = ChainedLogicAssembler(model=model, input_key='atomic_tasks', output_key='sequential_tasks')
print(op({'atomic_tasks': atomic_tasks}))
# {
# 'atomic_tasks': [...],
# 'sequential_tasks': [
# {'task': '获取出发地与目的地', 'next_task': '确认出行日期', 'composed_task': '先获取站点再确认日期'},
# {'task': '确认出行日期', 'next_task': '筛选符合条件的车次', 'composed_task': '在已知日期基础上筛选车次'},
# ...
# ]
# }
Source code in lazyllm/tools/data/operators/tool_use_ops.py
ContextualBeacon
Bases: ToolUseOps
Tool-use data operator: scenario extractor.
Extracts high-level scenario information from a conversation text and writes a structured JSON object into the output field.
Typical JSON structure:
- scene: one-sentence scenario description
- domain: domain/topic
- user_profile: user role/profile (optional)
- assistant_goal: goal the assistant should achieve
- constraints: list of constraints
- key_entities: list of key entities
Parameters:
-
model–a LazyLLM model object (required), shared and wrapped with a JSON formatter.
-
input_key(str, default:'content') –input conversation field name, default 'content'.
-
output_key(str, default:'scenario') –output scenario field name, default 'scenario'.
-
system_prompt(str | None, default:None) –optional custom system prompt, defaults to a built-in Chinese prompt.
-
**kwargs–extra args passed to the base operator (e.g. _max_workers, _save_data).
Examples:
from lazyllm.tools.data.operators.tool_use_ops import ContextualBeacon
op = ContextualBeacon(model=model, input_key='content', output_key='scenario')
item = {
'content': 'User: 我想订一张从北京到上海的高铁票,下午出发最好。\nAssistant: 好的,请问具体日期?'
}
print(op(item))
# Output Example:
# {
# 'content': 'User: 我想订一张从北京到上海的高铁票,下午出发最好。\nAssistant: 好的,请问具体日期?',
# 'scenario': {
# 'scene': '用户咨询高铁购票服务',
# 'domain': '出行/购票',
# 'user_profile': '普通出行乘客',
# 'assistant_goal': '帮助用户完成车次与时间筛选并完成购票',
# 'constraints': ['出发地为北京', '目的地为上海', '尽量下午出发'],
# 'key_entities': ['北京', '上海', '高铁', '下午']
# }
# }
Source code in lazyllm/tools/data/operators/tool_use_ops.py
DecompositionKernel
Bases: ToolUseOps
Tool-use data operator: atomic task generator.
Given a scenario, generates a list of fine-grained, single-goal atomic tasks, which can be used for later orchestration and tool design.
Typical JSON structure:
- tasks: list of atomic task dicts:
- task: task description
- input: task input (optional)
- output: task output (optional)
- constraints: list of constraints
Parameters:
-
model–a LazyLLM model object (required).
-
input_key(str, default:'scenario') –input scenario field name, default 'scenario'.
-
output_key(str, default:'atomic_tasks') –output atomic task list field name, default 'atomic_tasks'.
-
n(int, default:5) –maximum number of tasks, default 5.
-
system_prompt(str | None, default:None) –optional system prompt.
-
**kwargs–extra args passed to the base operator.
Examples:
from lazyllm.tools.data.operators.tool_use_ops import DecompositionKernel
scenario = {
'scene': '用户咨询高铁购票服务',
'assistant_goal': '帮助用户完成车次筛选并购票',
}
op = DecompositionKernel(model=model, input_key='scenario', output_key='atomic_tasks', n=4)
print(op({'scenario': scenario}))
# {
# 'scenario': {...},
# 'atomic_tasks': [
# {'task': '获取用户出发地和目的地', 'input': '', 'output': '出发地与目的地', 'constraints': [...]},
# {'task': '确认出行日期与大致时间', ...},
# ...
# ]
# }
Source code in lazyllm/tools/data/operators/tool_use_ops.py
DialogueSimulator
Bases: ToolUseOps
Tool-use data operator: multi-turn conversation generator (with tools).
Given a composed task and a list of available functions, generates a multi-turn conversation JSON involving User, Assistant and Tool roles, suitable for tool-calling training data.
Typical JSON structure:
- messages: list of dicts:
- role: 'user' | 'assistant' | 'tool'
- content: text content
- name: tool name (optional, when role == 'tool')
Parameters:
-
model–a LazyLLM model object (required).
-
input_composition_key(str, default:'composition_task') –input composition task field name, default 'composition_task'.
-
input_functions_key(str, default:'functions') –input function list field name, default 'functions'.
-
output_key(str, default:'conversation') –output conversation field name, default 'conversation'.
-
n_turns(int, default:6) –desired number of turns (as a hint to the model), default 6.
-
system_prompt(str | None, default:None) –optional system prompt.
-
**kwargs–extra args passed to the base operator.
Examples:
from lazyllm.tools.data.operators.tool_use_ops import DialogueSimulator
composition_task = '根据用户需求查询并推荐合适的高铁车次'
functions = [
{
'name': 'query_train_tickets',
'description': '查询高铁车次',
'args': [...],
'returns': {...},
}
]
op = DialogueSimulator(model=model,
input_composition_key='composition_task',
input_functions_key='functions',
output_key='conversation',
n_turns=6)
print(op({'composition_task': composition_task, 'functions': functions}))
# {
# 'composition_task': '根据用户需求查询并推荐合适的高铁车次',
# 'functions': [...],
# 'conversation': {
# 'messages': [
# {'role': 'user', 'content': '我想订一张明天下午从北京到上海的高铁票'},
# {'role': 'assistant', 'content': '好的,我先为您确认出发时间与车次。'},
# {'role': 'tool', 'name': 'query_train_tickets', 'content': '{...工具返回...}'},
# ...
# ]
# }
# }
Source code in lazyllm/tools/data/operators/tool_use_ops.py
576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 | |
ProtocolSpecifier
Bases: ToolUseOps
Tool-use data operator: function specification generator.
Given a composed task and its subtasks, generates a list of function specifications suitable for tool calling.
Typical JSON structure:
- functions: list of dicts:
- name: function name
- description: what the function does
- args: list of argument specs with name/type/description
- returns: return type and description
Parameters:
-
model–a LazyLLM model object (required).
-
input_composition_key(str, default:'composition_task') –input composition task field name, default 'composition_task'.
-
input_atomic_key(str, default:'atomic_tasks') –input atomic task field name, default 'atomic_tasks'.
-
output_key(str, default:'functions') –output function spec list field name, default 'functions'.
-
system_prompt(str | None, default:None) –optional system prompt.
-
**kwargs–extra args passed to the base operator.
Examples:
from lazyllm.tools.data.operators.tool_use_ops import ProtocolSpecifier
composition_task = '根据用户出发地、目的地和日期查询可选高铁车次并返回候选列表'
atomic_tasks = [
{'task': '获取出发地与目的地'},
{'task': '确认出行日期'},
{'task': '调用车次查询接口并过滤结果'},
]
op = ProtocolSpecifier(model=model,
input_composition_key='composition_task',
input_atomic_key='atomic_tasks',
output_key='functions')
print(op({'composition_task': composition_task, 'atomic_tasks': atomic_tasks}))
# {
# 'composition_task': '根据用户出发地、目的地和日期查询可选高铁车次并返回候选列表',
# 'atomic_tasks': [...],
# 'functions': [
# {
# 'name': 'query_train_tickets',
# 'description': '根据出发地、目的地与日期查询高铁车次',
# 'args': [{'name': 'from_city', 'type': 'string', ...}, ...],
# 'returns': {'type': 'TrainList', 'description': '符合条件的车次列表'}
# },
# ...
# ]
# }
Source code in lazyllm/tools/data/operators/tool_use_ops.py
476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 | |
ScenarioDiverger
Bases: ToolUseOps
Tool-use data operator: scenario expander.
Given a base scenario, generates multiple alternative scenarios that are semantically related but differ in details, to enrich data diversity.
Typical JSON structure:
- scenarios: list of scenario dicts, each with fields like scene/domain/assistant_goal/constraints/key_entities.
Parameters:
-
model–a LazyLLM model object (required).
-
input_key(str, default:'scenario') –input scenario field name, default 'scenario' (dict or str).
-
output_key(str, default:'expanded_scenarios') –output expanded scenario list field name, default 'expanded_scenarios'.
-
n(int, default:3) –maximum number of scenarios to generate, default 3.
-
system_prompt(str | None, default:None) –optional system prompt.
-
**kwargs–extra args passed to the base operator.
Examples:
from lazyllm.tools.data.operators.tool_use_ops import ScenarioDiverger
base = {
'scene': '用户咨询高铁购票服务',
'domain': '出行/购票',
'assistant_goal': '帮助用户完成车次筛选并购票',
}
op = ScenarioDiverger(model=model, input_key='scenario', output_key='expanded_scenarios', n=3)
print(op({'scenario': base}))
# {
# 'scenario': {...},
# 'expanded_scenarios': [
# {'scene': '用户预订跨城商务出差火车票', ...},
# {'scene': '用户为家人购买回乡火车票', ...},
# ...
# ]
# }
Source code in lazyllm/tools/data/operators/tool_use_ops.py
TopologyArchitect
Bases: ToolUseOps
Tool-use data operator: parallel/sequential/hybrid task combination generator.
Given atomic tasks, generates three kinds of task compositions:
- parallel_tasks: tasks that can be executed in parallel
- sequential_tasks: tasks with explicit ordering dependencies
- hybrid_tasks: compositions mixing parallel and sequential relations
Parameters:
-
model–a LazyLLM model object (required).
-
input_key(str, default:'atomic_tasks') –input atomic task field name, default 'atomic_tasks'.
-
output_key(str, default:'para_seq_tasks') –output composition field name, default 'para_seq_tasks'.
-
system_prompt(str | None, default:None) –optional system prompt.
-
**kwargs–extra args passed to the base operator.
Examples:
from lazyllm.tools.data.operators.tool_use_ops import TopologyArchitect
atomic_tasks = [
{'task': '收集出行需求'},
{'task': '查询可选车次'},
{'task': '对比价格与时间'},
{'task': '完成下单支付'},
]
op = TopologyArchitect(model=model, input_key='atomic_tasks', output_key='para_seq_tasks')
print(op({'atomic_tasks': atomic_tasks}))
# {
# 'atomic_tasks': [...],
# 'para_seq_tasks': {
# 'parallel_tasks': ['同时查询不同日期/车次方案', ...],
# 'sequential_tasks': ['先确认日期再选车次', ...],
# 'hybrid_tasks': ['并行对比多个方案后统一决策并下单', ...]
# }
# }
Source code in lazyllm/tools/data/operators/tool_use_ops.py
ViabilitySieve
Bases: ToolUseOps
Tool-use data operator: composition task feasibility filter.
Evaluates a list of composed tasks for feasibility and completeness, and filters out invalid ones.
Expected intermediate JSON from the model:
- items: list of dicts with composed_task, is_valid, reason, etc.
On output, only keeps composed_task values where is_valid is true. If the model output does not match the schema, it falls back to returning items or the raw parsed result.
Parameters:
-
model–a LazyLLM model object (required).
-
input_composition_key(str, default:'composition_tasks') –input composition task field name, default 'composition_tasks'.
-
input_atomic_key(str, default:'atomic_tasks') –input atomic task field name (optional), default 'atomic_tasks'.
-
output_key(str, default:'filtered_composition_tasks') –output filtered composition task field name, default 'filtered_composition_tasks'.
-
system_prompt(str | None, default:None) –optional system prompt.
-
**kwargs–extra args passed to the base operator.
Examples:
from lazyllm.tools.data.operators.tool_use_ops import ViabilitySieve
composition_tasks = ['先获取出发地和目的地再筛选车次', '直接随机推荐一个车次']
atomic_tasks = [
{'task': '获取出发地与目的地'}, {'task': '确认出行日期'}, {'task': '筛选符合条件的车次'}
]
op = ViabilitySieve(model=model,
input_composition_key='composition_tasks',
input_atomic_key='atomic_tasks',
output_key='filtered_composition_tasks')
print(op({'composition_tasks': composition_tasks, 'atomic_tasks': atomic_tasks}))
# {
# 'composition_tasks': [...],
# 'atomic_tasks': [...],
# 'filtered_composition_tasks': ['先获取出发地和目的地再筛选车次', ...]
# }
Source code in lazyllm/tools/data/operators/tool_use_ops.py
380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 | |
Text2SQL Operators
lazyllm.tools.data.operators.text2sql_ops
SQLConsensusUnifier
Bases: Text2SQLOps
Text2SQL data operator: SQLConsensusUnifier.
Given multiple CoT traces (cot_responses), parses SQL from each, executes them, and selects the best CoT/SQL pair based on execution consistency and success.
Behavior:
- Parses SQL from each CoT using the same logic as SQLForge.
- Calls database_manager.batch_execute_queries to get execution results and signatures.
- Uses a voting strategy (_vote_select) to pick the best candidate, then:
- sets output_cot_key (default 'cot_reasoning') to the winning CoT,
- overwrites data['SQL'] with the winning SQL.
Parameters:
-
database_manager–query execution provider (required) implementing: - batch_execute_queries(list[(db_id, sql)])
-
**kwargs–extra args forwarded to the base operator.
Examples:
from lazyllm.tools.data.operators.text2sql_ops import SQLConsensusUnifier
op = SQLConsensusUnifier(database_manager=database_manager)
item = {
'db_id': 'db_1',
'cot_responses': [
'...CoT + ```sql SELECT count(*) FROM orders WHERE status = 'paid'```',
'...CoT + ```sql SELECT count(*) FROM orders```',
]
}
res = op(item)
print(res['cot_reasoning'][:200])
print(res['SQL'])
# "...首先识别需要统计已支付订单数量,其次在 orders 表中过滤 status = 'paid' ... ```sql SELECT count(*) FROM orders WHERE status = 'paid';```"
# "SELECT count(*) FROM orders WHERE status = 'paid';"
Source code in lazyllm/tools/data/operators/text2sql_ops.py
842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 | |
SQLContextAssembler
Bases: Text2SQLOps
Text2SQL data operator: SQLContextAssembler.
Builds prompts for downstream Text2SQL models from database schema, natural language question, and evidence.
Behavior:
- Prefers database_manager.get_db_details(db_id); falls back to get_create_statements_and_insert_statements if not available.
- Supports a custom prompt_template; otherwise uses a simple English template.
Parameters:
-
database_manager–schema provider (required), implementing: - get_db_details(db_id) (optional) - get_create_statements_and_insert_statements(db_id)
-
prompt_template–optional custom prompt builder.
-
**kwargs–extra args forwarded to the base operator.
Examples:
from lazyllm.tools.data.operators.text2sql_ops import SQLContextAssembler
op = SQLContextAssembler(database_manager=database_manager)
item = {
'db_id': 'db_1',
'question': '有多少已支付的订单?',
'evidence': '订单表中 status 字段标记订单状态。'
}
res = op(item)
print(res['prompt'])
# Database Schema:
# CREATE TABLE orders (id INT, status TEXT, ...);
# ...
#
# Question: 有多少已支付的订单?
# Evidence: 订单表中 status 字段标记订单状态。
# Generate a SQL query for postgres.
Source code in lazyllm/tools/data/operators/text2sql_ops.py
637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 | |
SQLEffortRanker
Bases: Text2SQLOps
Text2SQL data operator: SQLEffortRanker.
Classifies SQL execution difficulty by repeatedly generating SQL from a prompt, comparing each prediction to the gold SQL on the database, and counting how many generations match.
Workflow:
- Uses the input prompt to generate num_generations SQL candidates, parsing SQL text from each.
- Builds comparison tuples (db_id, predicted_sql, gold_sql) and calls database_manager.batch_compare_queries.
- Maps the number of correct generations (cnt_true) to a difficulty label using difficulty_thresholds and difficulty_labels.
Parameters:
-
model–a LazyLLM model object (required).
-
database_manager–provider implementing batch_compare_queries (required).
-
num_generations(int, default:10) –number of SQL generations per item, default 10; may be auto-increased to a multiple of 5.
-
difficulty_thresholds(list[int] | None, default:None) –thresholds list, default [2, 5, 9].
-
difficulty_labels(list[str] | None, default:None) –label list, default ['extra', 'hard', 'medium', 'easy'].
-
system_prompt(str | None, default:None) –optional system prompt.
-
**kwargs–extra args forwarded to the base operator.
Examples:
from lazyllm.tools.data.operators.text2sql_ops import SQLEffortRanker
op = SQLEffortRanker(model=model, database_manager=database_manager, num_generations=15)
item = {
'db_id': 'db_1',
'prompt': 'Database Schema: ... Question: 有多少已支付的订单?',
'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';'
}
res = op(item)
print(res)
# {
# 'db_id': 'db_1',
# 'prompt': 'Database Schema: ... Question: 有多少已支付的订单?',
# 'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
# 'sql_execution_difficulty': 'medium'
# }
Source code in lazyllm/tools/data/operators/text2sql_ops.py
997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 | |
SQLForge
Bases: Text2SQLOps
Text2SQL data operator: SQLForge.
Generates executable SQL queries for one or multiple databases based on their schema and optional sample data, and labels each query with a rough complexity type.
Behavior:
- Generates output_num SQLs per database.
- Uses a default English system prompt (or a custom prompt_template) to control complexity labels (easy/medium/hard, etc.).
- Parses SQL text from model responses, preferring
sql ...code blocks.
Parameters:
-
model–a LazyLLM model object (required), shared via share().
-
database_manager–database manager (required) implementing: - list_databases() - get_create_statements_and_insert_statements(db_name)
-
output_num(int, default:300) –number of SQLs to generate per database, default 300.
-
prompt_template–optional custom prompt builder with build_prompt(...).
-
system_prompt(str | None, default:None) –optional system prompt, defaults to a built-in English prompt.
-
**kwargs–extra args forwarded to the Text2SQLOps/LazyLLMDataBase base class.
Examples:
from lazyllm.tools.data.operators.text2sql_ops import SQLForge
# 假设 database_manager 已封装了你的 SQLite / Postgres 等数据库
op = SQLForge(model=model, database_manager=database_manager, output_num=10)
# 如果 data 中不指定 db_id,则为所有数据库各生成若干条 SQL
res = op({})
print(res[0])
# {
# 'db_id': 'database_1',
# 'SQL': 'SELECT ...',
# 'sql_complexity_type': 'easy'
# }
Source code in lazyllm/tools/data/operators/text2sql_ops.py
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 | |
SQLIntentSynthesizer
Bases: Text2SQLOps
Text2SQL data operator: SQLIntentSynthesizer.
Given a SQL query and database schema (with optional column descriptions), generates a natural language question aligned with the SQL semantics, plus optional external knowledge text.
Key features:
- Generates multiple candidate questions (input_query_num) and selects one using embeddings-based diversity.
- Uses special markers in model output: [QUESTION-START]/[QUESTION-END] and [EXTERNAL-KNOWLEDGE-START]/[...-END].
Parameters:
-
model–text generation model (required).
-
embedding_model–optional embedding model, supporting: - generate_embedding_from_input(texts) or callable(texts).
-
database_manager–schema provider (required) implementing: - get_create_statements_and_insert_statements(db_id)
-
input_query_num(int, default:5) –number of question candidates per SQL, default 5.
-
prompt_template–optional custom prompt builder.
-
system_prompt(str | None, default:None) –optional system prompt, default simple English helper.
-
**kwargs–extra args forwarded to the base operator.
Examples:
from lazyllm.tools.data.operators.text2sql_ops import SQLIntentSynthesizer
op = SQLIntentSynthesizer(model=model,
embedding_model=embedding_model,
database_manager=database_manager,
input_query_num=5)
item = {'db_id': 'db_1', 'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';'}
res = op(item)
print(res)
# {
# 'db_id': 'db_1',
# 'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
# 'question_type': 'default',
# 'question': '有多少已支付的订单?',
# 'evidence': '...可选的外部知识...'
# }
Source code in lazyllm/tools/data/operators/text2sql_ops.py
296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 | |
SQLReasoningTracer
Bases: Text2SQLOps
Text2SQL data operator: SQLReasoningTracer.
For each (question, SQL, schema, evidence) item, generates multiple chain-of-thought (CoT) reasoning traces from question to SQL.
Parameters:
-
model–a LazyLLM model object (required).
-
database_manager–schema provider (required) implementing: - get_create_statements_and_insert_statements(db_id)
-
prompt_template–optional custom prompt builder.
-
output_num(int, default:3) –number of CoT trajectories per item, default 3 (>=1).
-
**kwargs–extra args forwarded to the base operator.
Examples:
from lazyllm.tools.data.operators.text2sql_ops import SQLReasoningTracer
op = SQLReasoningTracer(model=model, database_manager=database_manager, output_num=3)
item = {
'db_id': 'db_1',
'question': '有多少已支付的订单?',
'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
'evidence': ''
}
res = op(item)
print(len(res['cot_responses']))
print(res['cot_responses'][0][:200]) # 打印第一条 CoT 的前 200 个字符
# 3
# "Database Schema: ... Question: 有多少已支付的订单? ... 推理步骤1:... 推理步骤2:... ```sql SELECT count(*) FROM orders WHERE status = 'paid';```"
Source code in lazyllm/tools/data/operators/text2sql_ops.py
740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 | |
SQLRuntimeSieve
Bases: Text2SQLOps
Text2SQL data operator: SQLRuntimeSieve.
Filters SQL queries by:
- Keeping only queries that look like SELECT/WITH queries.
- Calling database_manager to run EXPLAIN (or similar) and keeping only those that execute successfully.
Parameters:
-
database_manager–database manager (required) implementing: - database_exists(db_id) - batch_explain_queries(list[(db_id, sql)])
-
**kwargs–extra args forwarded to the base operator.
Examples:
from lazyllm.tools.data.operators.text2sql_ops import SQLRuntimeSieve
op = SQLRuntimeSieve(database_manager=database_manager)
item = {'db_id': 'db_1', 'SQL': 'SELECT * FROM users;'}
res = op(item)
print(res) # 若 SQL 可在 db_1 上 explain 成功,则返回原始 dict;否则返回 None
Source code in lazyllm/tools/data/operators/text2sql_ops.py
SQLSyntaxProfiler
Bases: Text2SQLOps
Text2SQL data operator: SQLSyntaxProfiler.
Classifies SQL difficulty based on structural components using EvalHardness/EvalHardnessLite, assigning labels such as easy/medium/hard/extra.
Parameters:
-
difficulty_thresholds(list[int] | None, default:None) –thresholds list, default [2, 4, 6].
-
difficulty_labels(list[str] | None, default:None) –label list, default ['easy', 'medium', 'hard', 'extra'].
-
**kwargs–extra args forwarded to the base operator.
Examples:
from lazyllm.tools.data.operators.text2sql_ops import SQLSyntaxProfiler
op = SQLSyntaxProfiler()
item = {'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';'}
res = op(item)
print(res)
# {
# 'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
# 'sql_component_difficulty': 'easy'
# }
Source code in lazyllm/tools/data/operators/text2sql_ops.py
TSQLSemanticAuditor
Bases: Text2SQLOps
Text2SQL data operator: TSQLSemanticAuditor.
Given a natural language question + optional evidence + SQL + database schema, determines whether the SQL correctly answers the question and filters samples accordingly.
Behavior:
- Fetches DDL for the given db_id via database_manager.
- Asks the model to answer Yes/No; only keeps data when the response contains 'yes' (case-insensitive).
Parameters:
-
model–a LazyLLM model object (required).
-
database_manager–schema provider (required) implementing: - get_create_statements_and_insert_statements(db_id)
-
prompt_template–optional custom prompt builder.
-
system_prompt(str | None, default:None) –optional system prompt, defaults to English Yes/No instructions.
-
**kwargs–extra args forwarded to the base operator.
Examples:
from lazyllm.tools.data.operators.text2sql_ops import TSQLSemanticAuditor
op = TSQLSemanticAuditor(model=model, database_manager=database_manager)
item = {
'db_id': 'db_1',
'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
'question': '有多少已支付的订单?',
'evidence': ''
}
res = op(item)
print(res)
# {
# 'db_id': 'db_1',
# 'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
# 'question': '有多少已支付的订单?',
# 'evidence': ''
# }
# 如果模型判断不匹配,则返回 None
Source code in lazyllm/tools/data/operators/text2sql_ops.py
531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 | |
PT Operators
lazyllm.tools.data.operators.pt_op
ContextQualFilter
Bases: PT
Use VLM or LLM to evaluate whether context is suitable for generating QA pairs; keep only samples with score=1 (suitable).
Parameters:
-
llm–vision- or text-language model instance
-
context_key(str, default:'context') –key for context, default 'context'
-
image_key(str, default:'image_path') –key for image path(s), default 'image_path'
-
prompt(str, default:None) –optional custom prompt
Examples:
from lazyllm.tools.data import pt
vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
op = pt.ContextQualFilter(vlm)
res = op([{'context': 'Good context for QA.', 'image_path': '/path/to/image.jpg'}])
# only samples with score=1 are kept
Source code in lazyllm/tools/data/operators/pt_op.py
GraphRetriever
Bases: PT_MM
Parse Markdown-style image links  from context field, extract existing file paths and write to img_key.
Does not modify source context; if context.strip() is empty, img_key is [] and the sample is kept.
Parameters:
-
context_key(str, default:'context') –key for text context, default 'context'
-
img_key(str, default:'image_path') –key for image path output, default 'image_path'
-
images_folder(str, default:None) –optional root folder for resolving relative paths
Examples:
from lazyllm.tools.data import pt_mm
op = pt_mm.GraphRetriever(context_key='context', img_key='img', _save_data=False)
data = {'context': 'Some content '}
res = op([data])
# res[0]['img'] contains resolved absolute path
# empty context: res[0]['img'] == [], record kept, source context unchanged
empty_res = op([{'context': ' '}])
Source code in lazyllm/tools/data/operators/pt_op.py
ImageDedup
Bases: PT_MM
Deduplicate images by file hash; keep first occurrence, skip duplicates.
Parameters:
-
image_key(str, default:'image_path') –key for image path(s), default 'image_path'
-
hash_method(str, default:'md5') –hash algorithm, default 'md5'
Examples:
from lazyllm.tools.data import pt_mm
op = pt_mm.ImageDedup()
batch = [{'image_path': 'a.jpg', 'id': 1}, {'image_path': 'a.jpg', 'id': 2}, {'image_path': 'b.jpg', 'id': 3}]
res = op(batch)
# len(res) == 2, duplicate removed
Source code in lazyllm/tools/data/operators/pt_op.py
Phi4QAGenerator
Bases: PT
Use LLM to convert context (with optional images) into pretraining-format Phi-4 style multi-turn Q&A pairs.
Parameters:
-
llm–vision- or text-language model instance
-
image_key(str, default:'image_path') –key for image path(s), default 'image_path'
-
context_key(str, default:'context') –key for context, default 'context'
-
num_qa(int, default:5) –number of QA pairs to generate, default 5
-
prompt(str, default:None) –optional custom prompt
Examples:
from lazyllm.tools.data import pt
vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
op = pt.Phi4QAGenerator(vlm, num_qa=2)
res = op([{'context': 'Some context.', 'image_path': '/path/to/image.jpg'}])
# res[0]['qa_pairs'] contains pretraining-format Q&A
Source code in lazyllm/tools/data/operators/pt_op.py
TextRelevanceFilter
Bases: PT_MM
Use VLM to judge image-text relevance; filter samples below the threshold.
Parameters:
-
vlm–vision-language model instance
-
image_key(str, default:'image_path') –key for image path(s), default 'image_path'
-
text_key(str, default:'text') –key for text, default 'text'
-
threshold(float, default:0.6) –relevance threshold [0,1], default 0.6
-
prompt(str, default:None) –optional custom prompt
Examples:
from lazyllm.tools.data import pt_mm
vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
op = pt_mm.TextRelevanceFilter(vlm, threshold=0.5)
res = op([{'image_path': '/path/to/image.jpg', 'text': 'a red square'}])
# samples with relevance >= threshold are kept
Source code in lazyllm/tools/data/operators/pt_op.py
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 | |
VQAGenerator
Bases: PT_MM
Use VLM to generate Visual Question Answering (VQA) pairs from context and images.
Parameters:
-
vlm–vision-language model instance
-
image_key(str, default:'image_path') –key for image path(s), default 'image_path'
-
context_key(str, default:'context') –key for context, default 'context'
-
num_qa(int, default:5) –number of QA pairs to generate, default 5
-
prompt(str, default:None) –optional custom prompt
Examples:
from lazyllm.tools.data import pt_mm
vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
op = pt_mm.VQAGenerator(vlm, num_qa=3)
res = op([{'image_path': '/path/to/image.jpg', 'context': 'A simple image.'}])
# res[0]['qa_pairs'] contains [{'query': '...', 'answer': '...'}, ...]
Source code in lazyllm/tools/data/operators/pt_op.py
VQAScorer
Bases: PT_MM
Use VLM to score VQA pair quality (query, answer, image_path), evaluating how good the visual QA is.
Parameters:
-
vlm–vision-language model instance
-
image_key(str, default:'image_path') –key for image path(s), default 'image_path'
-
query_key(str, default:'query') –key for question, default 'query'
-
answer_key(str, default:'answer') –key for answer, default 'answer'
-
prompt(str, default:None) –optional custom prompt
Examples:
from lazyllm.tools.data import pt_mm
vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
op = pt_mm.VQAScorer(vlm)
res = op([{
'image_path': '/path/to/image.jpg',
'query': 'What color is it?',
'answer': 'Red',
}])
# res[0]['quality_score'] contains score, relevance, correctness, reason
Source code in lazyllm/tools/data/operators/pt_op.py
398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 | |
integrity_check(data, image_key='image_path', input_key=None)
Check image file integrity; filter out corrupted or empty files, keep paths of valid images.
Parameters:
-
data(dict) –single data dict
-
image_key(str, default:'image_path') –key for image path(s), default 'image_path'
-
input_key(str, default:None) –optional, overrides image_key
Examples:
from lazyllm.tools.data import pt_mm
op = pt_mm.integrity_check()
res = op([{'image_path': '/path/to/image.jpg'}, {'image_path': '/nonexistent.png'}])
# only valid images retained
Source code in lazyllm/tools/data/operators/pt_op.py
resolution_filter(data, image_key='image_path', min_width=256, min_height=256, max_width=4096, max_height=4096, input_key=None)
Filter images by min/max width and height, keeping only those within the specified resolution range.
Parameters:
-
data(dict) –single data dict
-
image_key(str, default:'image_path') –key for image path(s), default 'image_path'
-
min_width(int, default:256) –minimum width, default 256
-
min_height(int, default:256) –minimum height, default 256
-
max_width(int, default:4096) –maximum width, default 4096
-
max_height(int, default:4096) –maximum height, default 4096
-
input_key(str, default:None) –optional, overrides image_key
Examples:
from lazyllm.tools.data import pt_mm
op = pt_mm.resolution_filter(min_width=256, min_height=256, max_width=4096, max_height=4096)
res = op([{'image_path': '/path/to/image.jpg'}])
Source code in lazyllm/tools/data/operators/pt_op.py
resolution_resize(data, image_key='image_path', max_side=1024, input_key=None, inplace=True)
Resize image so the longest side does not exceed max_side. Can overwrite in place or save to a new file.
Parameters:
-
data(dict) –single data dict
-
image_key(str, default:'image_path') –key for image path(s), default 'image_path'
-
max_side(int, default:1024) –max length of longest side, default 1024
-
inplace(bool, default:True) –overwrite original file if True; if False, save with _resized suffix
-
input_key(str, default:None) –optional, overrides image_key
Examples:
from lazyllm.tools.data import pt_mm
op = pt_mm.resolution_resize(max_side=400, inplace=False)
res = op([{'image_path': '/path/to/image.jpg'}])
# resized file saved as image_resized.jpg in same directory
Source code in lazyllm/tools/data/operators/pt_op.py
Refine Operators
lazyllm.tools.data.operators.refine_op
remove_emoji(data, input_key='content')
Remove emoji characters from the specified text field.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
Examples:
from lazyllm.tools.data import refine
func = refine.remove_emoji(input_key='content')
inputs = [{'content': 'Hello 😊 World 🌍!'}]
res = func(inputs)
print(res)
# [{'content': 'Hello World !'}]
Source code in lazyllm/tools/data/operators/refine_op.py
remove_extra_spaces(data, input_key='content')
Normalize whitespace by collapsing multiple spaces, newlines and tabs into single spaces.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
Examples:
from lazyllm.tools.data import refine
func = refine.remove_extra_spaces(input_key='content')
inputs = [{'content': 'hello world\n\n foo\tbar'}]
res = func(inputs)
print(res)
# [{'content': 'hello world foo bar'}]
Source code in lazyllm/tools/data/operators/refine_op.py
remove_html_entity(data, input_key='content')
Remove HTML entities (e.g. , <, &) from the specified text field.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
Examples:
from lazyllm.tools.data import refine
func = refine.remove_html_entity(input_key='content')
inputs = [{'content': 'Hello World & <tag>'}]
res = func(inputs)
print(res)
# [{'content': 'HelloWorld tag'}]
Source code in lazyllm/tools/data/operators/refine_op.py
remove_html_url(data, input_key='content')
Remove HTTP/HTTPS URLs and HTML tags from the specified text field.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
Examples:
from lazyllm.tools.data import refine
func = refine.remove_html_url(input_key='content')
inputs = [{'content': 'Check https://example.com and <b>bold</b>'}]
res = func(inputs)
print(res)
# [{'content': 'Check and bold'}]
Source code in lazyllm/tools/data/operators/refine_op.py
Filter Operators
lazyllm.tools.data.operators.filter_op
CapitalWordFilter
Bases: Filter
Filter text with too high ratio of all-caps words.
Parameters:
-
input_key(str, default:'content') –key of the text field, default 'content'
-
max_ratio(float, default:0.5) –max ratio of all-caps words, default 0.5
-
use_tokenizer(bool, default:False) –use tokenizer, default False
-
_concurrency_mode(str, default:'thread') –optional concurrency mode
Examples:
from lazyllm.tools.data import filter
func = filter.CapitalWordFilter(input_key='content', max_ratio=0.5)
inputs = [{'content': 'Normal text with Some Capitals'}, {'content': 'MOSTLY UPPERCASE'}]
res = func(inputs)
print(res)
# [{'content': 'Normal text with Some Capitals'}]
Source code in lazyllm/tools/data/operators/filter_op.py
MinHashDeduplicator
Bases: Filter
Remove near-duplicate texts using MinHash LSH. For batch input, keeps first occurrence of each unique text.
Parameters:
-
input_key(str, default:'content') –key of the text field, default 'content'
-
threshold(float, default:0.85) –similarity threshold, default 0.85
-
num_perm(int, default:128) –number of MinHash permutations, default 128
-
use_n_gram(bool, default:True) –use n-gram, default True
-
ngram(int, default:5) –n-gram size, default 5
Examples:
from lazyllm.tools.data import filter
func = filter.MinHashDeduplicator(input_key='content', threshold=0.85)
inputs = [{'uid': '0', 'content': '这是第一段不同的内容。'}, {'uid': '1', 'content': '这是第一段不同的内容。'}]
res = func(inputs)
print(res)
# [{'uid': '0', 'content': '这是第一段不同的内容。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
StopWordFilter
Bases: Filter
Filter text with too high stopword ratio (e.g. invalid content mostly stopwords).
Parameters:
-
input_key(str, default:'content') –key of the text field, default 'content'
-
max_ratio(float, default:0.5) –max stopword ratio, filter if exceeded, default 0.5
-
use_tokenizer(bool, default:True) –use tokenizer, default True
-
language(str, default:'zh') –language, 'zh' or 'en', default 'zh'
-
_concurrency_mode(str, default:'thread') –optional concurrency mode
Examples:
from lazyllm.tools.data import filter
func = filter.StopWordFilter(input_key='content', max_ratio=0.5, language='zh')
inputs = [{'content': '这是一段包含实际内容的正常文本。'}, {'content': '的了吗呢吧啊'}]
res = func(inputs)
print(res)
# [{'content': '这是一段包含实际内容的正常文本。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 | |
SymbolRatioFilter
Bases: Filter
Filter text with too high ratio of specified symbols (e.g. #, ..., …) to words.
Parameters:
-
input_key(str, default:'content') –key of the text field, default 'content'
-
max_ratio(float, default:0.3) –max ratio of symbols to words, default 0.3
-
symbols(list | None, default:None) –symbols to count, default ['#', '...', '…']
-
_concurrency_mode(str, default:'process') –optional concurrency mode
Examples:
from lazyllm.tools.data import filter
func = filter.SymbolRatioFilter(input_key='content', max_ratio=0.3)
inputs = [{'content': 'Normal text without symbols'}, {'content': '### ... … ###'}]
res = func(inputs)
print(res)
# [{'content': 'Normal text without symbols'}]
Source code in lazyllm/tools/data/operators/filter_op.py
TargetLanguageFilter
Bases: Filter
Filter text by language using FastText. Keeps only texts in the specified language(s).
Parameters:
-
input_key(str, default:'content') –key of the text field, default 'content'
-
target_language(str | list, default:'zho_Hans') –target language code(s), e.g. 'zho_Hans', 'eng_Latn'
-
threshold(float, default:0.6) –confidence threshold, default 0.6
-
model_path(str | None, default:None) –path to FastText model
-
_concurrency_mode(str, default:'thread') –optional concurrency mode
Examples:
from lazyllm.tools.data import filter
func = filter.TargetLanguageFilter(input_key='content', target_language='zho_Hans', threshold=0.3)
inputs = [{'content': '这是一段中文文本。'}, {'content': 'This is English.'}]
res = func(inputs)
print(res)
# [{'content': '这是一段中文文本。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 | |
WordBlocklistFilter
Bases: Filter
Filter text containing more than threshold blocked words using Aho-Corasick automaton.
Parameters:
-
input_key(str, default:'content') –key of the text field, default 'content'
-
blocklist(list | None, default:None) –list of blocked words
-
blocklist_path(str | None, default:None) –path to blocklist file
-
language(str, default:'zh') –language, 'zh' or 'en', default 'zh'
-
threshold(int, default:1) –max allowed occurrences of blocked words, default 1
-
_concurrency_mode(str, default:'thread') –optional concurrency mode
Examples:
from lazyllm.tools.data import filter
func = filter.WordBlocklistFilter(input_key='content', blocklist=['敏感', '违禁'], threshold=0)
inputs = [{'content': '这是正常的文本内容。'}, {'content': '这里包含敏感词。'}]
res = func(inputs)
print(res)
# [{'content': '这是正常的文本内容。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 | |
bullet_point_filter(data, input_key='content', max_ratio=0.9)
Filter text with too many bullet-point lines (e.g. TOC, pure lists).
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
-
max_ratio(float, default:0.9) –max ratio of bullet lines, default 0.9
Examples:
from lazyllm.tools.data import filter
func = filter.bullet_point_filter(input_key='content', max_ratio=0.5)
inputs = [{'content': 'Normal paragraph text'}, {'content': '- Item 1\n- Item 2\n- Item 3'}]
res = func(inputs)
print(res)
# [{'content': 'Normal paragraph text'}]
Source code in lazyllm/tools/data/operators/filter_op.py
char_count_filter(data, input_key='content', min_chars=100, max_chars=100000)
Filter by character count (excluding whitespace). Keeps text in [min_chars, max_chars].
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
-
min_chars(int, default:100) –min chars, default 100
-
max_chars(int, default:100000) –max chars, default 100000
Examples:
from lazyllm.tools.data import filter
func = filter.char_count_filter(input_key='content', min_chars=10, max_chars=100)
inputs = [{'content': '短'}, {'content': '这是一段中等长度的文本内容。'}]
res = func(inputs)
print(res)
# [{'content': '这是一段中等长度的文本内容。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
colon_end_filter(data, input_key='content')
Filter text ending with colon.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
Examples:
from lazyllm.tools.data import filter
func = filter.colon_end_filter(input_key='content')
inputs = [{'content': '这是正常结尾。'}, {'content': '这是冒号结尾:'}]
res = func(inputs)
print(res)
# [{'content': '这是正常结尾。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
curly_bracket_filter(data, input_key='content', max_ratio=0.08)
Filter text with too high ratio of curly brackets {}.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
-
max_ratio(float, default:0.08) –max ratio of curly brackets, default 0.08
Examples:
from lazyllm.tools.data import filter
func = filter.curly_bracket_filter(input_key='content', max_ratio=0.08)
inputs = [{'content': 'Normal text'}, {'content': '{{{{{' * 10}]
res = func(inputs)
print(res)
# [{'content': 'Normal text'}]
Source code in lazyllm/tools/data/operators/filter_op.py
ellipsis_end_filter(data, input_key='content', max_ratio=0.3)
Filter text with too many lines ending in ellipsis (...、…、……).
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
-
max_ratio(float, default:0.3) –max ratio of lines ending with ellipsis, default 0.3
Examples:
from lazyllm.tools.data import filter
func = filter.ellipsis_end_filter(input_key='content', max_ratio=0.3)
inputs = [{'content': '第一行。\n第二行。\n第三行。'}, {'content': '第一行...\n第二行...'}]
res = func(inputs)
print(res)
# [{'content': '第一行。\n第二行。\n第三行。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
idcard_filter(data, input_key='content', threshold=3)
Filter text containing too many ID card / identity document related terms.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
-
threshold(int, default:3) –max matches of related terms, filter if exceeded, default 3
Examples:
from lazyllm.tools.data import filter
func = filter.idcard_filter(input_key='content', threshold=1)
inputs = [{'content': '这是正常文本'}, {'content': '请提供身份证号码和ID number'}]
res = func(inputs)
print(res)
# [{'content': '这是正常文本'}]
Source code in lazyllm/tools/data/operators/filter_op.py
javascript_filter(data, input_key='content', min_non_script_lines=3)
Filter text containing many JavaScript patterns (code, script fragments). Short text (<=3 lines) is passed through to avoid false positives on normal short sentences.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
-
min_non_script_lines(int, default:3) –min non-script lines, default 3
Examples:
from lazyllm.tools.data import filter
func = filter.javascript_filter(input_key='content', min_non_script_lines=2)
inputs = [{'content': 'Short normal text'}, {'content': 'function() { return 1; }
const x = 1;
var y = 2;
let z = 3;'}]
res = func(inputs)
print(res)
# [{'content': 'Short normal text'}]
Source code in lazyllm/tools/data/operators/filter_op.py
lorem_ipsum_filter(data, input_key='content', max_ratio=3e-08)
Filter Lorem ipsum, placeholder text, etc.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
-
max_ratio(float, default:3e-08) –max ratio of placeholder patterns, default 3e-8
Examples:
from lazyllm.tools.data import filter
func = filter.lorem_ipsum_filter(input_key='content')
inputs = [{'content': 'This is real content'}, {'content': 'Lorem ipsum dolor sit amet'}]
res = func(inputs)
print(res)
# [{'content': 'This is real content'}]
Source code in lazyllm/tools/data/operators/filter_op.py
no_punc_filter(data, input_key='content', max_length_between_punct=112, language='zh')
Filter text with too long segments between punctuation marks.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
-
max_length_between_punct(int, default:112) –max length between punctuation, default 112
-
language(str, default:'zh') –language, 'zh' or 'en', default 'zh'
Examples:
from lazyllm.tools.data import filter
func = filter.no_punc_filter(input_key='content', max_length_between_punct=20, language='zh')
inputs = [{'content': '这是。正常。文本。'}, {'content': '这是一段没有标点符号的超长文本' * 10}]
res = func(inputs)
print(res)
# [{'content': '这是。正常。文本。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
null_content_filter(data, input_key='content')
Filter null or whitespace-only content.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
Examples:
from lazyllm.tools.data import filter
func = filter.null_content_filter(input_key='content')
inputs = [{'content': 'Valid content'}, {'content': ''}, {'content': ' '}]
res = func(inputs)
print(res)
# [{'content': 'Valid content'}]
Source code in lazyllm/tools/data/operators/filter_op.py
sentence_count_filter(data, input_key='content', min_sentences=3, max_sentences=1000, language='zh')
Filter by sentence count. Keeps text with sentences in [min_sentences, max_sentences].
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
-
min_sentences(int, default:3) –min sentence count, default 3
-
max_sentences(int, default:1000) –max sentence count, default 1000
-
language(str, default:'zh') –language, 'zh' or 'en', default 'zh'
Examples:
from lazyllm.tools.data import filter
func = filter.sentence_count_filter(input_key='content', min_sentences=2, max_sentences=10, language='zh')
inputs = [{'content': '单句。'}, {'content': '第一句。第二句。'}]
res = func(inputs)
print(res)
# [{'content': '第一句。第二句。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
special_char_filter(data, input_key='content')
Filter text containing special invisible characters (zero-width, replacement char, etc.).
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
Examples:
from lazyllm.tools.data import filter
func = filter.special_char_filter(input_key='content')
inputs = [{'content': 'Normal text 正常文本'}, {'content': 'Text with zero width'}]
res = func(inputs)
print(res)
# [{'content': 'Normal text 正常文本'}]
Source code in lazyllm/tools/data/operators/filter_op.py
unique_word_filter(data, input_key='content', min_ratio=0.1, use_tokenizer=True, language='zh')
Filter text with too low unique word ratio (excessive repetition).
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
-
min_ratio(float, default:0.1) –min unique word ratio, default 0.1
-
use_tokenizer(bool, default:True) –use tokenizer, default True
-
language(str, default:'zh') –language, 'zh' or 'en', default 'zh'
Examples:
from lazyllm.tools.data import filter
func = filter.unique_word_filter(input_key='content', min_ratio=0.4, language='zh')
inputs = [{'content': '这是一段包含多个不同词汇的文本。'}, {'content': '重复重复重复'}]
res = func(inputs)
print(res)
# [{'content': '这是一段包含多个不同词汇的文本。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
watermark_filter(data, input_key='content', watermarks=None)
Filter text containing copyright/watermark related terms.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
-
watermarks(list | None, default:None) –custom watermark terms, default uses built-in list
Examples:
from lazyllm.tools.data import filter
func = filter.watermark_filter(input_key='content')
inputs = [{'content': 'Normal content'}, {'content': 'This document contains Copyright notice'}]
res = func(inputs)
print(res)
# [{'content': 'Normal content'}]
Source code in lazyllm/tools/data/operators/filter_op.py
word_count_filter(data, input_key='content', min_words=10, max_words=10000, language='zh')
Filter by word/char count: Chinese by char count, English by word count. Keeps text in [min_words, max_words).
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
-
min_words(int, default:10) –min count, default 10
-
max_words(int, default:10000) –max count, default 10000
-
language(str, default:'zh') –language, 'zh' or 'en', default 'zh'
Examples:
from lazyllm.tools.data import filter
func = filter.word_count_filter(input_key='content', min_words=5, max_words=20, language='zh')
inputs = [{'content': '短文本'}, {'content': '这是一段适中长度的中文文本内容。'}]
res = func(inputs)
print(res)
# [{'content': '这是一段适中长度的中文文本内容。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
word_length_filter(data, input_key='content', min_length=3, max_length=20)
Filter by average word length. Keeps text with mean word length in [min_length, max_length).
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'content') –key of the text field, default 'content'
-
min_length(float, default:3) –min avg word length, default 3
-
max_length(float, default:20) –max avg word length, default 20
Examples:
from lazyllm.tools.data import filter
func = filter.word_length_filter(input_key='content', min_length=3, max_length=10)
inputs = [{'content': 'I am ok'}, {'content': 'This is a normal sentence'}]
res = func(inputs)
print(res)
# [{'content': 'This is a normal sentence'}]
Source code in lazyllm/tools/data/operators/filter_op.py
Token Chunker
lazyllm.tools.data.operators.token_chunker
TokenChunker
Bases: Chunker
Split long text into chunks by token count. Splits by paragraph first, then by sentence. Ensures each chunk does not exceed max_tokens; chunks below min_tokens may be discarded.
Parameters:
-
input_key(str, default:'content') –key of the text field, default 'content'
-
model_path(str | None, default:None) –path to tokenizer model, default Qwen2.5-0.5B-Instruct
-
max_tokens(int, default:1024) –max tokens per chunk, default 1024
-
min_tokens(int, default:200) –min tokens per chunk, smaller chunks may be discarded, default 200
-
_concurrency_mode(str, default:'process') –optional concurrency mode
-
_max_workers(int | None) –optional max concurrency
Examples:
from lazyllm.tools.data import chunker
func = chunker.TokenChunker(input_key='content', max_tokens=50, min_tokens=10)
inputs = [{'content': '人工智能是计算机科学的一个分支。' * 20, 'meta_data': {'source': 'doc_1'}}]
res = func(inputs)
print(res)
# [{'uid': '...', 'content': '...', 'meta_data': {'source': 'doc_1', 'index': 0, 'total': N, 'length': ...}}, ...]
Source code in lazyllm/tools/data/operators/token_chunker.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 | |
Code Generation Operators
lazyllm.tools.data.operators.codegen_ops
CodeInstructionGenerator
Bases: CodeGenOps
Code-gen pipeline operator: CodeInstructionGenerator.
Extracts the user instruction from raw messages and rewrites it into a standardized English instruction plus a Python function skeleton code block.
Typical output structure (default input_key='messages', output_key='generated_instruction'):
- messages: original multi-turn messages (unchanged)
- generated_instruction (str): standardized English instruction + Python code block
Parameters:
-
model–a LazyLLM model object (required), shared via share().
-
prompt_template(str | None, default:None) –optional custom system prompt (overrides default).
-
input_key(str, default:'messages') –input conversation field name, default 'messages'.
-
output_key(str, default:'generated_instruction') –output standardized instruction field name, default 'generated_instruction'.
-
**kwargs–extra args forwarded to the base operator (e.g. _max_workers, _save_data).
Examples:
from lazyllm.tools.data.operators.codegen_ops import CodeInstructionGenerator
op = CodeInstructionGenerator(model=model,
input_key='messages',
output_key='generated_instruction')
item = {
'messages': [
{'role': 'user', 'content': '写一个 Python 函数,打印 hello'}
]
}
res = op(item)
print(res)
# Output Example:
# {
# 'messages': [...],
# 'generated_instruction': "Write a Python function that prints 'hello'.\n"
# "```python\n"
# "def solution():\n"
# " print('hello')\n"
# "```"
# }
Source code in lazyllm/tools/data/operators/codegen_ops.py
LogicIntegrityAuditor
Bases: CodeGenOps
Code-gen pipeline operator: LogicIntegrityAuditor.
Evaluates a single (generated_instruction, generated_code) sample, producing a quality score (0–10) and textual feedback, parsed from a JSON-formatted model response.
Typical output structure (default input_instruction_key='instruction', input_code_key='new_code'):
- instruction: standardized instruction
- new_code: generated code
- quality_score: numeric quality score (int/float depending on JsonFormatter parsing)
- feedback: textual review feedback
Parameters:
-
model–a LazyLLM model object (required), wrapped with JsonFormatter.
-
prompt_template(str | None, default:None) –optional custom system prompt.
-
input_instruction_key(str, default:'instruction') –input instruction field name, default 'instruction'.
-
input_code_key(str, default:'new_code') –input code field name, default 'new_code'.
-
output_score_key(str, default:'quality_score') –output score field name, default 'quality_score'.
-
output_feedback_key(str, default:'feedback') –output feedback field name, default 'feedback'.
-
**kwargs–extra args forwarded to the base operator.
Examples:
from lazyllm.tools.data.operators.codegen_ops import LogicIntegrityAuditor
op = LogicIntegrityAuditor(model=model)
item = {
'instruction': "Write a Python function that prints 'hello'.",
'new_code': "def solution():
print('hello')"
}
res = op(item)
print(res)
# {
# 'instruction': "Write a Python function that prints 'hello'.",
# 'new_code': "def solution():
print('hello')",
# 'quality_score': 8,
# 'feedback': 'Good code. The logic is clear and follows PEP8.'
# }
Source code in lazyllm/tools/data/operators/codegen_ops.py
200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 | |
ScriptSynthesizer
Bases: CodeGenOps
Code-gen pipeline operator: ScriptSynthesizer.
Given a natural language code instruction (often from the previous generated_instruction or a cleaned instruction field), generates the corresponding Python source code, stripping Markdown code fences when present.
Typical output structure (default input_key='instruction', output_key='new_code'):
- instruction: natural language code instruction
- new_code (str): generated Python code string
Parameters:
-
model–a LazyLLM model object (required).
-
prompt_template(str | None, default:None) –optional custom system prompt.
-
input_key(str, default:'instruction') –input instruction field name, default 'instruction'.
-
output_key(str, default:'new_code') –output code field name, default 'new_code'.
-
**kwargs–extra args forwarded to the base operator.
Examples:
from lazyllm.tools.data.operators.codegen_ops import ScriptSynthesizer
op = ScriptSynthesizer(model=model,
input_key='instruction',
output_key='new_code')
item = {
'instruction': 'Write a Python function that prints "hello".'
}
res = op(item)
print(res)
# {
# 'instruction': 'Write a Python function that prints "hello".',
# 'new_code': "def solution():
print('hello')"
# }
Source code in lazyllm/tools/data/operators/codegen_ops.py
ThresholdSieve
Bases: CodeGenOps
Code-gen pipeline operator: ThresholdSieve.
Filters samples based on code quality scores produced by LogicIntegrityAuditor:
- If quality_score/feedback are missing, it first calls the internal scorer.
- If the score is within [min_score, max_score], the sample is kept and labeled.
- Otherwise, it returns an empty list [], effectively dropping the sample from the pipeline.
Typical output structure (default output_key='quality_score_filter_label'):
- instruction: ...
- new_code: ...
- quality_score: 8
- feedback: 'Good code. ...'
- quality_score_filter_label: 1 (1 for passed, 0 otherwise; non-passed samples are dropped)
Parameters:
-
model–a LazyLLM model object (required), used by the internal scorer.
-
min_score(int, default:7) –minimum score (inclusive) to pass the filter, default 7.
-
max_score(int, default:10) –maximum score (inclusive) to pass the filter, default 10.
-
input_instruction_key(str, default:'instruction') –input instruction field, default 'instruction'.
-
input_code_key(str, default:'new_code') –input code field, default 'new_code'.
-
output_score_key(str, default:'quality_score') –score field name, default 'quality_score'.
-
output_feedback_key(str, default:'feedback') –feedback field name, default 'feedback'.
-
output_key(str, default:'quality_score_filter_label') –filter label field name, default 'quality_score_filter_label'.
-
**kwargs–extra args forwarded to the base operator.
Examples:
from lazyllm.tools.data.operators.codegen_ops import ThresholdSieve
op = ThresholdSieve(model=model, min_score=7, max_score=10)
item = {
'instruction': "Write a Python function that prints 'hello'.",
'new_code': "def solution():
print('hello')"
}
res = op(item)
print(res)
# {
# 'instruction': '...',
# 'new_code': '...',
# 'quality_score': 8,
# 'feedback': 'Good code. The logic is clear and follows PEP8.',
# 'quality_score_filter_label': 1
# }
Source code in lazyllm/tools/data/operators/codegen_ops.py
298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 | |
Text to QA pairs Operators
lazyllm.tools.data.operators.text2qa_ops
ChunkToQA
Bases: Text2qa
Use an LLM to generate one QA pair (question + answer) per text chunk. Output format is constrained via JsonFormatter; user_prompt can be customized or left as default.
Parameters:
-
input_key(str, default:'chunk') –key of the input chunk, default 'chunk'
-
query_key(str, default:'query') –key to write the generated question, default 'query'
-
answer_key(str, default:'answer') –key to write the generated answer, default 'answer'
-
model–optional TrainableModule or compatible; None uses default Qwen model
-
user_prompt(str | None, default:None) –optional user prompt prefix; None uses default
-
**kwargs–other base-class args
Examples:
from lazyllm.tools.data import Text2qa
from lazyllm import OnlineChatModule
llm = OnlineChatModule()
op = Text2qa.ChunkToQA(input_key='chunk', query_key='query', answer_key='answer', model=llm)
data = [{'chunk': '今天是晴天!'}]
res = op(data)
print(res)
# [{'chunk': '今天是晴天!', 'query': '今天的天气怎么样?', 'answer': '今天是晴天!'}]
Source code in lazyllm/tools/data/operators/text2qa_ops.py
214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 | |
QAScorer
Bases: Text2qa
Use an LLM to score QA pairs: whether the answer is strictly grounded in the source chunk. Outputs 1 (grounded) or 0 (otherwise). Output format constrained via JsonFormatter.
Parameters:
-
input_key(str, default:'chunk') –key of the source chunk, default 'chunk'
-
output_key(str, default:'score') –key to write the score, default 'score'
-
query_key(str, default:'query') –key of the question, default 'query'
-
answer_key(str, default:'answer') –key of the answer, default 'answer'
-
model–optional TrainableModule or compatible; None uses default Qwen model
-
user_prompt(str | None, default:None) –optional user prompt; None uses default rules
-
**kwargs–other base-class args
Examples:
from lazyllm.tools.data import Text2qa
from lazyllm import OnlineChatModule
llm = OnlineChatModule()
op = Text2qa.QAScorer(input_key='chunk', output_key='score', query_key='query', answer_key='answer', model=llm)
data = [
{'chunk': '今天是晴天!', 'query': '今天的天气怎么样?', 'answer': '今天是晴天!'},
{'chunk': '1+1=2', 'query': '1+1=?', 'answer': '3'}
]
res = op(data)
print(res)
# [{'chunk': '今天是晴天!', 'query': '今天的天气怎么样?', 'answer': '今天是晴天!', 'score': 1}, {'chunk': '1+1=2', 'query': '1+1=?', 'answer': '3', 'score': 0}]
Source code in lazyllm/tools/data/operators/text2qa_ops.py
291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 | |
TextToChunks
Bases: Text2qa
Split input text into chunks by lines, with size controlled by token or character count. One input item may expand into multiple output items. Supports optional tokenizer or character-based length.
Parameters:
-
input_key(str, default:'content') –key of the input text field, default 'content'
-
output_key(str, default:'chunk') –key to write each chunk into, default 'chunk'
-
chunk_size(int, default:10) –max length per chunk (tokens or chars), default 10
-
tokenize(bool, default:True) –whether to count by tokens; if True and tokenizer not provided, uses default Qwen tokenizer
-
tokenizer–optional tokenizer for counting; if None and tokenize=True, loads default
-
**kwargs–other base-class args (e.g. _concurrency_mode, _max_workers)
Examples:
from lazyllm.tools.data import Text2qa
op = Text2qa.TextToChunks(input_key='content', output_key='chunk', chunk_size=10, tokenize=False)
data = [{'content': 'line1
line2
line3
line4'}]
res = op(data)
print(res)
# [{'content': 'line1
line2
line3
line4', 'chunk': 'line1
line2'}, {'content': 'line1
line2
line3
line4', 'chunk': 'line3
line4'}]
Source code in lazyllm/tools/data/operators/text2qa_ops.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | |
empty_or_noise_filter(data, input_key='chunk')
Filter out empty or noise-only items. If the specified field is empty or contains no word/CJK characters, the item is dropped (returns empty list); otherwise the item is kept. Registered as a single-item forward.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'chunk') –key to check, default 'chunk'
Examples:
from lazyllm.tools.data import Text2qa
op = Text2qa.empty_or_noise_filter(input_key='chunk')
data = [{'chunk': 'hello'}, {'chunk': ''}, {'chunk': '
'}]
res = op(data)
print(res)
# [{'chunk': 'hello'}]
Source code in lazyllm/tools/data/operators/text2qa_ops.py
invalid_unicode_cleaner(data, input_key='chunk')
Remove invalid Unicode code points (e.g. FDD0–FDEF, FFFE/FFFF and certain Supplementary Special Purpose ranges) from the specified text field in place. Registered as a single-item forward.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'chunk') –key of the text field to clean, default 'chunk'
Examples:
from lazyllm.tools.data import Text2qa
op = Text2qa.invalid_unicode_cleaner(input_key='chunk')
data = {'chunk': 'valid text tail'}
res = op(data) # 剔除乱码
print(res)
[{'chunk': 'valid text tail'}]
Source code in lazyllm/tools/data/operators/text2qa_ops.py
CoT QA Operators
lazyllm.tools.data.operators.cot_ops
CoTGenerator
Bases: GenCot
Use an LLM to generate chain-of-thought reasoning for a question, with final answer wrapped in \boxed{{ANSWER}}. Writes result to the specified output key.
Parameters:
-
input_key(str, default:'query') –key of the input question, default 'query'
-
output_key(str, default:'cot_answer') –key to write the CoT answer, default 'cot_answer'
-
model–optional TrainableModule or compatible; None uses default Qwen model
-
user_prompt(str | None, default:None) –optional user prompt prefix; None uses default
-
**kwargs–other base-class args
Examples:
from lazyllm.tools.data import genCot
from lazyllm import OnlineChatModule
llm = OnlineChatModule()
op = genCot.CoTGenerator(input_key='query', output_key='cot_answer', model=llm)
data = {'query': 'What is 2+2?'}
res = op(data) # each item gets 'cot_answer' with CoT and \boxed{{4}}
print(res)
# {'query': 'What is 2+2?', 'cot_answer': '首先,我们需要理解加法的基本概念,即两个或多个数值的总和。在这个问题中,我们需要计算 2 和另一个 2 的和。
第一步,我们识别出第一个数值是 2。
第二步,我们识别出第二个数值也是 2。
第三步,我们将这两个数值相加:2 + 2。
第四步,我们进行计算:2 + 2 = 4。
因此,最终答案是 4,使用规定的格式包裹答案。
最终答案:oxed{4}'}
Source code in lazyllm/tools/data/operators/cot_ops.py
SelfConsistencyCoTGenerator
Bases: GenCot
Sample multiple CoT answers for the same question, extract \boxed{{}} answers, take majority vote, and output one CoT that matches the majority answer.
Parameters:
-
input_key(str, default:'query') –key of the input question, default 'query'
-
output_key(str, default:'cot_answer') –key to write the CoT answer, default 'cot_answer'
-
num_samples(int, default:5) –number of samples, default 5
-
model–optional; None uses default Qwen model
-
user_prompt(str | None, default:None) –optional user prompt
-
**kwargs–other base-class args
Examples:
from lazyllm.tools.data import genCot
from lazyllm import OnlineChatModule
llm = OnlineChatModule()
op = genCot.SelfConsistencyCoTGenerator(
input_key='query',
output_key='cot_answer',
num_samples=3,
model=llm
)
data = {'query': 'What is 3*4?'}
res = op(data)
print(res)
# {'query': 'What is 3*4?', 'candidates': ['12', '12', '12'], 'cot_answer': '首先,我们需要理解问题的核心,即计算3乘以4的结果。
1. 确定操作:这是一个乘法问题,我们需要将两个数相乘。
2. 识别数字:问题中给出的两个数字是3和4。
3. 执行乘法:将3乘以4,计算过程如下:
- 3 * 4 = 12
因此,3乘以4的结果是12。
最终答案为:oxed{12}'}
Source code in lazyllm/tools/data/operators/cot_ops.py
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 | |
answer_verify(data, answer_key='reference', infer_key='llm_extracted', output_key='is_equal')
Compare reference answer and model-extracted answer for mathematical equality. Uses math_verify to parse and verify; result written to the specified key. Registered as single-item forward.
Parameters:
-
data(dict) –single data dict
-
answer_key(str, default:'reference') –key of reference answer, default 'reference'
-
infer_key(str, default:'llm_extracted') –key of LLM-extracted answer, default 'llm_extracted'
-
output_key(str, default:'is_equal') –key to write equality result, default 'is_equal'
Examples:
from lazyllm.tools.data import genCot
data = {'reference': '1/2', 'llm_extracted': '0.5'}
op = genCot.answer_verify(answer_key='reference', infer_key='llm_extracted', output_key='is_equal')
print(op(data)) # Add key/value: 'is_equal': True
# {'reference': '1/2', 'llm_extracted': '0.5', 'is_equal': True}
Source code in lazyllm/tools/data/operators/cot_ops.py
Enhanced QA Operators
lazyllm.tools.data.operators.enQa_ops
DiversityScorer
Bases: EnQA
Score diversity of a list of questions; output list matches input order, each item has rewritten_query and diversity_score (0 similar / 1 diverse).
Parameters:
-
input_key(str, default:'rewrite_querys') –key of the question list, default 'rewrite_querys'
-
output_key(str, default:'diversity_querys') –key to write the scored list, default 'diversity_querys'
-
model–optional; None uses default Qwen model
-
user_prompt(str | None, default:None) –optional user prompt
-
**kwargs–other base-class args
Examples:
from lazyllm.tools.data import EnQA
from lazyllm import OnlineChatModule
llm = OnlineChatModule()
op = EnQA.DiversityScorer(input_key='rewrite_querys', output_key='diversity_querys', model=llm)
data = {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!']}
res = op(data)
print(data)
# {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'diversity_querys': [{'rewritten_query': '今天是个好天气', 'diversity_score': 1}, {'rewritten_query': '今天天气不错', 'diversity_score': 1}, {'rewritten_query': 'It is a nice day!', 'diversity_score': 1}]}
Source code in lazyllm/tools/data/operators/enQa_ops.py
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | |
QueryRewriter
Bases: EnQA
Use an LLM to rewrite the original question into multiple semantically equivalent formulations. Writes a list to the specified output key.
Parameters:
-
input_key(str, default:'query') –key of the input question, default 'query'
-
output_key(str, default:'rewrite_querys') –key to write the list of rewrites, default 'rewrite_querys'
-
rewrite_num(int, default:3) –number of rewrites to generate, default 3
-
model–optional; None uses default Qwen model
-
user_prompt(str | None, default:None) –optional user prompt
-
**kwargs–other base-class args
Examples:
from lazyllm.tools.data import EnQA
from lazyllm import OnlineChatModule
llm = OnlineChatModule()
op = EnQA.QueryRewriter(input_key='query', output_key='rewrite_querys', rewrite_num=2, model=llm)
data = {'query': 'What is machine learning?'}
res = op(data) # data gets 'rewrite_querys': [str, str, ...]
print(res)
# [{'query': 'What is machine learning?', 'rewrite_querys': ['Could you explain what machine learning is?', 'What does the term machine learning refer to?']}]
Source code in lazyllm/tools/data/operators/enQa_ops.py
diversity_filter(data, input_key, min_score)
Filter by diversity score: if the value at input_key is less than min_score, drop the item (return []); otherwise keep (return None to keep original data). Registered as single-item forward.
Parameters:
-
data(dict) –single data dict
-
input_key(str) –key holding the score
-
min_score–minimum score threshold
Examples:
from lazyllm.tools.data import EnQA
data = {'query': 'a and b', 'rewritten_query': 'b', 'diversity_score': 0}
op = EnQA.diversity_filter(input_key='diversity_score', min_score=1)
print(op(data)) # [None] (drop)
# []
Source code in lazyllm/tools/data/operators/enQa_ops.py
post_processor(data, input_key)
Expand the specified key (list of dicts) into multiple rows: each dict merged with original data as one row, list key removed. Returns list of rows or None if no data. Registered as single-item forward.
Parameters:
-
data(dict) –single data dict
-
input_key(str) –key of the list of dicts to expand
Examples:
from lazyllm.tools.data import EnQA
data = {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'diversity_querys': [{'rewritten_query': '今天是个好天气', 'diversity_score': 1}, {'rewritten_query': '今天天气不错', 'diversity_score': 1}, {'rewritten_query': 'It is a nice day!', 'diversity_score': 1}]}
op = EnQA.post_processor(input_key='diversity_querys')
print(op(data))
# [{'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'rewritten_query': '今天是个好天气', 'diversity_score': 1}, {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'rewritten_query': '今天天气不错', 'diversity_score': 1}, {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'rewritten_query': 'It is a nice day!', 'diversity_score': 1}]
Source code in lazyllm/tools/data/operators/enQa_ops.py
Math QA Operators
lazyllm.tools.data.operators.math_ops
DifficultyEvaluator
Bases: MathQA
Use an LLM to evaluate math question difficulty; output Easy | Medium | Hard. Skips if difficulty already present.
Parameters:
-
input_key(str, default:'question') –key of the question, default 'question'
-
output_key(str, default:'difficulty') –key to write difficulty, default 'difficulty'
-
model–optional; None uses default Qwen model
-
user_prompt(str | None, default:None) –optional user prompt
-
**kwargs–other base-class args
Examples:
from lazyllm.tools.data.operators.math_ops import DifficultyEvaluator
from lazyllm.tools.data import MathQA
from lazyllm import OnlineChatModule
llm = OnlineChatModule()
op = MathQA.DifficultyEvaluator(input_key='question', output_key='difficulty', model=llm)
data = {'question': '1+1=?'}
res = op(data) # each item gets 'difficulty': 'Easy'|'Medium'|'Hard'
print(res)
# [{'question': '1+1=?', 'difficulty': 'Easy'}]
Source code in lazyllm/tools/data/operators/math_ops.py
150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | |
DuplicateAnswerDetector
Bases: MathQA
Detect duplicate/periodic/long-repeat in answers: periodic repetition, sentence-level repeat, or long substring repeat in question+answer. Sets output True if detected. No model call.
Parameters:
-
question_key(str, default:'question') –key of the question, default 'question'
-
answer_key(str, default:'answer') –key of the answer, default 'answer'
-
output_key(str, default:'duplicate') –key to write duplicate flag, default 'duplicate'
-
min_repeat_len(int, default:15) –min substring length for long repeat, default 15
-
repeat_threshold(int, default:2) –occurrence threshold for substring, default 2
-
periodic_min_repeat(int, default:3) –min period repeats for periodic, default 3
-
**kwargs–other base-class args
Examples:
from lazyllm.tools.data import MathQA
op = MathQA.DuplicateAnswerDetector(question_key='question', answer_key='answer', output_key='duplicate')
data = {'question': 'Q', 'answer': 'A' * 50}
res = op(data) # data['duplicate'] True
print(res)
# [{'question': 'Q', 'answer': 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA', 'duplicate': True}]
Source code in lazyllm/tools/data/operators/math_ops.py
347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 | |
MathAnswerGenerator
Bases: MathQA
Use an LLM to generate reasoning and answer for a math question, with final result in \boxed{{ANSWER}}. Skips if answer already exists and regenerate is not set.
Parameters:
-
input_key(str, default:'question') –key of the question, default 'question'
-
output_key(str, default:'answer') –key to write the answer, default 'answer'
-
regenerate_key(str, default:'regenerate') –key for force-regenerate flag, default 'regenerate'
-
model–optional; None uses default Qwen model
-
user_prompt(str | None, default:None) –optional user prompt
-
**kwargs–other base-class args
Examples:
from lazyllm.tools.data.operators.math_ops import MathAnswerGenerator
from lazyllm.tools.data import MathQA
from lazyllm import OnlineChatModule
llm = OnlineChatModule()
op = MathQA.MathAnswerGenerator(input_key='question', output_key='answer', model=llm)
data = [{'question': 'Solve 10 * 10'}]
res = op(data)
print(res)
# [{'question': 'Solve 10 * 10', 'answer': '首先,我们需要计算 \(10 imes 10\)。这是一个简单的乘法运算,其中两个乘数都是10。
步骤1:写下乘数10和另一个乘数10。
步骤2:将两个10相乘。
计算过程如下:
\[ 10 imes 10 = 100 \]
因此,最终结果是 \(oxed{100}\)。', 'regenerate': False}]
Source code in lazyllm/tools/data/operators/math_ops.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | |
QualityEvaluator
Bases: MathQA
Use an LLM to score question-answer quality: 0 = regenerate, 1 = acceptable. Skips if output_key already present.
Parameters:
-
question_key(str, default:'question') –key of the question, default 'question'
-
answer_key(str, default:'answer') –key of the answer, default 'answer'
-
output_key(str, default:'score') –key to write score, default 'score'
-
model–optional; None uses default Qwen model
-
user_prompt(str | None, default:None) –optional user prompt
-
**kwargs–other base-class args
Examples:
from lazyllm.tools.data import MathQA
from lazyllm import OnlineChatModule
llm = OnlineChatModule()
op = MathQA.QualityEvaluator(question_key='question', answer_key='answer', output_key='score', model=llm)
data = {'question': '今天天气如何', 'answer': '大家好~'}
res = op(data) # 质量低的会被打 0 分
print(res)
# [{'question': '今天天气如何', 'answer': '大家好~', 'score': 0}]
Source code in lazyllm/tools/data/operators/math_ops.py
263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 | |
QuestionFusionGenerator
Bases: MathQA
Use an LLM to fuse multiple questions into one and generate reasoning with \boxed{{}} answer. Requires at least 2 questions under list_key.
Parameters:
-
input_key(str, default:'question') –key for fused question, default 'question'
-
output_key(str, default:'answer') –key to write answer, default 'answer'
-
list_key(str, default:'question_list') –key of the question list, default 'question_list'
-
model–optional; None uses default Qwen model
-
user_prompt(str | None, default:None) –optional user prompt
-
**kwargs–other base-class args
Examples:
from lazyllm.tools.data import MathQA
from lazyllm import OnlineChatModule
llm = OnlineChatModule()
op = MathQA.QuestionFusionGenerator(input_key='new_question', list_key='question_list', output_key='new_answer', model=llm)
data = {'question_list': [
{'question': '1加1等于几?', 'answer': '1+1 = 2'},
{'question': '2的平方等于几?', 'answer': '2*2 = 4'}]}
res = op(data)
print(res)
# [{'question_list': [{'question': '1加1等于几?', 'answer': '1+1 = 2'}, {'question': '2的平方等于几?', 'answer': '2*2 = 4'}],
# 'new_question': '如果1加1的结果与2的平方相比较,哪个更大?',
# 'new_answer': '首先,我们解决第一个问题:1加1等于几?计算得到 1+1 = 2。然后,解决第二个问题:2的平方等于几?计算得到 2*2 = 4。最后,我们比较这两个结果,2和4。显然,4大于2。所以,2的平方更大。'}]
Source code in lazyllm/tools/data/operators/math_ops.py
551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 | |
ReasoningAnswerTokenLengthFilter
Bases: MathQA
Filter by answer length (tokens or chars): if over max_answer_token_length, clear the field and return modified data; if within limit return None to keep; if empty return []. Supports tokenizer or char count.
Parameters:
-
input_key(str, default:'answer') –key of the answer, default 'answer'
-
max_answer_token_length(int, default:300) –max allowed length, default 300
-
tokenize(bool, default:True) –whether to count by tokens; uses default Qwen tokenizer if True and tokenizer not provided
-
tokenizer–optional
-
**kwargs–other base-class args
Examples:
from lazyllm.tools.data import MathQA
op = MathQA.ReasoningAnswerTokenLengthFilter(input_key='answer', max_answer_token_length=100, tokenize=False)
data = [{'answer': 'short'}]
print(op(data)) # less than the max_length, keep the original input
# [{'answer': 'short'}]
Source code in lazyllm/tools/data/operators/math_ops.py
458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 | |
DifficultyEvaluatorBatch(data, input_key='difficulty')
Batch: aggregate counts of the specified key (e.g. difficulty) over the input list; returns a single-element list [{{key: count}}]. Registered as forward_batch_input.
Parameters:
-
data(list[dict]) –list of input dicts
-
input_key(str, default:'difficulty') –key to aggregate, default 'difficulty'
Examples:
from lazyllm.tools.data import MathQA
op = MathQA.DifficultyEvaluatorBatch(input_key='difficulty')
data = [{'difficulty': 'Easy'}, {'difficulty': 'Hard'}, {'difficulty': 'Easy'}]
print(op(data))
# [{'Easy': 2, 'Hard': 1}]
Source code in lazyllm/tools/data/operators/math_ops.py
math_answer_extractor(data, input_key='answer', output_key='math_answer')
Extract the math answer inside \boxed{{}} from text and write to the specified output key. Registered as single-item forward.
Parameters:
-
data(dict) –single data dict
-
input_key(str, default:'answer') –key of the text containing the answer, default 'answer'
-
output_key(str, default:'math_answer') –key to write the extracted value, default 'math_answer'
Examples:
from lazyllm.tools.data import MathQA
data = {'answer': 'So the answer is \boxed{{42}}.'}
op = MathQA.math_answer_extractor(input_key='answer', output_key='math_answer')
print(op(data)) # data['math_answer'] == '42'
# [{'answer': 'So the answer is \boxed{{42}}.', 'math_answer': '{42}'}]
Source code in lazyllm/tools/data/operators/math_ops.py
Pdf QA Operators
lazyllm.tools.data.operators.pdf_ops
Pdf2Md
Bases: Pdf2Qa
Convert PDF to a list of Markdown documents. Uses MineruPDFReader (reader_url required). Supports cache.
Parameters:
-
input_key(str, default:'pdf_path') –key of the PDF path, default 'pdf_path'
-
output_key(str, default:'docs') –key to write the document list, default 'docs'
-
reader_url–required, Mineru reader service URL
-
backend(str, default:'vlm-vllm-async-engine') –backend type, default 'vlm-vllm-async-engine'
-
upload_mode(bool, default:True) –whether to use upload mode, default True
-
use_cache(bool, default:False) –whether to use cache, default False
-
**kwargs–other base-class args
Examples:
from lazyllm.tools.data import Pdf2Qa
from lazyllm.tools.data.operators.pdf_ops import Pdf2Md
op = Pdf2Qa.Pdf2Md(input_key='pdf_path', output_key='docs', reader_url='http://...')
data = [{'pdf_path': '/path/to/file.pdf'}]
res = op(data) # each item gets 'docs' (list of doc content)
Source code in lazyllm/tools/data/operators/pdf_ops.py
Agentic rag
lazyllm.tools.data.operators.agentic_rag.agenticrag_atomic_task_generator
AgenticRAGCleanQA
Bases: agenticrag
Cleans and refines a generated QA pair by calling the LLM to produce a refined_answer .
Parameters:
-
llm–language model service instance
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGCleanQA(llm=my_llm)
result = op({'question': 'What is...', 'answer': 'Raw answer'})
print(result['refined_answer'])
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
AgenticRAGExpandConclusions
Bases: agenticrag
Parses the JSON conclusion list in raw_conclusion and expands it into multiple candidate task records.
Only items containing 'conclusion' and 'R' are kept. Each valid item produces a new data row with candidate_tasks_str.
Parameters:
-
max_per_task(int, default:10) –maximum number of candidate tasks per sample
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGExpandConclusions(max_per_task=5)
rows = op({
'raw_conclusion': '[{"conclusion":"A","R":"rel"}]',
'identifier': 'doc1'
})
print(rows)
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
AgenticRAGGenerateQuestion
Bases: agenticrag
Generates a question-answer pair from task identifier (ID), relationship (R), and answer (A).
Parameters:
-
llm–language model service instance
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGGenerateQuestion(llm=my_llm)
result = op({
'candidate_tasks_str': '{"conclusion":"Paris","R":"capital_of"}',
'identifier': 'France'
})
print(result['question'], result['answer'])
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
AgenticRAGGetConclusion
Bases: agenticrag
An operator that extracts conclusions and generates relationships using an LLM.
It builds prompts from the input text and stores the raw model output in data['raw_conclusion'] for downstream parsing and task expansion. If generation fails, an empty string is assigned.
Parameters:
-
llm–language model service instance
-
input_key(str, default:'prompts') –name of the input text field, default 'prompts'
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGGetConclusion(llm=my_llm)
result = op({'prompts': 'Some document content'})
print(result['raw_conclusion'])
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
AgenticRAGGetIdentifier
Bases: agenticrag
An operator that extracts a content identifier from the input text using an LLM.
Parameters:
-
llm–language model service instance
-
input_key(str, default:'prompts') –name of the input text field, default 'prompts'
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGGetIdentifier(llm=my_llm, input_key='prompts')
result = op({'prompts': 'What is the third movie in the Avatar series?'})
print('identifier:', result['identifier'])
# {'identifier': 'Avatar series'}
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
AgenticRAGGoldenDocAnswer
Bases: agenticrag
Generates answers from a golden document and verifies via recall scoring.
It produces an answer using golden_doc and question, then scores it against refined_answer. Samples with insufficient score are filtered out.
Parameters:
-
llm–language model service instance
-
input_key(str, default:'prompts') –golden document field name
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGGoldenDocAnswer(llm=my_llm)
result = op({
'prompts': 'Golden document text',
'question': 'Q?',
'refined_answer': 'Expected A'
})
print(result)
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 | |
AgenticRAGGroupAndLimit
Bases: agenticrag
Groups data by a specified key and limits the number of QA pairs per group.
It groups batch input by input_key and retains up to max_question items per group to control sample distribution.
Parameters:
-
input_key(str, default:'prompts') –grouping field name
-
max_question(int, default:10) –maximum QA pairs per group
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGGroupAndLimit(input_key='prompts', max_question=2)
result = op([
{'prompts': 'doc1', 'question': 'Q1'},
{'prompts': 'doc1', 'question': 'Q2'},
{'prompts': 'doc1', 'question': 'Q3'}
])
print(result) # only 2 kept for doc1
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
AgenticRAGLLMVerify
Bases: agenticrag
Verifies QA quality via LLM answering and recall scoring.
The model first answers the question to produce llm_answer, then scores refined_answer against llm_answer. If score >= 1, the sample is filtered out; otherwise retained.
Parameters:
-
llm–language model service instance
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGLLMVerify(llm=my_llm)
result = op({'question': 'Q?', 'refined_answer': 'A'})
print(result)
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
AgenticRAGOptionalAnswers
Bases: agenticrag
Generates multiple optional answers for a refined answer.
It calls the LLM to produce semantically equivalent or similar variants, stored in optional_answer.
Parameters:
-
llm–language model service instance
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGOptionalAnswers(llm=my_llm)
result = op({'refined_answer': 'Paris'})
print(result['optional_answer'])
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
lazyllm.tools.data.operators.agentic_rag.agenticrag_depth_qa_generator
DepthQAGBackwardTask
Bases: agenticrag
Generates a backward task from the existing identifier, producing a new identifier and relation.
This operator infers backwards from the given identifier to generate a new identifier and corresponding relation for building depth QA tasks.
Parameters:
-
llm–language model service instance
-
identifier_key(str, default:'identifier') –original identifier field name, default 'identifier'
-
new_identifier_key(str, default:'new_identifier') –new identifier field name, default 'new_identifier'
-
relation_key(str, default:'relation') –relation field name, default 'relation'
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.DepthQAGBackwardTask(llm=my_llm)
result = op({'identifier': 'machine learning'})
print(result)
# {'identifier': 'machine learning', 'new_identifier': '...', 'relation': '...'}
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_depth_qa_generator.py
DepthQAGCheckSuperset
Bases: agenticrag
Checks whether the newly generated query is a superset of the original identifier.
Verifies if the combination of new_identifier and relation constitutes a valid superset query of the original identifier. If validation passes, the data is retained; otherwise, an empty list is returned to filter out the sample.
Parameters:
-
llm–language model service instance
-
new_identifier_key(str, default:'new_identifier') –new identifier field name, default 'new_identifier'
-
relation_key(str, default:'relation') –relation field name, default 'relation'
-
identifier_key(str, default:'identifier') –original identifier field name, default 'identifier'
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.DepthQAGCheckSuperset(llm=my_llm)
result = op({
'identifier': 'Paris',
'new_identifier': 'France',
'relation': 'capital_of'
})
print(result) # returns data if valid, empty list if invalid
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_depth_qa_generator.py
DepthQAGGenerateQuestion
Bases: agenticrag
Generates a depth question based on the new identifier, relation, and original identifier.
Uses an LLM to generate a question for depth QA tasks based on new_identifier, relation, and identifier, storing the result in the specified question_key field.
Parameters:
-
llm–language model service instance
-
new_identifier_key(str, default:'new_identifier') –new identifier field name, default 'new_identifier'
-
relation_key(str, default:'relation') –relation field name, default 'relation'
-
identifier_key(str, default:'identifier') –original identifier field name, default 'identifier'
-
question_key(str, default:'depth_question') –field name to store generated question, default 'depth_question'
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.DepthQAGGenerateQuestion(llm=my_llm)
result = op({
'identifier': 'Paris',
'new_identifier': 'France',
'relation': 'capital_of'
})
print(result['depth_question'])
# 'What is the capital of France?'
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_depth_qa_generator.py
DepthQAGGetIdentifier
Bases: agenticrag
An operator that extracts a content identifier from the input text using an LLM.
If the identifier field already exists in the data, processing is skipped.
Parameters:
-
llm–language model service instance
-
input_key(str, default:'question') –name of the input text field, default 'question'
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.DepthQAGGetIdentifier(llm=my_llm, input_key='question')
result = op({'question': 'What is the capital of France?'})
print('identifier:', result['identifier'])
# {'identifier': 'capital of France'}
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_depth_qa_generator.py
DepthQAGVerifyQuestion
Bases: agenticrag
Verifies the quality of generated questions and filters out overly easy ones.
First has the LLM answer the question to produce llm_answer, then calculates a recall score against refined_answer. If score >= 1 (indicating the question is too easy), the sample is filtered out; otherwise the data is retained.
Parameters:
-
llm–language model service instance
-
question_key(str, default:'depth_question') –question field name, default 'depth_question'
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.DepthQAGVerifyQuestion(llm=my_llm)
result = op({
'depth_question': 'What is the capital of France?',
'refined_answer': 'Paris'
})
# Returns data if question is challenging, empty list if too easy
print(result)
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_depth_qa_generator.py
292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 | |
lazyllm.tools.data.operators.agentic_rag.agenticrag_qaf1_sample_evaluator
qaf1_calculate_score(data, result_key='F1Score')
A function that calculates the F1 score for QA pairs.
Calculates the F1 score (combining precision and recall) based on normalized prediction and ground truth answers. Supports multiple ground truth answers, taking the highest F1 score as the final result. Cleans up temporary fields after calculation.
Parameters:
-
data(dict) –single data dictionary
-
output_key(str) –output field name for F1 score, default 'F1Score'
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.qaf1_calculate_score(output_key='F1Score')
result = op({
'_normalized_prediction': 'paris is capital',
'_normalized_ground_truths': ['capital is paris', 'paris capital france']
})
print(result['F1Score']) # F1 score value between 0.0 and 1.0
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_qaf1_sample_evaluator.py
qaf1_normalize_texts(data, predicted_key='refined_answer', reference_key='golden_doc_answer')
A function that normalizes prediction and ground truth answer texts.
Performs standardization on prediction and ground truth answers, including: converting to lowercase, removing punctuation, removing articles (a/an/the), and normalizing whitespace. Normalized results are stored in temporary fields for subsequent F1 score calculation.
Parameters:
-
data(dict) –single data dictionary
-
prediction_key(str) –prediction answer field name, default 'refined_answer'
-
ground_truth_key(str) –ground truth answer field name, default 'golden_doc_answer'
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.qaf1_normalize_texts(prediction_key='refined_answer', ground_truth_key='golden_doc_answer')
result = op({
'refined_answer': 'Paris is the capital.',
'golden_doc_answer': 'The capital is Paris!'
})
print(result['_normalized_prediction']) # 'paris is capital'
print(result['_normalized_ground_truths']) # ['capital is paris']
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_qaf1_sample_evaluator.py
lazyllm.tools.data.operators.agentic_rag.agenticrag_width_qa_generator
WidthQAGCheckDecomposition
Bases: agenticrag
An operator that verifies whether the merged question effectively decomposes the original questions.
This operator checks if the complex question generated by LLM correctly decomposes and includes the original questions. If validation passes, the data is retained; otherwise an empty list is returned to filter out the sample.
Parameters:
-
llm–language model service instance
-
output_question_key(str, default:'generated_width_task') –field name for the generated question, default 'generated_width_task'
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.WidthQAGCheckDecomposition(llm=my_llm)
result = op({
'question': 'What are the capitals of France and UK?',
'original_question': ['What is Paris?', 'What is London?'],
'index': 0
})
print(result) # Returns data if valid, empty list if invalid
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_width_qa_generator.py
WidthQAGFilterByScore
Bases: agenticrag
An operator that filters width questions based on recall score.
This operator compares golden_answer with llm_answer to calculate a recall score. If score >= 1, the sample is filtered out (indicating the question is too easy or LLM answered too well); otherwise the data is retained and temporary fields are cleaned.
Parameters:
-
llm–language model service instance
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.WidthQAGFilterByScore(llm=my_llm)
result = op({
'original_answer': ['Paris', 'London'],
'llm_answer': 'Paris is the capital of France and London is the capital of UK',
'state': 1
})
# Returns data if score < 1, empty list if score >= 1
print(result)
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_width_qa_generator.py
WidthQAGMergePairs
Bases: agenticrag
An operator that merges adjacent QA pairs to generate width questions.
This operator receives a batch of QA data and uses an LLM to merge adjacent pairs into more complex width questions. Requires at least 2 items to perform merging.
Parameters:
-
llm–language model service instance
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.WidthQAGMergePairs(llm=my_llm)
result = op([
{'question': 'What is Paris?', 'golden_answer': 'Capital of France'},
{'question': 'What is London?', 'golden_answer': 'Capital of UK'}
])
print(result[0]['question']) # Merged complex question
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_width_qa_generator.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | |
WidthQAGVerifyQuestion
Bases: agenticrag
An operator that verifies if the generated question can be properly answered.
This operator uses an LLM to attempt answering the generated question and stores the answer in the llm_answer field for subsequent scoring.
Parameters:
-
llm–language model service instance
-
output_question_key(str, default:'generated_width_task') –question field name, default 'generated_width_task'
-
**kwargs(dict, default:{}) –additional user-provided arguments.
Examples:
from lazyllm.tools.data import agenticrag
op = agenticrag.WidthQAGVerifyQuestion(llm=my_llm)
result = op({
'generated_width_task': 'What are the capitals of France and UK?',
'index': 0
})
print(result['llm_answer']) # LLM's answer to the question
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_width_qa_generator.py
Embedding synthesis
lazyllm.tools.data.operators.embedding_synthesis
EmbeddingFormatFlagEmbedding
Bases: embedding
An operator that formats data into FlagEmbedding training format.
This operator formats the input query, pos (positive samples), and neg (negative samples) into the training data format required by the FlagEmbedding framework. Supports adding an instruction field for supervised Embedding training.
Parameters:
-
instruction(str, default:None) –Instruction text for supervised training scenarios. Defaults to None.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–A dictionary containing query, pos, neg, and optional prompt fields.
Examples:
from lazyllm.tools.data import embedding
op = embedding.EmbeddingFormatFlagEmbedding(instruction='Represent this sentence for searching relevant passages:')
result = op({'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking recipe']})
# Returns: {'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking recipe'], 'prompt': 'Represent this sentence for searching relevant passages:'}
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_data_formatter.py
EmbeddingFormatSentenceTransformers
Bases: embedding
An operator that formats data into SentenceTransformers triplet training format.
This operator converts the input query, pos (positive samples), and neg (negative samples) into the anchor-positive-negative triplet format required by the SentenceTransformers framework. Suitable for training with losses like MultipleNegativesRankingLoss.
Parameters:
-
**kwargs(dict, default:{}) –Optional arguments passed to the parent class.
Returns:
-
–
List[dict]: A list of dictionaries containing anchor, positive, and negative fields,
-
–
with one triplet generated for each positive-negative pair.
Examples:
from lazyllm.tools.data import embedding
op = embedding.EmbeddingFormatSentenceTransformers()
result = op({'query': 'machine learning', 'pos': ['ML basics'], 'neg': ['cooking tips']})
# Returns: [{'anchor': 'machine learning', 'positive': 'ML basics', 'negative': 'cooking tips'}]
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_data_formatter.py
EmbeddingFormatTriplet
Bases: embedding
An operator that formats data into generic triplet format.
This operator converts the input query, pos (positive samples), and neg (negative samples) into a standard triplet format with field names query, positive, and negative. Compatible with various Embedding training frameworks.
Parameters:
-
**kwargs(dict, default:{}) –Optional arguments passed to the parent class.
Returns:
-
–
List[dict]: A list of dictionaries containing query, positive, and negative fields,
-
–
with one triplet generated for each positive-negative pair.
Examples:
from lazyllm.tools.data import embedding
op = embedding.EmbeddingFormatTriplet()
result = op({'query': 'deep learning', 'pos': ['neural networks', 'AI'], 'neg': ['history', 'geography']})
# Returns list of triplets combining each positive with each negative
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_data_formatter.py
EmbeddingGenerateQueries
Bases: embedding
An operator that generates queries using LLM.
This operator calls a language model service to generate queries based on the built prompts. Returns the query response in JSON format.
Parameters:
-
llm–LLM service instance for generating queries.
-
num_queries(int, default:3) –Number of queries to generate, defaults to 3.
-
lang(str, default:'zh') –Language, 'zh' for Chinese, 'en' for English, defaults to 'zh'.
-
query_types(List[str], default:None) –List of query types, defaults to ['factual', 'semantic', 'inferential'].
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Input data with '_query_response' field added containing the generated query response.
Examples:
from lazyllm.tools.data import embedding
# Assuming llm is an LLM service instance
generator = embedding.EmbeddingGenerateQueries(llm=llm, lang='zh')
data = {'_query_prompt': 'Generate queries for: machine learning tutorial'}
result = generator(data)
# Returns data with '_query_response' field containing JSON queries
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_query_generator.py
EmbeddingInitBM25
Bases: embedding
An operator that initializes BM25 index.
This operator builds BM25 index based on corpus for subsequent keyword retrieval and hard negative mining. Supports Chinese and English tokenization, using jieba for Chinese and Stemmer for English stemming.
Parameters:
-
language(str, default:'zh') –Language type, 'zh' for Chinese, 'en' for English, defaults to 'zh'.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
–
List[dict]: Input data with BM25 index and related configuration added to each item.
Examples:
from lazyllm.tools.data import embedding
# First build corpus, then initialize BM25
corpus_op = embedding.build_embedding_corpus(input_pos_key='pos')
bm25_op = embedding.EmbeddingInitBM25(language='zh')
# Returns data with '_bm25' index and tokenizer configuration
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_hard_negative_miner.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | |
EmbeddingInitSemantic
Bases: embedding
An operator that initializes semantic embeddings.
This operator uses Embedding service to compute vector representations for all documents in the corpus and saves them to files. Used for subsequent semantic similarity calculation and hard negative mining.
Parameters:
-
embedding_serving(Callable, default:None) –Embedding service callable for computing text vectors.
-
embeddings_dir(str, default:None) –Directory to save embedding files, defaults to corpus directory.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
–
List[dict]: Input data with semantic embedding file paths and corpus information added.
Examples:
from lazyllm.tools.data import embedding
# Assuming my_embedding_fn is an embedding service
semantic_op = embedding.EmbeddingInitSemantic(embedding_serving=my_embedding_fn)
# Returns data with '_semantic_embeddings_path' pointing to saved embeddings
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_hard_negative_miner.py
202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 | |
EmbeddingMineSemanticNegatives
Bases: embedding
An operator that mines hard negative samples using semantic similarity.
This operator finds documents most similar to the query but not in positive samples based on semantic vector similarity. Suitable for mining hard negatives that are semantically similar but actually irrelevant, usually performs better than BM25 method.
Parameters:
-
num_negatives(int, default:7) –Number of negative samples to mine, defaults to 7.
-
embedding_serving(Callable, default:None) –Embedding service callable for computing query vectors.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Input data with negative samples mined based on semantic similarity added.
Examples:
from lazyllm.tools.data import embedding
# Assuming embeddings are initialized
semantic_miner = embedding.EmbeddingMineSemanticNegatives(num_negatives=5, embedding_serving=my_embedding_fn)
data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_semantic_embeddings_path': emb_path, '_semantic_corpus': corpus}
result = semantic_miner(data)
# Returns data with 'neg' field containing semantically similar negative samples
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_hard_negative_miner.py
426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 | |
EmbeddingParseQueries
Bases: embedding
An operator that parses generated queries.
This operator parses the query response generated by LLM and expands each query into an independent data record.
Parameters:
-
input_key(str, default:'passage') –Input field name, defaults to 'passage'.
-
output_query_key(str, default:'query') –Output query field name, defaults to 'query'.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
–
List[dict]: List of parsed queries, each query as an independent data record.
Examples:
from lazyllm.tools.data import embedding
parser = embedding.EmbeddingParseQueries(input_key='passage', output_query_key='query')
data = {'_query_response': '[{"query": "what is ML?", "type": "factual"}]', 'passage': 'Machine learning is...'}
result = parser(data)
# Returns list of expanded query records with 'query' and 'pos' fields
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_query_generator.py
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | |
EmbeddingTrainTestSplitter
Bases: embedding
An operator that splits dataset into training and test sets.
This operator randomly shuffles the input data and splits it into training and test sets according to the specified ratio. Supports saving split data to JSONL files and stratified sampling by a specified key.
Parameters:
-
test_size(float, default:0.1) –Proportion of test set, defaults to 0.1 (i.e., 10%).
-
seed(int, default:42) –Random seed for reproducible splitting, defaults to 42.
-
stratify_key(str, default:None) –Key name for stratified sampling, defaults to None.
-
train_output_file(str, default:None) –Output file path for training set, defaults to None.
-
test_output_file(str, default:None) –Output file path for test set, defaults to None.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
–
List[dict]: All samples from both training and test sets, with a 'split' field added
to indicate which set each sample belongs to.
Examples:
from lazyllm.tools.data import embedding
op = embedding.EmbeddingTrainTestSplitter(test_size=0.2, seed=123, train_output_file='train.jsonl', test_output_file='test.jsonl')
data = [{'query': 'q1', 'pos': 'p1'}, {'query': 'q2', 'pos': 'p2'}, {'query': 'q3', 'pos': 'p3'}]
result = op(data)
# Returns all samples with 'split' field ('train' or 'test')
# Saves train data to train.jsonl and test data to test.jsonl
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_data_formatter.py
180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 | |
Knowledge clean
lazyllm.tools.data.operators.knowledge_cleaning.file_or_url_to_markdown_converter_api
FileOrURLNormalizer
Bases: kbc
File or URL normalizer operator.
This operator automatically identifies file format based on input type (file or URL) and performs normalization. Supports PDF, HTML/XML, TXT/MD files, and web URLs. For network PDFs, they will be downloaded locally first.
Parameters:
-
intermediate_dir(str, default:'intermediate') –Directory for intermediate files, defaults to 'intermediate'.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Normalized data containing the following fields:
-
_type–File type ('pdf', 'html', 'text', 'invalid', 'unsupported')
-
_raw_path–Local file path (if available)
-
_url–URL address (if web page)
-
_output_path–Expected Markdown output path
-
_error–Error message (if any)
Examples:
from lazyllm.tools.data import kbc
normalizer = kbc.FileOrURLNormalizer(intermediate_dir='./temp')
# For file input
data = {'source': '/path/to/document.pdf'}
result = normalizer(data)
# Returns: {'source': '/path/to/document.pdf', '_type': 'pdf', '_raw_path': '/path/to/document.pdf', '_output_path': './temp/document.md'}
# For URL input
data = {'source': 'https://example.com/page.html'}
result = normalizer(data)
# Returns: {'source': 'https://example.com/page.html', '_type': 'html', '_url': 'https://example.com/page.html', '_output_path': './temp/url_xxx.md'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/file_or_url_to_markdown_converter_api.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | |
HTMLToMarkdownConverter
Bases: kbc
HTML to Markdown converter operator.
This operator uses the trafilatura library to extract content from HTML or XML files and convert to Markdown format. Supports local HTML files and web URLs, automatically handles page metadata.
Parameters:
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Converted data containing the following fields:
-
_markdown_path–Path to the generated Markdown file
Examples:
from lazyllm.tools.data import kbc
converter = kbc.HTMLToMarkdownConverter()
# After normalization
data = {'_type': 'html', '_url': 'https://example.com/article', '_output_path': './temp/output.md'}
result = converter(data)
# Returns: {'_type': 'html', '_url': 'https://example.com/article', '_output_path': './temp/output.md', '_markdown_path': './temp/output.md'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/file_or_url_to_markdown_converter_api.py
PDFToMarkdownConverterAPI
Bases: kbc
PDF to Markdown converter API operator.
This operator uses the MinerU service to convert PDF files (including scanned documents and images) to Markdown format. Supports calling MinerU via API for PDF parsing, with configurable backend engine and upload mode.
Parameters:
-
mineru_url(str, default:None) –MinerU service URL address.
-
mineru_backend(str, default:'vlm-vllm-async-engine') –MinerU backend engine type, defaults to 'vlm-vllm-async-engine'.
-
upload_mode(bool, default:True) –Whether to use upload mode, defaults to True.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Converted data containing the following fields:
-
_markdown_path–Path to the generated Markdown file
Examples:
from lazyllm.tools.data import kbc
converter = kbc.PDFToMarkdownConverterAPI(
mineru_url='your_mineru_url',
mineru_backend='vlm-vllm-async-engine',
upload_mode=True
)
# After normalization
data = {'_type': 'pdf', '_raw_path': '/path/to/doc.pdf', '_output_path': './temp/output.md'}
result = converter(data)
# Returns: {'_type': 'pdf', '_raw_path': '/path/to/doc.pdf', '_output_path': './temp/output.md', '_markdown_path': './temp/output.md'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/file_or_url_to_markdown_converter_api.py
270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 | |
lazyllm.tools.data.operators.knowledge_cleaning.kbc_chunk_generator_batch
KBCChunkText
Bases: kbc
Text chunking operator.
This operator splits long text into chunks, supporting multiple chunking strategies: - token: Token-based chunking - sentence: Sentence boundary-based chunking - semantic: Semantic similarity-based chunking - recursive: Recursive chunking
Parameters:
-
chunk_size(int, default:512) –Maximum size of each chunk, defaults to 512.
-
chunk_overlap(int, default:50) –Overlap size between chunks, defaults to 50.
-
split_method(str, default:'token') –Chunking method, options: 'token', 'sentence', 'semantic', 'recursive', defaults to 'token'.
-
tokenizer_name(str, default:'bert-base-uncased') –Name of the tokenizer to use, defaults to 'bert-base-uncased'.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Data containing chunking results:
-
_chunks–List of chunked texts
-
_chunk_error–Chunking error message (if any)
Examples:
from lazyllm.tools.data import kbc
chunker = kbc.KBCChunkText(chunk_size=512, chunk_overlap=50, split_method='token')
data = {'_text_content': 'Long text content that needs to be chunked...'}
result = chunker(data)
# Returns: {'_text_content': 'Long text content...', '_chunks': ['chunk1', 'chunk2', ...]}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_chunk_generator_batch.py
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | |
KBCLoadText
Bases: kbc
Operator for loading text file content.
This operator loads text file content from the specified path, supporting multiple file formats: - .txt, .md, .xml: Direct text content reading - .json, .jsonl: Extract and merge content from specified text fields
Parameters:
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Data containing loading results:
-
_text_content–Loaded text content
-
_load_error–Loading error message (if any)
Examples:
from lazyllm.tools.data import kbc
loader = kbc.KBCLoadText()
# Load text file
data = {'text_path': '/path/to/document.txt'}
result = loader(data)
# Returns: {'text_path': '/path/to/document.txt', '_text_content': 'file content...'}
# Load JSON file
data = {'text_path': '/path/to/data.json'}
result = loader(data)
# Returns: {'text_path': '/path/to/data.json', '_text_content': 'extracted text...'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_chunk_generator_batch.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | |
KBCSaveChunks
Bases: kbc
Operator for saving text chunking results.
This operator saves chunked texts as JSON files, with each chunk as a JSON object. Supports specifying output directory, preserving the relative path structure of the original file.
Parameters:
-
output_dir(str, default:None) –Output directory path, defaults to None (save to 'extract' subdirectory of the original file's directory).
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Data containing save results:
-
chunk_path–Path to the saved JSON file
Examples:
from lazyllm.tools.data import kbc
saver = kbc.KBCSaveChunks(output_dir='./output')
data = {'text_path': '/path/to/doc.txt', '_chunks': ['chunk1', 'chunk2']}
result = saver(data)
# Returns: {'text_path': '/path/to/doc.txt', 'chunk_path': './output/path/to/doc_chunk.json'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_chunk_generator_batch.py
234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 | |
lazyllm.tools.data.operators.knowledge_cleaning.kbc_chunk_generator
KBCExpandChunks
Bases: kbc
Operator that expands chunked text into independent records.
This operator expands data records containing multiple text chunks into multiple independent data records, with each record containing one chunk. Suitable for scenarios where chunked texts need to be processed as independent samples.
Parameters:
-
output_key(str, default:'raw_chunk') –Output key name for storing chunk text, defaults to 'raw_chunk'.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
–
List[dict]: List of expanded independent data records, each containing one chunk.
Examples:
from lazyllm.tools.data import kbc
expander = kbc.KBCExpandChunks(output_key='raw_chunk')
data = {'text_path': '/path/to/doc.txt', '_chunks': ['chunk1 content', 'chunk2 content', 'chunk3 content']}
result = expander(data)
# Returns: [
# {'text_path': '/path/to/doc.txt', 'raw_chunk': 'chunk1 content'},
# {'text_path': '/path/to/doc.txt', 'raw_chunk': 'chunk2 content'},
# {'text_path': '/path/to/doc.txt', 'raw_chunk': 'chunk3 content'}
# ]
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_chunk_generator.py
lazyllm.tools.data.operators.knowledge_cleaning.kbc_multihop_qa_generator_batch
KBCExtractInfoPairs
Bases: kbc
Information pair extraction operator.
This operator extracts information pairs from preprocessed text for multi-hop QA generation. Uses different sentence delimiters based on language type (Chinese or English), extracting premise-intermediate-conclusion triples and related contexts.
Parameters:
-
lang(str, default:'en') –Language type, 'en' for English, 'zh' for Chinese, defaults to 'en'.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Data containing information pairs:
-
_info_pairs–List of information pairs, each containing premise, intermediate, conclusion, and related_contexts
Examples:
from lazyllm.tools.data import kbc
extractor = kbc.KBCExtractInfoPairs(lang='en')
data = {'_processed_chunks': [{'text': 'First sentence. Second sentence. Third sentence.', 'original_data': {}}]}
result = extractor(data)
# Returns: {'_processed_chunks': [...], '_info_pairs': [{'premise': 'First sentence', 'intermediate': 'Second sentence', 'conclusion': 'Third sentence', 'related_contexts': [], 'original_data': {}}]}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py
KBCGenerateMultiHopQA
Bases: kbc
Multi-hop QA generation operator.
This operator uses LLM to generate multi-hop QA pairs based on extracted information pairs. Multi-hop QA requires multiple reasoning steps to answer, suitable for training complex QA models.
Parameters:
-
llm–LLM service instance for generating QA pairs.
-
lang(str, default:'en') –Language type, 'en' for English, 'zh' for Chinese, defaults to 'en'.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Data containing generated QA results:
-
_qa_results–List of QA results, each containing response and info_pair
Examples:
from lazyllm.tools.data import kbc
# Assuming llm is an LLM service instance
generator = kbc.KBCGenerateMultiHopQA(llm=llm, lang='en')
data = {'_info_pairs': [{'premise': 'A', 'intermediate': 'B', 'conclusion': 'C', 'original_data': {}}]}
result = generator(data)
# Returns: {'_info_pairs': [...], '_qa_results': [{'response': {...}, 'info_pair': {...}}]}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 | |
KBCLoadChunkFile
Bases: kbc
Chunk file loading operator.
This operator loads JSON or JSONL format chunk files from the specified path. Supports chunk result files generated from the knowledge base cleaning process.
Parameters:
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Data containing chunk data:
-
_chunks_data–List of chunk data
-
_chunk_path–Chunk file path
Examples:
from lazyllm.tools.data import kbc
loader = kbc.KBCLoadChunkFile()
data = {'chunk_path': '/path/to/chunks.json'}
result = loader(data)
# Returns: {'chunk_path': '/path/to/chunks.json', '_chunks_data': [...], '_chunk_path': '/path/to/chunks.json'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py
KBCPreprocessText
Bases: kbc
Text preprocessing operator.
This operator preprocesses loaded chunk texts, filtering chunks based on length. Only retains chunks within the specified length range, avoiding processing text that is too short or too long.
Parameters:
-
min_length(int, default:100) –Minimum text length, defaults to 100.
-
max_length(int, default:200000) –Maximum text length, defaults to 200000.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Data containing preprocessing results:
-
_processed_chunks–List of preprocessed chunks
Examples:
from lazyllm.tools.data import kbc
processor = kbc.KBCPreprocessText(min_length=50, max_length=10000)
data = {'_chunks_data': [{'cleaned_chunk': 'Short text.'}, {'cleaned_chunk': 'A much longer text that meets the length requirements and will be processed.'}]}
result = processor(data, text_field='cleaned_chunk')
# Returns: {'_chunks_data': [...], '_processed_chunks': [{'text': 'A much longer text...', 'original_data': {...}}]}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py
KBCSaveEnhanced
Bases: kbc
Enhanced data saving operator.
This operator merges generated QA pairs with original chunk data and saves them as enhanced chunk files. Supports specifying output directory, preserving the relative path structure of the original file.
Parameters:
-
output_dir(str, default:None) –Output directory path, defaults to None (save to the original file's directory).
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Data containing save results:
-
enhanced_chunk_path–Path to the enhanced chunk file
Examples:
from lazyllm.tools.data import kbc
saver = kbc.KBCSaveEnhanced(output_dir='./enhanced_output')
data = {'_chunk_path': '/path/to/chunks.json', '_chunks_data': [{'id': 1, 'text': 'chunk1'}], '_qa_pairs': [{'id': 1, 'qa_pairs': {'question': 'Q1', 'answer': 'A1'}}]}
result = saver(data, output_key='enhanced_chunk_path')
# Returns: {'enhanced_chunk_path': './enhanced_output/path/to/chunks_enhanced.json'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py
parse_qa_pairs(data)
QA pair parsing function.
This function parses LLM-generated QA responses, extracting valid QA pairs. Supports multiple response formats (dict, list, string) and merges parsing results with original data.
Parameters:
-
data(dict) –Data containing QA results.
Returns:
-
dict(dict) –Data containing parsed QA pairs:
-
_qa_pairs(dict) –List of parsed QA pairs
Examples:
from lazyllm.tools.data.operators.knowledge_cleaning.kbc_multihop_qa_generator_batch import parse_qa_pairs
data = {'_qa_results': [{'response': {'question': 'What is AI?', 'answer': 'Artificial Intelligence'}, 'info_pair': {'original_data': {'id': 1}}}]}
result = parse_qa_pairs(data)
# Returns: {'_qa_results': [...], '_qa_pairs': [{'id': 1, 'qa_pairs': {'question': 'What is AI?', 'answer': 'Artificial Intelligence'}}]}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py
lazyllm.tools.data.operators.knowledge_cleaning.kbc_text_cleaner_batch
KBCGenerateCleanedText
Bases: kbc
Cleaned text generation operator.
This operator uses LLM to clean raw chunk text, removing noise and formatting content. Supports multiple languages, falls back to original text when LLM call fails.
Parameters:
-
llm–LLM service instance for cleaning text.
-
lang(str, default:'en') –Language type, 'en' for English, 'zh' for Chinese, defaults to 'en'.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Data containing cleaning results:
-
_cleaned_results–List of cleaning results, each containing response, raw_chunk, and original_item
Examples:
from lazyllm.tools.data import kbc
# Assuming llm is an LLM service instance
cleaner = kbc.KBCGenerateCleanedText(llm=llm, lang='en')
data = {'_chunks_data': [{'raw_chunk': 'Noisy text with errors...'}]}
result = cleaner(data)
# Returns: {'_chunks_data': [...], '_cleaned_results': [{'response': 'Cleaned text', 'raw_chunk': '...', 'original_item': {...}}]}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner_batch.py
KBCLoadRAWChunkFile
Bases: kbc
Raw chunk file loading operator.
This operator loads JSON or JSONL files containing raw chunks (raw_chunk) from the specified path. Used in the knowledge base cleaning process to load raw chunk data that needs cleaning.
Parameters:
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Data containing raw chunk data:
-
_chunks_data–List of raw chunk data
-
_chunk_path–Chunk file path
Examples:
from lazyllm.tools.data import kbc
loader = kbc.KBCLoadRAWChunkFile()
data = {'chunk_path': '/path/to/raw_chunks.json'}
result = loader(data)
# Returns: {'chunk_path': '/path/to/raw_chunks.json', '_chunks_data': [{'raw_chunk': '...'}], '_chunk_path': '/path/to/raw_chunks.json'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner_batch.py
KBCSaveCleaned
Bases: kbc
Cleaned data saving operator.
This operator saves cleaned chunk data as JSON files, preserving the correspondence between raw and cleaned chunks. Supports specifying output directory, preserving the relative path structure of the original file.
Parameters:
-
output_dir(str, default:None) –Output directory path, defaults to None (save to the original file's directory).
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Data containing save results:
-
cleaned_chunk_path–Path to the cleaned chunk file
Examples:
from lazyllm.tools.data import kbc
saver = kbc.KBCSaveCleaned(output_dir='./cleaned_output')
data = {'_chunk_path': '/path/to/raw_chunks.json', '_cleaned_chunks': [{'raw_chunk': 'raw', 'cleaned_chunk': 'cleaned'}]}
result = saver(data, output_key='cleaned_chunk_path')
# Returns: {'cleaned_chunk_path': './cleaned_output/path/to/raw_chunks_cleaned.json'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner_batch.py
extract_cleaned_content(data)
Extract cleaned content function.
This function extracts cleaned text content from LLM cleaning results, handling different response formats.
Supports extracting content between
Parameters:
-
data(dict) –Data containing cleaning results.
Returns:
-
dict(dict) –Data containing extracted cleaned content:
-
_cleaned_chunks(dict) –List of cleaned chunks, each containing raw_chunk, cleaned_chunk, and original_item
Examples:
from lazyllm.tools.data.operators.knowledge_cleaning.kbc_text_cleaner_batch import extract_cleaned_content
data = {'_cleaned_results': [{'response': '<cleaned_start>Clean text<cleaned_end>', 'raw_chunk': 'raw', 'original_item': {}}]}
result = extract_cleaned_content(data)
# Returns: {'_cleaned_results': [...], '_cleaned_chunks': [{'raw_chunk': 'raw', 'cleaned_chunk': 'Clean text', 'original_item': {}}]}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner_batch.py
lazyllm.tools.data.operators.knowledge_cleaning.kbc_text_cleaner
KBCGenerateCleanedTextSingle
Bases: kbc
Single text cleaning generation operator.
This operator uses LLM to clean single raw text, removing noise and formatting content. Suitable for real-time cleaning of individual data items, falls back to original text when LLM call fails.
Parameters:
-
llm–LLM service instance for cleaning text.
-
lang(str, default:'en') –Language type, 'en' for English, 'zh' for Chinese, defaults to 'en'.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Data containing cleaning response:
-
_cleaned_response–LLM's cleaning response
Examples:
from lazyllm.tools.data import kbc
# Assuming llm is an LLM service instance
cleaner = kbc.KBCGenerateCleanedTextSingle(llm=llm, lang='en')
data = {'raw_chunk': 'Noisy text with errors...'}
result = cleaner(data, input_key='raw_chunk')
# Returns: {'raw_chunk': '...', '_cleaned_response': 'Cleaned text result'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner.py
extract_cleaned_content_single(data, output_key='cleaned_chunk')
Single cleaned content extraction function.
This function extracts cleaned text content from single LLM cleaning response, handling different response formats.
Supports extracting content between
Parameters:
-
data(dict) –Data containing cleaning response.
-
output_key(str, default:'cleaned_chunk') –Output key name, defaults to 'cleaned_chunk'.
Returns:
-
dict(dict) –Data containing extracted cleaned content with field specified by output_key added.
Examples:
from lazyllm.tools.data.operators.knowledge_cleaning.kbc_text_cleaner import extract_cleaned_content_single
data = {'_cleaned_response': '<cleaned_start>Clean text<cleaned_end>'}
result = extract_cleaned_content_single(data, output_key='cleaned_chunk')
# Returns: {'cleaned_chunk': 'Clean text'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner.py
lazyllm.tools.data.operators.knowledge_cleaning.qa_extract
KBCExtractQAPairs
Bases: kbc
QA pairs extraction operator.
This operator extracts QA pairs from loaded QA data and converts them to standard format. Supports customizing output field names for instruction, question, and answer.
Parameters:
-
qa_key(str, default:'QA_pairs') –QA data field name, defaults to 'QA_pairs'.
-
instruction(str, default:'Please answer the following question based on the provided information.') –Instruction text, defaults to 'Please answer the following question based on the provided information.'.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
–
List[dict]: List of extracted QA pairs, each containing instruction, input, and output fields.
Examples:
from lazyllm.tools.data import kbc
extractor = kbc.KBCExtractQAPairs(
qa_key='QA_pairs',
instruction='Please answer based on the context.'
)
data = {'_qa_data': {'qa_pairs': [{'question': 'What is AI?', 'answer': 'Artificial Intelligence'}]}}
result = extractor(
data,
output_instruction_key='instruction',
output_question_key='input',
output_answer_key='output'
)
# Returns: [{'instruction': 'Please answer based on the context.', 'input': 'What is AI?', 'output': 'Artificial Intelligence'}]
Source code in lazyllm/tools/data/operators/knowledge_cleaning/qa_extract.py
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | |
KBCLoadQAData
Bases: kbc
QA data loading operator.
This operator loads QA data from input data or chunk files. First checks if QA data already exists in input data, if not, tries to load from enhanced chunk files, cleaned chunk files, or regular chunk files.
Parameters:
-
qa_key(str, default:'QA_pairs') –QA data field name, defaults to 'QA_pairs'.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Data containing QA data:
-
_qa_data–Loaded QA data
-
_source_file–Data source file path (if loaded from file)
Examples:
from lazyllm.tools.data import kbc
loader = kbc.KBCLoadQAData(qa_key='QA_pairs')
# From existing data
data = {'QA_pairs': [{'question': 'Q1', 'answer': 'A1'}]}
result = loader(data)
# Returns: {'QA_pairs': [...], '_qa_data': [...]}
# From file
data = {'enhanced_chunk_path': '/path/to/enhanced.json'}
result = loader(data)
# Returns: {'enhanced_chunk_path': '...', '_qa_data': [...], '_source_file': '/path/to/enhanced.json'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/qa_extract.py
Reranker synthesis
lazyllm.tools.data.operators.reranker_synthesis
RerankerAdjustNegatives
Bases: reranker
Reranker negative sample adjustment operator.
This operator adjusts the number of negative samples to match the target count. Truncates if there are too many, or pads by random sampling if there are too few. Uses deterministic random seed based on query content for reproducibility.
Parameters:
-
adjust_neg_count(int, default:7) –Target negative sample count, defaults to 7.
-
seed(int, default:42) –Random seed for random selection during padding, defaults to 42.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Adjusted data with updated _neg field.
Examples:
from lazyllm.tools.data import reranker
adjuster = reranker.RerankerAdjustNegatives(adjust_neg_count=5, seed=123)
# Too many negatives
data = {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8']}
result = adjuster(data)
# Returns: {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2', 'n3', 'n4', 'n5']}
# Too few negatives
data = {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2']}
result = adjuster(data)
# Returns: {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2', 'n1', 'n2', 'n1']}
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_from_embedding_converter.py
RerankerBuildFormat
Bases: reranker
Reranker format builder operator.
This operator converts validated data to standard reranker training format. Outputs a dictionary containing query, pos, and neg fields without prompts or instructions.
Parameters:
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
dict–Reranker format data containing query, pos, and neg fields. Returns empty dict if data is invalid.
Examples:
from lazyllm.tools.data import reranker
builder = reranker.RerankerBuildFormat()
data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking']}
result = builder(data)
# Returns: {'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking']}
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_from_embedding_converter.py
RerankerFormatCrossEncoder
Bases: reranker
CrossEncoder format conversion operator.
This operator converts validated data to CrossEncoder training format. Each query-document pair is an independent sample, with positive samples labeled 1 and negative samples labeled 0.
Parameters:
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
–
List[dict]: List of converted data, each containing query, document, and label fields.
Examples:
from lazyllm.tools.data import reranker
formatter = reranker.RerankerFormatCrossEncoder()
data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking']}
result = formatter(data)
# Returns: [{'query': 'machine learning', 'document': 'ML tutorial', 'label': 1}, {'query': 'machine learning', 'document': 'cooking', 'label': 0}]
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_data_formatter.py
RerankerFormatFlagReranker
Bases: reranker
FlagReranker format conversion operator.
This operator converts validated data to FlagReranker training format. Ensures the number of negative samples meets training group size requirements, padding with duplicates if insufficient or truncating if excessive.
Parameters:
-
train_group_size(int, default:8) –Training group size (including 1 positive sample), defaults to 8.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
–
List[dict]: List of converted data, each containing query, pos, and neg fields.
Examples:
from lazyllm.tools.data import reranker
formatter = reranker.RerankerFormatFlagReranker(train_group_size=8)
data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking', 'history']}
result = formatter(data)
# Returns: [{'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking', 'history', ...]}]
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_data_formatter.py
RerankerFormatPairwise
Bases: reranker
Pairwise format conversion operator.
This operator converts validated data to Pairwise training format. Creates pairwise combinations of positive and negative samples for training ranking models to distinguish relevant from irrelevant documents.
Parameters:
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
–
List[dict]: List of converted data, each containing query, doc_pos, and doc_neg fields.
Examples:
from lazyllm.tools.data import reranker
formatter = reranker.RerankerFormatPairwise()
data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking']}
result = formatter(data)
# Returns: [{'query': 'machine learning', 'doc_pos': 'ML tutorial', 'doc_neg': 'cooking'}]
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_data_formatter.py
RerankerGenerateQueries
Bases: reranker
Generates multiple retrieval queries from a given passage.
This operator builds prompts using RerankerQueryGeneratorPrompt and calls the LLM to produce queries with different difficulty levels. The result is parsed by JsonFormatter and stored as a JSON string in the '_query_response' field.
If the passage is empty or generation fails, an empty response is returned.
Parameters:
-
llm_serving–language model serving instance
-
lang(str, default:'zh') –language of generated queries, default 'zh'
-
num_queries(int, default:3) –number of queries to generate, default 3
-
difficulty_levels(List[str], default:None) –list of difficulty levels, default ['easy', 'medium', 'hard']
-
**kwargs(dict, default:{}) –Additional optional parameters passed to parent class.
Examples:
op = RerankerGenerateQueries(
llm_serving=my_llm,
lang='en',
num_queries=5,
difficulty_levels=['easy', 'hard']
)
result = op({'passage': 'Large language models are widely used in NLP.'})
print(result['_query_response'])
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_query_generator.py
RerankerInitBM25
Bases: reranker
Initialize BM25 index operator.
This operator builds BM25 index based on corpus for keyword-based negative sample mining. Supports Chinese and English tokenization, Chinese uses jieba, English uses Stemmer stemming.
Parameters:
-
language(str, default:'zh') –Language type, 'zh' for Chinese, 'en' for English, defaults to 'zh'.
-
**kwargs(dict, default:{}) –Additional optional parameters passed to parent class.
Returns:
-
–
List[dict]: Input data list, each data adds BM25 index and tokenizer configuration.
Examples:
from lazyllm.tools.data import reranker
init_bm25 = reranker.RerankerInitBM25(language='zh')
# 先构建语料库
data_with_corpus = reranker.build_reranker_corpus(inputs)
# 然后初始化BM25
result = init_bm25(data_with_corpus)
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | |
RerankerInitSemantic
Bases: reranker
Initialize semantic embeddings operator.
This operator uses embedding service to compute vector representations for all documents in the corpus and saves them to files. Used for subsequent semantic similarity calculation and negative sample mining.
Parameters:
-
embedding_serving(Callable, default:None) –Embedding service callable function.
-
embeddings_dir(str, default:None) –Embedding file save directory, defaults to corpus directory.
-
**kwargs(dict, default:{}) –Additional optional parameters passed to parent class.
Returns:
-
–
List[dict]: Input data list, each data adds embedding file path and corpus information.
Examples:
from lazyllm.tools.data import reranker
# 假设 embedding_fn 是embedding服务
init_semantic = reranker.RerankerInitSemantic(embedding_serving=embedding_fn)
# 先构建语料库
data_with_corpus = reranker.build_reranker_corpus(inputs)
# 然后计算语义向量
result = init_semantic(data_with_corpus)
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py
RerankerMineBM25Negatives
Bases: reranker
BM25 negative sample mining operator.
This operator retrieves documents most relevant to the query but not in positive samples based on BM25 index. Suitable for mining hard negatives that have lexical overlap but different semantics.
Parameters:
-
num_negatives(int, default:7) –Number of negative samples to mine, defaults to 7.
-
**kwargs(dict, default:{}) –Additional optional parameters passed to parent class.
Returns:
-
dict–Input data with mined negative samples list added.
Examples:
from lazyllm.tools.data import reranker
miner = reranker.RerankerMineBM25Negatives(num_negatives=5)
data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_bm25': bm25_index, '_bm25_corpus': corpus}
result = miner(data)
# Returns: {'query': '...', 'pos': [...], 'neg': ['bm25_neg1', 'bm25_neg2', ...]}
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py
344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 | |
RerankerMineMixedNegatives
Bases: reranker
Mixed strategy negative sample mining operator.
This operator combines BM25 and semantic similarity methods to mine negative samples. Uses both methods according to specified ratio to obtain more diverse hard negatives.
Parameters:
-
embedding_serving(Callable, default:None) –Embedding service callable function.
-
num_negatives(int, default:7) –Number of negative samples to mine, defaults to 7.
-
bm25_ratio(float, default:0.5) –BM25 method ratio, remaining portion uses semantic method, defaults to 0.5.
-
**kwargs(dict, default:{}) –Additional optional parameters passed to parent class.
Returns:
-
dict–Input data with mixed strategy mined negative samples list added.
Examples:
from lazyllm.tools.data import reranker
# 假设 embedding_fn 是embedding服务
miner = reranker.RerankerMineMixedNegatives(
embedding_serving=embedding_fn,
num_negatives=6,
bm25_ratio=0.5 # 3个BM25负样本 + 3个语义负样本
)
data = {
'query': 'machine learning',
'pos': ['ML tutorial'],
'_bm25': bm25_index,
'_bm25_corpus': corpus,
'_semantic_embeddings_path': emb_path,
'_semantic_corpus': corpus
}
result = miner(data)
# Returns: {'query': '...', 'pos': [...], 'neg': [...]} 包含3个BM25负样本和3个语义负样本
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py
496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 | |
RerankerMineRandomNegatives
Bases: reranker
Random negative sample mining operator.
This operator randomly selects documents from corpus that are not in positive samples as negative samples. Suitable for baseline comparison or scenarios requiring random negative samples.
Parameters:
-
num_negatives(int, default:7) –Number of negative samples to mine, defaults to 7.
-
seed(int, default:42) –Random seed for reproducible selection, defaults to 42.
-
**kwargs(dict, default:{}) –Additional optional parameters passed to parent class.
Returns:
-
dict–Input data with mined negative samples list added.
Examples:
from lazyllm.tools.data import reranker
miner = reranker.RerankerMineRandomNegatives(num_negatives=5, seed=123)
data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_corpus': corpus_path}
result = miner(data)
# Returns: {'query': '...', 'pos': [...], '_corpus': '...', 'neg': ['random_neg1', 'random_neg2', ...]}
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py
RerankerMineSemanticNegatives
Bases: reranker
Semantic similarity negative sample mining operator.
This operator finds documents most similar to the query but not in positive samples based on semantic vector similarity. Suitable for mining hard negatives that are semantically similar but actually irrelevant, usually performs better than BM25 method.
Parameters:
-
num_negatives(int, default:7) –Number of negative samples to mine, defaults to 7.
-
embedding_serving(Callable, default:None) –Embedding service callable function for computing query vectors.
-
**kwargs(dict, default:{}) –Additional optional parameters passed to parent class.
Returns:
-
dict–Input data with negative samples mined based on semantic similarity added.
Examples:
from lazyllm.tools.data import reranker
# 假设 embedding_fn 是embedding服务
miner = reranker.RerankerMineSemanticNegatives(num_negatives=5, embedding_serving=embedding_fn)
data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_semantic_embeddings_path': emb_path, '_semantic_corpus': corpus}
result = miner(data)
# Returns: {'query': '...', 'pos': [...], 'neg': ['semantic_neg1', 'semantic_neg2', ...]}
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py
RerankerParseQueries
Bases: reranker
Parses LLM-generated query results and expands them into multiple training samples.
It reads the '_query_response' JSON content and extracts the query list (supporting both list and {'queries': [...]} structures). Each query generates a new data record containing:
- query: query text
- difficulty: difficulty level (default 'medium')
- pos: positive sample list (original passage)
Intermediate fields like '_query_response' are removed.
Parameters:
-
input_key(str, default:'passage') –source passage field name, default 'passage'
-
output_query_key(str, default:'query') –output query field name, default 'query'
-
**kwargs(dict, default:{}) –Additional optional parameters passed to parent class.
Examples:
op = RerankerParseQueries(input_key='passage', output_query_key='query')
data = {
'passage': 'Large language models are widely used in NLP.',
'_query_response': '[{"query": "What are LLMs used for?", "difficulty": "easy"}]'
}
rows = op(data)
for row in rows:
print(row['query'], row['difficulty'], row['pos'])
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_query_generator.py
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | |
RerankerTrainTestSplitter
Bases: reranker
Reranker train/test splitter operator.
This operator randomly splits dataset into training and test sets, supporting specified split ratio and random seed. Can save training and test sets to specified files, with test set format converted for evaluation compatibility.
Parameters:
-
test_size(float, default:0.1) –Test set proportion, defaults to 0.1 (i.e., 10%).
-
seed(int, default:42) –Random seed for reproducible splitting, defaults to 42.
-
train_output_file(str, default:None) –Training set output file path, defaults to None.
-
test_output_file(str, default:None) –Test set output file path, defaults to None.
-
**kwargs(dict, default:{}) –Additional optional arguments passed to the parent class.
Returns:
-
–
List[dict]: List of split data, each sample contains split field marking its set ('train' or 'test').
Examples:
from lazyllm.tools.data import reranker
splitter = reranker.RerankerTrainTestSplitter(
test_size=0.2,
seed=123,
train_output_file='train.jsonl',
test_output_file='test.jsonl'
)
data = [
{'query': 'q1', 'pos': ['p1'], 'neg': ['n1']},
{'query': 'q2', 'pos': ['p2'], 'neg': ['n2']}
]
result = splitter(data)
# Returns: [{'query': 'q1', 'pos': ['p1'], 'neg': ['n1'], 'split': 'train'}, {'query': 'q2', 'pos': ['p2'], 'neg': ['n2'], 'split': 'test'}]
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_data_formatter.py
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 | |
LLM JSON Operators
lazyllm.tools.data.operators.llm_base_ops
LLMDataJson
Base class for LLM-based JSON data processing operators. Provides foundational logic for structured output, including automatic JsonFormatter configuration, retry mechanisms, and a pre/verify/post-processing lifecycle.
Constructor args:
- model: a LazyLLM model instance.
- prompt: optional, ChatPrompter or string to guide the LLM.
- max_retries: maximum number of retries, default 3.
- **kwargs: additional concurrency or persistence arguments for the base class.
Source code in lazyllm/tools/data/operators/llm_base_ops.py
lazyllm.tools.data.operators.llm_json_ops
FieldExtractor
Bases: LLMDataJson, LLMJsonBase
Field extractor. Uses LLM to extract specific information from input text based on a provided list of fields.
Parameters:
-
model–a LazyLLM model instance.
-
prompt–optional custom extraction prompt.
-
input_keys–list of input keys, defaults to ['persona', 'text', 'fields'].
-
output_key–key name to store results in the data dict, default 'structured_data'.
Examples:
from lazyllm import OnlineChatModule
from lazyllm.tools.data.operators.llm_json_ops import FieldExtractor
model = OnlineChatModule(source='sensenova')
op = FieldExtractor(model=model)
inputs = [{
'text': '张三,28岁,目前在上海',
'fields': ['name', 'age', 'location']
}]
res = op(inputs)
print(res[0]['structured_data']) # {'name': '张三', 'age': '28', 'location': '上海'}
Source code in lazyllm/tools/data/operators/llm_json_ops.py
SchemaExtractor
Bases: LLMDataJson, LLMJsonBase
Schema extractor. Uses LLM to extract structured data from text according to a specified schema (dict or Pydantic model).
Parameters:
-
model–a LazyLLM model instance.
-
prompt–optional custom extraction prompt.
-
input_key–key name for input text, default 'text'.
-
output_key–key name to store results in the data dict, default 'structured_data'.
Examples:
from lazyllm import OnlineChatModule
from lazyllm.tools.data.operators.llm_json_ops import SchemaExtractor
model = OnlineChatModule(source='sensenova')
op = SchemaExtractor(model=model)
inputs = [{'text': 'Math score is 95', 'schema': {'subject': 'str', 'score': 'int'}}]
res = op(inputs)
print(res[0]['structured_data']) # {'subject': 'Math', 'score': 95}
Source code in lazyllm/tools/data/operators/llm_json_ops.py
Data Processing Pipeline
Demo Pipeline
lazyllm.tools.data.pipelines.demo_pipelines
build_demo_pipeline(input_key='text')
Build a demo data processing pipeline composed of several example operators.
Parameters:
-
input_key(str, default:'text') –the text field name to process, default 'text'
Returns:
A callable pipeline object that executes registered operators in sequence.
Examples:
from lazyllm.tools.data.pipelines.demo_pipelines import build_demo_pipeline
ppl = build_demo_pipeline(input_key='text')
data = [{'text': 'lazyLLM'}]
res = ppl(data)
print(res) # demonstrates how operators are combined and applied