Data Processing

Data Processing Operators

Base Operators

`lazyllm.tools.data.LazyLLMDataBase`

Base class for data processing operators registered via data_register. Provides concurrency, result persistence/resume, progress tracking, and error collection.

Key methods:

forward(self, input, **kwargs): implement single-item processing.
forward_batch_input(self, inputs, **kwargs): implement batch processing and return results.
call(self, inputs): unified entry point; decides execution mode based on implemented methods and handles concurrency, resume and saving.
set_output(self, path): set export path; when set, call writes results to a file and returns the file path.

Constructor args:

_concurrency_mode (str): concurrency mode, one of 'process'|'thread'|'single'.
_save_data (bool): whether to persist intermediate results for resume.
_max_workers (int|None): maximum workers for concurrency, None means default.
_ignore_errors (bool): whether to ignore exceptions in tasks.
**kwargs (dict): additional operator arguments.

Config keys (via lazyllm.config):

data_process_path (str): root folder to store pipeline outputs.
data_process_resume (bool): enable resume from previous progress.

Examples:

from lazyllm.tools.data import LazyLLMDataBase

# simple usage: subclass and implement forward
class EchoOp(LazyLLMDataBase):
    def forward(self, data):
        return {'text': data.get('text', '')}

op = EchoOp(_save_data=True)
res = op([{'text': 'hello'}])  # returns list or exported path depending on set_output

Source code in lazyllm/tools/data/base_data.py

class LazyLLMDataBase(metaclass=LazyLLMRegisterMetaClass):
    """Base class for data processing operators registered via data_register.
Provides concurrency, result persistence/resume, progress tracking, and error collection.

Key methods:

- forward(self, input, **kwargs): implement single-item processing.
- forward_batch_input(self, inputs, **kwargs): implement batch processing and return results.
- __call__(self, inputs): unified entry point; decides execution mode based on implemented methods and handles concurrency, resume and saving.
- set_output(self, path): set export path; when set, __call__ writes results to a file and returns the file path.

Constructor args:

- _concurrency_mode (str): concurrency mode, one of 'process'|'thread'|'single'.
- _save_data (bool): whether to persist intermediate results for resume.
- _max_workers (int|None): maximum workers for concurrency, None means default.
- _ignore_errors (bool): whether to ignore exceptions in tasks.
- **kwargs (dict): additional operator arguments.

Config keys (via lazyllm.config):

- data_process_path (str): root folder to store pipeline outputs.
- data_process_resume (bool): enable resume from previous progress.


Examples:
    ```python
    from lazyllm.tools.data import LazyLLMDataBase

    # simple usage: subclass and implement forward
    class EchoOp(LazyLLMDataBase):
        def forward(self, data):
            return {'text': data.get('text', '')}

    op = EchoOp(_save_data=True)
    res = op([{'text': 'hello'}])  # returns list or exported path depending on set_output
    ```
    """
    def __init__(self, _concurrency_mode=None, _save_data=True, _max_workers=None,
                 _ignore_errors=True, **kwargs):
        self._concurrency_mode = _concurrency_mode or getattr(self, '_concurrency_mode', 'process')
        if _max_workers:
            self._max_workers = _max_workers
        elif self._concurrency_mode == 'process':
            self._max_workers = os.cpu_count()
        else:
            self._max_workers = min(max(32, (os.cpu_count() or 1) * 5), 128)
        self._ignore_errors = _ignore_errors
        self._store = DataStateStore(self.__class__.__name__, _save_data)
        self._lazyllm_kwargs = kwargs
        self._export_path = None

    def set_output(self, output_path):
        """Set output path for exporting final results to a JSONL file and return the file path.

Args:
    output_path (str): directory path or concrete .jsonl file path. If a directory is provided, a file named <ClassName>.jsonl will be created inside it.

Behavior:
- If a folder path is provided, a file named <ClassName>.jsonl will be created in that folder.
- If a .jsonl file path is provided, results will be written to that file (directories created as needed).
- Returns the absolute path of the exported file.


Examples:
    ```python
    from lazyllm.tools.data import Demo2

    # export to a directory (will create DemoClass.jsonl)
    op = Demo2.rich_content(input_key='text').set_output('./out_dir')
    path = op([{'text': 'sample'}])
    print(path)  # ./out_dir/RichContent.jsonl or similar

    # export to a specific file
    op = Demo2.rich_content(input_key='text').set_output('./out_dir/results.jsonl')
    path = op([{'text': 'sample'}])
    print(path)  # ./out_dir/results.jsonl
    ```
    """
        self._export_path = output_path
        return self

    def _overwrote(self, f):
        return getattr(self.__class__, f) is not getattr(__class__, f) or \
            getattr(self.__class__, '__reg_overwrite__', None) == f

    def forward(self, input_data, **kwargs):
        """Method to implement in subclasses for single-item processing. Supported return types:

- dict: processed single result.
- list: expand one input into multiple outputs.
- None: keep the original input unchanged.
Exceptions or error returns are recorded to the error file and typically skipped from valid results.

Args:
    input (dict): a single input data dict.
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import LazyLLMDataBase

    class MyOp(LazyLLMDataBase):
        def forward(self, data):
            # return dict or list or None
            return {'text': data.get('text', '').upper()}

    op = MyOp()
    print(op([{'text': 'a'}]))
    ```
    """
        raise NotImplementedError()

    def forward_batch_input(self, inputs, **kwargs):
        """Optional batch-processing method for subclasses. Receives the whole input list and returns a final list of results. Useful for custom batching or single-call external services.

Args:
    inputs (list[dict]): list of input data dicts.
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import LazyLLMDataBase

    class BatchOp(LazyLLMDataBase):
        def forward_batch_input(self, inputs):
            # implement batch processing and return a list
            return [{'text': i.get('text', '').lower()} for i in inputs]

    op = BatchOp()
    print(op([{'text': 'A'}, {'text': 'B'}]))
    ```
    """
        raise NotImplementedError()

    def _run_one(self, data):
        try:
            kwargs = getattr(self, '_lazyllm_kwargs', {})
            return self.forward(data, **kwargs)
        except Exception as e:
            err_msg = str(e)
            if isinstance(data, dict):
                return {**data, 'infer_error': err_msg}
            return {'input': data, 'infer_error': err_msg}

    def _process_forward_common(self, data):
        self._store.load_progress()
        results = []
        pbar = tqdm(total=len(data), desc=f'Processing {self.__class__.__name__}', unit='item')

        if self._store.is_done:
            pbar.update(len(data))
            pending_indices = []
        else:
            if len(self._store.processed_indices) > 0:
                pbar.update(len(self._store.processed_indices))

            pending_indices = [idx for idx in range(len(data)) if idx not in self._store.processed_indices]

        if not pending_indices:
            pbar.close()
            return self._store.load_results()

        if self._concurrency_mode == 'single':
            for idx in pending_indices:
                res = self._run_one(data[idx])
                self._handle_result(res, data[idx], results, [idx])
                pbar.update(1)
        else:
            self._process_parallel(data, pending_indices, results, pbar)

        pbar.close()
        # Flush remaining
        if self._store.save_data:
            self._store.save_results([], force=True)  # Flush
            return self._store.load_results()
        return results

    def _process_parallel(self, data, pending_indices, results, pbar):

        executor_cls = ProcessPoolExecutor if self._concurrency_mode == 'process' else ThreadPoolExecutor
        idx_iter = iter(pending_indices)
        futures = {}

        with executor_cls(max_workers=self._max_workers) as executor:
            # 1. Submit initial batch
            for _ in range(self._max_workers):
                try:
                    idx = next(idx_iter)
                    fut = executor.submit(self._run_one, data[idx])
                    futures[fut] = idx
                except StopIteration:
                    break

            # 2. Loop
            while futures:
                done, _ = wait(futures.keys(), return_when=FIRST_COMPLETED)

                for fut in done:
                    idx = futures.pop(fut)
                    try:
                        res = fut.result()
                        self._handle_result(res, data[idx], results, [idx])
                    except Exception as e:
                        if not self._ignore_errors:
                            raise e
                        LOG.error(f'Task failed: {e}')

                    pbar.update(1)

                    # Submit next
                    try:
                        next_idx = next(idx_iter)
                        new_fut = executor.submit(self._run_one, data[next_idx])
                        futures[new_fut] = next_idx
                    except StopIteration:
                        pass

    def _handle_result(self, res, original_data, results, indices):
        if isinstance(res, dict) and 'infer_error' in res:
            if self._store.save_data:
                self._store.save_errors(res)
                self._store.save_results([], indices)
            return

        # Logic to interpret return value
        final_res = []
        if res is None:
            final_res.append(original_data)  # Keep original
        elif isinstance(res, list):
            if res:  # Not empty
                final_res.extend(res)
            # Empty list means delete (do nothing)
        elif isinstance(res, dict):
            final_res.append(res)
        else:
            # Treat unexpected return types as errors
            err_msg = f'Invalid return type {type(res)} from {self.__class__.__name__}, expect dict or list or None'
            LOG.error(err_msg)
            if isinstance(original_data, dict):
                error_res = original_data.copy()
                error_res['infer_error'] = err_msg
            else:
                error_res = {'input': original_data, 'infer_error': err_msg}

            if self._store.save_data:
                self._store.save_errors(error_res)
                self._store.save_results([], indices)
            return

        if self._store.save_data:
            self._store.save_results(final_res, indices)
        else:
            results.extend(final_res)

    def _export_file(self, result):
        if not self._export_path or result is None:
            return result

        path = self._export_path
        if not path.endswith('.jsonl'):
            os.makedirs(path, exist_ok=True)
            path = os.path.join(path, f'{self.__class__.__name__}.jsonl')
        else:
            dir_name = os.path.dirname(path)
            if dir_name:
                os.makedirs(dir_name, exist_ok=True)

        abs_path = os.path.abspath(path)
        with open(abs_path, 'w', encoding='utf-8') as f:
            for item in result:
                f.write(json.dumps(item, ensure_ascii=False) + '\n')
        return abs_path

    def __call__(self, inputs):
        if not isinstance(inputs, list):
            inputs = [inputs]

        kwargs = getattr(self, '_lazyllm_kwargs', {})
        res = []

        if self._overwrote('forward_batch_input'):
            self._store.load_progress()
            if self._store.save_data and self._store.resume and self._store.is_done:
                LOG.warning(f'skip {self.__class__.__name__} and load data from {self._store.save_path}')
                res = self._store.load_results()
            else:
                res = self.forward_batch_input(inputs, **kwargs)

                if self._store.save_data and res is not None:
                    self._store.save_results(res if isinstance(res, list) else [res], indices='Done', force=True)

        elif self._overwrote('forward'):
            res = self._process_forward_common(inputs)
        else:
            raise RuntimeError('Must implement forward or forward_batch_input')

        return self._export_file(res)

`forward(input_data, **kwargs)`

Method to implement in subclasses for single-item processing. Supported return types:

dict: processed single result.
list: expand one input into multiple outputs.
None: keep the original input unchanged. Exceptions or error returns are recorded to the error file and typically skipped from valid results.

Parameters:

input (dict) –

a single input data dict.
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import LazyLLMDataBase

class MyOp(LazyLLMDataBase):
    def forward(self, data):
        # return dict or list or None
        return {'text': data.get('text', '').upper()}

op = MyOp()
print(op([{'text': 'a'}]))

Source code in lazyllm/tools/data/base_data.py

    def forward(self, input_data, **kwargs):
        """Method to implement in subclasses for single-item processing. Supported return types:

- dict: processed single result.
- list: expand one input into multiple outputs.
- None: keep the original input unchanged.
Exceptions or error returns are recorded to the error file and typically skipped from valid results.

Args:
    input (dict): a single input data dict.
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import LazyLLMDataBase

    class MyOp(LazyLLMDataBase):
        def forward(self, data):
            # return dict or list or None
            return {'text': data.get('text', '').upper()}

    op = MyOp()
    print(op([{'text': 'a'}]))
    ```
    """
        raise NotImplementedError()

`forward_batch_input(inputs, **kwargs)`

Optional batch-processing method for subclasses. Receives the whole input list and returns a final list of results. Useful for custom batching or single-call external services.

Parameters:

inputs (list[dict]) –

list of input data dicts.
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import LazyLLMDataBase

class BatchOp(LazyLLMDataBase):
    def forward_batch_input(self, inputs):
        # implement batch processing and return a list
        return [{'text': i.get('text', '').lower()} for i in inputs]

op = BatchOp()
print(op([{'text': 'A'}, {'text': 'B'}]))

Source code in lazyllm/tools/data/base_data.py

    def forward_batch_input(self, inputs, **kwargs):
        """Optional batch-processing method for subclasses. Receives the whole input list and returns a final list of results. Useful for custom batching or single-call external services.

Args:
    inputs (list[dict]): list of input data dicts.
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import LazyLLMDataBase

    class BatchOp(LazyLLMDataBase):
        def forward_batch_input(self, inputs):
            # implement batch processing and return a list
            return [{'text': i.get('text', '').lower()} for i in inputs]

    op = BatchOp()
    print(op([{'text': 'A'}, {'text': 'B'}]))
    ```
    """
        raise NotImplementedError()

`set_output(output_path)`

Set output path for exporting final results to a JSONL file and return the file path.

Parameters:

output_path (str) –

directory path or concrete .jsonl file path. If a directory is provided, a file named .jsonl will be created inside it.

Behavior: - If a folder path is provided, a file named .jsonl will be created in that folder. - If a .jsonl file path is provided, results will be written to that file (directories created as needed). - Returns the absolute path of the exported file.

Examples:

from lazyllm.tools.data import Demo2

# export to a directory (will create DemoClass.jsonl)
op = Demo2.rich_content(input_key='text').set_output('./out_dir')
path = op([{'text': 'sample'}])
print(path)  # ./out_dir/RichContent.jsonl or similar

# export to a specific file
op = Demo2.rich_content(input_key='text').set_output('./out_dir/results.jsonl')
path = op([{'text': 'sample'}])
print(path)  # ./out_dir/results.jsonl

Source code in lazyllm/tools/data/base_data.py

    def set_output(self, output_path):
        """Set output path for exporting final results to a JSONL file and return the file path.

Args:
    output_path (str): directory path or concrete .jsonl file path. If a directory is provided, a file named <ClassName>.jsonl will be created inside it.

Behavior:
- If a folder path is provided, a file named <ClassName>.jsonl will be created in that folder.
- If a .jsonl file path is provided, results will be written to that file (directories created as needed).
- Returns the absolute path of the exported file.


Examples:
    ```python
    from lazyllm.tools.data import Demo2

    # export to a directory (will create DemoClass.jsonl)
    op = Demo2.rich_content(input_key='text').set_output('./out_dir')
    path = op([{'text': 'sample'}])
    print(path)  # ./out_dir/RichContent.jsonl or similar

    # export to a specific file
    op = Demo2.rich_content(input_key='text').set_output('./out_dir/results.jsonl')
    path = op([{'text': 'sample'}])
    print(path)  # ./out_dir/results.jsonl
    ```
    """
        self._export_path = output_path
        return self

Demo Operators

`lazyllm.tools.data.operators.demo_ops`

`AddSuffix`

Bases: Demo2

Class-based operator that appends a suffix to a specified field. Supports concurrency configuration via constructor args.

Parameters:

suffix (str) –

suffix string to append
input_key (str, default: 'content' ) –

key name of the text field
_max_workers (int | None) –

optional max concurrency
_concurrency_mode (str, default: 'process' ) –

optional concurrency mode
_save_data (bool) –

optional whether to persist results

Examples:

from lazyllm.tools.data import Demo2

op = Demo2.AddSuffix(suffix='!!!', input_key='text', _max_workers=2)
data = [{'text': 'wow'}]
res = op(data)
print(res)
# [{'text': 'wow!!!'}]

Source code in lazyllm/tools/data/operators/demo_ops.py

class AddSuffix(Demo2):
    """Class-based operator that appends a suffix to a specified field. Supports concurrency configuration via constructor args.

Args:
    suffix (str): suffix string to append
    input_key (str): key name of the text field
    _max_workers (int|None): optional max concurrency
    _concurrency_mode (str): optional concurrency mode
    _save_data (bool): optional whether to persist results


Examples:
    ```python
    from lazyllm.tools.data import Demo2

    op = Demo2.AddSuffix(suffix='!!!', input_key='text', _max_workers=2)
    data = [{'text': 'wow'}]
    res = op(data)
    print(res)
    # [{'text': 'wow!!!'}]
    ```
    """
    def __init__(self, suffix, input_key='content', _concurrency_mode='process', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        self.suffix = suffix
        self.input_key = input_key

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        data[self.input_key] = f'{data.get(self.input_key, "")}{self.suffix}'
        return data

`build_pre_suffix(data, input_key='content', prefix='', suffix='')`

Add a prefix and suffix to the specified field of each item in the input list. Registered as a batch operator.

Parameters:

data (list[dict]) –

list of dicts
input_key (str, default: 'content' ) –

key name of the text field
prefix (str, default: '' ) –

string to add before the field
suffix (str, default: '' ) –

string to add after the field

Examples:

from lazyllm.tools.data import Demo1

op = Demo1.build_pre_suffix(input_key='text', prefix='Hello, ', suffix='!')
data = [{'text': 'world'}]
res = op(data)
print(res)
# [{'text': 'Hello, world!'}]

Source code in lazyllm/tools/data/operators/demo_ops.py

@data_register('data.demo1', rewrite_func='forward_batch_input')
def build_pre_suffix(data, input_key='content', prefix='', suffix=''):
    """Add a prefix and suffix to the specified field of each item in the input list. Registered as a batch operator.

Args:
    data (list[dict]): list of dicts
    input_key (str): key name of the text field
    prefix (str): string to add before the field
    suffix (str): string to add after the field


Examples:
    ```python
    from lazyllm.tools.data import Demo1

    op = Demo1.build_pre_suffix(input_key='text', prefix='Hello, ', suffix='!')
    data = [{'text': 'world'}]
    res = op(data)
    print(res)
    # [{'text': 'Hello, world!'}]
    ```
    """
    assert isinstance(data, list)
    for item in data:
        item[input_key] = f'{prefix}{item.get(input_key, "")}{suffix}'
    return data

`error_prone_op(data, input_key='content')`

A test operator that raises an exception for specific input (content == 'fail') and otherwise returns a processed dict. Used to validate error collection and skipping behavior.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key name of the text field

Examples:

from lazyllm.tools.data import Demo2

op = Demo2.error_prone_op(input_key='text', _save_data=True, _concurrency_mode='single')
data = [{'text': 'ok'}, {'text': 'fail'}, {'text': 'ok2'}]
res = op(data)
print(res)
# [{'text': 'Processed: ok'}, {'text': 'Processed: ok2'}]
# valid results skip the failed item; error details written to error file

Source code in lazyllm/tools/data/operators/demo_ops.py

@data_register('data.demo2', rewrite_func='forward')
def error_prone_op(data, input_key='content'):
    """A test operator that raises an exception for specific input (content == 'fail') and otherwise returns a processed dict.
Used to validate error collection and skipping behavior.

Args:
    data (dict): single data dict
    input_key (str): key name of the text field


Examples:
    ```python
    from lazyllm.tools.data import Demo2

    op = Demo2.error_prone_op(input_key='text', _save_data=True, _concurrency_mode='single')
    data = [{'text': 'ok'}, {'text': 'fail'}, {'text': 'ok2'}]
    res = op(data)
    print(res)
    # [{'text': 'Processed: ok'}, {'text': 'Processed: ok2'}]
    # valid results skip the failed item; error details written to error file
    ```
    """
    assert isinstance(data, dict)
    content = data.get(input_key, '')
    if content == 'fail':
        raise ValueError('Intentional error for testing.')
    data[input_key] = f'Processed: {content}'
    return data

`process_uppercase(data, input_key='content')`

Convert the input text field to uppercase. Intended as a single-item processing function.

Parameters:

data (dict) –

a dict representing a single data item.
input_key (str, default: 'content' ) –

key name of the text field, default 'content'.

Examples:

from lazyllm.tools.data import Demo1

op = Demo1.process_uppercase(input_key='text')
data = [{'text': 'hello'}]
res = op(data)
print(res)
# [{'text': 'HELLO'}]

Source code in lazyllm/tools/data/operators/demo_ops.py

@data_register('data.demo1', rewrite_func='forward', _concurrency_mode='process')
def process_uppercase(data, input_key='content'):
    """Convert the input text field to uppercase. Intended as a single-item processing function.

Args:
    data (dict): a dict representing a single data item.
    input_key (str): key name of the text field, default 'content'.


Examples:
    ```python
    from lazyllm.tools.data import Demo1

    op = Demo1.process_uppercase(input_key='text')
    data = [{'text': 'hello'}]
    res = op(data)
    print(res)
    # [{'text': 'HELLO'}]
    ```
    """
    assert isinstance(data, dict)
    data[input_key] = data.get(input_key, '').upper()
    return data

`rich_content(data, input_key='content')`

Split a single input into multiple outputs (original + derived parts). Implemented as a forward that returns a list.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key name of the text field

Examples:

from lazyllm.tools.data import Demo2

op = Demo2.rich_content(input_key='text')
data = [{'text': 'This is a test.'}]
res = op(data)
print(res)
# [
#   {'text': 'This is a test.'},
#   {'text': 'This is a test. - part 1'},
#   {'text': 'This is a test. - part 2'}
# ]

Source code in lazyllm/tools/data/operators/demo_ops.py

@data_register('data.demo2', rewrite_func='forward', _concurrency_mode='process')
def rich_content(data, input_key='content'):
    """Split a single input into multiple outputs (original + derived parts). Implemented as a forward that returns a list.

Args:
    data (dict): single data dict
    input_key (str): key name of the text field


Examples:
    ```python
    from lazyllm.tools.data import Demo2

    op = Demo2.rich_content(input_key='text')
    data = [{'text': 'This is a test.'}]
    res = op(data)
    print(res)
    # [
    #   {'text': 'This is a test.'},
    #   {'text': 'This is a test. - part 1'},
    #   {'text': 'This is a test. - part 2'}
    # ]
    ```
    """
    assert isinstance(data, dict)
    content = data.get(input_key, '')
    new_res = [data]
    for i in range(2):
        new_data = data.copy()
        new_data[input_key] = f'{content} - part {i+1}'
        new_res.append(new_data)
    return new_res

Preference Operators

`lazyllm.tools.data.operators.preference_ops`

`IntentExtractor`

Bases: PreferenceOps

Preference operator: intent extractor.

Extracts the core intent from a specified field of the input data dict and writes it to an output field, so that downstream steps can generate multiple candidate responses and construct preference pairs.

Notes:

Internally uses a model plus a JSON formatter; it expects the model output to be a JSON dict. If it cannot be parsed as dict, the output is None.
Default concurrency mode is thread.

Parameters:

model –

a LazyLLM model object (required), will be shared via share().
input_key (str, default: 'content' ) –

input text field name, default 'content'.
output_key (str, default: 'intent' ) –

output intent field name, default 'intent'.
**kwargs –

extra args passed to the base operator (e.g. _max_workers, _save_data).

Examples:

from lazyllm.tools.data.operators.preference_ops import IntentExtractor

# model 需要由你的项目环境提供，例如 lazyllm.xxx(...) 得到的模型对象
op = IntentExtractor(model=model, input_key='content', output_key='intent')
print(op({'content': 'I want to stay at a hotel in Beijing.'}))
# [{
#   'content': 'I want to stay at a hotel in Beijing.',
#   'intent': {
#     'intent': 'book_hotel',
#     'entities': [{'entity': 'location', 'value': 'Beijing'}]
#   }
# }]

Source code in lazyllm/tools/data/operators/preference_ops.py

class IntentExtractor(PreferenceOps):
    """Preference operator: intent extractor.

Extracts the core intent from a specified field of the input data dict and writes it to an output field,
so that downstream steps can generate multiple candidate responses and construct preference pairs.

Notes:

- Internally uses a model plus a JSON formatter; it expects the model output to be a JSON dict. If it cannot be parsed as dict, the output is None.
- Default concurrency mode is thread.

Args:
    model: a LazyLLM model object (required), will be shared via share().
    input_key (str): input text field name, default 'content'.
    output_key (str): output intent field name, default 'intent'.
    **kwargs: extra args passed to the base operator (e.g. _max_workers, _save_data).


Examples:
    ```python
    from lazyllm.tools.data.operators.preference_ops import IntentExtractor

    # model 需要由你的项目环境提供，例如 lazyllm.xxx(...) 得到的模型对象
    op = IntentExtractor(model=model, input_key='content', output_key='intent')
    print(op({'content': 'I want to stay at a hotel in Beijing.'}))
    # [{
    #   'content': 'I want to stay at a hotel in Beijing.',
    #   'intent': {
    #     'intent': 'book_hotel',
    #     'entities': [{'entity': 'location', 'value': 'Beijing'}]
    #   }
    # }]
    ```
    """
    def __init__(self, model=None, input_key='content', output_key='intent', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        sys_prompt = '你是一个意图提取助手，请从用户文本中提取核心意图，并以 JSON 格式返回。'
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        if self.input_key in data:
            data[self.output_key] = self.extract(data[self.input_key])
        return data

    def extract(self, raw_text):
        instruction = f'提炼以下用户文本的核心意图: \n{raw_text}'
        res = self.model(instruction)
        return res if isinstance(res, dict) else None

`PreferencePairConstructor`

Bases: PreferenceOps

Preference operator: preference pair constructor (chosen / rejected).

Given a list of candidate responses and their score list, constructs a (chosen, rejected) pair and outputs a preference sample:

instruction: instruction text (by default read from the intent field)
chosen: better response
rejected: worse response

Two strategies are supported:

max_min: choose the highest score as chosen and the lowest as rejected (requires highest > lowest).
threshold: find a pair with score difference >= threshold, from high to low.

Note: if inputs are empty/mismatched, or no valid pair can be constructed, it returns an empty list [] (useful to filter invalid samples in pipelines).

Parameters:

strategy (str, default: 'max_min' ) –

'max_min' or 'threshold', default 'max_min'.
threshold (float, default: 0.5 ) –

minimum score gap when strategy == 'threshold', default 0.5.
instruction_key (str, default: 'intent' ) –

instruction field name, default 'intent'.
response_key (str, default: 'responses' ) –

candidate response list field name, default 'responses'.
score_key (str, default: 'evaluation' ) –

score list field name, default 'evaluation'.
output_chosen_key (str, default: 'chosen' ) –

chosen field name, default 'chosen'.
output_rejected_key (str, default: 'rejected' ) –

rejected field name, default 'rejected'.
**kwargs –

extra args passed to the base operator.

Examples:

from lazyllm.tools.data.operators.preference_ops import PreferencePairConstructor

op = PreferencePairConstructor(strategy='max_min', instruction_key='intent',
                              response_key='responses', score_key='evaluation')
data = {
    'intent': 'book a hotel',
    'responses': ['good response', 'bad response'],
    'evaluation': [10, 6],
}
print(op(data))
# [{
#   'instruction': 'book a hotel',
#   'chosen': 'good response',
#   'rejected': 'bad response'
# }]

Source code in lazyllm/tools/data/operators/preference_ops.py

class PreferencePairConstructor(PreferenceOps):
    """Preference operator: preference pair constructor (chosen / rejected).

Given a list of candidate responses and their score list, constructs a (chosen, rejected) pair and outputs a preference sample:

- instruction: instruction text (by default read from the intent field)
- chosen: better response
- rejected: worse response

Two strategies are supported:

- max_min: choose the highest score as chosen and the lowest as rejected (requires highest > lowest).
- threshold: find a pair with score difference >= threshold, from high to low.

Note: if inputs are empty/mismatched, or no valid pair can be constructed, it returns an empty list [] (useful to filter invalid samples in pipelines).

Args:
    strategy (str): 'max_min' or 'threshold', default 'max_min'.
    threshold (float): minimum score gap when strategy == 'threshold', default 0.5.
    instruction_key (str): instruction field name, default 'intent'.
    response_key (str): candidate response list field name, default 'responses'.
    score_key (str): score list field name, default 'evaluation'.
    output_chosen_key (str): chosen field name, default 'chosen'.
    output_rejected_key (str): rejected field name, default 'rejected'.
    **kwargs: extra args passed to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.preference_ops import PreferencePairConstructor

    op = PreferencePairConstructor(strategy='max_min', instruction_key='intent',
                                  response_key='responses', score_key='evaluation')
    data = {
        'intent': 'book a hotel',
        'responses': ['good response', 'bad response'],
        'evaluation': [10, 6],
    }
    print(op(data))
    # [{
    #   'instruction': 'book a hotel',
    #   'chosen': 'good response',
    #   'rejected': 'bad response'
    # }]
    ```
    """
    def __init__(self, strategy='max_min', threshold=0.5,
                 instruction_key='intent', response_key='responses', score_key='evaluation',
                 output_chosen_key='chosen', output_rejected_key='rejected', **kwargs):
        super().__init__(**kwargs)
        self.strategy = strategy
        self.threshold = threshold
        self.instruction_key = instruction_key
        self.response_key = response_key
        self.score_key = score_key
        self.output_chosen_key = output_chosen_key
        self.output_rejected_key = output_rejected_key

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        if self.response_key in data and self.score_key in data:
            responses = data[self.response_key]
            scores = data[self.score_key]

            if not responses or not scores or len(responses) != len(scores):
                return []

            chosen, rejected = self.construct_pair(responses, scores)

            if chosen is not None and rejected is not None:
                return {
                    'instruction': data.get(self.instruction_key, ''),
                    self.output_chosen_key: chosen,
                    self.output_rejected_key: rejected
                }

        return []

    def construct_pair(self, responses, scores):
        if len(responses) < 2:
            return None, None

        pairs = list(zip(responses, scores))
        pairs.sort(key=lambda x: x[1], reverse=True)

        if self.strategy == 'max_min':
            chosen_pair = pairs[0]
            rejected_pair = pairs[-1]

            if chosen_pair[1] > rejected_pair[1]:
                return chosen_pair[0], rejected_pair[0]

        elif self.strategy == 'threshold':
            for i in range(len(pairs)):
                for j in range(i + 1, len(pairs)):
                    score_diff = pairs[i][1] - pairs[j][1]
                    if score_diff >= self.threshold:
                        return pairs[i][0], pairs[j][0]

        return None, None

`PreferenceResponseGenerator`

Bases: PreferenceOps

Preference operator: multi-response generator.

Given the intent (or any instruction text), generates n candidate responses and writes them as a list to the output field.

Parameters:

model –

a LazyLLM model object (required), will be shared via share().
n (int, default: 3 ) –

number of candidate responses to generate, default 3.
temperature (float, default: 1.0 ) –

sampling temperature, default 1.0.
system_prompt (str | None, default: None ) –

optional system prompt; if provided, applies .prompt(system_prompt) to the model.
input_key (str, default: 'intent' ) –

input field name, default 'intent'.
output_key (str, default: 'responses' ) –

output field name, default 'responses'.
**kwargs –

extra args passed to the base operator.

Examples:

from lazyllm.tools.data.operators.preference_ops import PreferenceResponseGenerator

op = PreferenceResponseGenerator(model=model, n=3, temperature=0.8, input_key='intent', output_key='responses')
print(op({'intent': 'book a hotel'}))
# [{
#   'intent': {'intent': 'book a hotel'},
#   'responses': [
#     "<think>Okay, the user wants to book a hotel. ...",
#     "<think>Okay, the user wants to book a hotel. ..."
#   ]
# }]

Source code in lazyllm/tools/data/operators/preference_ops.py

class PreferenceResponseGenerator(PreferenceOps):
    """Preference operator: multi-response generator.

Given the intent (or any instruction text), generates n candidate responses and writes them as a list to the output field.

Args:
    model: a LazyLLM model object (required), will be shared via share().
    n (int): number of candidate responses to generate, default 3.
    temperature (float): sampling temperature, default 1.0.
    system_prompt (str|None): optional system prompt; if provided, applies .prompt(system_prompt) to the model.
    input_key (str): input field name, default 'intent'.
    output_key (str): output field name, default 'responses'.
    **kwargs: extra args passed to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.preference_ops import PreferenceResponseGenerator

    op = PreferenceResponseGenerator(model=model, n=3, temperature=0.8, input_key='intent', output_key='responses')
    print(op({'intent': 'book a hotel'}))
    # [{
    #   'intent': {'intent': 'book a hotel'},
    #   'responses': [
    #     "<think>Okay, the user wants to book a hotel. ...",
    #     "<think>Okay, the user wants to book a hotel. ..."
    #   ]
    # }]
    ```
    """
    def __init__(self, model=None, n=3, temperature=1.0, system_prompt=None,
                 input_key='intent', output_key='responses', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.n = n
        self.temperature = temperature
        self.input_key = input_key
        self.output_key = output_key
        self.model = model.share()
        if system_prompt:
            self.model = self.model.prompt(system_prompt)

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        if self.input_key in data:
            data[self.output_key] = self.generate(data[self.input_key])
        return data

    def generate(self, x):
        responses = []
        for _ in range(self.n):
            response = self.model(x, temperature=self.temperature)
            responses.append(response)
        return responses

`ResponseEvaluator`

Bases: PreferenceOps

Preference operator: response evaluator.

Evaluates multiple candidate responses for the same instruction and outputs a score list, which can be used to build chosen/rejected pairs.

Scoring dimensions (total 10):

Helpfulness: 4
Truthfulness: 3
Fluency: 3

Notes:

Internally uses a model plus a JSON formatter; each evaluation is expected to return a dict with total_score.
If total_score cannot be extracted, a warning is logged and the score defaults to 0 for that response.

Parameters:

model –

a LazyLLM model object (required), will be shared via share().
input_key (str, default: 'content' ) –

instruction/raw content field name, default 'content'.
response_key (str, default: 'responses' ) –

candidate response list field name, default 'responses'.
output_key (str, default: 'evaluation' ) –

output score list field name, default 'evaluation'.
**kwargs –

extra args passed to the base operator.

Examples:

from lazyllm.tools.data.operators.preference_ops import ResponseEvaluator

op = ResponseEvaluator(model=model, input_key='intent', response_key='responses', output_key='evaluation')
data = {
    'intent': {'intent': 'book a hotel'},
    'responses': [
        'I can help you book a hotel in Beijing.',
        'Here are some hotels for you.'
    ],
}
print(op(data))
# [{
#   'intent': {'intent': 'book a hotel'},
#   'responses': [
#     'I can help you book a hotel in Beijing.',
#     'Here are some hotels for you.'
#   ],
#   'evaluation': [10, 8]
# }]

Source code in lazyllm/tools/data/operators/preference_ops.py

class ResponseEvaluator(PreferenceOps):
    """Preference operator: response evaluator.

Evaluates multiple candidate responses for the same instruction and outputs a score list, which can be used to build chosen/rejected pairs.

Scoring dimensions (total 10):

- Helpfulness: 4
- Truthfulness: 3
- Fluency: 3

Notes:

- Internally uses a model plus a JSON formatter; each evaluation is expected to return a dict with total_score.
- If total_score cannot be extracted, a warning is logged and the score defaults to 0 for that response.

Args:
    model: a LazyLLM model object (required), will be shared via share().
    input_key (str): instruction/raw content field name, default 'content'.
    response_key (str): candidate response list field name, default 'responses'.
    output_key (str): output score list field name, default 'evaluation'.
    **kwargs: extra args passed to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.preference_ops import ResponseEvaluator

    op = ResponseEvaluator(model=model, input_key='intent', response_key='responses', output_key='evaluation')
    data = {
        'intent': {'intent': 'book a hotel'},
        'responses': [
            'I can help you book a hotel in Beijing.',
            'Here are some hotels for you.'
        ],
    }
    print(op(data))
    # [{
    #   'intent': {'intent': 'book a hotel'},
    #   'responses': [
    #     'I can help you book a hotel in Beijing.',
    #     'Here are some hotels for you.'
    #   ],
    #   'evaluation': [10, 8]
    # }]
    ```
    """
    def __init__(self, model=None, input_key='content', response_key='responses', output_key='evaluation', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.response_key = response_key
        self.output_key = output_key
        sys_prompt = (
            '你是一个专业的回复评测判官。请针对用户提供的指令和回复，从以下三个维度进行打分，总分为 10 分：\n'
            '1. 有用性 (Helpfulness): 满分 4 分。回复是否解决了用户的问题。\n'
            '2. 真实性 (Truthfulness): 满分 3 分。回复内容是否准确、无误导。\n'
            '3. 流畅度 (Fluency): 满分 3 分。回复是否自然、逻辑清晰。\n'
            '请先给出详细的理由 (Rationale)，然后以 JSON 格式输出各项得分及总分。\n'
            '输出示例：\n'
            '{\n'
            '  "rationale": "回复简洁且准确...",\n'
            '  "scores": {"helpfulness": 4, "truthfulness": 3, "fluency": 3},\n'
            '  "total_score": 10\n'
            '}'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        if self.input_key in data and self.response_key in data:
            data[self.output_key] = self.evaluate(data[self.input_key], data[self.response_key])
        return data

    def evaluate(self, instruction, responses):
        scores = []
        for resp in responses:
            prompt = (
                f'指令: {instruction}\n\n'
                f'回复: {resp}\n\n'
                '请对上述回复进行打分。'
            )
            res = self.model(prompt)
            if isinstance(res, dict):
                scores.append(res.get('total_score', 0))
            else:
                LOG.warning(f'Failed to extract total_score from response: {res}')
                scores.append(0)
        return scores

Tool-Use Operators

`lazyllm.tools.data.operators.tool_use_ops`

`ChainedLogicAssembler`

Bases: ToolUseOps

Tool-use data operator: sequential task generator.

Given a list of atomic tasks, generates successor relationships and composed tasks to form linear or dependency-aware task chains.

Typical JSON structure:

items: list of dicts:
task: current atomic task
next_task: its successor task
composed_task: description combining task and next_task

Parameters:

model –

a LazyLLM model object (required).
input_key (str, default: 'atomic_tasks' ) –

input atomic task field name, default 'atomic_tasks'.
output_key (str, default: 'sequential_tasks' ) –

output sequential task list field name, default 'sequential_tasks'.
system_prompt (str | None, default: None ) –

optional system prompt.
**kwargs –

extra args passed to the base operator.

Examples:

from lazyllm.tools.data.operators.tool_use_ops import ChainedLogicAssembler

atomic_tasks = [
    {'task': '获取出发地与目的地'},
    {'task': '确认出行日期'},
    {'task': '筛选符合条件的车次'},
]
op = ChainedLogicAssembler(model=model, input_key='atomic_tasks', output_key='sequential_tasks')
print(op({'atomic_tasks': atomic_tasks}))
# {
#   'atomic_tasks': [...],
#   'sequential_tasks': [
#     {'task': '获取出发地与目的地', 'next_task': '确认出行日期', 'composed_task': '先获取站点再确认日期'},
#     {'task': '确认出行日期', 'next_task': '筛选符合条件的车次', 'composed_task': '在已知日期基础上筛选车次'},
#     ...
#   ]
# }

Source code in lazyllm/tools/data/operators/tool_use_ops.py

class ChainedLogicAssembler(ToolUseOps):
    """Tool-use data operator: sequential task generator.

Given a list of atomic tasks, generates successor relationships and composed tasks to form linear or dependency-aware task chains.

Typical JSON structure:

- items: list of dicts:
  - task: current atomic task
  - next_task: its successor task
  - composed_task: description combining task and next_task

Args:
    model: a LazyLLM model object (required).
    input_key (str): input atomic task field name, default 'atomic_tasks'.
    output_key (str): output sequential task list field name, default 'sequential_tasks'.
    system_prompt (str|None): optional system prompt.
    **kwargs: extra args passed to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.tool_use_ops import ChainedLogicAssembler

    atomic_tasks = [
        {'task': '获取出发地与目的地'},
        {'task': '确认出行日期'},
        {'task': '筛选符合条件的车次'},
    ]
    op = ChainedLogicAssembler(model=model, input_key='atomic_tasks', output_key='sequential_tasks')
    print(op({'atomic_tasks': atomic_tasks}))
    # {
    #   'atomic_tasks': [...],
    #   'sequential_tasks': [
    #     {'task': '获取出发地与目的地', 'next_task': '确认出行日期', 'composed_task': '先获取站点再确认日期'},
    #     {'task': '确认出行日期', 'next_task': '筛选符合条件的车次', 'composed_task': '在已知日期基础上筛选车次'},
    #     ...
    #   ]
    # }
    ```
    """
    def __init__(
        self, model=None, input_key='atomic_tasks', output_key='sequential_tasks', system_prompt=None, **kwargs
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        sys_prompt = system_prompt or (
            '你是一个任务编排助手。你的任务是根据原子任务集合，生成：\n'
            '1) 每个任务的后继任务（next_task）\n'
            '2) 由两者组合形成的组合任务（composed_task）\n'
            '只输出 JSON，不要输出任何额外文本。\n'
            'JSON 结构：\n'
            '{\n'
            '  "items": [\n'
            '    {"task": "...", "next_task": "...", "composed_task": "..."}\n'
            '  ]\n'
            '}\n'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        tasks = data.get(self.input_key, None)
        if not tasks:
            data[self.output_key] = []
            return data
        tasks_text = json.dumps(tasks, ensure_ascii=False) if not isinstance(tasks, str) else tasks
        instruction = f'原子任务列表：\n{tasks_text}\n\n请生成后继与组合任务并输出 JSON。'
        parsed = self.model(instruction)
        items = parsed.get('items') if isinstance(parsed, dict) else None
        data[self.output_key] = items if isinstance(items, list) else (parsed if parsed else [])
        return data

`ContextualBeacon`

Bases: ToolUseOps

Tool-use data operator: scenario extractor.

Extracts high-level scenario information from a conversation text and writes a structured JSON object into the output field.

Typical JSON structure:

scene: one-sentence scenario description
domain: domain/topic
user_profile: user role/profile (optional)
assistant_goal: goal the assistant should achieve
constraints: list of constraints
key_entities: list of key entities

Parameters:

model –

a LazyLLM model object (required), shared and wrapped with a JSON formatter.
input_key (str, default: 'content' ) –

input conversation field name, default 'content'.
output_key (str, default: 'scenario' ) –

output scenario field name, default 'scenario'.
system_prompt (str | None, default: None ) –

optional custom system prompt, defaults to a built-in Chinese prompt.
**kwargs –

extra args passed to the base operator (e.g. _max_workers, _save_data).

Examples:

from lazyllm.tools.data.operators.tool_use_ops import ContextualBeacon

op = ContextualBeacon(model=model, input_key='content', output_key='scenario')
item = {
    'content': 'User: 我想订一张从北京到上海的高铁票，下午出发最好。\nAssistant: 好的，请问具体日期？'
}
print(op(item))

# Output Example:
# {
#   'content': 'User: 我想订一张从北京到上海的高铁票，下午出发最好。\nAssistant: 好的，请问具体日期？',
#   'scenario': {
#     'scene': '用户咨询高铁购票服务',
#     'domain': '出行/购票',
#     'user_profile': '普通出行乘客',
#     'assistant_goal': '帮助用户完成车次与时间筛选并完成购票',
#     'constraints': ['出发地为北京', '目的地为上海', '尽量下午出发'],
#     'key_entities': ['北京', '上海', '高铁', '下午']
#   }
# }

Source code in lazyllm/tools/data/operators/tool_use_ops.py

class ContextualBeacon(ToolUseOps):
    """Tool-use data operator: scenario extractor.

Extracts high-level scenario information from a conversation text and writes a structured JSON object into the output field.

Typical JSON structure:

- scene: one-sentence scenario description
- domain: domain/topic
- user_profile: user role/profile (optional)
- assistant_goal: goal the assistant should achieve
- constraints: list of constraints
- key_entities: list of key entities

Args:
    model: a LazyLLM model object (required), shared and wrapped with a JSON formatter.
    input_key (str): input conversation field name, default 'content'.
    output_key (str): output scenario field name, default 'scenario'.
    system_prompt (str|None): optional custom system prompt, defaults to a built-in Chinese prompt.
    **kwargs: extra args passed to the base operator (e.g. _max_workers, _save_data).


Examples:

    from lazyllm.tools.data.operators.tool_use_ops import ContextualBeacon

    op = ContextualBeacon(model=model, input_key='content', output_key='scenario')
    item = {
        'content': 'User: 我想订一张从北京到上海的高铁票，下午出发最好。\\nAssistant: 好的，请问具体日期？'
    }
    print(op(item))

    # Output Example:
    # {
    #   'content': 'User: 我想订一张从北京到上海的高铁票，下午出发最好。\\nAssistant: 好的，请问具体日期？',
    #   'scenario': {
    #     'scene': '用户咨询高铁购票服务',
    #     'domain': '出行/购票',
    #     'user_profile': '普通出行乘客',
    #     'assistant_goal': '帮助用户完成车次与时间筛选并完成购票',
    #     'constraints': ['出发地为北京', '目的地为上海', '尽量下午出发'],
    #     'key_entities': ['北京', '上海', '高铁', '下午']
    #   }
    # }
    """
    def __init__(self, model=None, input_key='content', output_key='scenario', system_prompt=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        sys_prompt = system_prompt or (
            '你是一个对话场景分析助手。你的任务是从对话内容中提取可用于数据生成的场景信息。\n'
            '只输出 JSON，不要输出任何额外文本。\n'
            'JSON 结构：\n'
            '{\n'
            '  "scene": "一句话场景描述",\n'
            '  "domain": "领域/主题",\n'
            '  "user_profile": "用户角色/背景（可为空）",\n'
            '  "assistant_goal": "助手应完成的目标",\n'
            '  "constraints": ["约束1","约束2"],\n'
            '  "key_entities": ["关键实体1","关键实体2"]\n'
            '}\n'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        content = data.get(self.input_key, '')
        if not content:
            data[self.output_key] = None
            return data
        instruction = f'对话内容如下：\n{content}\n\n请提取场景信息并输出 JSON。'
        parsed = self.model(instruction)
        data[self.output_key] = parsed if parsed is not None else ''
        return data

`DecompositionKernel`

Bases: ToolUseOps

Tool-use data operator: atomic task generator.

Given a scenario, generates a list of fine-grained, single-goal atomic tasks, which can be used for later orchestration and tool design.

Typical JSON structure:

tasks: list of atomic task dicts:
task: task description
input: task input (optional)
output: task output (optional)
constraints: list of constraints

Parameters:

model –

a LazyLLM model object (required).
input_key (str, default: 'scenario' ) –

input scenario field name, default 'scenario'.
output_key (str, default: 'atomic_tasks' ) –

output atomic task list field name, default 'atomic_tasks'.
n (int, default: 5 ) –

maximum number of tasks, default 5.
system_prompt (str | None, default: None ) –

optional system prompt.
**kwargs –

extra args passed to the base operator.

Examples:

from lazyllm.tools.data.operators.tool_use_ops import DecompositionKernel

scenario = {
    'scene': '用户咨询高铁购票服务',
    'assistant_goal': '帮助用户完成车次筛选并购票',
}
op = DecompositionKernel(model=model, input_key='scenario', output_key='atomic_tasks', n=4)
print(op({'scenario': scenario}))
# {
#   'scenario': {...},
#   'atomic_tasks': [
#     {'task': '获取用户出发地和目的地', 'input': '', 'output': '出发地与目的地', 'constraints': [...]},
#     {'task': '确认出行日期与大致时间', ...},
#     ...
#   ]
# }

Source code in lazyllm/tools/data/operators/tool_use_ops.py

class DecompositionKernel(ToolUseOps):
    """Tool-use data operator: atomic task generator.

Given a scenario, generates a list of fine-grained, single-goal atomic tasks, which can be used for later orchestration and tool design.

Typical JSON structure:

- tasks: list of atomic task dicts:
  - task: task description
  - input: task input (optional)
  - output: task output (optional)
  - constraints: list of constraints

Args:
    model: a LazyLLM model object (required).
    input_key (str): input scenario field name, default 'scenario'.
    output_key (str): output atomic task list field name, default 'atomic_tasks'.
    n (int): maximum number of tasks, default 5.
    system_prompt (str|None): optional system prompt.
    **kwargs: extra args passed to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.tool_use_ops import DecompositionKernel

    scenario = {
        'scene': '用户咨询高铁购票服务',
        'assistant_goal': '帮助用户完成车次筛选并购票',
    }
    op = DecompositionKernel(model=model, input_key='scenario', output_key='atomic_tasks', n=4)
    print(op({'scenario': scenario}))
    # {
    #   'scenario': {...},
    #   'atomic_tasks': [
    #     {'task': '获取用户出发地和目的地', 'input': '', 'output': '出发地与目的地', 'constraints': [...]},
    #     {'task': '确认出行日期与大致时间', ...},
    #     ...
    #   ]
    # }
    ```
    """
    def __init__(
        self, model=None, input_key='scenario', output_key='atomic_tasks', n=5, system_prompt=None, **kwargs
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        self.n = n
        sys_prompt = system_prompt or (
            '你是一个任务分解助手。你的任务是根据给定场景，生成一组可执行的原子任务（粒度小、单目标）。\n'
            '只输出 JSON，不要输出任何额外文本。\n'
            'JSON 结构：\n'
            '{\n'
            '  "tasks": [\n'
            '    {"task": "任务描述", "input": "输入（可为空）", "output": "输出（可为空）", "constraints": ["..."]}\n'
            '  ]\n'
            '}\n'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        scenario = data.get(self.input_key, None)
        if scenario is None or scenario == '':
            data[self.output_key] = []
            return data
        scenario_text = json.dumps(scenario, ensure_ascii=False) if not isinstance(scenario, str) else scenario
        instruction = f'场景：\n{scenario_text}\n\n请生成不超过 {self.n} 个原子任务并输出 JSON。'
        parsed = self.model(instruction)
        tasks = parsed.get('tasks') if isinstance(parsed, dict) else None
        data[self.output_key] = tasks if isinstance(tasks, list) else (parsed if parsed else [])
        return data

`DialogueSimulator`

Bases: ToolUseOps

Tool-use data operator: multi-turn conversation generator (with tools).

Given a composed task and a list of available functions, generates a multi-turn conversation JSON involving User, Assistant and Tool roles, suitable for tool-calling training data.

Typical JSON structure:

messages: list of dicts:
role: 'user' | 'assistant' | 'tool'
content: text content
name: tool name (optional, when role == 'tool')

Parameters:

model –

a LazyLLM model object (required).
input_composition_key (str, default: 'composition_task' ) –

input composition task field name, default 'composition_task'.
input_functions_key (str, default: 'functions' ) –

input function list field name, default 'functions'.
output_key (str, default: 'conversation' ) –

output conversation field name, default 'conversation'.
n_turns (int, default: 6 ) –

desired number of turns (as a hint to the model), default 6.
system_prompt (str | None, default: None ) –

optional system prompt.
**kwargs –

extra args passed to the base operator.

Examples:

from lazyllm.tools.data.operators.tool_use_ops import DialogueSimulator

composition_task = '根据用户需求查询并推荐合适的高铁车次'
functions = [
    {
        'name': 'query_train_tickets',
        'description': '查询高铁车次',
        'args': [...],
        'returns': {...},
    }
]
op = DialogueSimulator(model=model,
                                    input_composition_key='composition_task',
                                    input_functions_key='functions',
                                    output_key='conversation',
                                    n_turns=6)
print(op({'composition_task': composition_task, 'functions': functions}))
# {
#   'composition_task': '根据用户需求查询并推荐合适的高铁车次',
#   'functions': [...],
#   'conversation': {
#     'messages': [
#       {'role': 'user', 'content': '我想订一张明天下午从北京到上海的高铁票'},
#       {'role': 'assistant', 'content': '好的，我先为您确认出发时间与车次。'},
#       {'role': 'tool', 'name': 'query_train_tickets', 'content': '{...工具返回...}'},
#       ...
#     ]
#   }
# }

Source code in lazyllm/tools/data/operators/tool_use_ops.py

class DialogueSimulator(ToolUseOps):
    """Tool-use data operator: multi-turn conversation generator (with tools).

Given a composed task and a list of available functions, generates a multi-turn conversation JSON involving User, Assistant and Tool roles, suitable for tool-calling training data.

Typical JSON structure:

- messages: list of dicts:
  - role: 'user' | 'assistant' | 'tool'
  - content: text content
  - name: tool name (optional, when role == 'tool')

Args:
    model: a LazyLLM model object (required).
    input_composition_key (str): input composition task field name, default 'composition_task'.
    input_functions_key (str): input function list field name, default 'functions'.
    output_key (str): output conversation field name, default 'conversation'.
    n_turns (int): desired number of turns (as a hint to the model), default 6.
    system_prompt (str|None): optional system prompt.
    **kwargs: extra args passed to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.tool_use_ops import DialogueSimulator

    composition_task = '根据用户需求查询并推荐合适的高铁车次'
    functions = [
        {
            'name': 'query_train_tickets',
            'description': '查询高铁车次',
            'args': [...],
            'returns': {...},
        }
    ]
    op = DialogueSimulator(model=model,
                                        input_composition_key='composition_task',
                                        input_functions_key='functions',
                                        output_key='conversation',
                                        n_turns=6)
    print(op({'composition_task': composition_task, 'functions': functions}))
    # {
    #   'composition_task': '根据用户需求查询并推荐合适的高铁车次',
    #   'functions': [...],
    #   'conversation': {
    #     'messages': [
    #       {'role': 'user', 'content': '我想订一张明天下午从北京到上海的高铁票'},
    #       {'role': 'assistant', 'content': '好的，我先为您确认出发时间与车次。'},
    #       {'role': 'tool', 'name': 'query_train_tickets', 'content': '{...工具返回...}'},
    #       ...
    #     ]
    #   }
    # }
    ```
    """
    def __init__(
        self,
        model=None,
        input_composition_key='composition_task',
        input_functions_key='functions',
        output_key='conversation',
        n_turns=6,
        system_prompt=None,
        **kwargs,
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.task_key = input_composition_key
        self.functions_key = input_functions_key
        self.output_key = output_key
        self.n_turns = n_turns
        sys_prompt = system_prompt or (
            '你是一个多轮对话数据生成助手。你需要根据组合任务与可用函数，模拟一段多轮对话。\n'
            '对话由 User/Assistant/Tool 三种角色组成：\n'
            '- User 提出需求与补充信息\n'
            '- Assistant 规划并在适当时机调用 Tool\n'
            '- Tool 返回函数执行结果\n'
            '只输出 JSON，不要输出任何额外文本。\n'
            'JSON 结构：\n'
            '{\n'
            '  "messages": [\n'
            '    {"role":"user","content":"..."},\n'
            '    {"role":"assistant","content":"..."},\n'
            '    {"role":"tool","name":"function_name","content":"..."}\n'
            '  ]\n'
            '}\n'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        task = data.get(self.task_key, None)
        functions = data.get(self.functions_key, None)
        if task is None or task == '':
            data[self.output_key] = []
            return data
        task_text = json.dumps(task, ensure_ascii=False) if not isinstance(task, str) else task
        functions_text = (json.dumps(functions, ensure_ascii=False) if functions is not None
                          and not isinstance(functions, str) else (functions or ''))
        instruction = (
            f'组合任务：\n{task_text}\n\n'
            f'函数列表：\n{functions_text}\n\n'
            f'请生成约 {self.n_turns} 轮对话的 messages 并输出 JSON。'
        )
        parsed = self.model(instruction)
        data[self.output_key] = parsed if parsed is not None else []
        return data

`ProtocolSpecifier`

Bases: ToolUseOps

Tool-use data operator: function specification generator.

Given a composed task and its subtasks, generates a list of function specifications suitable for tool calling.

Typical JSON structure:

functions: list of dicts:
name: function name
description: what the function does
args: list of argument specs with name/type/description
returns: return type and description

Parameters:

model –

a LazyLLM model object (required).
input_composition_key (str, default: 'composition_task' ) –

input composition task field name, default 'composition_task'.
input_atomic_key (str, default: 'atomic_tasks' ) –

input atomic task field name, default 'atomic_tasks'.
output_key (str, default: 'functions' ) –

output function spec list field name, default 'functions'.
system_prompt (str | None, default: None ) –

optional system prompt.
**kwargs –

extra args passed to the base operator.

Examples:

from lazyllm.tools.data.operators.tool_use_ops import ProtocolSpecifier

composition_task = '根据用户出发地、目的地和日期查询可选高铁车次并返回候选列表'
atomic_tasks = [
    {'task': '获取出发地与目的地'},
    {'task': '确认出行日期'},
    {'task': '调用车次查询接口并过滤结果'},
]
op = ProtocolSpecifier(model=model,
                       input_composition_key='composition_task',
                       input_atomic_key='atomic_tasks',
                       output_key='functions')
print(op({'composition_task': composition_task, 'atomic_tasks': atomic_tasks}))
# {
#   'composition_task': '根据用户出发地、目的地和日期查询可选高铁车次并返回候选列表',
#   'atomic_tasks': [...],
#   'functions': [
#     {
#       'name': 'query_train_tickets',
#       'description': '根据出发地、目的地与日期查询高铁车次',
#       'args': [{'name': 'from_city', 'type': 'string', ...}, ...],
#       'returns': {'type': 'TrainList', 'description': '符合条件的车次列表'}
#     },
#     ...
#   ]
# }

Source code in lazyllm/tools/data/operators/tool_use_ops.py

class ProtocolSpecifier(ToolUseOps):
    """Tool-use data operator: function specification generator.

Given a composed task and its subtasks, generates a list of function specifications suitable for tool calling.

Typical JSON structure:

- functions: list of dicts:
  - name: function name
  - description: what the function does
  - args: list of argument specs with name/type/description
  - returns: return type and description

Args:
    model: a LazyLLM model object (required).
    input_composition_key (str): input composition task field name, default 'composition_task'.
    input_atomic_key (str): input atomic task field name, default 'atomic_tasks'.
    output_key (str): output function spec list field name, default 'functions'.
    system_prompt (str|None): optional system prompt.
    **kwargs: extra args passed to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.tool_use_ops import ProtocolSpecifier

    composition_task = '根据用户出发地、目的地和日期查询可选高铁车次并返回候选列表'
    atomic_tasks = [
        {'task': '获取出发地与目的地'},
        {'task': '确认出行日期'},
        {'task': '调用车次查询接口并过滤结果'},
    ]
    op = ProtocolSpecifier(model=model,
                           input_composition_key='composition_task',
                           input_atomic_key='atomic_tasks',
                           output_key='functions')
    print(op({'composition_task': composition_task, 'atomic_tasks': atomic_tasks}))
    # {
    #   'composition_task': '根据用户出发地、目的地和日期查询可选高铁车次并返回候选列表',
    #   'atomic_tasks': [...],
    #   'functions': [
    #     {
    #       'name': 'query_train_tickets',
    #       'description': '根据出发地、目的地与日期查询高铁车次',
    #       'args': [{'name': 'from_city', 'type': 'string', ...}, ...],
    #       'returns': {'type': 'TrainList', 'description': '符合条件的车次列表'}
    #     },
    #     ...
    #   ]
    # }
    ```
    """
    def __init__(
        self,
        model=None,
        input_composition_key='composition_task',
        input_atomic_key='atomic_tasks',
        output_key='functions',
        system_prompt=None,
        **kwargs,
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.task_key = input_composition_key
        self.subtask_key = input_atomic_key
        self.output_key = output_key
        sys_prompt = system_prompt or (
            '你是一个函数设计助手。给定组合任务及其子任务，请生成一组函数规格，便于后续工具调用。\n'
            '只输出 JSON，不要输出任何额外文本。\n'
            'JSON 结构：\n'
            '{\n'
            '  "functions": [\n'
            '    {"name": "function_name", "description": "...", '
            '"args": [{"name":"...","type":"...","description":"..."}], '
            '"returns": {"type":"...","description":"..."}}\n'
            '  ]\n'
            '}\n'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        task = data.get(self.task_key, None)
        subtasks = data.get(self.subtask_key, None)
        if task is None or task == '':
            data[self.output_key] = []
            return data
        task_text = json.dumps(task, ensure_ascii=False) if not isinstance(task, str) else task
        subtasks_text = (json.dumps(subtasks, ensure_ascii=False) if subtasks is not None
                         and not isinstance(subtasks, str) else (subtasks or ''))
        instruction = (
            f'组合任务：\n{task_text}\n\n'
            f'子任务（可选）：\n{subtasks_text}\n\n'
            '请生成函数列表并输出 JSON。'
        )
        parsed = self.model(instruction)
        funcs = parsed.get('functions') if isinstance(parsed, dict) else None
        data[self.output_key] = funcs if isinstance(funcs, list) else (parsed if parsed else [])
        return data

`ScenarioDiverger`

Bases: ToolUseOps

Tool-use data operator: scenario expander.

Given a base scenario, generates multiple alternative scenarios that are semantically related but differ in details, to enrich data diversity.

Typical JSON structure:

scenarios: list of scenario dicts, each with fields like scene/domain/assistant_goal/constraints/key_entities.

Parameters:

model –

a LazyLLM model object (required).
input_key (str, default: 'scenario' ) –

input scenario field name, default 'scenario' (dict or str).
output_key (str, default: 'expanded_scenarios' ) –

output expanded scenario list field name, default 'expanded_scenarios'.
n (int, default: 3 ) –

maximum number of scenarios to generate, default 3.
system_prompt (str | None, default: None ) –

optional system prompt.
**kwargs –

extra args passed to the base operator.

Examples:

from lazyllm.tools.data.operators.tool_use_ops import ScenarioDiverger

base = {
    'scene': '用户咨询高铁购票服务',
    'domain': '出行/购票',
    'assistant_goal': '帮助用户完成车次筛选并购票',
}
op = ScenarioDiverger(model=model, input_key='scenario', output_key='expanded_scenarios', n=3)
print(op({'scenario': base}))
# {
#   'scenario': {...},
#   'expanded_scenarios': [
#     {'scene': '用户预订跨城商务出差火车票', ...},
#     {'scene': '用户为家人购买回乡火车票', ...},
#     ...
#   ]
# }

Source code in lazyllm/tools/data/operators/tool_use_ops.py

class ScenarioDiverger(ToolUseOps):
    """Tool-use data operator: scenario expander.

Given a base scenario, generates multiple alternative scenarios that are semantically related but differ in details, to enrich data diversity.

Typical JSON structure:

- scenarios: list of scenario dicts, each with fields like scene/domain/assistant_goal/constraints/key_entities.

Args:
    model: a LazyLLM model object (required).
    input_key (str): input scenario field name, default 'scenario' (dict or str).
    output_key (str): output expanded scenario list field name, default 'expanded_scenarios'.
    n (int): maximum number of scenarios to generate, default 3.
    system_prompt (str|None): optional system prompt.
    **kwargs: extra args passed to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.tool_use_ops import ScenarioDiverger

    base = {
        'scene': '用户咨询高铁购票服务',
        'domain': '出行/购票',
        'assistant_goal': '帮助用户完成车次筛选并购票',
    }
    op = ScenarioDiverger(model=model, input_key='scenario', output_key='expanded_scenarios', n=3)
    print(op({'scenario': base}))
    # {
    #   'scenario': {...},
    #   'expanded_scenarios': [
    #     {'scene': '用户预订跨城商务出差火车票', ...},
    #     {'scene': '用户为家人购买回乡火车票', ...},
    #     ...
    #   ]
    # }
    ```
    """
    def __init__(
        self, model=None, input_key='scenario', output_key='expanded_scenarios', n=3, system_prompt=None, **kwargs
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        self.n = n
        sys_prompt = system_prompt or (
            '你是一个场景扩展助手。你的任务是基于给定的原始场景，生成多个可替代的新场景，语义相关但细节不同。\n'
            '只输出 JSON，不要输出任何额外文本。\n'
            'JSON 结构：\n'
            '{\n'
            '  "scenarios": [\n'
            '    {"scene": "...", "domain": "...", "assistant_goal": "...", "constraints": ["..."], '
            '"key_entities": ["..."]}\n'
            '  ]\n'
            '}\n'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        base = data.get(self.input_key, None)
        if base is None or base == '':
            data[self.output_key] = []
            return data
        base_text = json.dumps(base, ensure_ascii=False) if not isinstance(base, str) else base
        instruction = f'原始场景：\n{base_text}\n\n请生成 {self.n} 个替代场景并输出 JSON。'
        parsed = self.model(instruction)
        scenarios = parsed.get('scenarios') if isinstance(parsed, dict) else None
        data[self.output_key] = scenarios if isinstance(scenarios, list) else (parsed if parsed else [])
        return data

`TopologyArchitect`

Bases: ToolUseOps

Tool-use data operator: parallel/sequential/hybrid task combination generator.

Given atomic tasks, generates three kinds of task compositions:

parallel_tasks: tasks that can be executed in parallel
sequential_tasks: tasks with explicit ordering dependencies
hybrid_tasks: compositions mixing parallel and sequential relations

Parameters:

model –

a LazyLLM model object (required).
input_key (str, default: 'atomic_tasks' ) –

input atomic task field name, default 'atomic_tasks'.
output_key (str, default: 'para_seq_tasks' ) –

output composition field name, default 'para_seq_tasks'.
system_prompt (str | None, default: None ) –

optional system prompt.
**kwargs –

extra args passed to the base operator.

Examples:

from lazyllm.tools.data.operators.tool_use_ops import TopologyArchitect

atomic_tasks = [
    {'task': '收集出行需求'},
    {'task': '查询可选车次'},
    {'task': '对比价格与时间'},
    {'task': '完成下单支付'},
]
op = TopologyArchitect(model=model, input_key='atomic_tasks', output_key='para_seq_tasks')
print(op({'atomic_tasks': atomic_tasks}))
# {
#   'atomic_tasks': [...],
#   'para_seq_tasks': {
#     'parallel_tasks': ['同时查询不同日期/车次方案', ...],
#     'sequential_tasks': ['先确认日期再选车次', ...],
#     'hybrid_tasks': ['并行对比多个方案后统一决策并下单', ...]
#   }
# }

Source code in lazyllm/tools/data/operators/tool_use_ops.py

class TopologyArchitect(ToolUseOps):
    """Tool-use data operator: parallel/sequential/hybrid task combination generator.

Given atomic tasks, generates three kinds of task compositions:

- parallel_tasks: tasks that can be executed in parallel
- sequential_tasks: tasks with explicit ordering dependencies
- hybrid_tasks: compositions mixing parallel and sequential relations

Args:
    model: a LazyLLM model object (required).
    input_key (str): input atomic task field name, default 'atomic_tasks'.
    output_key (str): output composition field name, default 'para_seq_tasks'.
    system_prompt (str|None): optional system prompt.
    **kwargs: extra args passed to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.tool_use_ops import TopologyArchitect

    atomic_tasks = [
        {'task': '收集出行需求'},
        {'task': '查询可选车次'},
        {'task': '对比价格与时间'},
        {'task': '完成下单支付'},
    ]
    op = TopologyArchitect(model=model, input_key='atomic_tasks', output_key='para_seq_tasks')
    print(op({'atomic_tasks': atomic_tasks}))
    # {
    #   'atomic_tasks': [...],
    #   'para_seq_tasks': {
    #     'parallel_tasks': ['同时查询不同日期/车次方案', ...],
    #     'sequential_tasks': ['先确认日期再选车次', ...],
    #     'hybrid_tasks': ['并行对比多个方案后统一决策并下单', ...]
    #   }
    # }
    ```
    """
    def __init__(
        self, model=None, input_key='atomic_tasks', output_key='para_seq_tasks', system_prompt=None, **kwargs
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        sys_prompt = system_prompt or (
            '你是一个任务组合生成助手。你的任务是基于原子任务生成三类任务：\n'
            '1) 并行任务（parallel_tasks）：可以同时进行的任务组合\n'
            '2) 后继任务（sequential_tasks）：有明确先后依赖的任务组合\n'
            '3) 组合任务（hybrid_tasks）：包含并行与先后依赖的混合组合\n'
            '只输出 JSON，不要输出任何额外文本。\n'
            'JSON 结构：\n'
            '{\n'
            '  "parallel_tasks": ["..."],\n'
            '  "sequential_tasks": ["..."],\n'
            '  "hybrid_tasks": ["..."]\n'
            '}\n'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        tasks = data.get(self.input_key, None)
        if not tasks:
            data[self.output_key] = {'parallel_tasks': [], 'sequential_tasks': [], 'hybrid_tasks': []}
            return data
        tasks_text = json.dumps(tasks, ensure_ascii=False) if not isinstance(tasks, str) else tasks
        instruction = f'原子任务列表：\n{tasks_text}\n\n请生成三类任务并输出 JSON。'
        parsed = self.model(instruction)
        default_val = {'parallel_tasks': [], 'sequential_tasks': [], 'hybrid_tasks': []}
        data[self.output_key] = parsed if parsed is not None else default_val
        return data

`ViabilitySieve`

Bases: ToolUseOps

Tool-use data operator: composition task feasibility filter.

Evaluates a list of composed tasks for feasibility and completeness, and filters out invalid ones.

Expected intermediate JSON from the model:

items: list of dicts with composed_task, is_valid, reason, etc.

On output, only keeps composed_task values where is_valid is true. If the model output does not match the schema, it falls back to returning items or the raw parsed result.

Parameters:

model –

a LazyLLM model object (required).
input_composition_key (str, default: 'composition_tasks' ) –

input composition task field name, default 'composition_tasks'.
input_atomic_key (str, default: 'atomic_tasks' ) –

input atomic task field name (optional), default 'atomic_tasks'.
output_key (str, default: 'filtered_composition_tasks' ) –

output filtered composition task field name, default 'filtered_composition_tasks'.
system_prompt (str | None, default: None ) –

optional system prompt.
**kwargs –

extra args passed to the base operator.

Examples:

from lazyllm.tools.data.operators.tool_use_ops import ViabilitySieve

composition_tasks = ['先获取出发地和目的地再筛选车次', '直接随机推荐一个车次']
atomic_tasks = [
    {'task': '获取出发地与目的地'}, {'task': '确认出行日期'}, {'task': '筛选符合条件的车次'}
]
op = ViabilitySieve(model=model,
                           input_composition_key='composition_tasks',
                           input_atomic_key='atomic_tasks',
                           output_key='filtered_composition_tasks')
print(op({'composition_tasks': composition_tasks, 'atomic_tasks': atomic_tasks}))
# {
#   'composition_tasks': [...],
#   'atomic_tasks': [...],
#   'filtered_composition_tasks': ['先获取出发地和目的地再筛选车次', ...]
# }

Source code in lazyllm/tools/data/operators/tool_use_ops.py

class ViabilitySieve(ToolUseOps):
    """Tool-use data operator: composition task feasibility filter.

Evaluates a list of composed tasks for feasibility and completeness, and filters out invalid ones.

Expected intermediate JSON from the model:

- items: list of dicts with composed_task, is_valid, reason, etc.

On output, only keeps composed_task values where is_valid is true. If the model output does not match the schema, it falls back to returning items or the raw parsed result.

Args:
    model: a LazyLLM model object (required).
    input_composition_key (str): input composition task field name, default 'composition_tasks'.
    input_atomic_key (str): input atomic task field name (optional), default 'atomic_tasks'.
    output_key (str): output filtered composition task field name, default 'filtered_composition_tasks'.
    system_prompt (str|None): optional system prompt.
    **kwargs: extra args passed to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.tool_use_ops import ViabilitySieve

    composition_tasks = ['先获取出发地和目的地再筛选车次', '直接随机推荐一个车次']
    atomic_tasks = [
        {'task': '获取出发地与目的地'}, {'task': '确认出行日期'}, {'task': '筛选符合条件的车次'}
    ]
    op = ViabilitySieve(model=model,
                               input_composition_key='composition_tasks',
                               input_atomic_key='atomic_tasks',
                               output_key='filtered_composition_tasks')
    print(op({'composition_tasks': composition_tasks, 'atomic_tasks': atomic_tasks}))
    # {
    #   'composition_tasks': [...],
    #   'atomic_tasks': [...],
    #   'filtered_composition_tasks': ['先获取出发地和目的地再筛选车次', ...]
    # }
    ```
    """
    def __init__(
        self,
        model=None,
        input_composition_key='composition_tasks',
        input_atomic_key='atomic_tasks',
        output_key='filtered_composition_tasks',
        system_prompt=None,
        **kwargs,
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.composition_key = input_composition_key
        self.subtask_key = input_atomic_key
        self.output_key = output_key
        sys_prompt = system_prompt or (
            '你是一个任务可运行性评审助手。你需要判断组合任务是否具备可行性与完备性：\n'
            '- 可行性：子任务是否能支撑组合任务目标\n'
            '- 完备性：是否缺少关键步骤或前置条件\n'
            '只输出 JSON，不要输出任何额外文本。\n'
            'JSON 结构：\n'
            '{\n'
            '  "items": [\n'
            '    {"composed_task": "...", "is_valid": true, "reason": "..."}\n'
            '  ]\n'
            '}\n'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        composition_tasks = data.get(self.composition_key, None)
        subtasks = data.get(self.subtask_key, None)
        if not composition_tasks:
            data[self.output_key] = []
            return data
        composition_text = (json.dumps(composition_tasks, ensure_ascii=False)
                            if not isinstance(composition_tasks, str) else composition_tasks)
        subtasks_text = (json.dumps(subtasks, ensure_ascii=False) if subtasks is not None
                         and not isinstance(subtasks, str) else (subtasks or ''))
        instruction = (
            f'组合任务：\n{composition_text}\n\n'
            f'子任务（可选）：\n{subtasks_text}\n\n'
            '请逐条判断并输出 JSON。'
        )
        parsed = self.model(instruction)
        items = parsed.get('items') if isinstance(parsed, dict) else None
        valid = []
        if isinstance(items, list):
            for it in items:
                if isinstance(it, dict) and it.get('is_valid') is True and it.get('composed_task'):
                    valid.append(it.get('composed_task'))
        data[self.output_key] = valid if valid else (
            items if isinstance(items, list) else (parsed if parsed else [])
        )
        return data

Text2SQL Operators

`lazyllm.tools.data.operators.text2sql_ops`

`SQLConsensusUnifier`

Bases: Text2SQLOps

Text2SQL data operator: SQLConsensusUnifier.

Given multiple CoT traces (cot_responses), parses SQL from each, executes them, and selects the best CoT/SQL pair based on execution consistency and success.

Behavior:

Parses SQL from each CoT using the same logic as SQLForge.
Calls database_manager.batch_execute_queries to get execution results and signatures.
Uses a voting strategy (_vote_select) to pick the best candidate, then:
sets output_cot_key (default 'cot_reasoning') to the winning CoT,
overwrites data['SQL'] with the winning SQL.

Parameters:

database_manager –

query execution provider (required) implementing: - batch_execute_queries(list[(db_id, sql)])
**kwargs –

extra args forwarded to the base operator.

Examples:

from lazyllm.tools.data.operators.text2sql_ops import SQLConsensusUnifier

op = SQLConsensusUnifier(database_manager=database_manager)
item = {
    'db_id': 'db_1',
    'cot_responses': [
        '...CoT + ```sql SELECT count(*) FROM orders WHERE status = 'paid'```',
        '...CoT + ```sql SELECT count(*) FROM orders```',
    ]
}
res = op(item)
print(res['cot_reasoning'][:200])
print(res['SQL'])
# "...首先识别需要统计已支付订单数量，其次在 orders 表中过滤 status = 'paid' ... ```sql SELECT count(*) FROM orders WHERE status = 'paid';```"
# "SELECT count(*) FROM orders WHERE status = 'paid';"

Source code in lazyllm/tools/data/operators/text2sql_ops.py

class SQLConsensusUnifier(Text2SQLOps):
    """Text2SQL data operator: SQLConsensusUnifier.

Given multiple CoT traces (cot_responses), parses SQL from each, executes them, and selects the best CoT/SQL pair based on execution consistency and success.

Behavior:

- Parses SQL from each CoT using the same logic as SQLForge.
- Calls database_manager.batch_execute_queries to get execution results and signatures.
- Uses a voting strategy (_vote_select) to pick the best candidate, then:
  - sets output_cot_key (default 'cot_reasoning') to the winning CoT,
  - overwrites data['SQL'] with the winning SQL.

Args:
    database_manager: query execution provider (required) implementing:
        - batch_execute_queries(list[(db_id, sql)])
    **kwargs: extra args forwarded to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import SQLConsensusUnifier

    op = SQLConsensusUnifier(database_manager=database_manager)
    item = {
        'db_id': 'db_1',
        'cot_responses': [
            '...CoT + ```sql SELECT count(*) FROM orders WHERE status = \'paid\'```',
            '...CoT + ```sql SELECT count(*) FROM orders```',
        ]
    }
    res = op(item)
    print(res['cot_reasoning'][:200])
    print(res['SQL'])
    # "...首先识别需要统计已支付订单数量，其次在 orders 表中过滤 status = 'paid' ... ```sql SELECT count(*) FROM orders WHERE status = 'paid';```"
    # "SELECT count(*) FROM orders WHERE status = 'paid';"
    ```
    """
    def __init__(self, database_manager=None, **kwargs):
        super().__init__(**kwargs)
        self.database_manager = database_manager
        self.tie_breaker = 'shortest_sql'

    def forward(
        self,
        data,
        input_cot_responses_key='cot_responses',
        input_db_id_key='db_id',
        output_cot_key='cot_reasoning',
        output_sql_key='SQL',
        **kwargs,
    ):
        assert isinstance(data, dict)
        if self.database_manager is None:
            raise ValueError('database_manager is required')

        cot_responses = data.get(input_cot_responses_key, [])
        if not isinstance(cot_responses, list) or not cot_responses:
            data[output_cot_key] = ''
            return data

        db_id = data.get(input_db_id_key)
        if not db_id:
            data[output_cot_key] = ''
            return data

        candidates = []
        queries = []
        for resp in cot_responses:
            sql = _parse_sql_response(resp)
            if sql:
                queries.append((str(db_id).strip(), sql))
                candidates.append({'cot': resp, 'sql': sql})

        if not queries:
            data[output_cot_key] = ''
            return data

        try:
            query_results = self.database_manager.batch_execute_queries(queries)
            for cand, result in zip(candidates, query_results):
                cand['signature'] = _result_to_signature(result)
                cand['is_valid'] = result.success if hasattr(result, 'success') else False
        except Exception as e:
            LOG.error(f'Failed to execute queries for voting: {e}')

        best = _vote_select(candidates, self.tie_breaker)
        if best:
            data[output_cot_key] = best.get('cot', '')
            data[output_sql_key] = best.get('sql', '')
        else:
            data[output_cot_key] = ''

        return data

`SQLContextAssembler`

Bases: Text2SQLOps

Text2SQL data operator: SQLContextAssembler.

Builds prompts for downstream Text2SQL models from database schema, natural language question, and evidence.

Behavior:

Prefers database_manager.get_db_details(db_id); falls back to get_create_statements_and_insert_statements if not available.
Supports a custom prompt_template; otherwise uses a simple English template.

Parameters:

database_manager –

schema provider (required), implementing: - get_db_details(db_id) (optional) - get_create_statements_and_insert_statements(db_id)
prompt_template –

optional custom prompt builder.
**kwargs –

extra args forwarded to the base operator.

Examples:

from lazyllm.tools.data.operators.text2sql_ops import SQLContextAssembler

op = SQLContextAssembler(database_manager=database_manager)
item = {
    'db_id': 'db_1',
    'question': '有多少已支付的订单？',
    'evidence': '订单表中 status 字段标记订单状态。'
}
res = op(item)
print(res['prompt'])
# Database Schema:
# CREATE TABLE orders (id INT, status TEXT, ...);
# ...
#
# Question: 有多少已支付的订单？
# Evidence: 订单表中 status 字段标记订单状态。
# Generate a SQL query for postgres.

Source code in lazyllm/tools/data/operators/text2sql_ops.py

class SQLContextAssembler(Text2SQLOps):
    """Text2SQL data operator: SQLContextAssembler.

Builds prompts for downstream Text2SQL models from database schema, natural language question, and evidence.

Behavior:

- Prefers database_manager.get_db_details(db_id); falls back to get_create_statements_and_insert_statements if not available.
- Supports a custom prompt_template; otherwise uses a simple English template.

Args:
    database_manager: schema provider (required), implementing:
        - get_db_details(db_id) (optional)
        - get_create_statements_and_insert_statements(db_id)
    prompt_template: optional custom prompt builder.
    **kwargs: extra args forwarded to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import SQLContextAssembler

    op = SQLContextAssembler(database_manager=database_manager)
    item = {
        'db_id': 'db_1',
        'question': '有多少已支付的订单？',
        'evidence': '订单表中 status 字段标记订单状态。'
    }
    res = op(item)
    print(res['prompt'])
    # Database Schema:
    # CREATE TABLE orders (id INT, status TEXT, ...);
    # ...
    #
    # Question: 有多少已支付的订单？
    # Evidence: 订单表中 status 字段标记订单状态。
    # Generate a SQL query for postgres.
    ```
    """
    def __init__(self, database_manager=None, prompt_template=None, **kwargs):
        super().__init__(**kwargs)
        self.database_manager = database_manager
        self.prompt_template = prompt_template

    def get_create_statements_and_insert_statements(self, db_id):
        return self.database_manager.get_create_statements_and_insert_statements(db_id)

    def _build_prompt(self, db_details, intent, evidence, db_engine):
        template = self.prompt_template
        if template is not None and hasattr(template, 'build_prompt'):
            return str(template.build_prompt(
                db_details=db_details,
                intent=intent,
                evidence=evidence,
                db_engine=db_engine
            ))

        return (
            f'Database Schema:\n{db_details}\n\n'
            f'Intent: {intent}\n'
            f'Evidence: {evidence}\n'
            f'Generate a SQL query for {db_engine}.'
        )

    def forward(self, data, input_intent_key='intent', input_db_id_key='db_id',
                input_evidence_key='evidence', output_prompt_key='prompt', **kwargs):
        assert isinstance(data, dict)
        if self.database_manager is None:
            raise ValueError('database_manager is required')

        db_engine = getattr(self.database_manager, 'db_type', 'unknown')
        db_id = data.get(input_db_id_key)
        intent = data.get(input_intent_key)
        evidence = data.get(input_evidence_key, '')

        if not db_id or not intent:
            LOG.warning(f'Missing db_id or intent for item: {data}')
            data[output_prompt_key] = ''
            return data

        db_id = str(db_id).replace('\n', '').replace('\r', '').strip()

        try:
            if hasattr(self.database_manager, 'get_db_details'):
                db_details = self.database_manager.get_db_details(db_id)
            else:
                create_statements, _ = self.database_manager.get_create_statements_and_insert_statements(db_id)
                db_details = '\n\n'.join([str(s) for s in create_statements])

            prompt = self._build_prompt(
                db_details=db_details,
                intent=intent,
                evidence=evidence,
                db_engine=db_engine
            )
            data[output_prompt_key] = prompt
        except Exception as e:
            LOG.error(f'Failed to generate context for db_id={db_id}: {e}')
            data[output_prompt_key] = ''

        return data

`SQLEffortRanker`

Bases: Text2SQLOps

Text2SQL data operator: SQLEffortRanker.

Classifies SQL execution difficulty by repeatedly generating SQL from a prompt, comparing each prediction to the gold SQL on the database, and counting how many generations match.

Workflow:

Uses the input prompt to generate num_generations SQL candidates, parsing SQL text from each.
Builds comparison tuples (db_id, predicted_sql, gold_sql) and calls database_manager.batch_compare_queries.
Maps the number of correct generations (cnt_true) to a difficulty label using difficulty_thresholds and difficulty_labels.

Parameters:

model –

a LazyLLM model object (required).
database_manager –

provider implementing batch_compare_queries (required).
num_generations (int, default: 10 ) –

number of SQL generations per item, default 10; may be auto-increased to a multiple of 5.
difficulty_thresholds (list[int] | None, default: None ) –

thresholds list, default [2, 5, 9].
difficulty_labels (list[str] | None, default: None ) –

label list, default ['extra', 'hard', 'medium', 'easy'].
system_prompt (str | None, default: None ) –

optional system prompt.
**kwargs –

extra args forwarded to the base operator.

Examples:

from lazyllm.tools.data.operators.text2sql_ops import SQLEffortRanker

op = SQLEffortRanker(model=model, database_manager=database_manager, num_generations=15)
item = {
    'db_id': 'db_1',
    'prompt': 'Database Schema: ... Question: 有多少已支付的订单？',
    'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';'
}
res = op(item)
print(res)
# {
#   'db_id': 'db_1',
#   'prompt': 'Database Schema: ... Question: 有多少已支付的订单？',
#   'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
#   'sql_execution_difficulty': 'medium'
# }

Source code in lazyllm/tools/data/operators/text2sql_ops.py

class SQLEffortRanker(Text2SQLOps):
    """Text2SQL data operator: SQLEffortRanker.

Classifies SQL execution difficulty by repeatedly generating SQL from a prompt, comparing each prediction to the gold SQL on the database, and counting how many generations match.

Workflow:

1. Uses the input prompt to generate num_generations SQL candidates, parsing SQL text from each.
2. Builds comparison tuples (db_id, predicted_sql, gold_sql) and calls database_manager.batch_compare_queries.
3. Maps the number of correct generations (cnt_true) to a difficulty label using difficulty_thresholds and difficulty_labels.

Args:
    model: a LazyLLM model object (required).
    database_manager: provider implementing batch_compare_queries (required).
    num_generations (int): number of SQL generations per item, default 10; may be auto-increased to a multiple of 5.
    difficulty_thresholds (list[int]|None): thresholds list, default [2, 5, 9].
    difficulty_labels (list[str]|None): label list, default ['extra', 'hard', 'medium', 'easy'].
    system_prompt (str|None): optional system prompt.
    **kwargs: extra args forwarded to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import SQLEffortRanker

    op = SQLEffortRanker(model=model, database_manager=database_manager, num_generations=15)
    item = {
        'db_id': 'db_1',
        'prompt': 'Database Schema: ... Question: 有多少已支付的订单？',
        'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';'
    }
    res = op(item)
    print(res)
    # {
    #   'db_id': 'db_1',
    #   'prompt': 'Database Schema: ... Question: 有多少已支付的订单？',
    #   'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';',
    #   'sql_execution_difficulty': 'medium'
    # }
    ```
    """
    def __init__(self, model=None, database_manager=None, num_generations=10,
                 difficulty_thresholds=None, difficulty_labels=None,
                 system_prompt=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.database_manager = database_manager
        self.num_generations = int(num_generations)
        sys_prompt = system_prompt or (
            'You are a SQL generator. '
            'Return ONLY the SQL query inside a Markdown code block: ```sql ... ```.'
        )
        self.model = model.share().prompt(sys_prompt) if model else None
        if difficulty_thresholds is None:
            difficulty_thresholds = [2, 5, 9]
        if difficulty_labels is None:
            difficulty_labels = ['extra', 'hard', 'medium', 'easy']

        self.difficulty_config = {
            'thresholds': difficulty_thresholds,
            'labels': difficulty_labels,
        }
        self.timeout = 5.0

        if self.num_generations <= self.difficulty_config['thresholds'][-1]:
            nearest_multiple = ((self.difficulty_config['thresholds'][-1] // 5) + 1) * 5
            LOG.warning(f'num_generations is less than the last threshold, will be set to {nearest_multiple}')
            self.num_generations = nearest_multiple
        if len(self.difficulty_config['thresholds']) != len(self.difficulty_config['labels']) - 1:
            raise ValueError('Thresholds and labels configuration mismatch')

    @staticmethod
    def parse_response(response):
        return _parse_sql_response(response)

    @staticmethod
    def _prepare_comparisons(predicted_sqls_list, ground_truth_list, db_ids, idxs):
        comparisons = []
        for predicted_sqls, ground_truth, db_id in zip(predicted_sqls_list, ground_truth_list, db_ids):
            for predicted_sql in predicted_sqls:
                comparisons.append((db_id, predicted_sql, ground_truth))
        return comparisons

    def classify_difficulty(self, score):
        if score == -1:
            return 'gold error'
        thresholds = self.difficulty_config['thresholds']
        labels = self.difficulty_config['labels']
        for i, threshold in enumerate(thresholds):
            if score <= threshold:
                return labels[i]
        return labels[-1]

    def report_statistics(self, inputs, output_difficulty_key):
        difficulties = [item.get(output_difficulty_key) for item in inputs]
        counts = pd.Series(difficulties).value_counts()
        LOG.info('SQL Difficulty Statistics')
        stats = [f'{d.title()}: {counts.get(d, 0)}' for d in ['easy', 'medium', 'hard', 'extra', 'gold error']]
        LOG.info(', '.join(stats))

    def _generate_and_parse_sqls(self, input_prompts):
        prompts = [q for q in input_prompts for _ in range(self.num_generations)]
        responses = []
        try:
            responses = self.model(prompts)
            if isinstance(responses, str):
                responses = [responses]
        except Exception as e:
            LOG.error(f'Generation failed: {e}')
            responses = [''] * len(prompts)

        all_parsed_sqls = []
        for response in responses:
            all_parsed_sqls.append(_parse_sql_response(response) if response else '')
        return all_parsed_sqls

    def forward(self, data, input_db_id_key='db_id', input_sql_key='SQL',
                input_prompt_key='prompt', output_difficulty_key='sql_execution_difficulty',
                **kwargs):
        assert isinstance(data, dict)
        if self.model is None or self.database_manager is None:
            raise ValueError('model and database_manager are required')

        prompt = data.get(input_prompt_key)
        ground_truth = data.get(input_sql_key)
        db_id = data.get(input_db_id_key)

        if not prompt or not ground_truth or not db_id:
            data[output_difficulty_key] = 'unknown'
            return data

        prompts = [prompt] * self.num_generations
        try:
            responses = self.model(prompts)
            if isinstance(responses, str):
                responses = [responses]
        except Exception as e:
            LOG.error(f'Generation failed: {e}')
            responses = [''] * self.num_generations

        parsed_sqls = [_parse_sql_response(r) if r else '' for r in responses]

        comparisons = [(db_id, sql, ground_truth) for sql in parsed_sqls]
        try:
            batch_results = self.database_manager.batch_compare_queries(comparisons)
            cnt_true = sum(1 for res in batch_results if res.res == 1)
            data[output_difficulty_key] = self.classify_difficulty(cnt_true)
        except Exception as e:
            LOG.error(f'Comparison failed: {e}')
            data[output_difficulty_key] = 'gold error'

        return data

`SQLForge`

Bases: Text2SQLOps

Text2SQL data operator: SQLForge.

Generates executable SQL queries for one or multiple databases based on their schema and optional sample data, and labels each query with a rough complexity type.

Behavior:

Generates output_num SQLs per database.
Uses a default English system prompt (or a custom prompt_template) to control complexity labels (easy/medium/hard, etc.).
Parses SQL text from model responses, preferring sql ... code blocks.

Parameters:

model –

a LazyLLM model object (required), shared via share().
database_manager –

database manager (required) implementing: - list_databases() - get_create_statements_and_insert_statements(db_name)
output_num (int, default: 300 ) –

number of SQLs to generate per database, default 300.
prompt_template –

optional custom prompt builder with build_prompt(...).
system_prompt (str | None, default: None ) –

optional system prompt, defaults to a built-in English prompt.
**kwargs –

extra args forwarded to the Text2SQLOps/LazyLLMDataBase base class.

Examples:

from lazyllm.tools.data.operators.text2sql_ops import SQLForge

# 假设 database_manager 已封装了你的 SQLite / Postgres 等数据库
op = SQLForge(model=model, database_manager=database_manager, output_num=10)

# 如果 data 中不指定 db_id，则为所有数据库各生成若干条 SQL
res = op({})
print(res[0])
# {
#   'db_id': 'database_1',
#   'SQL': 'SELECT ...',
#   'sql_complexity_type': 'easy'
# }

Source code in lazyllm/tools/data/operators/text2sql_ops.py

class SQLForge(Text2SQLOps):
    """Text2SQL data operator: SQLForge.

Generates executable SQL queries for one or multiple databases based on their schema and optional sample data, and labels each query with a rough complexity type.

Behavior:

- Generates output_num SQLs per database.
- Uses a default English system prompt (or a custom prompt_template) to control complexity labels (easy/medium/hard, etc.).
- Parses SQL text from model responses, preferring ```sql ... ``` code blocks.

Args:
    model: a LazyLLM model object (required), shared via share().
    database_manager: database manager (required) implementing:
        - list_databases()
        - get_create_statements_and_insert_statements(db_name)
    output_num (int): number of SQLs to generate per database, default 300.
    prompt_template: optional custom prompt builder with build_prompt(...).
    system_prompt (str|None): optional system prompt, defaults to a built-in English prompt.
    **kwargs: extra args forwarded to the Text2SQLOps/LazyLLMDataBase base class.


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import SQLForge

    # 假设 database_manager 已封装了你的 SQLite / Postgres 等数据库
    op = SQLForge(model=model, database_manager=database_manager, output_num=10)

    # 如果 data 中不指定 db_id，则为所有数据库各生成若干条 SQL
    res = op({})
    print(res[0])
    # {
    #   'db_id': 'database_1',
    #   'SQL': 'SELECT ...',
    #   'sql_complexity_type': 'easy'
    # }
    ```
    """
    def __init__(self, model=None, database_manager=None, output_num=300,
                 prompt_template=None, system_prompt=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.database_manager = database_manager
        self.output_num = output_num
        self.prompt_template = prompt_template
        sys_prompt = system_prompt or (
            'You are a SQL generator for Text2SQL tasks.\n'
            'Return ONLY one SQL query inside a Markdown code block: ```sql ... ```.\n'
        )
        self.model = model.share().prompt(sys_prompt) if model else None

    def _build_prompt(self, create_statements, insert_statements, db_engine):
        template = self.prompt_template
        if template is not None and hasattr(template, 'build_prompt'):
            built = template.build_prompt(
                insert_statements=insert_statements,
                create_statements=create_statements,
                db_engine=db_engine,
            )
            if isinstance(built, tuple) and len(built) >= 2:
                return str(built[0]), str(built[1])
            return str(built), 'unknown'
        return _default_build_prompt(create_statements, insert_statements, db_engine)

    def _validate_manager(self):
        if self.model is None:
            raise ValueError('model is required')
        if self.database_manager is None:
            raise ValueError('database_manager is required')
        if not hasattr(self.database_manager, 'list_databases'):
            raise ValueError('database_manager.list_databases is required')
        if not hasattr(self.database_manager, 'get_create_statements_and_insert_statements'):
            raise ValueError('database_manager.get_create_statements_and_insert_statements is required')

    def forward(self, data, output_sql_key='SQL', output_db_id_key='db_id',
                output_complexity_type_key='sql_complexity_type', **kwargs):
        assert isinstance(data, dict)
        self._validate_manager()

        db_engine = getattr(self.database_manager, 'db_type', 'unknown')
        db_id_in_data = data.get(output_db_id_key)
        if db_id_in_data:
            db_names = [db_id_in_data]
        else:
            db_names = self.database_manager.list_databases() or []

        if not db_names:
            LOG.warning('No databases found in database_manager.list_databases()')
            return []

        prompts, db_ids, complexity_types = self._collect_prompts(db_names, db_engine)

        responses = []
        for p in prompts:
            try:
                responses.append(self.model(p))
            except Exception as e:
                LOG.error(f'Failed to generate SQL: {e}')
                responses.append('')

        return [
            {
                output_db_id_key: db_id,
                output_sql_key: _parse_sql_response(resp),
                output_complexity_type_key: complexity,
            }
            for db_id, resp, complexity in zip(db_ids, responses, complexity_types)
        ]

    def _collect_prompts(self, db_names, db_engine):
        prompts = []
        db_ids = []
        complexity_types = []

        LOG.info(f'Generating {self.output_num} SQLs for each database')
        for db_name in tqdm(db_names, desc='Processing Databases'):
            create_statements, insert_statements = self.database_manager.get_create_statements_and_insert_statements(
                db_name
            )
            for _ in range(int(self.output_num)):
                prompt, complexity = self._build_prompt(create_statements, insert_statements, db_engine=db_engine)
                prompts.append(prompt)
                db_ids.append(db_name)
                complexity_types.append(complexity)
        return prompts, db_ids, complexity_types

`SQLIntentSynthesizer`

Bases: Text2SQLOps

Text2SQL data operator: SQLIntentSynthesizer.

Given a SQL query and database schema (with optional column descriptions), generates a natural language question aligned with the SQL semantics, plus optional external knowledge text.

Key features:

Generates multiple candidate questions (input_query_num) and selects one using embeddings-based diversity.
Uses special markers in model output: [QUESTION-START]/[QUESTION-END] and [EXTERNAL-KNOWLEDGE-START]/[...-END].

Parameters:

model –

text generation model (required).
embedding_model –

optional embedding model, supporting: - generate_embedding_from_input(texts) or callable(texts).
database_manager –

schema provider (required) implementing: - get_create_statements_and_insert_statements(db_id)
input_query_num (int, default: 5 ) –

number of question candidates per SQL, default 5.
prompt_template –

optional custom prompt builder.
system_prompt (str | None, default: None ) –

optional system prompt, default simple English helper.
**kwargs –

extra args forwarded to the base operator.

Examples:

from lazyllm.tools.data.operators.text2sql_ops import SQLIntentSynthesizer

op = SQLIntentSynthesizer(model=model,
                               embedding_model=embedding_model,
                               database_manager=database_manager,
                               input_query_num=5)
item = {'db_id': 'db_1', 'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';'}
res = op(item)
print(res)
# {
#   'db_id': 'db_1',
#   'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
#   'question_type': 'default',
#   'question': '有多少已支付的订单？',
#   'evidence': '...可选的外部知识...'
# }

Source code in lazyllm/tools/data/operators/text2sql_ops.py

class SQLIntentSynthesizer(Text2SQLOps):
    """Text2SQL data operator: SQLIntentSynthesizer.

Given a SQL query and database schema (with optional column descriptions), generates a natural language question aligned with the SQL semantics, plus optional external knowledge text.

Key features:

- Generates multiple candidate questions (input_query_num) and selects one using embeddings-based diversity.
- Uses special markers in model output: [QUESTION-START]/[QUESTION-END] and [EXTERNAL-KNOWLEDGE-START]/[...-END].

Args:
    model: text generation model (required).
    embedding_model: optional embedding model, supporting:
        - generate_embedding_from_input(texts) or callable(texts).
    database_manager: schema provider (required) implementing:
        - get_create_statements_and_insert_statements(db_id)
    input_query_num (int): number of question candidates per SQL, default 5.
    prompt_template: optional custom prompt builder.
    system_prompt (str|None): optional system prompt, default simple English helper.
    **kwargs: extra args forwarded to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import SQLIntentSynthesizer

    op = SQLIntentSynthesizer(model=model,
                                   embedding_model=embedding_model,
                                   database_manager=database_manager,
                                   input_query_num=5)
    item = {'db_id': 'db_1', 'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';'}
    res = op(item)
    print(res)
    # {
    #   'db_id': 'db_1',
    #   'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';',
    #   'question_type': 'default',
    #   'question': '有多少已支付的订单？',
    #   'evidence': '...可选的外部知识...'
    # }
    ```
    """
    def __init__(self, model=None, embedding_model=None, database_manager=None,
                 input_query_num=5, prompt_template=None, system_prompt=None,
                 input_intent_key='intent', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.embedding_model = embedding_model
        self.database_manager = database_manager
        self.question_candidates_num = int(input_query_num)
        self.prompt_template = prompt_template
        self.input_intent_key = input_intent_key
        sys_prompt = system_prompt or 'You are a helpful assistant.'
        self.model = model.share().prompt(sys_prompt) if model else None

    @staticmethod
    def _is_non_empty_text(x):
        return isinstance(x, str) and x.strip() != ''

    def extract_column_descriptions(self, create_statements):
        column_name2column_desc = {}
        pattern = r'"(\w+)"\s+\w+\s*/\*\s*(.*?)\s*\*/'
        if not create_statements:
            return column_name2column_desc
        for create_statement in create_statements:
            for column_name, description in re.findall(pattern, str(create_statement)):
                col = str(column_name).lower()
                if col not in column_name2column_desc:
                    column_name2column_desc[col] = str(description)
        return column_name2column_desc

    def parse_llm_response(self, response):
        if not isinstance(response, str):
            LOG.warning(f'Invalid response type: {type(response)}, expected str. Response: {response}')
            return None

        question_pattern = re.compile(r'\[QUESTION-START\](.*?)\[QUESTION-END\]', re.DOTALL)
        external_knowledge_pattern = re.compile(
            r'\[EXTERNAL-KNOWLEDGE-START\](.*?)\[EXTERNAL-KNOWLEDGE-END\]', re.DOTALL
        )

        question_match = question_pattern.search(response)
        external_knowledge_match = external_knowledge_pattern.search(response)

        question_content = question_match.group(1).strip() if question_match else ''
        external_knowledge_content = external_knowledge_match.group(1).strip() if external_knowledge_match else ''

        if question_content == '':
            return None
        return {'question': question_content, 'external_knowledge': external_knowledge_content}

    @staticmethod
    def _cosine_distance(a, b):
        if not a or not b:
            return 1.0
        n = min(len(a), len(b))
        dot = 0.0
        na = 0.0
        nb = 0.0
        for i in range(n):
            x = float(a[i])
            y = float(b[i])
            dot += x * y
            na += x * x
            nb += y * y
        denom = math.sqrt(na) * math.sqrt(nb)
        if denom == 0.0:
            return 1.0
        return 1.0 - (dot / denom)

    def _select_best_question(self, question_candidates, start_idx, embeddings):
        if not question_candidates:
            return None
        if len(question_candidates) == 1:
            return question_candidates[0]
        if embeddings is None or start_idx < 0:
            return random.sample(question_candidates, 1)[0]

        end_idx = start_idx + len(question_candidates)
        if end_idx > len(embeddings):
            return random.sample(question_candidates, 1)[0]

        candidate_embeddings = embeddings[start_idx:end_idx]
        distance_sums = []
        for i in range(len(candidate_embeddings)):
            s = 0.0
            for j in range(len(candidate_embeddings)):
                if i == j:
                    continue
                s += self._cosine_distance(candidate_embeddings[i], candidate_embeddings[j])
            distance_sums.append(s)
        min_index = min(range(len(distance_sums)), key=distance_sums.__getitem__)
        return question_candidates[min_index]

    def _build_prompt(self, sql, db_id, db_id2column_info, db_engine):
        template = self.prompt_template
        if template is not None and hasattr(template, 'build_prompt'):
            built = template.build_prompt(sql, db_id, db_id2column_info, db_engine)
            if isinstance(built, tuple) and len(built) >= 2:
                return str(built[0]), str(built[1])
            return str(built), 'unknown'

        column_info = db_id2column_info.get(db_id, {})
        column_info_text = '\n'.join([f'- {k}: {v}' for k, v in list(column_info.items())[:200]])
        prompt = (
            f'You are a Text2SQL intent synthesizer.\n'
            f'Database engine: {db_engine}\n'
            f'db_id: {db_id}\n\n'
            f'Given a SQL query, generate a natural language intent that matches it.\n'
            f'If helpful, you may use the following column descriptions:\n{column_info_text}\n\n'
            f'Output format:\n'
            f'[INTENT-START] ... [INTENT-END]\n'
            f'[EXTERNAL-KNOWLEDGE-START] ... [EXTERNAL-KNOWLEDGE-END]\n\n'
            f'SQL:\n{sql}\n'
        )
        return prompt, 'default'

    def _generate_embeddings(self, texts):
        if not texts:
            return []
        emb = self.embedding_model
        if emb is None:
            return None
        try:
            if hasattr(emb, 'generate_embedding_from_input'):
                vectors = emb.generate_embedding_from_input(texts)
            elif callable(emb):
                vectors = emb(texts)
            else:
                return None
            if not isinstance(vectors, list):
                return None
            return vectors
        except Exception as e:
            LOG.warning(f'Embedding generation failed: {e}')
            return None

    def _validate_generator_manager(self):
        if self.model is None:
            raise ValueError('model is required')
        if self.database_manager is None:
            raise ValueError('database_manager is required')
        if not hasattr(self.database_manager, 'get_create_statements_and_insert_statements'):
            raise ValueError('database_manager.get_create_statements_and_insert_statements is required')

    def forward(self, data, input_sql_key='SQL', input_db_id_key='db_id',
                output_intent_key=None, output_evidence_key='evidence', **kwargs):
        assert isinstance(data, dict)
        self._validate_generator_manager()

        if output_intent_key is None:
            output_intent_key = self.input_intent_key

        if self._is_non_empty_text(data.get(self.input_intent_key)):
            return data

        db_engine = getattr(self.database_manager, 'db_type', 'unknown')
        sql = data.get(input_sql_key, '')
        db_id = data.get(input_db_id_key, '')

        try:
            create_statements, _ = self.database_manager.get_create_statements_and_insert_statements(db_id)
            column_info = self.extract_column_descriptions(create_statements)
        except Exception as e:
            LOG.warning(f'Failed to extract schema for db_id={db_id}: {e}')
            column_info = {}

        prompt, question_type = self._build_prompt(str(sql), str(db_id), {db_id: column_info}, db_engine)
        data['question_type'] = question_type

        responses = []
        for _ in range(self.question_candidates_num):
            try:
                responses.append(self.model(prompt))
            except Exception as e:
                LOG.error(f'Failed to generate question: {e}')
                responses.append('')

        candidates = []
        embedding_texts = []
        for resp in responses:
            parsed = self.parse_llm_response(resp)
            if parsed:
                candidates.append(parsed)
                text = f'{parsed.get("external_knowledge", "")} {parsed.get("question", "")}'.strip()
                embedding_texts.append(text)

        embeddings = self._generate_embeddings(embedding_texts) if embedding_texts else None
        best = self._select_best_question(candidates, 0, embeddings)

        if best is not None:
            data[output_intent_key] = best.get('question', '')
            data[output_evidence_key] = best.get('external_knowledge', '')

        return data

`SQLReasoningTracer`

Bases: Text2SQLOps

Text2SQL data operator: SQLReasoningTracer.

For each (question, SQL, schema, evidence) item, generates multiple chain-of-thought (CoT) reasoning traces from question to SQL.

Parameters:

model –

a LazyLLM model object (required).
database_manager –

schema provider (required) implementing: - get_create_statements_and_insert_statements(db_id)
prompt_template –

optional custom prompt builder.
output_num (int, default: 3 ) –

number of CoT trajectories per item, default 3 (>=1).
**kwargs –

extra args forwarded to the base operator.

Examples:

from lazyllm.tools.data.operators.text2sql_ops import SQLReasoningTracer

op = SQLReasoningTracer(model=model, database_manager=database_manager, output_num=3)
item = {
    'db_id': 'db_1',
    'question': '有多少已支付的订单？',
    'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
    'evidence': ''
}
res = op(item)
print(len(res['cot_responses']))
print(res['cot_responses'][0][:200])  # 打印第一条 CoT 的前 200 个字符
# 3
# "Database Schema: ... Question: 有多少已支付的订单？ ... 推理步骤1：... 推理步骤2：... ```sql SELECT count(*) FROM orders WHERE status = 'paid';```"

Source code in lazyllm/tools/data/operators/text2sql_ops.py

class SQLReasoningTracer(Text2SQLOps):
    """Text2SQL data operator: SQLReasoningTracer.

For each (question, SQL, schema, evidence) item, generates multiple chain-of-thought (CoT) reasoning traces from question to SQL.

Args:
    model: a LazyLLM model object (required).
    database_manager: schema provider (required) implementing:
        - get_create_statements_and_insert_statements(db_id)
    prompt_template: optional custom prompt builder.
    output_num (int): number of CoT trajectories per item, default 3 (>=1).
    **kwargs: extra args forwarded to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import SQLReasoningTracer

    op = SQLReasoningTracer(model=model, database_manager=database_manager, output_num=3)
    item = {
        'db_id': 'db_1',
        'question': '有多少已支付的订单？',
        'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';',
        'evidence': ''
    }
    res = op(item)
    print(len(res['cot_responses']))
    print(res['cot_responses'][0][:200])  # 打印第一条 CoT 的前 200 个字符
    # 3
    # "Database Schema: ... Question: 有多少已支付的订单？ ... 推理步骤1：... 推理步骤2：... ```sql SELECT count(*) FROM orders WHERE status = 'paid';```"
    ```
    """
    def __init__(self, model=None, database_manager=None, prompt_template=None,
                 output_num=3, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.database_manager = database_manager
        self.prompt_template = prompt_template
        self.output_num = int(output_num)
        if self.output_num < 1:
            raise ValueError('output_num must be >= 1')
        sys_prompt = 'You are a database expert. Please generate a step-by-step reasoning ' \
                     '(Chain of Thought) and the final SQL.'
        self.model = model.share().prompt(sys_prompt) if model else None

    def _build_prompt(self, item, schema_str):
        intent = item.get(self.input_intent_key)
        gold_sql = item.get(self.input_sql_key)
        evidence = item.get(self.input_evidence_key, '')

        template = self.prompt_template
        if template is not None and hasattr(template, 'build_prompt'):
            return template.build_prompt(schema_str, intent, gold_sql, evidence)

        return (
            f'Database Schema:\n{schema_str}\n\n'
            f'Intent: {intent}\n'
            f'Evidence: {evidence}\n'
            f'Target SQL: {gold_sql}\n\n'
            f'Please provide a detailed step-by-step reasoning that leads to the correct SQL query.'
        )

    def forward(self, data, input_sql_key='SQL', input_intent_key='intent',
                input_db_id_key='db_id', input_evidence_key='evidence',
                output_cot_key='cot_responses', **kwargs):
        assert isinstance(data, dict)
        self._validate_manager()

        self.input_intent_key = input_intent_key
        self.input_sql_key = input_sql_key
        self.input_db_id_key = input_db_id_key
        self.input_evidence_key = input_evidence_key

        db_id = data.get(input_db_id_key)
        if not db_id:
            LOG.warning('Missing db_id for reasoning tracing')
            return data

        try:
            create_statements, _ = self.database_manager.get_create_statements_and_insert_statements(db_id)
            schema_str = '\n\n'.join([str(s) for s in create_statements])
            prompt = self._build_prompt(data, schema_str)

            responses = []
            for _ in range(self.output_num):
                try:
                    responses.append(self.model(prompt))
                except Exception as e:
                    LOG.error(f'Failed to generate reasoning trace: {e}')
                    responses.append('')
            data[output_cot_key] = responses
        except Exception as e:
            LOG.error(f'Error during reasoning tracing for db_id={db_id}: {e}')
            data[output_cot_key] = []

        return data

    def _validate_manager(self):
        if self.model is None:
            raise ValueError('model is required')
        if self.database_manager is None:
            raise ValueError('database_manager is required')

`SQLRuntimeSieve`

Bases: Text2SQLOps

Text2SQL data operator: SQLRuntimeSieve.

Filters SQL queries by:

Keeping only queries that look like SELECT/WITH queries.
Calling database_manager to run EXPLAIN (or similar) and keeping only those that execute successfully.

Parameters:

database_manager –

database manager (required) implementing: - database_exists(db_id) - batch_explain_queries(list[(db_id, sql)])
**kwargs –

extra args forwarded to the base operator.

Examples:

from lazyllm.tools.data.operators.text2sql_ops import SQLRuntimeSieve

op = SQLRuntimeSieve(database_manager=database_manager)
item = {'db_id': 'db_1', 'SQL': 'SELECT * FROM users;'}
res = op(item)
print(res)  # 若 SQL 可在 db_1 上 explain 成功，则返回原始 dict；否则返回 None

Source code in lazyllm/tools/data/operators/text2sql_ops.py

class SQLRuntimeSieve(Text2SQLOps):
    """Text2SQL data operator: SQLRuntimeSieve.

Filters SQL queries by:

1. Keeping only queries that look like SELECT/WITH queries.
2. Calling database_manager to run EXPLAIN (or similar) and keeping only those that execute successfully.

Args:
    database_manager: database manager (required) implementing:
        - database_exists(db_id)
        - batch_explain_queries(list[(db_id, sql)])
    **kwargs: extra args forwarded to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import SQLRuntimeSieve

    op = SQLRuntimeSieve(database_manager=database_manager)
    item = {'db_id': 'db_1', 'SQL': 'SELECT * FROM users;'}
    res = op(item)
    print(res)  # 若 SQL 可在 db_1 上 explain 成功，则返回原始 dict；否则返回 None
    ```
    """
    def __init__(self, database_manager=None, **kwargs):
        super().__init__(**kwargs)
        self.database_manager = database_manager

    def filter_select_sql(self, sql):
        if not isinstance(sql, str):
            return False
        sql_wo_comments = re.sub(r'/\*.*?\*/', '', sql, flags=re.DOTALL)
        sql_wo_comments = re.sub(r'--.*', '', sql_wo_comments)
        sql_wo_comments = sql_wo_comments.strip()

        if sql_wo_comments.lower().startswith('select') or \
           sql_wo_comments.lower().startswith('with'):
            return True
        return False

    def forward(self, data, input_sql_key='SQL', input_db_id_key='db_id', **kwargs):
        assert isinstance(data, dict)
        if self.database_manager is None:
            LOG.error('database_manager is required for SQLExecutabilityFilter')
            return data

        sql = data.get(input_sql_key)
        db_id = data.get(input_db_id_key)

        if not self.filter_select_sql(sql):
            return []

        if not self.database_manager.database_exists(db_id):
            LOG.warning(f'Database {db_id} not found in registry, please check the database folder')
            return []

        try:
            execution_results = self.database_manager.batch_explain_queries([(db_id, sql)])
            if execution_results and execution_results[0].success:
                return data
        except Exception as e:
            LOG.error(f'Error during explain_query: {e}')

        return []

`SQLSyntaxProfiler`

Bases: Text2SQLOps

Text2SQL data operator: SQLSyntaxProfiler.

Classifies SQL difficulty based on structural components using EvalHardness/EvalHardnessLite, assigning labels such as easy/medium/hard/extra.

Parameters:

difficulty_thresholds (list[int] | None, default: None ) –

thresholds list, default [2, 4, 6].
difficulty_labels (list[str] | None, default: None ) –

label list, default ['easy', 'medium', 'hard', 'extra'].
**kwargs –

extra args forwarded to the base operator.

Examples:

from lazyllm.tools.data.operators.text2sql_ops import SQLSyntaxProfiler

op = SQLSyntaxProfiler()
item = {'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';'}
res = op(item)
print(res)
# {
#   'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
#   'sql_component_difficulty': 'easy'
# }

Source code in lazyllm/tools/data/operators/text2sql_ops.py

class SQLSyntaxProfiler(Text2SQLOps):
    """Text2SQL data operator: SQLSyntaxProfiler.

Classifies SQL difficulty based on structural components using EvalHardness/EvalHardnessLite, assigning labels such as easy/medium/hard/extra.

Args:
    difficulty_thresholds (list[int]|None): thresholds list, default [2, 4, 6].
    difficulty_labels (list[str]|None): label list, default ['easy', 'medium', 'hard', 'extra'].
    **kwargs: extra args forwarded to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import SQLSyntaxProfiler

    op = SQLSyntaxProfiler()
    item = {'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';'}
    res = op(item)
    print(res)
    # {
    #   'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';',
    #   'sql_component_difficulty': 'easy'
    # }
    ```
    """
    def __init__(self, difficulty_thresholds=None, difficulty_labels=None, **kwargs):
        super().__init__(**kwargs)
        if difficulty_thresholds is None:
            difficulty_thresholds = [2, 4, 6]
        if difficulty_labels is None:
            difficulty_labels = ['easy', 'medium', 'hard', 'extra']

        self.difficulty_config = {
            'thresholds': difficulty_thresholds,
            'labels': difficulty_labels,
        }
        if len(self.difficulty_config['thresholds']) != len(self.difficulty_config['labels']) - 1:
            raise ValueError('Thresholds and labels configuration mismatch')

    def eval_component_hardness(self, sql, schema):
        evaluator = EvalHardness(Schema(schema), sql)
        return evaluator.run()

    def eval_hardness_lite(self, sql):
        evaluator = EvalHardnessLite(str(sql), self.difficulty_config)
        return evaluator.run()

    def forward(self, data, input_sql_key='SQL',
                output_difficulty_key='sql_component_difficulty', **kwargs):
        assert isinstance(data, dict)
        sql = data.get(input_sql_key)
        if not sql:
            data[output_difficulty_key] = 'unknown'
            return data
        hardness = self.eval_hardness_lite(str(sql))
        data[output_difficulty_key] = hardness

        return data

`TSQLSemanticAuditor`

Bases: Text2SQLOps

Text2SQL data operator: TSQLSemanticAuditor.

Given a natural language question + optional evidence + SQL + database schema, determines whether the SQL correctly answers the question and filters samples accordingly.

Behavior:

Fetches DDL for the given db_id via database_manager.
Asks the model to answer Yes/No; only keeps data when the response contains 'yes' (case-insensitive).

Parameters:

model –

a LazyLLM model object (required).
database_manager –

schema provider (required) implementing: - get_create_statements_and_insert_statements(db_id)
prompt_template –

optional custom prompt builder.
system_prompt (str | None, default: None ) –

optional system prompt, defaults to English Yes/No instructions.
**kwargs –

extra args forwarded to the base operator.

Examples:

from lazyllm.tools.data.operators.text2sql_ops import TSQLSemanticAuditor

op = TSQLSemanticAuditor(model=model, database_manager=database_manager)
item = {
    'db_id': 'db_1',
    'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
    'question': '有多少已支付的订单？',
    'evidence': ''
}
res = op(item)
print(res)
# {
#   'db_id': 'db_1',
#   'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
#   'question': '有多少已支付的订单？',
#   'evidence': ''
# }
# 如果模型判断不匹配，则返回 None

Source code in lazyllm/tools/data/operators/text2sql_ops.py

class TSQLSemanticAuditor(Text2SQLOps):
    """Text2SQL data operator: TSQLSemanticAuditor.

Given a natural language question + optional evidence + SQL + database schema, determines whether the SQL correctly answers the question and filters samples accordingly.

Behavior:

- Fetches DDL for the given db_id via database_manager.
- Asks the model to answer Yes/No; only keeps data when the response contains 'yes' (case-insensitive).

Args:
    model: a LazyLLM model object (required).
    database_manager: schema provider (required) implementing:
        - get_create_statements_and_insert_statements(db_id)
    prompt_template: optional custom prompt builder.
    system_prompt (str|None): optional system prompt, defaults to English Yes/No instructions.
    **kwargs: extra args forwarded to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import TSQLSemanticAuditor

    op = TSQLSemanticAuditor(model=model, database_manager=database_manager)
    item = {
        'db_id': 'db_1',
        'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';',
        'question': '有多少已支付的订单？',
        'evidence': ''
    }
    res = op(item)
    print(res)
    # {
    #   'db_id': 'db_1',
    #   'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';',
    #   'question': '有多少已支付的订单？',
    #   'evidence': ''
    # }
    # 如果模型判断不匹配，则返回 None
    ```
    """
    def __init__(self, model=None, database_manager=None, prompt_template=None,
                 system_prompt=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.database_manager = database_manager
        self.prompt_template = prompt_template
        sys_prompt = system_prompt or (
            'You are an expert in SQL and database analysis.\n'
            'Your task is to determine if a given SQL query correctly answers a natural language '
            'question based on the provided database schema.\n'
            'Respond ONLY with "Yes" if the SQL is correct and "No" otherwise.'
        )
        self.model = model.share().prompt(sys_prompt) if model else None

    def _parse_consistency_response(self, response):
        if not isinstance(response, str):
            return False
        response = response.strip().lower()
        if 'yes' in response:
            return True
        return False

    def _build_prompt(self, question, sql, db_details):
        template = self.prompt_template
        if template is not None and hasattr(template, 'build_prompt'):
            return str(template.build_prompt(question, sql, db_details))

        return (
            f'Database Schema:\n{db_details}\n\n'
            f'Question: {question}\n\n'
            f'SQL Query: {sql}\n\n'
            f'Does the SQL query correctly answer the question according to the schema? (Yes/No)'
        )

    def forward(self, data, input_sql_key='SQL', input_db_id_key='db_id',
                input_question_key='question', input_evidence_key='evidence', **kwargs):
        assert isinstance(data, dict)
        if self.model is None:
            raise ValueError('model is required')
        if self.database_manager is None:
            raise ValueError('database_manager is required')

        sql = data.get(input_sql_key)
        question = data.get(input_question_key)
        evidence = data.get(input_evidence_key, '')
        db_id = data.get(input_db_id_key)

        if not question or str(question).strip() == '':
            return []

        if evidence:
            question = f'{question}\n{evidence}'

        try:
            create_statements, _ = self.database_manager.get_create_statements_and_insert_statements(db_id)
            db_details = '\n\n'.join([str(s) for s in create_statements])
            prompt = self._build_prompt(str(question), str(sql), db_details)
            response = self.model(prompt)
            if self._parse_consistency_response(response):
                return data
        except Exception as e:
            LOG.warning(f'Failed to check correspondence: {e}')

        return []

PT Operators

`lazyllm.tools.data.operators.pt_op`

`ContextQualFilter`

Bases: PT

Use VLM or LLM to evaluate whether context is suitable for generating QA pairs; keep only samples with score=1 (suitable).

Parameters:

llm –

vision- or text-language model instance
context_key (str, default: 'context' ) –

key for context, default 'context'
image_key (str, default: 'image_path' ) –

key for image path(s), default 'image_path'
prompt (str, default: None ) –

optional custom prompt

Examples:

from lazyllm.tools.data import pt

vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
op = pt.ContextQualFilter(vlm)
res = op([{'context': 'Good context for QA.', 'image_path': '/path/to/image.jpg'}])
# only samples with score=1 are kept

Source code in lazyllm/tools/data/operators/pt_op.py

class ContextQualFilter(PT):
    """Use VLM or LLM to evaluate whether context is suitable for generating QA pairs; keep only samples with score=1 (suitable).

Args:
    llm: vision- or text-language model instance
    context_key (str): key for context, default 'context'
    image_key (str): key for image path(s), default 'image_path'
    prompt (str): optional custom prompt


Examples:
    ```python
    from lazyllm.tools.data import pt

    vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
    op = pt.ContextQualFilter(vlm)
    res = op([{'context': 'Good context for QA.', 'image_path': '/path/to/image.jpg'}])
    # only samples with score=1 are kept
    ```
    """
    DEFAULT_PROMPT = (
        'Evaluate whether the given context (text and/or images) is suitable for generating QA pairs. '
        'Output JSON only. Do not output any other irrelevant content.\n'
        '{\n'
        '  "score": 0,\n'
        '  "reason": ""\n'
        '}\n'
        'score: MUST be 0 or 1 only. 1=suitable, 0=not suitable. Good context has sufficient info for Q&A.'
    )

    def __init__(self, llm, context_key='context', image_key='image_path',
                 prompt: Optional[str] = None,
                 _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        if llm is None:
            raise ValueError('ContextQualFilter requires llm (vision- or text-language model).')
        self.context_key = context_key
        self.image_key = image_key
        self.prompt = prompt or self.DEFAULT_PROMPT
        self._evaluator = llm.share().prompt(self.prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        context = data.get(self.context_key, '')
        if not context:
            return []
        paths = _normalize_image_paths(data.get(self.image_key, ''))
        try:
            query = f'Context:\n{context}\n\nIs this context suitable for generating QA pairs?'
            inputs = encode_query_with_filepaths(query, paths) if paths else query
            out = self._evaluator(inputs)
            if not isinstance(out, dict):
                return []
            score = out.get('score', out.get('suitable', 0))
            try:
                score = int(float(score))
            except (TypeError, ValueError):
                score = 0
            if score != 1:
                return []
            return data
        except Exception as e:
            LOG.warning(f'Context qualification evaluation failed: {e}')
            return []

`GraphRetriever`

Bases: PT_MM

Parse Markdown-style image links ![alt](path) from context field, extract existing file paths and write to img_key. Does not modify source context; if context.strip() is empty, img_key is [] and the sample is kept.

Parameters:

context_key (str, default: 'context' ) –

key for text context, default 'context'
img_key (str, default: 'image_path' ) –

key for image path output, default 'image_path'
images_folder (str, default: None ) –

optional root folder for resolving relative paths

Examples:

from lazyllm.tools.data import pt_mm

op = pt_mm.GraphRetriever(context_key='context', img_key='img', _save_data=False)
data = {'context': 'Some content ![](/path/to/fig.png)'}
res = op([data])
# res[0]['img'] contains resolved absolute path

# empty context: res[0]['img'] == [], record kept, source context unchanged
empty_res = op([{'context': '   '}])

Source code in lazyllm/tools/data/operators/pt_op.py

class GraphRetriever(PT_MM):
    """Parse Markdown-style image links `![alt](path)` from context field, extract existing file paths and write to img_key.
Does not modify source context; if context.strip() is empty, img_key is [] and the sample is kept.

Args:
    context_key (str): key for text context, default 'context'
    img_key (str): key for image path output, default 'image_path'
    images_folder (str): optional root folder for resolving relative paths


Examples:
    ```python
    from lazyllm.tools.data import pt_mm

    op = pt_mm.GraphRetriever(context_key='context', img_key='img', _save_data=False)
    data = {'context': 'Some content ![](/path/to/fig.png)'}
    res = op([data])
    # res[0]['img'] contains resolved absolute path

    # empty context: res[0]['img'] == [], record kept, source context unchanged
    empty_res = op([{'context': '   '}])
    ```
    """
    def __init__(self, context_key='context', img_key='image_path', images_folder: Optional[str] = None,
                 _concurrency_mode='process', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        self.context_key = context_key
        self.img_key = img_key
        self.images_folder = images_folder

    def _parse_str_for_paths(self, s) -> list:
        matches = re.findall(r'!\[.*?\]\((.*?)\)', str(s))
        candidates = matches if matches else [str(s)] if s else []
        paths = []
        for p in candidates:
            if not p or not p.strip():
                continue
            raw = os.path.join(self.images_folder, os.path.basename(p)) if self.images_folder else p
            full = os.path.abspath(raw)
            if os.path.exists(full):
                paths.append(full)
        return paths

    def _extract_img_paths(self, img_data) -> list:
        valid_paths = []
        if isinstance(img_data, list):
            for item in img_data:
                if isinstance(item, list):
                    for sub in item:
                        valid_paths.extend(self._parse_str_for_paths(sub))
                else:
                    valid_paths.extend(self._parse_str_for_paths(item))
        else:
            valid_paths.extend(self._parse_str_for_paths(img_data))
        return list(dict.fromkeys(valid_paths))

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        context = data.get(self.context_key, '')
        if isinstance(context, list):
            context = '\n\n'.join(str(c) for c in context)
        context_stripped = context.strip() if context else ''
        if not context_stripped:
            data[self.img_key] = []
            return data
        valid_paths = self._extract_img_paths(context)
        data[self.img_key] = valid_paths
        return data

`ImageDedup`

Bases: PT_MM

Deduplicate images by file hash; keep first occurrence, skip duplicates.

Parameters:

image_key (str, default: 'image_path' ) –

key for image path(s), default 'image_path'
hash_method (str, default: 'md5' ) –

hash algorithm, default 'md5'

Examples:

from lazyllm.tools.data import pt_mm

op = pt_mm.ImageDedup()
batch = [{'image_path': 'a.jpg', 'id': 1}, {'image_path': 'a.jpg', 'id': 2}, {'image_path': 'b.jpg', 'id': 3}]
res = op(batch)
# len(res) == 2, duplicate removed

Source code in lazyllm/tools/data/operators/pt_op.py

class ImageDedup(PT_MM):
    """Deduplicate images by file hash; keep first occurrence, skip duplicates.

Args:
    image_key (str): key for image path(s), default 'image_path'
    hash_method (str): hash algorithm, default 'md5'


Examples:
    ```python
    from lazyllm.tools.data import pt_mm

    op = pt_mm.ImageDedup()
    batch = [{'image_path': 'a.jpg', 'id': 1}, {'image_path': 'a.jpg', 'id': 2}, {'image_path': 'b.jpg', 'id': 3}]
    res = op(batch)
    # len(res) == 2, duplicate removed
    ```
    """
    def __init__(self, image_key='image_path', hash_method='md5', **kwargs):
        super().__init__(**kwargs)
        self.image_key = image_key
        self.hash_method = hash_method

    def _calc_hash(self, image_path):
        try:
            if not os.path.exists(image_path):
                return None
            hash_obj = hashlib.new(self.hash_method)
            with open(image_path, 'rb') as f:
                for chunk in iter(lambda: f.read(4096), b''):
                    hash_obj.update(chunk)
            return hash_obj.hexdigest()
        except Exception as e:
            LOG.warning(f'Failed to calculate hash for {image_path}: {e}')
            return None

    def forward_batch_input(self, data, **kwargs):
        assert isinstance(data, list)
        seen_hashes: Set[str] = set()
        deduplicated_data = []
        for item in data:
            assert isinstance(item, dict)
            paths = _normalize_image_paths(item.get(self.image_key, ''))
            if not paths:
                continue
            image_hash = self._calc_hash(paths[0])
            if image_hash is None:
                continue
            if image_hash in seen_hashes:
                continue
            seen_hashes.add(image_hash)
            deduplicated_data.append(item)
        return deduplicated_data

`Phi4QAGenerator`

Bases: PT

Use LLM to convert context (with optional images) into pretraining-format Phi-4 style multi-turn Q&A pairs.

Parameters:

llm –

vision- or text-language model instance
image_key (str, default: 'image_path' ) –

key for image path(s), default 'image_path'
context_key (str, default: 'context' ) –

key for context, default 'context'
num_qa (int, default: 5 ) –

number of QA pairs to generate, default 5
prompt (str, default: None ) –

optional custom prompt

Examples:

from lazyllm.tools.data import pt

vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
op = pt.Phi4QAGenerator(vlm, num_qa=2)
res = op([{'context': 'Some context.', 'image_path': '/path/to/image.jpg'}])
# res[0]['qa_pairs'] contains pretraining-format Q&A

Source code in lazyllm/tools/data/operators/pt_op.py

class Phi4QAGenerator(PT):
    """Use LLM to convert context (with optional images) into pretraining-format Phi-4 style multi-turn Q&A pairs.

Args:
    llm: vision- or text-language model instance
    image_key (str): key for image path(s), default 'image_path'
    context_key (str): key for context, default 'context'
    num_qa (int): number of QA pairs to generate, default 5
    prompt (str): optional custom prompt


Examples:
    ```python
    from lazyllm.tools.data import pt

    vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
    op = pt.Phi4QAGenerator(vlm, num_qa=2)
    res = op([{'context': 'Some context.', 'image_path': '/path/to/image.jpg'}])
    # res[0]['qa_pairs'] contains pretraining-format Q&A
    ```
    """
    DEFAULT_PROMPT = (
        'Convert the given context (text and/or images) into pretraining-format multi-turn Q&A dialogue data. '
        'Output JSON only. Do not output any other irrelevant content.\n'
        '{\n'
        '  "qa_pairs": [\n'
        '    {"query": "", "answer": ""}\n'
        '  ]\n'
        '}\n'
        'Each item has query (question) and answer. Generate natural, instructional Q&A suitable for LM pretraining.'
    )

    def __init__(self, llm, image_key='image_path', context_key='context', num_qa=5,
                 prompt: Optional[str] = None,
                 _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        if llm is None:
            raise ValueError('Phi4QAGenerator requires llm (vision- or text-language model).')
        self.image_key = image_key
        self.context_key = context_key
        self.num_qa = num_qa
        self.prompt = prompt or self.DEFAULT_PROMPT
        self._generator = llm.share().prompt(self.prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        context = data.get(self.context_key, '')
        if not context:
            return []
        paths = _normalize_image_paths(data.get(self.image_key, ''))
        try:
            query = f'Context:\n{context}\n\nGenerate {self.num_qa} pretraining-format Q&A pairs (phi-4 style).'
            inputs = encode_query_with_filepaths(query, paths) if paths else query
            out = self._generator(inputs)
            if not isinstance(out, dict):
                data['qa_pairs'] = []
                return data
            raw = out.get('qa_pairs', [])
            if not isinstance(raw, list):
                data['qa_pairs'] = []
                return data
            qa_pairs = []
            for item in raw:
                if isinstance(item, dict) and 'query' in item and 'answer' in item:
                    qa_pairs.append({'query': str(item['query']), 'answer': str(item['answer'])})
                elif isinstance(item, dict) and 'question' in item:
                    qa_pairs.append({
                        'query': str(item.get('question', item.get('query', ''))),
                        'answer': str(item.get('answer', item.get('ans', ''))),
                    })
            data['qa_pairs'] = qa_pairs
            return data
        except Exception as e:
            LOG.warning(f'Phi4 Q&A generation failed: {e}')
            return []

`TextRelevanceFilter`

Bases: PT_MM

Use VLM to judge image-text relevance; filter samples below the threshold.

Parameters:

vlm –

vision-language model instance
image_key (str, default: 'image_path' ) –

key for image path(s), default 'image_path'
text_key (str, default: 'text' ) –

key for text, default 'text'
threshold (float, default: 0.6 ) –

relevance threshold [0,1], default 0.6
prompt (str, default: None ) –

optional custom prompt

Examples:

from lazyllm.tools.data import pt_mm

vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
op = pt_mm.TextRelevanceFilter(vlm, threshold=0.5)
res = op([{'image_path': '/path/to/image.jpg', 'text': 'a red square'}])
# samples with relevance >= threshold are kept

Source code in lazyllm/tools/data/operators/pt_op.py

class TextRelevanceFilter(PT_MM):
    """Use VLM to judge image-text relevance; filter samples below the threshold.

Args:
    vlm: vision-language model instance
    image_key (str): key for image path(s), default 'image_path'
    text_key (str): key for text, default 'text'
    threshold (float): relevance threshold [0,1], default 0.6
    prompt (str): optional custom prompt


Examples:
    ```python
    from lazyllm.tools.data import pt_mm

    vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
    op = pt_mm.TextRelevanceFilter(vlm, threshold=0.5)
    res = op([{'image_path': '/path/to/image.jpg', 'text': 'a red square'}])
    # samples with relevance >= threshold are kept
    ```
    """
    DEFAULT_PROMPT = (
        'You are an image-text relevance judge.\n'
        'Given ONE image and ONE piece of text, you must output STRICT JSON and nothing else.\n'
        'JSON schema:\n'
        '{\n'
        '  "relevance": 0.0,  // float in [0, 1]\n'
        '  "reason": ""      // short string\n'
        '}\n'
        'Rules:\n'
        '- relevance=1 means fully relevant; relevance=0 means irrelevant.\n'
        '- Do not output markdown, code fences, or any extra words outside JSON.\n'
    )

    def __init__(self, vlm, image_key='image_path', text_key='text', threshold=0.6,
                 prompt: Optional[str] = None,
                 _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        if vlm is None:
            raise ValueError('TextRelevanceFilter requires vlm (vision-language model).')
        self.image_key = image_key
        self.text_key = text_key
        self.threshold = threshold
        self.prompt = prompt or self.DEFAULT_PROMPT
        self._judge = vlm.share().prompt(self.prompt).formatter(JsonFormatter())

    def _calc_relevance(self, image_path, text):
        if not text or not image_path or not os.path.exists(image_path):
            return 0.0
        try:
            out = self._judge(encode_query_with_filepaths(text, [image_path]))
            v = out.get('relevance', 0.0) if isinstance(out, dict) else 0.0
            v = max(0.0, min(1.0, float(v))) if isinstance(v, (int, float)) else 0.0
            return v
        except Exception as e:
            LOG.warning(f'VLM relevance failed: {e}')
            return 0.0

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        paths = _normalize_image_paths(data.get(self.image_key, ''))
        text = data.get(self.text_key, '')
        if not paths or not text:
            return []
        try:
            scores = [self._calc_relevance(p, text) for p in paths]
            mean_relevance = sum(scores) / len(scores) if scores else 0.0
            if mean_relevance < self.threshold:
                return []
            valid_paths = [p for p, s in zip(paths, scores) if s >= self.threshold]
            if not valid_paths:
                return []
            data[self.image_key] = valid_paths
            data['image_text_relevance'] = mean_relevance
            return data
        except Exception as e:
            LOG.warning(f'Failed to calculate image-text relevance: {e}')
            return []

`VQAGenerator`

Bases: PT_MM

Use VLM to generate Visual Question Answering (VQA) pairs from context and images.

Parameters:

vlm –

vision-language model instance
image_key (str, default: 'image_path' ) –

key for image path(s), default 'image_path'
context_key (str, default: 'context' ) –

key for context, default 'context'
num_qa (int, default: 5 ) –

number of QA pairs to generate, default 5
prompt (str, default: None ) –

optional custom prompt

Examples:

from lazyllm.tools.data import pt_mm

vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
op = pt_mm.VQAGenerator(vlm, num_qa=3)
res = op([{'image_path': '/path/to/image.jpg', 'context': 'A simple image.'}])
# res[0]['qa_pairs'] contains [{'query': '...', 'answer': '...'}, ...]

Source code in lazyllm/tools/data/operators/pt_op.py

class VQAGenerator(PT_MM):
    """Use VLM to generate Visual Question Answering (VQA) pairs from context and images.

Args:
    vlm: vision-language model instance
    image_key (str): key for image path(s), default 'image_path'
    context_key (str): key for context, default 'context'
    num_qa (int): number of QA pairs to generate, default 5
    prompt (str): optional custom prompt


Examples:
    ```python
    from lazyllm.tools.data import pt_mm

    vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
    op = pt_mm.VQAGenerator(vlm, num_qa=3)
    res = op([{'image_path': '/path/to/image.jpg', 'context': 'A simple image.'}])
    # res[0]['qa_pairs'] contains [{'query': '...', 'answer': '...'}, ...]
    ```
    """
    DEFAULT_PROMPT = (
        'Generate Visual Question Answering (VQA) pairs from the given context and image(s). '
        'Output JSON only. Do not output any other irrelevant content.\n'
        '{\n'
        '  "qa_pairs": [\n'
        '    {"query": "", "answer": ""}\n'
        '  ]\n'
        '}\n'
        'Each item in qa_pairs has query (question) and answer. '
        'All questions should be answerable from the context and image.'
    )

    def __init__(self, vlm, image_key='image_path', context_key='context', num_qa=5,
                 prompt: Optional[str] = None,
                 _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        if vlm is None:
            raise ValueError('VQAGenerator requires vlm (vision-language model).')
        self.image_key = image_key
        self.context_key = context_key
        self.num_qa = num_qa
        self.prompt = prompt or self.DEFAULT_PROMPT
        self._generator = vlm.share().prompt(prompt or self.DEFAULT_PROMPT).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        paths = _normalize_image_paths(data.get(self.image_key, ''))
        context = data.get(self.context_key, '')
        if not paths or not context:
            return []
        try:
            query = f'Context: {context}\n\nGenerate {self.num_qa} QA pairs based on the context and image(s).'
            out = self._generator(encode_query_with_filepaths(query, paths))
            if not isinstance(out, dict):
                data['qa_pairs'] = []
                return data
            raw = out.get('qa_pairs', [])
            if not isinstance(raw, list):
                data['qa_pairs'] = []
                return data
            qa_pairs = []
            for item in raw:
                if isinstance(item, dict) and 'query' in item and 'answer' in item:
                    qa_pairs.append({'query': str(item['query']), 'answer': str(item['answer'])})
                elif isinstance(item, dict) and 'question' in item:
                    qa_pairs.append({
                        'query': str(item.get('question', item.get('query', ''))),
                        'answer': str(item.get('answer', item.get('ans', ''))),
                    })
            data['qa_pairs'] = qa_pairs
            return data
        except Exception as e:
            LOG.warning(f'VQA generation failed: {e}')
            return []

`VQAScorer`

Bases: PT_MM

Use VLM to score VQA pair quality (query, answer, image_path), evaluating how good the visual QA is.

Parameters:

vlm –

vision-language model instance
image_key (str, default: 'image_path' ) –

key for image path(s), default 'image_path'
query_key (str, default: 'query' ) –

key for question, default 'query'
answer_key (str, default: 'answer' ) –

key for answer, default 'answer'
prompt (str, default: None ) –

optional custom prompt

Examples:

from lazyllm.tools.data import pt_mm

vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
op = pt_mm.VQAScorer(vlm)
res = op([{
    'image_path': '/path/to/image.jpg',
    'query': 'What color is it?',
    'answer': 'Red',
}])
# res[0]['quality_score'] contains score, relevance, correctness, reason

Source code in lazyllm/tools/data/operators/pt_op.py

class VQAScorer(PT_MM):
    """Use VLM to score VQA pair quality (query, answer, image_path), evaluating how good the visual QA is.

Args:
    vlm: vision-language model instance
    image_key (str): key for image path(s), default 'image_path'
    query_key (str): key for question, default 'query'
    answer_key (str): key for answer, default 'answer'
    prompt (str): optional custom prompt


Examples:
    ```python
    from lazyllm.tools.data import pt_mm

    vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
    op = pt_mm.VQAScorer(vlm)
    res = op([{
        'image_path': '/path/to/image.jpg',
        'query': 'What color is it?',
        'answer': 'Red',
    }])
    # res[0]['quality_score'] contains score, relevance, correctness, reason
    ```
    """
    DEFAULT_PROMPT = (
        'Given an image and a VQA pair (query, answer), rate the quality of this VQA. '
        'Output JSON only. Do not output any other irrelevant content.\n'
        '{\n'
        '  "score": 0.0,\n'
        '  "relevance": 0.0,\n'
        '  "correctness": 0.0,\n'
        '  "reason": ""\n'
        '}\n'
        'score: overall VQA quality [0, 1]; relevance: answer relevance to query [0, 1]; '
        'correctness: answer correctness given the image [0, 1]. All floats.'
    )

    def __init__(self, vlm, image_key='image_path', query_key='query', answer_key='answer',
                 prompt: Optional[str] = None,
                 _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        if vlm is None:
            raise ValueError('VQAScorer requires vlm (vision-language model).')
        self.image_key = image_key
        self.query_key = query_key
        self.answer_key = answer_key
        self.prompt = prompt or self.DEFAULT_PROMPT
        self._scorer = vlm.share().prompt(self.prompt).formatter(JsonFormatter())

    def _clamp_score(self, v):
        try:
            return max(0.0, min(1.0, float(v)))
        except (TypeError, ValueError):
            return 0.0

    def _calc_vqa_quality(self, query, answer, image_path):
        if not query or not answer:
            return 0.0, {}
        if not image_path or not os.path.exists(image_path):
            return 0.0, {}
        try:
            eval_query = (
                f'Query: {query}\nAnswer: {answer}\n\n'
                'Rate the quality of this VQA pair given the image. How relevant and correct is the answer?'
            )
            out = self._scorer(encode_query_with_filepaths(eval_query, [image_path]))
            if not isinstance(out, dict):
                return 0.0, {}
            score = self._clamp_score(out.get('score', out.get('overall', 0.0)))
            return score, out
        except Exception as e:
            LOG.warning(f'VLM VQA scoring failed: {e}')
            return 0.0, {}

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        paths = _normalize_image_paths(data.get(self.image_key, ''))
        query = data.get(self.query_key, '')
        answer = data.get(self.answer_key, '')
        if not paths or not query or not answer:
            return []
        try:
            image_path = paths[0]
            _, out = self._calc_vqa_quality(query, answer, image_path)
            data['quality_score'] = {
                'score': self._clamp_score(out.get('score', out.get('overall', 0.0))),
                'relevance': self._clamp_score(out.get('relevance', 0.0)),
                'correctness': self._clamp_score(out.get('correctness', 0.0)),
                'reason': str(out.get('reason', '')),
            }
            return data
        except Exception as e:
            LOG.warning(f'Failed to score VQA quality: {e}')
            return []

`integrity_check(data, image_key='image_path', input_key=None)`

Check image file integrity; filter out corrupted or empty files, keep paths of valid images.

Parameters:

data (dict) –

single data dict
image_key (str, default: 'image_path' ) –

key for image path(s), default 'image_path'
input_key (str, default: None ) –

optional, overrides image_key

Examples:

from lazyllm.tools.data import pt_mm

op = pt_mm.integrity_check()
res = op([{'image_path': '/path/to/image.jpg'}, {'image_path': '/nonexistent.png'}])
# only valid images retained

Source code in lazyllm/tools/data/operators/pt_op.py

@data_register('data.pt_mm', rewrite_func='forward', _concurrency_mode='thread')
def integrity_check(data, image_key='image_path', input_key=None):
    """Check image file integrity; filter out corrupted or empty files, keep paths of valid images.

Args:
    data (dict): single data dict
    image_key (str): key for image path(s), default 'image_path'
    input_key (str): optional, overrides image_key


Examples:
    ```python
    from lazyllm.tools.data import pt_mm

    op = pt_mm.integrity_check()
    res = op([{'image_path': '/path/to/image.jpg'}, {'image_path': '/nonexistent.png'}])
    # only valid images retained
    ```
    """
    assert isinstance(data, dict)
    if input_key:
        image_key = input_key
    paths = _normalize_image_paths(data.get(image_key, ''))
    if not paths:
        return []
    valid_paths = []
    try:
        for image_path in paths:
            if not os.path.exists(image_path):
                LOG.warning(f'Image path not found: {image_path}')
                continue
            try:
                with PIL.Image.open(image_path) as img:
                    img.verify()
                if os.path.getsize(image_path) == 0:
                    continue
                valid_paths.append(image_path)
            except Exception as e:
                LOG.warning(f'Failed to check file integrity for {image_path}: {e}')
                continue
        if not valid_paths:
            return []
        data[image_key] = valid_paths
        return data
    except Exception as e:
        LOG.warning(f'Failed to check file integrity: {e}')
        return []

`resolution_filter(data, image_key='image_path', min_width=256, min_height=256, max_width=4096, max_height=4096, input_key=None)`

Filter images by min/max width and height, keeping only those within the specified resolution range.

Parameters:

data (dict) –

single data dict
image_key (str, default: 'image_path' ) –

key for image path(s), default 'image_path'
min_width (int, default: 256 ) –

minimum width, default 256
min_height (int, default: 256 ) –

minimum height, default 256
max_width (int, default: 4096 ) –

maximum width, default 4096
max_height (int, default: 4096 ) –

maximum height, default 4096
input_key (str, default: None ) –

optional, overrides image_key

Examples:

from lazyllm.tools.data import pt_mm

op = pt_mm.resolution_filter(min_width=256, min_height=256, max_width=4096, max_height=4096)
res = op([{'image_path': '/path/to/image.jpg'}])

Source code in lazyllm/tools/data/operators/pt_op.py

@data_register('data.pt_mm', rewrite_func='forward', _concurrency_mode='thread')
def resolution_filter(data, image_key='image_path', min_width=256, min_height=256,
                      max_width=4096, max_height=4096, input_key=None):
    """Filter images by min/max width and height, keeping only those within the specified resolution range.

Args:
    data (dict): single data dict
    image_key (str): key for image path(s), default 'image_path'
    min_width (int): minimum width, default 256
    min_height (int): minimum height, default 256
    max_width (int): maximum width, default 4096
    max_height (int): maximum height, default 4096
    input_key (str): optional, overrides image_key


Examples:
    ```python
    from lazyllm.tools.data import pt_mm

    op = pt_mm.resolution_filter(min_width=256, min_height=256, max_width=4096, max_height=4096)
    res = op([{'image_path': '/path/to/image.jpg'}])
    ```
    """
    assert isinstance(data, dict)
    if input_key:
        image_key = input_key
    paths = _normalize_image_paths(data.get(image_key, ''))
    if not paths:
        return []
    valid_paths = []
    try:
        for image_path in paths:
            if not os.path.exists(image_path):
                LOG.warning(f'Image path not found or invalid: {image_path}')
                continue
            with PIL.Image.open(image_path) as img:
                width, height = img.size
                if width < min_width or height < min_height:
                    continue
                if width > max_width or height > max_height:
                    continue
                valid_paths.append(image_path)
        if not valid_paths:
            return []
        data[image_key] = valid_paths
        return data
    except Exception as e:
        LOG.warning(f'Failed to check image resolution: {e}')
        return []

`resolution_resize(data, image_key='image_path', max_side=1024, input_key=None, inplace=True)`

Resize image so the longest side does not exceed max_side. Can overwrite in place or save to a new file.

Parameters:

data (dict) –

single data dict
image_key (str, default: 'image_path' ) –

key for image path(s), default 'image_path'
max_side (int, default: 1024 ) –

max length of longest side, default 1024
inplace (bool, default: True ) –

overwrite original file if True; if False, save with _resized suffix
input_key (str, default: None ) –

optional, overrides image_key

Examples:

from lazyllm.tools.data import pt_mm

op = pt_mm.resolution_resize(max_side=400, inplace=False)
res = op([{'image_path': '/path/to/image.jpg'}])
# resized file saved as image_resized.jpg in same directory

Source code in lazyllm/tools/data/operators/pt_op.py

@data_register('data.pt_mm', rewrite_func='forward', _concurrency_mode='thread')
def resolution_resize(data, image_key='image_path', max_side=1024, input_key=None, inplace=True):
    """Resize image so the longest side does not exceed max_side. Can overwrite in place or save to a new file.

Args:
    data (dict): single data dict
    image_key (str): key for image path(s), default 'image_path'
    max_side (int): max length of longest side, default 1024
    inplace (bool): overwrite original file if True; if False, save with _resized suffix
    input_key (str): optional, overrides image_key


Examples:
    ```python
    from lazyllm.tools.data import pt_mm

    op = pt_mm.resolution_resize(max_side=400, inplace=False)
    res = op([{'image_path': '/path/to/image.jpg'}])
    # resized file saved as image_resized.jpg in same directory
    ```
    """
    assert isinstance(data, dict)
    if input_key:
        image_key = input_key
    paths = _normalize_image_paths(data.get(image_key, ''))
    if not paths:
        return []
    valid_paths = []
    try:
        for image_path in paths:
            if not os.path.exists(image_path):
                LOG.warning(f'Image path not found or invalid: {image_path}')
                continue
            with PIL.Image.open(image_path) as img:
                img.load()
                w, h = img.size
                if max(w, h) <= max_side:
                    valid_paths.append(image_path)
                    continue
                scale = max_side / max(w, h)
                new_w, new_h = int(round(w * scale)), int(round(h * scale))
                if new_w < 1 or new_h < 1:
                    continue
                resample = getattr(
                    getattr(PIL.Image, 'Resampling', None), 'LANCZOS', PIL.Image.LANCZOS
                )
                out = img.resize((new_w, new_h), resample)
                if inplace:
                    save_path = image_path
                else:
                    base, ext = os.path.splitext(image_path)
                    save_path = f'{base}_resized{ext}'
                out.save(save_path, quality=95)
                valid_paths.append(save_path)
        if not valid_paths:
            return []
        data[image_key] = valid_paths
        return data
    except Exception as e:
        LOG.warning(f'Failed to resize image resolution: {e}')
        return []

Refine Operators

`lazyllm.tools.data.operators.refine_op`

`remove_emoji(data, input_key='content')`

Remove emoji characters from the specified text field.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'

Examples:

from lazyllm.tools.data import refine

func = refine.remove_emoji(input_key='content')
inputs = [{'content': 'Hello 😊 World 🌍!'}]
res = func(inputs)
print(res)
# [{'content': 'Hello  World !'}]

Source code in lazyllm/tools/data/operators/refine_op.py

@data_register('data.refine', rewrite_func='forward', _concurrency_mode='process')
def remove_emoji(data, input_key='content'):
    """Remove emoji characters from the specified text field.

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'


Examples:
    ```python
    from lazyllm.tools.data import refine

    func = refine.remove_emoji(input_key='content')
    inputs = [{'content': 'Hello 😊 World 🌍!'}]
    res = func(inputs)
    print(res)
    # [{'content': 'Hello  World !'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key, '')
    if text:
        data[input_key] = EMOJIS.sub('', text)
    return data

`remove_extra_spaces(data, input_key='content')`

Normalize whitespace by collapsing multiple spaces, newlines and tabs into single spaces.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'

Examples:

from lazyllm.tools.data import refine

func = refine.remove_extra_spaces(input_key='content')
inputs = [{'content': 'hello   world\n\n  foo\tbar'}]
res = func(inputs)
print(res)
# [{'content': 'hello world foo bar'}]

Source code in lazyllm/tools/data/operators/refine_op.py

@data_register('data.refine', rewrite_func='forward', _concurrency_mode='process')
def remove_extra_spaces(data, input_key='content'):
    """Normalize whitespace by collapsing multiple spaces, newlines and tabs into single spaces.

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'


Examples:
    ```python
    from lazyllm.tools.data import refine

    func = refine.remove_extra_spaces(input_key='content')
    inputs = [{'content': 'hello   world\\n\\n  foo\\tbar'}]
    res = func(inputs)
    print(res)
    # [{'content': 'hello world foo bar'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key, '')
    if text:
        text = text.replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ')
        data[input_key] = ' '.join(text.split())
    return data

`remove_html_entity(data, input_key='content')`

Remove HTML entities (e.g. , <, &) from the specified text field.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'

Examples:

from lazyllm.tools.data import refine

func = refine.remove_html_entity(input_key='content')
inputs = [{'content': 'Hello&nbsp;World &amp; &lt;tag&gt;'}]
res = func(inputs)
print(res)
# [{'content': 'HelloWorld  tag'}]

Source code in lazyllm/tools/data/operators/refine_op.py

@data_register('data.refine', rewrite_func='forward', _concurrency_mode='process')
def remove_html_entity(data, input_key='content'):
    """Remove HTML entities (e.g. &nbsp;, &lt;, &amp;) from the specified text field.

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'


Examples:
    ```python
    from lazyllm.tools.data import refine

    func = refine.remove_html_entity(input_key='content')
    inputs = [{'content': 'Hello&nbsp;World &amp; &lt;tag&gt;'}]
    res = func(inputs)
    print(res)
    # [{'content': 'HelloWorld  tag'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key, '')
    if text:
        data[input_key] = HTML_ENTITY_PATTERN.sub('', text)
    return data

`remove_html_url(data, input_key='content')`

Remove HTTP/HTTPS URLs and HTML tags from the specified text field.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'

Examples:

from lazyllm.tools.data import refine

func = refine.remove_html_url(input_key='content')
inputs = [{'content': 'Check https://example.com and <b>bold</b>'}]
res = func(inputs)
print(res)
# [{'content': 'Check  and bold'}]

Source code in lazyllm/tools/data/operators/refine_op.py

@data_register('data.refine', rewrite_func='forward', _concurrency_mode='process')
def remove_html_url(data, input_key='content'):
    """Remove HTTP/HTTPS URLs and HTML tags from the specified text field.

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'


Examples:
    ```python
    from lazyllm.tools.data import refine

    func = refine.remove_html_url(input_key='content')
    inputs = [{'content': 'Check https://example.com and <b>bold</b>'}]
    res = func(inputs)
    print(res)
    # [{'content': 'Check  and bold'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key, '')
    if text:
        text = URL_PATTERN.sub('', text)
        text = HTML_PATTERN.sub('', text)
        data[input_key] = text
    return data

Filter Operators

`lazyllm.tools.data.operators.filter_op`

`CapitalWordFilter`

Bases: Filter

Filter text with too high ratio of all-caps words.

Parameters:

input_key (str, default: 'content' ) –

key of the text field, default 'content'
max_ratio (float, default: 0.5 ) –

max ratio of all-caps words, default 0.5
use_tokenizer (bool, default: False ) –

use tokenizer, default False
_concurrency_mode (str, default: 'thread' ) –

optional concurrency mode

Examples:

from lazyllm.tools.data import filter

func = filter.CapitalWordFilter(input_key='content', max_ratio=0.5)
inputs = [{'content': 'Normal text with Some Capitals'}, {'content': 'MOSTLY UPPERCASE'}]
res = func(inputs)
print(res)
# [{'content': 'Normal text with Some Capitals'}]

Source code in lazyllm/tools/data/operators/filter_op.py

class CapitalWordFilter(Filter):
    """Filter text with too high ratio of all-caps words.

Args:
    input_key (str): key of the text field, default 'content'
    max_ratio (float): max ratio of all-caps words, default 0.5
    use_tokenizer (bool): use tokenizer, default False
    _concurrency_mode (str): optional concurrency mode


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.CapitalWordFilter(input_key='content', max_ratio=0.5)
    inputs = [{'content': 'Normal text with Some Capitals'}, {'content': 'MOSTLY UPPERCASE'}]
    res = func(inputs)
    print(res)
    # [{'content': 'Normal text with Some Capitals'}]
    ```
    """
    def __init__(self, input_key='content', max_ratio=0.5, use_tokenizer=False,
                 _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        self.input_key = input_key
        self.max_ratio = max_ratio
        self.use_tokenizer = use_tokenizer

        if self.use_tokenizer:
            nltk_data_dir = _setup_nltk_data_dir()
            try:
                nltk.data.find('tokenizers/punkt_tab')
            except LookupError:
                LOG.info('Downloading NLTK punkt_tab tokenizer...')
                nltk.download('punkt_tab', quiet=True, download_dir=nltk_data_dir)

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)

        text = data.get(self.input_key)
        if not isinstance(text, str) or not text.strip():
            return []

        if self.use_tokenizer:
            words = nltk.word_tokenize(text)
        else:
            words = text.split()

        num_words = len(words)
        if num_words == 0:
            return []

        num_caps_words = sum(1 for word in words if word.isupper())
        ratio = num_caps_words / num_words

        if ratio <= self.max_ratio:
            return data
        else:
            return []

`MinHashDeduplicator`

Bases: Filter

Remove near-duplicate texts using MinHash LSH. For batch input, keeps first occurrence of each unique text.

Parameters:

input_key (str, default: 'content' ) –

key of the text field, default 'content'
threshold (float, default: 0.85 ) –

similarity threshold, default 0.85
num_perm (int, default: 128 ) –

number of MinHash permutations, default 128
use_n_gram (bool, default: True ) –

use n-gram, default True
ngram (int, default: 5 ) –

n-gram size, default 5

Examples:

from lazyllm.tools.data import filter

func = filter.MinHashDeduplicator(input_key='content', threshold=0.85)
inputs = [{'uid': '0', 'content': '这是第一段不同的内容。'}, {'uid': '1', 'content': '这是第一段不同的内容。'}]
res = func(inputs)
print(res)
# [{'uid': '0', 'content': '这是第一段不同的内容。'}]

Source code in lazyllm/tools/data/operators/filter_op.py

class MinHashDeduplicator(Filter):
    """Remove near-duplicate texts using MinHash LSH. For batch input, keeps first occurrence of each unique text.

Args:
    input_key (str): key of the text field, default 'content'
    threshold (float): similarity threshold, default 0.85
    num_perm (int): number of MinHash permutations, default 128
    use_n_gram (bool): use n-gram, default True
    ngram (int): n-gram size, default 5


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.MinHashDeduplicator(input_key='content', threshold=0.85)
    inputs = [{'uid': '0', 'content': '这是第一段不同的内容。'}, {'uid': '1', 'content': '这是第一段不同的内容。'}]
    res = func(inputs)
    print(res)
    # [{'uid': '0', 'content': '这是第一段不同的内容。'}]
    ```
    """
    __reg_overwrite__ = 'forward_batch_input'

    def __init__(self, input_key='content', threshold=0.85, num_perm=128, use_n_gram=True, ngram=5, **kwargs):
        super().__init__(**kwargs)
        self.input_key = input_key
        self.threshold = threshold
        self.num_perm = num_perm
        self.use_n_gram = use_n_gram
        self.ngram = ngram
        self.lsh = datasketch.MinHashLSH(threshold=self.threshold, num_perm=self.num_perm)
        self._item_counter = 0
        self._minhash_map = {}

    def _create_minhash(self, text):
        minhash = datasketch.MinHash(num_perm=self.num_perm)
        if self.use_n_gram:
            for i in range(len(text) - self.ngram + 1):
                minhash.update(text[i:i + self.ngram].encode('utf8'))
        else:
            for char in text:
                minhash.update(char.encode('utf8'))
        return minhash

    def forward_batch_input(self, data, **kwargs):
        assert isinstance(data, list)

        kept_items = []
        for item in data:
            if not isinstance(item, dict) or self.input_key not in item:
                continue

            text = item[self.input_key]
            if not isinstance(text, str) or not text.strip():
                continue

            minhash = self._create_minhash(text)
            result = self.lsh.query(minhash)

            if len(result) == 0:
                self.lsh.insert(self._item_counter, minhash)
                self._minhash_map[self._item_counter] = minhash
                self._item_counter += 1
                kept_items.append(item)

        return kept_items

`StopWordFilter`

Bases: Filter

Filter text with too high stopword ratio (e.g. invalid content mostly stopwords).

Parameters:

input_key (str, default: 'content' ) –

key of the text field, default 'content'
max_ratio (float, default: 0.5 ) –

max stopword ratio, filter if exceeded, default 0.5
use_tokenizer (bool, default: True ) –

use tokenizer, default True
language (str, default: 'zh' ) –

language, 'zh' or 'en', default 'zh'
_concurrency_mode (str, default: 'thread' ) –

optional concurrency mode

Examples:

from lazyllm.tools.data import filter

func = filter.StopWordFilter(input_key='content', max_ratio=0.5, language='zh')
inputs = [{'content': '这是一段包含实际内容的正常文本。'}, {'content': '的了吗呢吧啊'}]
res = func(inputs)
print(res)
# [{'content': '这是一段包含实际内容的正常文本。'}]

Source code in lazyllm/tools/data/operators/filter_op.py

class StopWordFilter(Filter):
    """Filter text with too high stopword ratio (e.g. invalid content mostly stopwords).

Args:
    input_key (str): key of the text field, default 'content'
    max_ratio (float): max stopword ratio, filter if exceeded, default 0.5
    use_tokenizer (bool): use tokenizer, default True
    language (str): language, 'zh' or 'en', default 'zh'
    _concurrency_mode (str): optional concurrency mode


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.StopWordFilter(input_key='content', max_ratio=0.5, language='zh')
    inputs = [{'content': '这是一段包含实际内容的正常文本。'}, {'content': '的了吗呢吧啊'}]
    res = func(inputs)
    print(res)
    # [{'content': '这是一段包含实际内容的正常文本。'}]
    ```
    """
    def __init__(self, input_key='content', max_ratio=0.5, use_tokenizer=True, language='zh',
                 _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        self.input_key = input_key
        self.max_ratio = max_ratio
        self.use_tokenizer = use_tokenizer
        self.language = language.lower()

        nltk_data_dir = _setup_nltk_data_dir()
        try:
            nltk.data.find('corpora/stopwords')
        except LookupError:
            LOG.info('Downloading NLTK stopwords...')
            nltk.download('stopwords', quiet=True, download_dir=nltk_data_dir)

        if self.language in ['en', 'english']:
            self.stopwords = set(nltk.corpus.stopwords.words('english'))
        elif self.language in ['zh', 'cn', 'chinese']:
            self.stopwords = set(nltk.corpus.stopwords.words('chinese'))
        else:
            LOG.warning(f'Unsupported language: {self.language}, using English stopwords')
            self.stopwords = set(nltk.corpus.stopwords.words('english'))

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)

        text = data.get(self.input_key)
        if not isinstance(text, str) or not text.strip():
            return []

        if self.language in ['zh', 'cn', 'chinese']:
            if self.use_tokenizer:
                words = list(jieba.cut(text.lower()))
            else:
                words = list(text)
        elif self.language in ['en', 'english']:
            if self.use_tokenizer:
                words = nltk.word_tokenize(text.lower())
            else:
                words = text.lower().split()
        else:
            words = text.lower().split()

        num_words = len(words)
        if num_words == 0:
            return []

        num_stop_words = sum(1 for w in words if w in self.stopwords)
        ratio = num_stop_words / num_words

        if ratio < self.max_ratio:
            return data
        else:
            return []

`SymbolRatioFilter`

Bases: Filter

Filter text with too high ratio of specified symbols (e.g. #, ..., …) to words.

Parameters:

input_key (str, default: 'content' ) –

key of the text field, default 'content'
max_ratio (float, default: 0.3 ) –

max ratio of symbols to words, default 0.3
symbols (list | None, default: None ) –

symbols to count, default ['#', '...', '…']
_concurrency_mode (str, default: 'process' ) –

optional concurrency mode

Examples:

from lazyllm.tools.data import filter

func = filter.SymbolRatioFilter(input_key='content', max_ratio=0.3)
inputs = [{'content': 'Normal text without symbols'}, {'content': '### ... … ###'}]
res = func(inputs)
print(res)
# [{'content': 'Normal text without symbols'}]

Source code in lazyllm/tools/data/operators/filter_op.py

class SymbolRatioFilter(Filter):
    """Filter text with too high ratio of specified symbols (e.g. #, ..., …) to words.

Args:
    input_key (str): key of the text field, default 'content'
    max_ratio (float): max ratio of symbols to words, default 0.3
    symbols (list|None): symbols to count, default ['#', '...', '…']
    _concurrency_mode (str): optional concurrency mode


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.SymbolRatioFilter(input_key='content', max_ratio=0.3)
    inputs = [{'content': 'Normal text without symbols'}, {'content': '### ... … ###'}]
    res = func(inputs)
    print(res)
    # [{'content': 'Normal text without symbols'}]
    ```
    """
    def __init__(self, input_key='content', max_ratio=0.3, symbols=None, _concurrency_mode='process', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        self.input_key = input_key
        self.max_ratio = max_ratio
        self.symbols = symbols or ['#', '...', '…']
        self.tokenizer = nltk.WordPunctTokenizer()

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)

        text = data.get(self.input_key)
        if not isinstance(text, str) or not text.strip():
            return []

        tokens = self.tokenizer.tokenize(text)
        word_tokens = [t for t in tokens if t not in self.symbols]
        num_words = len(word_tokens)
        if num_words == 0:
            return []

        num_symbols = sum(text.count(symbol) for symbol in self.symbols)
        ratio = num_symbols / num_words
        if ratio < self.max_ratio:
            return data
        else:
            return []

`TargetLanguageFilter`

Bases: Filter

Filter text by language using FastText. Keeps only texts in the specified language(s).

Parameters:

input_key (str, default: 'content' ) –

key of the text field, default 'content'
target_language (str | list, default: 'zho_Hans' ) –

target language code(s), e.g. 'zho_Hans', 'eng_Latn'
threshold (float, default: 0.6 ) –

confidence threshold, default 0.6
model_path (str | None, default: None ) –

path to FastText model
_concurrency_mode (str, default: 'thread' ) –

optional concurrency mode

Examples:

from lazyllm.tools.data import filter

func = filter.TargetLanguageFilter(input_key='content', target_language='zho_Hans', threshold=0.3)
inputs = [{'content': '这是一段中文文本。'}, {'content': 'This is English.'}]
res = func(inputs)
print(res)
# [{'content': '这是一段中文文本。'}]

Source code in lazyllm/tools/data/operators/filter_op.py

class TargetLanguageFilter(Filter):
    """Filter text by language using FastText. Keeps only texts in the specified language(s).

Args:
    input_key (str): key of the text field, default 'content'
    target_language (str|list): target language code(s), e.g. 'zho_Hans', 'eng_Latn'
    threshold (float): confidence threshold, default 0.6
    model_path (str|None): path to FastText model
    _concurrency_mode (str): optional concurrency mode


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.TargetLanguageFilter(input_key='content', target_language='zho_Hans', threshold=0.3)
    inputs = [{'content': '这是一段中文文本。'}, {'content': 'This is English.'}]
    res = func(inputs)
    print(res)
    # [{'content': '这是一段中文文本。'}]
    ```
    """
    COMMON_LANGUAGES = {
        'zho_Hans', 'zho_Hant', 'eng_Latn', 'spa_Latn', 'fra_Latn',
        'deu_Latn', 'jpn', 'kor', 'rus_Cyrl', 'ara', 'por_Latn',
        'ita_Latn', 'nld_Latn', 'pol_Latn', 'tur_Latn', 'vie',
        'tha', 'hin', 'ind_Latn', 'msa_Latn'
    }

    def __init__(self, input_key='content', target_language='zho_Hans', threshold=0.6, model_path=None,
                 _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        self.input_key = input_key
        if isinstance(target_language, str):
            self.allowed_languages = {target_language}
        else:
            self.allowed_languages = set(target_language)
        self.threshold = threshold
        if model_path is None:
            try:
                default_cache_dir = config['model_cache_dir']
            except (KeyError, TypeError):
                default_cache_dir = os.path.join(os.path.expanduser('~'), '.lazyllm', 'models')
            model_path = os.path.join(default_cache_dir, 'fasttext-language-identification', 'model.bin')
        self.model_path = model_path
        self._validate_languages()
        self.model = self._load_model()

    def _validate_languages(self):
        invalid_langs = self.allowed_languages - self.COMMON_LANGUAGES
        if invalid_langs:
            LOG.warning(
                f'TargetLanguageFilter: Invalid language codes: {invalid_langs}\n'
                f'Common language codes:\n'
                f'  - zho_Hans (Simplified Chinese), zho_Hant (Traditional Chinese)\n'
                f'  - eng_Latn (English)\n'
                f'  - spa_Latn (Spanish), fra_Latn (French), deu_Latn (German)\n'
                f'  - jpn (Japanese), kor (Korean)\n'
                f'  - rus_Cyrl (Russian), ara (Arabic)\n'
                f'  - por_Latn (Portuguese), ita_Latn (Italian)\n'
                f'Full list: {sorted(self.COMMON_LANGUAGES)}'
            )

    def _load_model(self):
        try:
            if os.path.isfile(self.model_path):
                model_file = self.model_path
            elif os.path.isdir(self.model_path):
                model_file = os.path.join(self.model_path, 'model.bin')
            else:
                model_file = self._download_model()

            if not os.path.exists(model_file):
                raise FileNotFoundError(f'Model file not found at {model_file}')

            LOG.info(f'Loading FastText language model from {model_file}...')
            model = fasttext.load_model(model_file)
            LOG.info('FastText language model loaded successfully.')
            return model
        except Exception as e:
            LOG.error(f'Error loading FastText model: {e}')
            raise

    def _download_model(self):
        LOG.info('Downloading FastText language identification model...')
        model_repo = 'facebook/fasttext-language-identification'
        try:
            model_source = config['model_source']
        except (KeyError, TypeError):
            model_source = 'modelscope'

        if os.path.isdir(self.model_path) or self.model_path.endswith(os.sep):
            model_dir = self.model_path if os.path.isdir(self.model_path) else os.path.dirname(self.model_path)
        else:
            model_dir = os.path.dirname(self.model_path)

        os.makedirs(model_dir, exist_ok=True)
        model_manager = ModelManager(model_source=model_source)
        downloaded_path = model_manager.hub_downloader.download(model_repo, model_dir)

        if not downloaded_path:
            raise RuntimeError(f'Failed to download model: {model_repo}')
        model_file = os.path.join(downloaded_path, 'model.bin')
        if not os.path.exists(model_file):
            raise FileNotFoundError(f'Model file not found at {model_file}')

        self.model_path = model_file
        return model_file

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)

        text = data.get(self.input_key)
        if not isinstance(text, str) or not text.strip():
            return []

        k = max(5, len(self.allowed_languages))
        labels, scores = self.model.predict(text.replace('\n', ' ').strip(), k=k)
        if len(labels) > 0 and len(scores) > 0:
            for label, score in zip(labels, scores):
                pred_label = label.replace('__label__', '')
                if pred_label in self.allowed_languages and score >= self.threshold:
                    return data

        return []

`WordBlocklistFilter`

Bases: Filter

Filter text containing more than threshold blocked words using Aho-Corasick automaton.

Parameters:

input_key (str, default: 'content' ) –

key of the text field, default 'content'
blocklist (list | None, default: None ) –

list of blocked words
blocklist_path (str | None, default: None ) –

path to blocklist file
language (str, default: 'zh' ) –

language, 'zh' or 'en', default 'zh'
threshold (int, default: 1 ) –

max allowed occurrences of blocked words, default 1
_concurrency_mode (str, default: 'thread' ) –

optional concurrency mode

Examples:

from lazyllm.tools.data import filter

func = filter.WordBlocklistFilter(input_key='content', blocklist=['敏感', '违禁'], threshold=0)
inputs = [{'content': '这是正常的文本内容。'}, {'content': '这里包含敏感词。'}]
res = func(inputs)
print(res)
# [{'content': '这是正常的文本内容。'}]

Source code in lazyllm/tools/data/operators/filter_op.py

class WordBlocklistFilter(Filter):
    """Filter text containing more than threshold blocked words using Aho-Corasick automaton.

Args:
    input_key (str): key of the text field, default 'content'
    blocklist (list|None): list of blocked words
    blocklist_path (str|None): path to blocklist file
    language (str): language, 'zh' or 'en', default 'zh'
    threshold (int): max allowed occurrences of blocked words, default 1
    _concurrency_mode (str): optional concurrency mode


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.WordBlocklistFilter(input_key='content', blocklist=['敏感', '违禁'], threshold=0)
    inputs = [{'content': '这是正常的文本内容。'}, {'content': '这里包含敏感词。'}]
    res = func(inputs)
    print(res)
    # [{'content': '这是正常的文本内容。'}]
    ```
    """
    def __init__(self, input_key='content', blocklist=None, blocklist_path=None,
                 language='zh', threshold=1, _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        self.input_key = input_key
        self.threshold = threshold
        self.language = language.lower()

        if blocklist is not None:
            words = [w.strip().lower() for w in blocklist if w and w.strip()]
        elif blocklist_path is not None:
            words = self._load_blocklist_from_file(blocklist_path)
        else:
            default_path = self._get_default_blocklist_path()
            words = self._load_blocklist_from_file(default_path)

        self._blocklist_words = words
        self._automaton = self._build_automaton(words)

        LOG.info(f'WordBlocklistFilter initialized with {len(words)} blocked words (AC automaton), '
                 f'language={self.language}')

    def _build_automaton(self, words):
        A = ahocorasick.Automaton()
        for idx, word in enumerate(words):
            A.add_word(word, (idx, word))
        A.make_automaton()
        return A

    def __getstate__(self):
        state = self.__dict__.copy()
        # automaton may not pickle well in process mode; keep words to rebuild
        state['_automaton'] = None
        return state

    def __setstate__(self, state):
        self.__dict__.update(state)
        if self._automaton is None and self._blocklist_words:
            self._automaton = self._build_automaton(self._blocklist_words)

    def _get_default_blocklist_path(self):
        current_dir = os.path.dirname(os.path.abspath(__file__))
        if self.language in ['zh', 'cn', 'chinese']:
            filename = 'zh.txt'
        elif self.language in ['en', 'english']:
            filename = 'en.txt'
        else:
            LOG.warning(f'Unsupported language: {self.language}, defaulting to zh.txt')
            filename = 'zh.txt'
        blocklist_path = os.path.join(current_dir, 'blocklist', filename)
        return blocklist_path

    def _load_blocklist_from_file(self, file_path):
        LOG.info(f'Loading blocklist from {file_path}...')
        with open(file_path, 'r', encoding='utf-8') as f:
            words = list(dict.fromkeys(line.strip().lower() for line in f if line.strip()))
        LOG.info(f'Loaded {len(words)} words from blocklist')
        return words

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)

        text = data.get(self.input_key)
        if not isinstance(text, str) or not text.strip():
            return data

        text_lower = text.lower()
        blocklist_count = sum(1 for _ in self._automaton.iter(text_lower))

        if blocklist_count <= self.threshold:
            return data
        else:
            return []

`bullet_point_filter(data, input_key='content', max_ratio=0.9)`

Filter text with too many bullet-point lines (e.g. TOC, pure lists).

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'
max_ratio (float, default: 0.9 ) –

max ratio of bullet lines, default 0.9

Examples:

from lazyllm.tools.data import filter

func = filter.bullet_point_filter(input_key='content', max_ratio=0.5)
inputs = [{'content': 'Normal paragraph text'}, {'content': '- Item 1\n- Item 2\n- Item 3'}]
res = func(inputs)
print(res)
# [{'content': 'Normal paragraph text'}]

Source code in lazyllm/tools/data/operators/filter_op.py

@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def bullet_point_filter(data, input_key='content', max_ratio=0.9):
    """Filter text with too many bullet-point lines (e.g. TOC, pure lists).

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'
    max_ratio (float): max ratio of bullet lines, default 0.9


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.bullet_point_filter(input_key='content', max_ratio=0.5)
    inputs = [{'content': 'Normal paragraph text'}, {'content': '- Item 1\\n- Item 2\\n- Item 3'}]
    res = func(inputs)
    print(res)
    # [{'content': 'Normal paragraph text'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    lines = [line.strip() for line in text.split('\n') if line.strip()]
    num_lines = len(lines)
    if num_lines == 0:
        return []
    num_bullet_lines = sum(
        1 for line in lines
        if any(line.startswith(bullet) for bullet in BULLET_CHARS)
    )
    ratio = num_bullet_lines / num_lines
    if ratio <= max_ratio:
        return data
    else:
        return []

`char_count_filter(data, input_key='content', min_chars=100, max_chars=100000)`

Filter by character count (excluding whitespace). Keeps text in [min_chars, max_chars].

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'
min_chars (int, default: 100 ) –

min chars, default 100
max_chars (int, default: 100000 ) –

max chars, default 100000

Examples:

from lazyllm.tools.data import filter

func = filter.char_count_filter(input_key='content', min_chars=10, max_chars=100)
inputs = [{'content': '短'}, {'content': '这是一段中等长度的文本内容。'}]
res = func(inputs)
print(res)
# [{'content': '这是一段中等长度的文本内容。'}]

Source code in lazyllm/tools/data/operators/filter_op.py

@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def char_count_filter(data, input_key='content', min_chars=100, max_chars=100000):
    """Filter by character count (excluding whitespace). Keeps text in [min_chars, max_chars].

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'
    min_chars (int): min chars, default 100
    max_chars (int): max chars, default 100000


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.char_count_filter(input_key='content', min_chars=10, max_chars=100)
    inputs = [{'content': '短'}, {'content': '这是一段中等长度的文本内容。'}]
    res = func(inputs)
    print(res)
    # [{'content': '这是一段中等长度的文本内容。'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    text_no_space = text.strip().replace(' ', '').replace('\n', '').replace('\t', '')
    num_chars = len(text_no_space)
    if min_chars <= num_chars <= max_chars:
        return data
    else:
        return []

`colon_end_filter(data, input_key='content')`

Filter text ending with colon.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'

Examples:

from lazyllm.tools.data import filter

func = filter.colon_end_filter(input_key='content')
inputs = [{'content': '这是正常结尾。'}, {'content': '这是冒号结尾：'}]
res = func(inputs)
print(res)
# [{'content': '这是正常结尾。'}]

Source code in lazyllm/tools/data/operators/filter_op.py

@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def colon_end_filter(data, input_key='content'):
    """Filter text ending with colon.

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.colon_end_filter(input_key='content')
    inputs = [{'content': '这是正常结尾。'}, {'content': '这是冒号结尾：'}]
    res = func(inputs)
    print(res)
    # [{'content': '这是正常结尾。'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return data
    if text.rstrip().endswith(':') or text.rstrip().endswith('：'):
        return []
    else:
        return data

`curly_bracket_filter(data, input_key='content', max_ratio=0.08)`

Filter text with too high ratio of curly brackets {}.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'
max_ratio (float, default: 0.08 ) –

max ratio of curly brackets, default 0.08

Examples:

from lazyllm.tools.data import filter

func = filter.curly_bracket_filter(input_key='content', max_ratio=0.08)
inputs = [{'content': 'Normal text'}, {'content': '{{{{{' * 10}]
res = func(inputs)
print(res)
# [{'content': 'Normal text'}]

Source code in lazyllm/tools/data/operators/filter_op.py

@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def curly_bracket_filter(data, input_key='content', max_ratio=0.08):
    """Filter text with too high ratio of curly brackets {}.

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'
    max_ratio (float): max ratio of curly brackets, default 0.08


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.curly_bracket_filter(input_key='content', max_ratio=0.08)
    inputs = [{'content': 'Normal text'}, {'content': '{{{{{' * 10}]
    res = func(inputs)
    print(res)
    # [{'content': 'Normal text'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    num_brackets = text.count('{') + text.count('}')
    ratio = num_brackets / len(text) if len(text) > 0 else 0
    if ratio < max_ratio:
        return data
    else:
        return []

`ellipsis_end_filter(data, input_key='content', max_ratio=0.3)`

Filter text with too many lines ending in ellipsis (...、…、……).

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'
max_ratio (float, default: 0.3 ) –

max ratio of lines ending with ellipsis, default 0.3

Examples:

from lazyllm.tools.data import filter

func = filter.ellipsis_end_filter(input_key='content', max_ratio=0.3)
inputs = [{'content': '第一行。\n第二行。\n第三行。'}, {'content': '第一行...\n第二行...'}]
res = func(inputs)
print(res)
# [{'content': '第一行。\n第二行。\n第三行。'}]

Source code in lazyllm/tools/data/operators/filter_op.py

@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def ellipsis_end_filter(data, input_key='content', max_ratio=0.3):
    """Filter text with too many lines ending in ellipsis (...、…、……).

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'
    max_ratio (float): max ratio of lines ending with ellipsis, default 0.3


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.ellipsis_end_filter(input_key='content', max_ratio=0.3)
    inputs = [{'content': '第一行。\\n第二行。\\n第三行。'}, {'content': '第一行...\\n第二行...'}]
    res = func(inputs)
    print(res)
    # [{'content': '第一行。\\n第二行。\\n第三行。'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return data
    ellipsis = ['...', '…', '……']
    lines = [line.strip() for line in text.split('\n') if line.strip()]
    num_lines = len(lines)
    if num_lines == 0:
        return data
    num_occurrences = sum(
        1 for line in lines
        if any(line.endswith(e) for e in ellipsis)
    )
    ratio = num_occurrences / num_lines
    if ratio < max_ratio:
        return data
    else:
        return []

`idcard_filter(data, input_key='content', threshold=3)`

Filter text containing too many ID card / identity document related terms.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'
threshold (int, default: 3 ) –

max matches of related terms, filter if exceeded, default 3

Examples:

from lazyllm.tools.data import filter

func = filter.idcard_filter(input_key='content', threshold=1)
inputs = [{'content': '这是正常文本'}, {'content': '请提供身份证号码和ID number'}]
res = func(inputs)
print(res)
# [{'content': '这是正常文本'}]

Source code in lazyllm/tools/data/operators/filter_op.py

@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def idcard_filter(data, input_key='content', threshold=3):
    """Filter text containing too many ID card / identity document related terms.

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'
    threshold (int): max matches of related terms, filter if exceeded, default 3


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.idcard_filter(input_key='content', threshold=1)
    inputs = [{'content': '这是正常文本'}, {'content': '请提供身份证号码和ID number'}]
    res = func(inputs)
    print(res)
    # [{'content': '这是正常文本'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    all_patterns = ID_CARD_CHINESE_TERMS + ID_CARD_ENGLISH_TERMS
    pattern = re.compile('|'.join(f'({p})' for p in all_patterns), re.I)
    matches = pattern.findall(text)
    has_too_many_id_terms = len(matches) >= threshold
    if not has_too_many_id_terms:
        return data
    else:
        return []

`javascript_filter(data, input_key='content', min_non_script_lines=3)`

Filter text containing many JavaScript patterns (code, script fragments). Short text (<=3 lines) is passed through to avoid false positives on normal short sentences.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'
min_non_script_lines (int, default: 3 ) –

min non-script lines, default 3

Examples:

from lazyllm.tools.data import filter

func = filter.javascript_filter(input_key='content', min_non_script_lines=2)
inputs = [{'content': 'Short normal text'}, {'content': 'function() { return 1; }
const x = 1;
var y = 2;
let z = 3;'}]
res = func(inputs)
print(res)
# [{'content': 'Short normal text'}]

Source code in lazyllm/tools/data/operators/filter_op.py

@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def javascript_filter(data, input_key='content', min_non_script_lines=3):
    """Filter text containing many JavaScript patterns (code, script fragments). Short text (<=3 lines) is passed through to avoid false positives on normal short sentences.

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'
    min_non_script_lines (int): min non-script lines, default 3


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.javascript_filter(input_key='content', min_non_script_lines=2)
    inputs = [{'content': 'Short normal text'}, {'content': 'function() { return 1; }
    const x = 1;
    var y = 2;
    let z = 3;'}]
    res = func(inputs)
    print(res)
    # [{'content': 'Short normal text'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    lines = [line.strip() for line in text.split('\n') if line.strip()]
    num_lines = len(lines)
    if num_lines == 0:
        return []
    if num_lines <= 3:
        return data
    num_script_lines = sum(
        1 for line in lines
        if any(pattern in line.lower() for pattern in JAVASCRIPT_PATTERNS)
    )
    num_non_script_lines = num_lines - num_script_lines
    if num_non_script_lines >= min_non_script_lines:
        return data
    else:
        return []

`lorem_ipsum_filter(data, input_key='content', max_ratio=3e-08)`

Filter Lorem ipsum, placeholder text, etc.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'
max_ratio (float, default: 3e-08 ) –

max ratio of placeholder patterns, default 3e-8

Examples:

from lazyllm.tools.data import filter

func = filter.lorem_ipsum_filter(input_key='content')
inputs = [{'content': 'This is real content'}, {'content': 'Lorem ipsum dolor sit amet'}]
res = func(inputs)
print(res)
# [{'content': 'This is real content'}]

Source code in lazyllm/tools/data/operators/filter_op.py

@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def lorem_ipsum_filter(data, input_key='content', max_ratio=3e-8):
    """Filter Lorem ipsum, placeholder text, etc.

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'
    max_ratio (float): max ratio of placeholder patterns, default 3e-8


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.lorem_ipsum_filter(input_key='content')
    inputs = [{'content': 'This is real content'}, {'content': 'Lorem ipsum dolor sit amet'}]
    res = func(inputs)
    print(res)
    # [{'content': 'This is real content'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    pattern_str = '|'.join(f'({p})' for p in LOREM_PATTERNS)
    pattern = re.compile(pattern_str, re.IGNORECASE)
    matches = pattern.findall(text)
    num_occurrences = len(matches)
    ratio = num_occurrences / len(text) if len(text) > 0 else 0
    if ratio <= max_ratio:
        return data
    else:
        return []

`no_punc_filter(data, input_key='content', max_length_between_punct=112, language='zh')`

Filter text with too long segments between punctuation marks.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'
max_length_between_punct (int, default: 112 ) –

max length between punctuation, default 112
language (str, default: 'zh' ) –

language, 'zh' or 'en', default 'zh'

Examples:

from lazyllm.tools.data import filter

func = filter.no_punc_filter(input_key='content', max_length_between_punct=20, language='zh')
inputs = [{'content': '这是。正常。文本。'}, {'content': '这是一段没有标点符号的超长文本' * 10}]
res = func(inputs)
print(res)
# [{'content': '这是。正常。文本。'}]

Source code in lazyllm/tools/data/operators/filter_op.py

@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def no_punc_filter(data, input_key='content', max_length_between_punct=112, language='zh'):
    """Filter text with too long segments between punctuation marks.

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'
    max_length_between_punct (int): max length between punctuation, default 112
    language (str): language, 'zh' or 'en', default 'zh'


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.no_punc_filter(input_key='content', max_length_between_punct=20, language='zh')
    inputs = [{'content': '这是。正常。文本。'}, {'content': '这是一段没有标点符号的超长文本' * 10}]
    res = func(inputs)
    print(res)
    # [{'content': '这是。正常。文本。'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    language = language.lower()
    if language in ['zh', 'cn', 'chinese']:
        punct_pattern = r'[。！？；，、：""''（）《》【】…—.!?,;:]'
    elif language in ['en', 'english']:
        punct_pattern = r'[–.!?,;•/|…:;\'\"]'
    else:
        LOG.warning(f'Unsupported language: {language}, using Chinese punctuation')
        punct_pattern = r'[。！？；，、：""''（）《》【】…—.!?,;:]'
    paragraphs = text.split('\n')
    max_length = 0
    for paragraph in paragraphs:
        if len(paragraph.strip()) == 0:
            continue
        segments = re.split(punct_pattern, paragraph)
        for segment in segments:
            segment = segment.strip()
            if not segment:
                continue
            if language in ['en', 'english']:
                length = len(segment.split())
            else:
                length = len(segment)
            if length > max_length:
                max_length = length
    if max_length <= max_length_between_punct:
        return data
    else:
        return []

`null_content_filter(data, input_key='content')`

Filter null or whitespace-only content.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'

Examples:

from lazyllm.tools.data import filter

func = filter.null_content_filter(input_key='content')
inputs = [{'content': 'Valid content'}, {'content': ''}, {'content': '   '}]
res = func(inputs)
print(res)
# [{'content': 'Valid content'}]

Source code in lazyllm/tools/data/operators/filter_op.py

@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def null_content_filter(data, input_key='content'):
    """Filter null or whitespace-only content.

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.null_content_filter(input_key='content')
    inputs = [{'content': 'Valid content'}, {'content': ''}, {'content': '   '}]
    res = func(inputs)
    print(res)
    # [{'content': 'Valid content'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if text is not None and isinstance(text, str) and text.strip() != '':
        return data
    else:
        return []

`sentence_count_filter(data, input_key='content', min_sentences=3, max_sentences=1000, language='zh')`

Filter by sentence count. Keeps text with sentences in [min_sentences, max_sentences].

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'
min_sentences (int, default: 3 ) –

min sentence count, default 3
max_sentences (int, default: 1000 ) –

max sentence count, default 1000
language (str, default: 'zh' ) –

language, 'zh' or 'en', default 'zh'

Examples:

from lazyllm.tools.data import filter

func = filter.sentence_count_filter(input_key='content', min_sentences=2, max_sentences=10, language='zh')
inputs = [{'content': '单句。'}, {'content': '第一句。第二句。'}]
res = func(inputs)
print(res)
# [{'content': '第一句。第二句。'}]

Source code in lazyllm/tools/data/operators/filter_op.py

@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def sentence_count_filter(data, input_key='content', min_sentences=3, max_sentences=1000, language='zh'):
    """Filter by sentence count. Keeps text with sentences in [min_sentences, max_sentences].

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'
    min_sentences (int): min sentence count, default 3
    max_sentences (int): max sentence count, default 1000
    language (str): language, 'zh' or 'en', default 'zh'


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.sentence_count_filter(input_key='content', min_sentences=2, max_sentences=10, language='zh')
    inputs = [{'content': '单句。'}, {'content': '第一句。第二句。'}]
    res = func(inputs)
    print(res)
    # [{'content': '第一句。第二句。'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    language = language.lower()
    if language in ['zh', 'cn', 'chinese']:
        sentences = re.split(r'[。！？]+', text)
        sentences = [s.strip() for s in sentences if s.strip()]
        num_sentences = len(sentences)
    elif language in ['en', 'english']:
        sentences = nltk.sent_tokenize(text)
        num_sentences = len(sentences)
    else:
        LOG.warning(f'Unsupported language: {language}, using Chinese punctuation')
        sentences = re.split(r'[。！？]+', text)
        sentences = [s.strip() for s in sentences if s.strip()]
        num_sentences = len(sentences)
    if min_sentences <= num_sentences <= max_sentences:
        return data
    else:
        return []

`special_char_filter(data, input_key='content')`

Filter text containing special invisible characters (zero-width, replacement char, etc.).

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'

Examples:

from lazyllm.tools.data import filter

func = filter.special_char_filter(input_key='content')
inputs = [{'content': 'Normal text 正常文本'}, {'content': 'Text with  zero width'}]
res = func(inputs)
print(res)
# [{'content': 'Normal text 正常文本'}]

Source code in lazyllm/tools/data/operators/filter_op.py

@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def special_char_filter(data, input_key='content'):
    """Filter text containing special invisible characters (zero-width, replacement char, etc.).

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.special_char_filter(input_key='content')
    inputs = [{'content': 'Normal text 正常文本'}, {'content': 'Text with ​ zero width'}]
    res = func(inputs)
    print(res)
    # [{'content': 'Normal text 正常文本'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    has_special_char = any(re.search(pattern, text) for pattern in SPECIAL_CHAR_PATTERNS)
    if not has_special_char:
        return data
    else:
        return []

`unique_word_filter(data, input_key='content', min_ratio=0.1, use_tokenizer=True, language='zh')`

Filter text with too low unique word ratio (excessive repetition).

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'
min_ratio (float, default: 0.1 ) –

min unique word ratio, default 0.1
use_tokenizer (bool, default: True ) –

use tokenizer, default True
language (str, default: 'zh' ) –

language, 'zh' or 'en', default 'zh'

Examples:

from lazyllm.tools.data import filter

func = filter.unique_word_filter(input_key='content', min_ratio=0.4, language='zh')
inputs = [{'content': '这是一段包含多个不同词汇的文本。'}, {'content': '重复重复重复'}]
res = func(inputs)
print(res)
# [{'content': '这是一段包含多个不同词汇的文本。'}]

Source code in lazyllm/tools/data/operators/filter_op.py

@data_register('data.filter', rewrite_func='forward', _concurrency_mode='thread')
def unique_word_filter(data, input_key='content', min_ratio=0.1, use_tokenizer=True, language='zh'):
    """Filter text with too low unique word ratio (excessive repetition).

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'
    min_ratio (float): min unique word ratio, default 0.1
    use_tokenizer (bool): use tokenizer, default True
    language (str): language, 'zh' or 'en', default 'zh'


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.unique_word_filter(input_key='content', min_ratio=0.4, language='zh')
    inputs = [{'content': '这是一段包含多个不同词汇的文本。'}, {'content': '重复重复重复'}]
    res = func(inputs)
    print(res)
    # [{'content': '这是一段包含多个不同词汇的文本。'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    language = language.lower()
    if language in ['zh', 'cn', 'chinese']:
        if use_tokenizer:
            words = list(jieba.cut(text.lower()))
        else:
            words = list(text)
    elif language in ['en', 'english']:
        if use_tokenizer:
            nltk_data_dir = _setup_nltk_data_dir()
            try:
                nltk.data.find('tokenizers/punkt_tab')
            except LookupError:
                LOG.info('Downloading NLTK punkt_tab tokenizer...')
                nltk.download('punkt_tab', quiet=True, download_dir=nltk_data_dir)
            words = nltk.word_tokenize(text.lower())
        else:
            words = text.lower().split()
    else:
        LOG.warning(f'Unsupported language: {language}, using simple split')
        words = text.lower().split()
    num_words = len(words)
    if num_words == 0:
        return []
    num_unique_words = len(set(words))
    ratio = num_unique_words / num_words
    if ratio > min_ratio:
        return data
    else:
        return []

`watermark_filter(data, input_key='content', watermarks=None)`

Filter text containing copyright/watermark related terms.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'
watermarks (list | None, default: None ) –

custom watermark terms, default uses built-in list

Examples:

from lazyllm.tools.data import filter

func = filter.watermark_filter(input_key='content')
inputs = [{'content': 'Normal content'}, {'content': 'This document contains Copyright notice'}]
res = func(inputs)
print(res)
# [{'content': 'Normal content'}]

Source code in lazyllm/tools/data/operators/filter_op.py

@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def watermark_filter(data, input_key='content', watermarks=None):
    """Filter text containing copyright/watermark related terms.

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'
    watermarks (list|None): custom watermark terms, default uses built-in list


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.watermark_filter(input_key='content')
    inputs = [{'content': 'Normal content'}, {'content': 'This document contains Copyright notice'}]
    res = func(inputs)
    print(res)
    # [{'content': 'Normal content'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    watermarks = watermarks or DEFAULT_WATERMARKS
    matches = re.search('|'.join(watermarks), text)
    if matches is None:
        return data
    else:
        return []

`word_count_filter(data, input_key='content', min_words=10, max_words=10000, language='zh')`

Filter by word/char count: Chinese by char count, English by word count. Keeps text in [min_words, max_words).

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'
min_words (int, default: 10 ) –

min count, default 10
max_words (int, default: 10000 ) –

max count, default 10000
language (str, default: 'zh' ) –

language, 'zh' or 'en', default 'zh'

Examples:

from lazyllm.tools.data import filter

func = filter.word_count_filter(input_key='content', min_words=5, max_words=20, language='zh')
inputs = [{'content': '短文本'}, {'content': '这是一段适中长度的中文文本内容。'}]
res = func(inputs)
print(res)
# [{'content': '这是一段适中长度的中文文本内容。'}]

Source code in lazyllm/tools/data/operators/filter_op.py

@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def word_count_filter(data, input_key='content', min_words=10, max_words=10000, language='zh'):
    """Filter by word/char count: Chinese by char count, English by word count. Keeps text in [min_words, max_words).

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'
    min_words (int): min count, default 10
    max_words (int): max count, default 10000
    language (str): language, 'zh' or 'en', default 'zh'


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.word_count_filter(input_key='content', min_words=5, max_words=20, language='zh')
    inputs = [{'content': '短文本'}, {'content': '这是一段适中长度的中文文本内容。'}]
    res = func(inputs)
    print(res)
    # [{'content': '这是一段适中长度的中文文本内容。'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    language = language.lower()
    if language in ['zh', 'cn', 'chinese']:
        count = len(text.replace(' ', '').replace('\n', '').replace('\t', ''))
    elif language in ['en', 'english']:
        count = len(text.split())
    else:
        LOG.warning(f'Unsupported language: {language}, using character count')
        count = len(text.replace(' ', '').replace('\n', '').replace('\t', ''))
    if min_words <= count < max_words:
        return data
    else:
        return []

`word_length_filter(data, input_key='content', min_length=3, max_length=20)`

Filter by average word length. Keeps text with mean word length in [min_length, max_length).

Parameters:

data (dict) –

single data dict
input_key (str, default: 'content' ) –

key of the text field, default 'content'
min_length (float, default: 3 ) –

min avg word length, default 3
max_length (float, default: 20 ) –

max avg word length, default 20

Examples:

from lazyllm.tools.data import filter

func = filter.word_length_filter(input_key='content', min_length=3, max_length=10)
inputs = [{'content': 'I am ok'}, {'content': 'This is a normal sentence'}]
res = func(inputs)
print(res)
# [{'content': 'This is a normal sentence'}]

Source code in lazyllm/tools/data/operators/filter_op.py

@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def word_length_filter(data, input_key='content', min_length=3, max_length=20):
    """Filter by average word length. Keeps text with mean word length in [min_length, max_length).

Args:
    data (dict): single data dict
    input_key (str): key of the text field, default 'content'
    min_length (float): min avg word length, default 3
    max_length (float): max avg word length, default 20


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.word_length_filter(input_key='content', min_length=3, max_length=10)
    inputs = [{'content': 'I am ok'}, {'content': 'This is a normal sentence'}]
    res = func(inputs)
    print(res)
    # [{'content': 'This is a normal sentence'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    words = text.split()
    num_words = len(words)
    if num_words == 0:
        return []
    num_chars = sum(len(word) for word in words)
    mean_length = num_chars / num_words
    if min_length <= mean_length < max_length:
        return data
    else:
        return []

Token Chunker

`lazyllm.tools.data.operators.token_chunker`

`TokenChunker`

Bases: Chunker

Split long text into chunks by token count. Splits by paragraph first, then by sentence. Ensures each chunk does not exceed max_tokens; chunks below min_tokens may be discarded.

Parameters:

input_key (str, default: 'content' ) –

key of the text field, default 'content'
model_path (str | None, default: None ) –

path to tokenizer model, default Qwen2.5-0.5B-Instruct
max_tokens (int, default: 1024 ) –

max tokens per chunk, default 1024
min_tokens (int, default: 200 ) –

min tokens per chunk, smaller chunks may be discarded, default 200
_concurrency_mode (str, default: 'process' ) –

optional concurrency mode
_max_workers (int | None) –

optional max concurrency

Examples:

from lazyllm.tools.data import chunker

func = chunker.TokenChunker(input_key='content', max_tokens=50, min_tokens=10)
inputs = [{'content': '人工智能是计算机科学的一个分支。' * 20, 'meta_data': {'source': 'doc_1'}}]
res = func(inputs)
print(res)
# [{'uid': '...', 'content': '...', 'meta_data': {'source': 'doc_1', 'index': 0, 'total': N, 'length': ...}}, ...]

Source code in lazyllm/tools/data/operators/token_chunker.py

class TokenChunker(Chunker):
    """Split long text into chunks by token count. Splits by paragraph first, then by sentence.
Ensures each chunk does not exceed max_tokens; chunks below min_tokens may be discarded.

Args:
    input_key (str): key of the text field, default 'content'
    model_path (str|None): path to tokenizer model, default Qwen2.5-0.5B-Instruct
    max_tokens (int): max tokens per chunk, default 1024
    min_tokens (int): min tokens per chunk, smaller chunks may be discarded, default 200
    _concurrency_mode (str): optional concurrency mode
    _max_workers (int|None): optional max concurrency


Examples:
    ```python
    from lazyllm.tools.data import chunker

    func = chunker.TokenChunker(input_key='content', max_tokens=50, min_tokens=10)
    inputs = [{'content': '人工智能是计算机科学的一个分支。' * 20, 'meta_data': {'source': 'doc_1'}}]
    res = func(inputs)
    print(res)
    # [{'uid': '...', 'content': '...', 'meta_data': {'source': 'doc_1', 'index': 0, 'total': N, 'length': ...}}, ...]
    ```
    """
    def __init__(self, input_key='content', model_path=None,
                 max_tokens=1024, min_tokens=200, _concurrency_mode='process', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        self.input_key = input_key
        self.max_tokens = max_tokens
        self.min_tokens = min_tokens
        self.model_path = model_path
        self.tokenizer = self._load_tokenizer()

    def _try_load_tokenizer(self, path, is_local, default_cache_dir):
        if is_local or os.path.isdir(path) or os.path.isfile(path):
            return transformers.AutoTokenizer.from_pretrained(
                path, trust_remote_code=True
            )
        else:
            return transformers.AutoTokenizer.from_pretrained(
                path, cache_dir=default_cache_dir, trust_remote_code=True
            )

    def _try_load_from_config_path(self, default_model_name, default_cache_dir):
        try:
            config_model_path = config['model_path']
            if not config_model_path:
                return None
            if os.path.isdir(config_model_path):
                joined_path = os.path.join(config_model_path, default_model_name)
                if os.path.exists(joined_path):
                    LOG.info(f'Loading tokenizer from config model_path: {joined_path}')
                    try:
                        return self._try_load_tokenizer(joined_path, True, default_cache_dir)
                    except Exception as e:
                        LOG.warning(f'Failed to load from {joined_path}: {e}, trying cache directory')
            elif os.path.exists(config_model_path):
                LOG.info(f'Loading tokenizer from config model_path: {config_model_path}')
                try:
                    return self._try_load_tokenizer(config_model_path, True, default_cache_dir)
                except Exception as e:
                    LOG.warning(f'Failed to load from {config_model_path}: {e}, trying cache directory')
        except (KeyError, TypeError):
            pass
        return None

    def _try_load_from_cache(self, default_model_name, default_cache_dir):
        try:
            cache_model_path = os.path.join(default_cache_dir, default_model_name)
            if os.path.exists(cache_model_path):
                LOG.info(f'Loading tokenizer from cache directory: {cache_model_path}')
                return self._try_load_tokenizer(cache_model_path, True, default_cache_dir)
        except Exception:
            pass
        return None

    def _load_tokenizer(self):
        default_model = 'Qwen/Qwen2.5-0.5B-Instruct'
        default_model_name = 'qwen2.5-0.5b-instruct'
        model_or_path = self.model_path

        try:
            default_cache_dir = config['model_cache_dir']
        except (KeyError, TypeError):
            default_cache_dir = os.path.join(os.path.expanduser('~'), '.lazyllm', 'models')

        if model_or_path:
            try:
                is_local = os.path.isdir(model_or_path) or os.path.isfile(model_or_path)
                if is_local:
                    log_msg = f'Loading tokenizer from local path: {model_or_path}'
                else:
                    log_msg = f'Loading tokenizer from model: {model_or_path}'
                LOG.info(log_msg)
                return self._try_load_tokenizer(model_or_path, is_local, default_cache_dir)
            except Exception as e:
                LOG.warning(f'Failed to load from {model_or_path}: {e}, trying config model_path')

        if model_or_path is None:
            result = self._try_load_from_config_path(default_model_name, default_cache_dir)
            if result:
                return result

        result = self._try_load_from_cache(default_model_name, default_cache_dir)
        if result:
            return result

        LOG.info(f'Loading default tokenizer: {default_model} (will download to cache)')
        try:
            return self._try_load_tokenizer(default_model, False, default_cache_dir)
        except Exception as e:
            LOG.error(f'Failed to load default tokenizer: {e}')
            raise

    def _split_paragraphs(self, text):
        paragraphs = re.split(r'(\n{2,})', text)
        processed_paragraphs = []
        for i in range(0, len(paragraphs), 2):
            unit = paragraphs[i]
            if i + 1 < len(paragraphs):
                unit += paragraphs[i + 1]
            if unit:
                processed_paragraphs.append(unit)
        return processed_paragraphs

    def _split_sentences(self, text):
        sentences = re.split(r'([。！？\.!\?])', text)
        return [s for s in (''.join(filter(None, t)) for t in zip_longest(sentences[0::2], sentences[1::2])) if s]

    def _process_chunks(self, processed_paragraphs):
        chunks = []
        current_chunk_text_parts = []
        current_chunk_tokens = 0

        for p_text in processed_paragraphs:
            p_tokens = self.tokenizer.encode(p_text)

            if current_chunk_tokens + len(p_tokens) <= self.max_tokens:
                current_chunk_text_parts.append(p_text)
                current_chunk_tokens += len(p_tokens)
            else:
                if current_chunk_text_parts:
                    chunks.append(''.join(current_chunk_text_parts))

                if len(p_tokens) > self.max_tokens:
                    sentences = self._split_sentences(p_text)

                    sub_chunk_parts = []
                    sub_chunk_tokens = 0
                    for sent in sentences:
                        sent_tokens_count = len(self.tokenizer.encode(sent))
                        if sub_chunk_tokens + sent_tokens_count <= self.max_tokens:
                            sub_chunk_parts.append(sent)
                            sub_chunk_tokens += sent_tokens_count
                        else:
                            if sub_chunk_parts:
                                chunks.append(''.join(sub_chunk_parts))
                            sub_chunk_parts = [sent]
                            sub_chunk_tokens = sent_tokens_count

                    current_chunk_text_parts = sub_chunk_parts
                    current_chunk_tokens = sub_chunk_tokens
                else:
                    current_chunk_text_parts = [p_text]
                    current_chunk_tokens = len(p_tokens)

        if current_chunk_text_parts:
            final_chunk_text = ''.join(current_chunk_text_parts)
            final_chunk_tokens = len(self.tokenizer.encode(final_chunk_text))
            if len(chunks) > 0 and final_chunk_tokens < self.min_tokens:
                LOG.warning(f'Discarding small chunk (tokens: {final_chunk_tokens}, threshold: {self.min_tokens})')
            else:
                chunks.append(final_chunk_text)

        return chunks

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        text = data.get(self.input_key, '')
        orig_meta = data.get('meta_data', {})

        if not text:
            return []

        paragraphs = self._split_paragraphs(text)
        chunks = self._process_chunks(paragraphs)

        if not chunks and text:
            chunks = [text]

        ts = datetime.now().strftime('%Y%m%d%H%M%S')
        total = len(chunks)

        return [
            {
                'uid': f'{ts}_{uuid.uuid4().hex}',
                'content': chunk,
                'meta_data': {
                    **orig_meta,
                    'index': idx,
                    'total': total,
                    'length': len(chunk),
                },
            }
            for idx, chunk in enumerate(chunks)
        ]

Code Generation Operators

`lazyllm.tools.data.operators.codegen_ops`

`CodeInstructionGenerator`

Bases: CodeGenOps

Code-gen pipeline operator: CodeInstructionGenerator.

Extracts the user instruction from raw messages and rewrites it into a standardized English instruction plus a Python function skeleton code block.

Typical output structure (default input_key='messages', output_key='generated_instruction'):

messages: original multi-turn messages (unchanged)
generated_instruction (str): standardized English instruction + Python code block

Parameters:

model –

a LazyLLM model object (required), shared via share().
prompt_template (str | None, default: None ) –

optional custom system prompt (overrides default).
input_key (str, default: 'messages' ) –

input conversation field name, default 'messages'.
output_key (str, default: 'generated_instruction' ) –

output standardized instruction field name, default 'generated_instruction'.
**kwargs –

extra args forwarded to the base operator (e.g. _max_workers, _save_data).

Examples:

from lazyllm.tools.data.operators.codegen_ops import CodeInstructionGenerator

op = CodeInstructionGenerator(model=model,
                                         input_key='messages',
                                         output_key='generated_instruction')
item = {
    'messages': [
        {'role': 'user', 'content': '写一个 Python 函数，打印 hello'}
    ]
}
res = op(item)
print(res)

# Output Example:
# {
#    'messages': [...],
#    'generated_instruction': "Write a Python function that prints 'hello'.\n"
#                             "```python\n"
#                             "def solution():\n"
#                             "    print('hello')\n"
#                             "```"
# }

Source code in lazyllm/tools/data/operators/codegen_ops.py

class CodeInstructionGenerator(CodeGenOps):
    """Code-gen pipeline operator: CodeInstructionGenerator.

Extracts the user instruction from raw messages and rewrites it into a standardized English instruction plus a Python function skeleton code block.

Typical output structure (default input_key='messages', output_key='generated_instruction'):

- messages: original multi-turn messages (unchanged)
- generated_instruction (str): standardized English instruction + Python code block

Args:
    model: a LazyLLM model object (required), shared via share().
    prompt_template (str|None): optional custom system prompt (overrides default).
    input_key (str): input conversation field name, default 'messages'.
    output_key (str): output standardized instruction field name, default 'generated_instruction'.
    **kwargs: extra args forwarded to the base operator (e.g. _max_workers, _save_data).


Examples:

    from lazyllm.tools.data.operators.codegen_ops import CodeInstructionGenerator

    op = CodeInstructionGenerator(model=model,
                                             input_key='messages',
                                             output_key='generated_instruction')
    item = {
        'messages': [
            {'role': 'user', 'content': '写一个 Python 函数，打印 hello'}
        ]
    }
    res = op(item)
    print(res)

    # Output Example:
    # {
    #    'messages': [...],
    #    'generated_instruction': "Write a Python function that prints 'hello'.\\n"
    #                             "```python\\n"
    #                             "def solution():\\n"
    #                             "    print('hello')\\n"
    #                             "```"
    # }
    """
    def __init__(self, model=None, prompt_template=None, input_key='messages', output_key='generated_instruction',
                 **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        sys_prompt = prompt_template or (
            'You are a code instruction standardization assistant.\n'
            'Rewrite the given instruction into a consistent format for Python code generation tasks.\n'
            'Output must be English and contain exactly two parts:\n'
            '1) A single concise instruction sentence in English.\n'
            '2) A Python code block in Markdown with a complete function skeleton.\n'
            'Do not add explanations, do not add extra sections.\n'
            'Example output format:\n'
            'Write a Python function that ...\n'
            '```python\n'
            'def solution(...):\n'
            '    \"\"\"...\"\"\"\n'
            '    ...\n'
            '```\n'
        )
        self.model = model.share().prompt(sys_prompt)

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        if self.model is None:
            raise ValueError('model is required')
        if self.input_key not in data:
            raise ValueError(f'Missing required key: {self.input_key}')
        if self.output_key in data:
            raise ValueError(f'The following key already exists and would be overwritten: {self.output_key}')
        raw_instruction = _extract_human_instruction(data.get(self.input_key))
        response = self.model(raw_instruction)
        data[self.output_key] = response.strip() if isinstance(response, str) else response
        return data

`LogicIntegrityAuditor`

Bases: CodeGenOps

Code-gen pipeline operator: LogicIntegrityAuditor.

Evaluates a single (generated_instruction, generated_code) sample, producing a quality score (0–10) and textual feedback, parsed from a JSON-formatted model response.

Typical output structure (default input_instruction_key='instruction', input_code_key='new_code'):

instruction: standardized instruction
new_code: generated code
quality_score: numeric quality score (int/float depending on JsonFormatter parsing)
feedback: textual review feedback

Parameters:

model –

a LazyLLM model object (required), wrapped with JsonFormatter.
prompt_template (str | None, default: None ) –

optional custom system prompt.
input_instruction_key (str, default: 'instruction' ) –

input instruction field name, default 'instruction'.
input_code_key (str, default: 'new_code' ) –

input code field name, default 'new_code'.
output_score_key (str, default: 'quality_score' ) –

output score field name, default 'quality_score'.
output_feedback_key (str, default: 'feedback' ) –

output feedback field name, default 'feedback'.
**kwargs –

extra args forwarded to the base operator.

Examples:

from lazyllm.tools.data.operators.codegen_ops import LogicIntegrityAuditor

op = LogicIntegrityAuditor(model=model)
item = {
    'instruction': "Write a Python function that prints 'hello'.",
    'new_code': "def solution():
print('hello')"
}
res = op(item)
print(res)
# {
#   'instruction': "Write a Python function that prints 'hello'.",
#   'new_code': "def solution():
print('hello')",
#   'quality_score': 8,
#   'feedback': 'Good code. The logic is clear and follows PEP8.'
# }

Source code in lazyllm/tools/data/operators/codegen_ops.py

class LogicIntegrityAuditor(CodeGenOps):
    """Code-gen pipeline operator: LogicIntegrityAuditor.

Evaluates a single (generated_instruction, generated_code) sample, producing a quality score (0–10) and textual feedback, parsed from a JSON-formatted model response.

Typical output structure (default input_instruction_key='instruction', input_code_key='new_code'):

- instruction: standardized instruction
- new_code: generated code
- quality_score: numeric quality score (int/float depending on JsonFormatter parsing)
- feedback: textual review feedback

Args:
    model: a LazyLLM model object (required), wrapped with JsonFormatter.
    prompt_template (str|None): optional custom system prompt.
    input_instruction_key (str): input instruction field name, default 'instruction'.
    input_code_key (str): input code field name, default 'new_code'.
    output_score_key (str): output score field name, default 'quality_score'.
    output_feedback_key (str): output feedback field name, default 'feedback'.
    **kwargs: extra args forwarded to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.codegen_ops import LogicIntegrityAuditor

    op = LogicIntegrityAuditor(model=model)
    item = {
        'instruction': "Write a Python function that prints 'hello'.",
        'new_code': "def solution():\n    print('hello')"
    }
    res = op(item)
    print(res)
    # {
    #   'instruction': "Write a Python function that prints 'hello'.",
    #   'new_code': "def solution():\n    print('hello')",
    #   'quality_score': 8,
    #   'feedback': 'Good code. The logic is clear and follows PEP8.'
    # }
    ```
    """
    def __init__(
        self,
        model=None,
        prompt_template=None,
        input_instruction_key='instruction',
        input_code_key='new_code',
        output_score_key='quality_score',
        output_feedback_key='feedback',
        **kwargs,
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_instruction_key = input_instruction_key
        self.input_code_key = input_code_key
        self.output_score_key = output_score_key
        self.output_feedback_key = output_feedback_key
        sys_prompt = prompt_template or (
            'You are an automated code reviewer.\n'
            'Evaluate the generated Python code against the given instruction.\n'
            'Please provide a score (0-10) and feedback.\n'
            'Output must be in JSON format:\n'
            '{\n'
            '  "score": <0-10>,\n'
            '  "feedback": "..."\n'
            '}'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        if self.model is None:
            raise ValueError('model is required')
        if self.input_instruction_key not in data:
            raise ValueError(f'Missing required key: {self.input_instruction_key}')
        if self.input_code_key not in data:
            raise ValueError(f'Missing required key: {self.input_code_key}')
        if self.output_score_key in data:
            raise ValueError(f'The following key already exists and would be overwritten: {self.output_score_key}')
        if self.output_feedback_key in data:
            raise ValueError(f'The following key already exists and would be overwritten: {self.output_feedback_key}')
        instruction = data.get(self.input_instruction_key, '')
        code = data.get(self.input_code_key, '')
        user_input = f'Instruction:\n{instruction}\n\nCode:\n```python\n{code}\n```'
        res = self.model(user_input)

        if isinstance(res, dict):
            score = res.get('score', 0)
            feedback = res.get('feedback', 'No feedback provided.')
        else:
            from lazyllm import LOG
            LOG.warning(f'Failed to extract JSON from response: {res}')
            score, feedback = 0, 'Failed to parse LLM evaluation output.'

        data[self.output_score_key] = score
        data[self.output_feedback_key] = feedback
        return data

`ScriptSynthesizer`

Bases: CodeGenOps

Code-gen pipeline operator: ScriptSynthesizer.

Given a natural language code instruction (often from the previous generated_instruction or a cleaned instruction field), generates the corresponding Python source code, stripping Markdown code fences when present.

Typical output structure (default input_key='instruction', output_key='new_code'):

instruction: natural language code instruction
new_code (str): generated Python code string

Parameters:

model –

a LazyLLM model object (required).
prompt_template (str | None, default: None ) –

optional custom system prompt.
input_key (str, default: 'instruction' ) –

input instruction field name, default 'instruction'.
output_key (str, default: 'new_code' ) –

output code field name, default 'new_code'.
**kwargs –

extra args forwarded to the base operator.

Examples:

from lazyllm.tools.data.operators.codegen_ops import ScriptSynthesizer

op = ScriptSynthesizer(model=model,
                                    input_key='instruction',
                                    output_key='new_code')
item = {
    'instruction': 'Write a Python function that prints "hello".'
}
res = op(item)
print(res)
# {
#   'instruction': 'Write a Python function that prints "hello".',
#   'new_code': "def solution():
print('hello')"
# }

Source code in lazyllm/tools/data/operators/codegen_ops.py

class ScriptSynthesizer(CodeGenOps):
    """Code-gen pipeline operator: ScriptSynthesizer.

Given a natural language code instruction (often from the previous generated_instruction or a cleaned instruction field), generates the corresponding Python source code, stripping Markdown code fences when present.

Typical output structure (default input_key='instruction', output_key='new_code'):

- instruction: natural language code instruction
- new_code (str): generated Python code string

Args:
    model: a LazyLLM model object (required).
    prompt_template (str|None): optional custom system prompt.
    input_key (str): input instruction field name, default 'instruction'.
    output_key (str): output code field name, default 'new_code'.
    **kwargs: extra args forwarded to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.codegen_ops import ScriptSynthesizer

    op = ScriptSynthesizer(model=model,
                                        input_key='instruction',
                                        output_key='new_code')
    item = {
        'instruction': 'Write a Python function that prints "hello".'
    }
    res = op(item)
    print(res)
    # {
    #   'instruction': 'Write a Python function that prints "hello".',
    #   'new_code': "def solution():\n    print('hello')"
    # }
    ```
    """
    def __init__(self, model=None, prompt_template=None, input_key='instruction', output_key='new_code', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        sys_prompt = prompt_template or (
            'You are a senior Python engineer.\n'
            'Given a natural language instruction, generate the corresponding Python code.\n'
            'Return only the code. If you include a Markdown code block, use ```python ... ```.\n'
        )
        self.model = model.share().prompt(sys_prompt)

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        if self.model is None:
            raise ValueError('model is required')
        if self.input_key not in data:
            raise ValueError(f'Missing required key: {self.input_key}')
        if self.output_key in data:
            raise ValueError(f'The following key already exists and would be overwritten: {self.output_key}')
        instruction = data.get(self.input_key, '')
        response = self.model(instruction)
        data[self.output_key] = _parse_code(response)
        return data

`ThresholdSieve`

Bases: CodeGenOps

Code-gen pipeline operator: ThresholdSieve.

Filters samples based on code quality scores produced by LogicIntegrityAuditor:

If quality_score/feedback are missing, it first calls the internal scorer.
If the score is within [min_score, max_score], the sample is kept and labeled.
Otherwise, it returns an empty list [], effectively dropping the sample from the pipeline.

Typical output structure (default output_key='quality_score_filter_label'):

instruction: ...
new_code: ...
quality_score: 8
feedback: 'Good code. ...'
quality_score_filter_label: 1 (1 for passed, 0 otherwise; non-passed samples are dropped)

Parameters:

model –

a LazyLLM model object (required), used by the internal scorer.
min_score (int, default: 7 ) –

minimum score (inclusive) to pass the filter, default 7.
max_score (int, default: 10 ) –

maximum score (inclusive) to pass the filter, default 10.
input_instruction_key (str, default: 'instruction' ) –

input instruction field, default 'instruction'.
input_code_key (str, default: 'new_code' ) –

input code field, default 'new_code'.
output_score_key (str, default: 'quality_score' ) –

score field name, default 'quality_score'.
output_feedback_key (str, default: 'feedback' ) –

feedback field name, default 'feedback'.
output_key (str, default: 'quality_score_filter_label' ) –

filter label field name, default 'quality_score_filter_label'.
**kwargs –

extra args forwarded to the base operator.

Examples:

from lazyllm.tools.data.operators.codegen_ops import ThresholdSieve

op = ThresholdSieve(model=model, min_score=7, max_score=10)
item = {
    'instruction': "Write a Python function that prints 'hello'.",
    'new_code': "def solution():
print('hello')"
}
res = op(item)
print(res)
# {
#   'instruction': '...',
#   'new_code': '...',
#   'quality_score': 8,
#   'feedback': 'Good code. The logic is clear and follows PEP8.',
#   'quality_score_filter_label': 1
# }

Source code in lazyllm/tools/data/operators/codegen_ops.py

class ThresholdSieve(CodeGenOps):
    """Code-gen pipeline operator: ThresholdSieve.

Filters samples based on code quality scores produced by LogicIntegrityAuditor:

- If quality_score/feedback are missing, it first calls the internal scorer.
- If the score is within [min_score, max_score], the sample is kept and labeled.
- Otherwise, it returns an empty list [], effectively dropping the sample from the pipeline.

Typical output structure (default output_key='quality_score_filter_label'):

- instruction: ...
- new_code: ...
- quality_score: 8
- feedback: 'Good code. ...'
- quality_score_filter_label: 1  (1 for passed, 0 otherwise; non-passed samples are dropped)

Args:
    model: a LazyLLM model object (required), used by the internal scorer.
    min_score (int): minimum score (inclusive) to pass the filter, default 7.
    max_score (int): maximum score (inclusive) to pass the filter, default 10.
    input_instruction_key (str): input instruction field, default 'instruction'.
    input_code_key (str): input code field, default 'new_code'.
    output_score_key (str): score field name, default 'quality_score'.
    output_feedback_key (str): feedback field name, default 'feedback'.
    output_key (str): filter label field name, default 'quality_score_filter_label'.
    **kwargs: extra args forwarded to the base operator.


Examples:
    ```python
    from lazyllm.tools.data.operators.codegen_ops import ThresholdSieve

    op = ThresholdSieve(model=model, min_score=7, max_score=10)
    item = {
        'instruction': "Write a Python function that prints 'hello'.",
        'new_code': "def solution():\n    print('hello')"
    }
    res = op(item)
    print(res)
    # {
    #   'instruction': '...',
    #   'new_code': '...',
    #   'quality_score': 8,
    #   'feedback': 'Good code. The logic is clear and follows PEP8.',
    #   'quality_score_filter_label': 1
    # }
    ```
    """
    def __init__(
        self,
        model=None,
        min_score: int = 7,
        max_score: int = 10,
        input_instruction_key: str = 'instruction',
        input_code_key: str = 'new_code',
        output_score_key: str = 'quality_score',
        output_feedback_key: str = 'feedback',
        output_key: str = 'quality_score_filter_label',
        **kwargs,
    ):
        super().__init__(**kwargs)
        self.model = model
        self.min_score = min_score
        self.max_score = max_score
        self.input_instruction_key = input_instruction_key
        self.input_code_key = input_code_key
        self.output_score_key = output_score_key
        self.output_feedback_key = output_feedback_key
        self.output_key = output_key
        self.scorer = LogicIntegrityAuditor(
            model=model,
            input_instruction_key=input_instruction_key,
            input_code_key=input_code_key,
            output_score_key=output_score_key,
            output_feedback_key=output_feedback_key,
        )

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        if self.model is None:
            raise ValueError('model is required')
        if self.output_key in data:
            raise ValueError(f'The following key already exists and would be overwritten: {self.output_key}')

        if self.output_score_key not in data:
            data = self.scorer.forward(data)

        score = data.get(self.output_score_key, 0)
        try:
            score_int = int(score)
        except (ValueError, TypeError):
            score_int = 0
        pass_filter = (self.min_score <= score_int <= self.max_score)
        data[self.output_key] = 1 if pass_filter else 0
        if pass_filter:
            return data
        return []

Text to QA pairs Operators

`lazyllm.tools.data.operators.text2qa_ops`

`ChunkToQA`

Bases: Text2qa

Use an LLM to generate one QA pair (question + answer) per text chunk. Output format is constrained via JsonFormatter; user_prompt can be customized or left as default.

Parameters:

input_key (str, default: 'chunk' ) –

key of the input chunk, default 'chunk'
query_key (str, default: 'query' ) –

key to write the generated question, default 'query'
answer_key (str, default: 'answer' ) –

key to write the generated answer, default 'answer'
model –

optional TrainableModule or compatible; None uses default Qwen model
user_prompt (str | None, default: None ) –

optional user prompt prefix; None uses default
**kwargs –

other base-class args

Examples:

from lazyllm.tools.data import Text2qa
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = Text2qa.ChunkToQA(input_key='chunk', query_key='query', answer_key='answer', model=llm)
data = [{'chunk': '今天是晴天！'}]
res = op(data)
print(res)
# [{'chunk': '今天是晴天！', 'query': '今天的天气怎么样？', 'answer': '今天是晴天！'}]

Source code in lazyllm/tools/data/operators/text2qa_ops.py

class ChunkToQA(Text2qa):
    """Use an LLM to generate one QA pair (question + answer) per text chunk. Output format is constrained via JsonFormatter; user_prompt can be customized or left as default.

Args:
    input_key (str): key of the input chunk, default 'chunk'
    query_key (str): key to write the generated question, default 'query'
    answer_key (str): key to write the generated answer, default 'answer'
    model: optional TrainableModule or compatible; None uses default Qwen model
    user_prompt (str|None): optional user prompt prefix; None uses default
    **kwargs: other base-class args


Examples:
    ```python
    from lazyllm.tools.data import Text2qa
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = Text2qa.ChunkToQA(input_key='chunk', query_key='query', answer_key='answer', model=llm)
    data = [{'chunk': '今天是晴天！'}]
    res = op(data)
    print(res)
    # [{'chunk': '今天是晴天！', 'query': '今天的天气怎么样？', 'answer': '今天是晴天！'}]
    ```
    """
    def __init__(self,
                 input_key='chunk',
                 query_key='query',
                 answer_key='answer',
                 model=None,
                 user_prompt=None,
                 **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.query_key = query_key
        self.answer_key = answer_key
        self.user_prompt = user_prompt

        output_structure = f'''
        输出格式要求：
        {{
            "{self.query_key}": "生成的问题",
            "{self.answer_key}": "答案"
        }}
        '''

        if model is None:
            self.model = TrainableModule(DEFAULT_MODEL)
        else:
            self.model = model.share()
        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data: dict):
        assert self.input_key in data
        chunk = data.get(self.input_key, '')

        if not chunk:
            data[self.query_key] = ''
            data[self.answer_key] = ''
            return data

        if self.user_prompt is None:
            user_prompt = '根据下面文本生成一个 QA 对：\n'
        else:
            user_prompt = self.user_prompt

        inp = f'{user_prompt}\n{chunk}'

        qa = self.model(inp)

        data[self.query_key] = qa.get(self.query_key, '')
        data[self.answer_key] = qa.get(self.answer_key, '')
        return data

`QAScorer`

Bases: Text2qa

Use an LLM to score QA pairs: whether the answer is strictly grounded in the source chunk. Outputs 1 (grounded) or 0 (otherwise). Output format constrained via JsonFormatter.

Parameters:

input_key (str, default: 'chunk' ) –

key of the source chunk, default 'chunk'
output_key (str, default: 'score' ) –

key to write the score, default 'score'
query_key (str, default: 'query' ) –

key of the question, default 'query'
answer_key (str, default: 'answer' ) –

key of the answer, default 'answer'
model –

optional TrainableModule or compatible; None uses default Qwen model
user_prompt (str | None, default: None ) –

optional user prompt; None uses default rules
**kwargs –

other base-class args

Examples:

from lazyllm.tools.data import Text2qa
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = Text2qa.QAScorer(input_key='chunk', output_key='score', query_key='query', answer_key='answer', model=llm)
data = [
{'chunk': '今天是晴天！', 'query': '今天的天气怎么样？', 'answer': '今天是晴天！'},
{'chunk': '1+1=2', 'query': '1+1=?', 'answer': '3'}
]
res = op(data)
print(res)
# [{'chunk': '今天是晴天！', 'query': '今天的天气怎么样？', 'answer': '今天是晴天！', 'score': 1}, {'chunk': '1+1=2', 'query': '1+1=?', 'answer': '3', 'score': 0}]

Source code in lazyllm/tools/data/operators/text2qa_ops.py

class QAScorer(Text2qa):
    """Use an LLM to score QA pairs: whether the answer is strictly grounded in the source chunk. Outputs 1 (grounded) or 0 (otherwise). Output format constrained via JsonFormatter.

Args:
    input_key (str): key of the source chunk, default 'chunk'
    output_key (str): key to write the score, default 'score'
    query_key (str): key of the question, default 'query'
    answer_key (str): key of the answer, default 'answer'
    model: optional TrainableModule or compatible; None uses default Qwen model
    user_prompt (str|None): optional user prompt; None uses default rules
    **kwargs: other base-class args


Examples:
    ```python
    from lazyllm.tools.data import Text2qa
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = Text2qa.QAScorer(input_key='chunk', output_key='score', query_key='query', answer_key='answer', model=llm)
    data = [
    {'chunk': '今天是晴天！', 'query': '今天的天气怎么样？', 'answer': '今天是晴天！'},
    {'chunk': '1+1=2', 'query': '1+1=?', 'answer': '3'}
    ]
    res = op(data)
    print(res)
    # [{'chunk': '今天是晴天！', 'query': '今天的天气怎么样？', 'answer': '今天是晴天！', 'score': 1}, {'chunk': '1+1=2', 'query': '1+1=?', 'answer': '3', 'score': 0}]
    ```
    """
    def __init__(self,
                 input_key='chunk',
                 output_key='score',
                 query_key='query',
                 answer_key='answer',
                 model=None,
                 user_prompt=None,
                 **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.output_key = output_key
        self.query_key = query_key
        self.answer_key = answer_key
        self.user_prompt = user_prompt

        output_structure = f'''
        输出格式要求：
        {{
            "{self.output_key}": 0 or 1
        }}
        '''

        if model is None:
            self.model = TrainableModule(DEFAULT_MODEL)
        else:
            self.model = model.share()
        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data: dict):
        assert self.input_key in data
        assert self.query_key in data
        assert self.answer_key in data

        chunk = data.get(self.input_key, '')
        query = data.get(self.query_key, '')
        answer = data.get(self.answer_key, '')

        if not (chunk and query and answer):
            data[self.output_key] = 0
            return data

        qa = f'问题{query}; 答案{answer}'
        if self.user_prompt is None:
            user_prompt = f'''
        请根据下面内容对 QA 打分：

        原文：
        {chunk}

        {qa}

        规则：
        - 严格基于原文 → 1
        - 否则 → 0
        '''
        else:
            user_prompt = self.user_prompt + qa
        res = self.model(user_prompt)

        data[self.output_key] = res.get(self.output_key, 0)
        return data

`TextToChunks`

Bases: Text2qa

Split input text into chunks by lines, with size controlled by token or character count. One input item may expand into multiple output items. Supports optional tokenizer or character-based length.

Parameters:

input_key (str, default: 'content' ) –

key of the input text field, default 'content'
output_key (str, default: 'chunk' ) –

key to write each chunk into, default 'chunk'
chunk_size (int, default: 10 ) –

max length per chunk (tokens or chars), default 10
tokenize (bool, default: True ) –

whether to count by tokens; if True and tokenizer not provided, uses default Qwen tokenizer
tokenizer –

optional tokenizer for counting; if None and tokenize=True, loads default
**kwargs –

other base-class args (e.g. _concurrency_mode, _max_workers)

Examples:

from lazyllm.tools.data import Text2qa

op = Text2qa.TextToChunks(input_key='content', output_key='chunk', chunk_size=10, tokenize=False)
data = [{'content': 'line1
line2
line3
line4'}]
res = op(data)
print(res)
# [{'content': 'line1
line2
line3
line4', 'chunk': 'line1
line2'}, {'content': 'line1
line2
line3
line4', 'chunk': 'line3
line4'}]

Source code in lazyllm/tools/data/operators/text2qa_ops.py

class TextToChunks(Text2qa):
    """Split input text into chunks by lines, with size controlled by token or character count. One input item may expand into multiple output items. Supports optional tokenizer or character-based length.

Args:
    input_key (str): key of the input text field, default 'content'
    output_key (str): key to write each chunk into, default 'chunk'
    chunk_size (int): max length per chunk (tokens or chars), default 10
    tokenize (bool): whether to count by tokens; if True and tokenizer not provided, uses default Qwen tokenizer
    tokenizer: optional tokenizer for counting; if None and tokenize=True, loads default
    **kwargs: other base-class args (e.g. _concurrency_mode, _max_workers)


Examples:
    ```python
    from lazyllm.tools.data import Text2qa

    op = Text2qa.TextToChunks(input_key='content', output_key='chunk', chunk_size=10, tokenize=False)
    data = [{'content': 'line1
    line2
    line3
    line4'}]
    res = op(data)
    print(res)
    # [{'content': 'line1
    line2
    line3
    line4', 'chunk': 'line1
    line2'}, {'content': 'line1
    line2
    line3
    line4', 'chunk': 'line3
    line4'}]
    ```
    """
    def __init__(self,
                 input_key='content',
                 output_key='chunk',
                 chunk_size=10,
                 tokenize=True,
                 tokenizer=None,
                 **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        self.chunk_size = chunk_size
        self.tokenizer = tokenizer
        if tokenize and tokenizer is None:
            LOG.warning(
                f'tokenize=True but tokenizer is None, '
                f'loading tokenizer from default model: {DEFAULT_TOKENIZER}'
            )
            try:
                self.tokenizer = transformers.AutoTokenizer.from_pretrained(
                    DEFAULT_TOKENIZER,
                    trust_remote_code=True
                )
                self.tokenize = True
            except Exception as e:
                LOG.warning(
                    f'failed to load tokenizer from {DEFAULT_TOKENIZER}, '
                    f'falling back to char count, error: {e}'
                )
                self.tokenize = False
                self.tokenizer = None
        else:
            self.tokenizer = tokenizer
            self.tokenize = tokenize

    def _get_len(self, text: str):
        if self.tokenize:
            return len(
                self.tokenizer.encode(text, add_special_tokens=False)
            )
        return len(text)

    def forward(self, data: dict):
        text = data.get(self.input_key, '')
        if not text:
            return []

        lines = [line.strip() for line in text.split('\n') if line.strip()]

        chunks = []
        cur_parts = []
        cur_len = 0

        for line in lines:
            l_len = self._get_len(line)
            if cur_len + l_len <= self.chunk_size:
                cur_parts.append(line)
                cur_len += l_len
            else:
                if cur_parts:
                    chunks.append('\n'.join(cur_parts))
                cur_parts = [line]
                cur_len = l_len

        if cur_parts:
            chunks.append('\n'.join(cur_parts))

        results = []
        for c in chunks:
            item = data.copy()
            item[self.output_key] = c
            results.append(item)

        return results

`empty_or_noise_filter(data, input_key='chunk')`

Filter out empty or noise-only items. If the specified field is empty or contains no word/CJK characters, the item is dropped (returns empty list); otherwise the item is kept. Registered as a single-item forward.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'chunk' ) –

key to check, default 'chunk'

Examples:

from lazyllm.tools.data import Text2qa

op = Text2qa.empty_or_noise_filter(input_key='chunk')
data = [{'chunk': 'hello'}, {'chunk': ''}, {'chunk': '
'}]
res = op(data)
print(res)
# [{'chunk': 'hello'}]

Source code in lazyllm/tools/data/operators/text2qa_ops.py

@data_register('data.Text2qa', rewrite_func='forward', _concurrency_mode='process')
def empty_or_noise_filter(data: dict, input_key='chunk'):
    """Filter out empty or noise-only items. If the specified field is empty or contains no word/CJK characters, the item is dropped (returns empty list); otherwise the item is kept. Registered as a single-item forward.

Args:
    data (dict): single data dict
    input_key (str): key to check, default 'chunk'


Examples:
    ```python
    from lazyllm.tools.data import Text2qa

    op = Text2qa.empty_or_noise_filter(input_key='chunk')
    data = [{'chunk': 'hello'}, {'chunk': ''}, {'chunk': '
    '}]
    res = op(data)
    print(res)
    # [{'chunk': 'hello'}]
    ```
    """
    text = data.get(input_key, '')
    if not text:
        return []

    if not re.search(r'[\w\u4e00-\u9fff]', text):
        return []

    return data

`invalid_unicode_cleaner(data, input_key='chunk')`

Remove invalid Unicode code points (e.g. FDD0–FDEF, FFFE/FFFF and certain Supplementary Special Purpose ranges) from the specified text field in place. Registered as a single-item forward.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'chunk' ) –

key of the text field to clean, default 'chunk'

Examples:

from lazyllm.tools.data import Text2qa

op = Text2qa.invalid_unicode_cleaner(input_key='chunk')
data = {'chunk': 'valid text tail'}
res = op(data)  # 剔除乱码
print(res)
[{'chunk': 'valid text tail'}]

Source code in lazyllm/tools/data/operators/text2qa_ops.py

@data_register('data.Text2qa', rewrite_func='forward', _concurrency_mode='process')
def invalid_unicode_cleaner(data: dict, input_key='chunk'):
    """Remove invalid Unicode code points (e.g. FDD0–FDEF, FFFE/FFFF and certain Supplementary Special Purpose ranges) from the specified text field in place. Registered as a single-item forward.

Args:
    data (dict): single data dict
    input_key (str): key of the text field to clean, default 'chunk'


Examples:
    ```python
    from lazyllm.tools.data import Text2qa

    op = Text2qa.invalid_unicode_cleaner(input_key='chunk')
    data = {'chunk': 'valid text￾ tail'}
    res = op(data)  # 剔除乱码￾
    print(res)
    [{'chunk': 'valid text tail'}]
    ```
    """
    text = data.get(input_key, '')
    if not text:
        return data

    text = re.sub(
        r'[\uFDD0-\uFDEF\uFFFE\uFFFF'
        r'\U0001FFFE\U0001FFFF'
        r'\U0002FFFE\U0002FFFF'
        r'\U0003FFFE\U0003FFFF'
        r'\U0004FFFE\U0004FFFF'
        r'\U0005FFFE\U0005FFFF'
        r'\U0006FFFE\U0006FFFF'
        r'\U0007FFFE\U0007FFFF'
        r'\U0008FFFE\U0008FFFF'
        r'\U0009FFFE\U0009FFFF'
        r'\U000AFFFE\U000AFFFF'
        r'\U000BFFFE\U000BFFFF'
        r'\U000CFFFE\U000CFFFF'
        r'\U000DFFFE\U000DFFFF'
        r'\U000EFFFE\U000EFFFF'
        r'\U000FFFFE\U000FFFFF'
        r'\U0010FFFE\U0010FFFF]',
        '',
        text
    )

    data[input_key] = text
    return data

CoT QA Operators

`lazyllm.tools.data.operators.cot_ops`

`CoTGenerator`

Bases: GenCot

Use an LLM to generate chain-of-thought reasoning for a question, with final answer wrapped in \boxed{{ANSWER}}. Writes result to the specified output key.

Parameters:

input_key (str, default: 'query' ) –

key of the input question, default 'query'
output_key (str, default: 'cot_answer' ) –

key to write the CoT answer, default 'cot_answer'
model –

optional TrainableModule or compatible; None uses default Qwen model
user_prompt (str | None, default: None ) –

optional user prompt prefix; None uses default
**kwargs –

other base-class args

Examples:

from lazyllm.tools.data import genCot
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = genCot.CoTGenerator(input_key='query', output_key='cot_answer', model=llm)
data = {'query': 'What is 2+2?'}
res = op(data)  # each item gets 'cot_answer' with CoT and \boxed{{4}}
print(res)
# {'query': 'What is 2+2?', 'cot_answer': '首先，我们需要理解加法的基本概念，即两个或多个数值的总和。在这个问题中，我们需要计算 2 和另一个 2 的和。

第一步，我们识别出第一个数值是 2。

第二步，我们识别出第二个数值也是 2。

第三步，我们将这两个数值相加：2 + 2。

第四步，我们进行计算：2 + 2 = 4。

因此，最终答案是 4，使用规定的格式包裹答案。

最终答案：oxed{4}'}

Source code in lazyllm/tools/data/operators/cot_ops.py

class CoTGenerator(GenCot):
    """Use an LLM to generate chain-of-thought reasoning for a question, with final answer wrapped in \\boxed{{ANSWER}}. Writes result to the specified output key.

Args:
    input_key (str): key of the input question, default 'query'
    output_key (str): key to write the CoT answer, default 'cot_answer'
    model: optional TrainableModule or compatible; None uses default Qwen model
    user_prompt (str|None): optional user prompt prefix; None uses default
    **kwargs: other base-class args


Examples:
    ```python
    from lazyllm.tools.data import genCot
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = genCot.CoTGenerator(input_key='query', output_key='cot_answer', model=llm)
    data = {'query': 'What is 2+2?'}
    res = op(data)  # each item gets 'cot_answer' with CoT and \\boxed{{4}}
    print(res)
    # {'query': 'What is 2+2?', 'cot_answer': '首先，我们需要理解加法的基本概念，即两个或多个数值的总和。在这个问题中，我们需要计算 2 和另一个 2 的和。

    第一步，我们识别出第一个数值是 2。

    第二步，我们识别出第二个数值也是 2。

    第三步，我们将这两个数值相加：2 + 2。

    第四步，我们进行计算：2 + 2 = 4。

    因此，最终答案是 4，使用规定的格式包裹答案。

    最终答案：\boxed{4}'}
    ```
    """
    def __init__(self,
                 input_key='query',
                 output_key='cot_answer',
                 model=None,
                 user_prompt=None,
                 **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.output_key = output_key
        self.user_prompt = user_prompt

        output_structure = f'''
        输出格式要求：
        {{
            "{self.output_key}": "包含CoT推理过程和最终boxed答案"
        }}
        '''

        if model is None:
            self.model = TrainableModule(DEFAULT_MODEL)
        else:
            self.model = model.share()

        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data):
        question = data.get(self.input_key, '')
        if not question:
            data[self.output_key] = None
            return data

        base_prompt = f'''
        问题：
        {question}

        规则：
        - 输出详细CoT
        - 最终答案必须使用 \boxed{{ANSWER}} 包裹
        '''

        if self.user_prompt is None:
            user_prompt = '请为这个问题生成带有思维链（Chain-of-Thought, CoT）的输出结果：\n' + base_prompt
        else:
            user_prompt = self.user_prompt + '\n' + f'问题：{question}'

        res = self.model(user_prompt)
        data[self.output_key] = res.get(self.output_key, None)
        return data

`SelfConsistencyCoTGenerator`

Bases: GenCot

Sample multiple CoT answers for the same question, extract \boxed{{}} answers, take majority vote, and output one CoT that matches the majority answer.

Parameters:

input_key (str, default: 'query' ) –

key of the input question, default 'query'
output_key (str, default: 'cot_answer' ) –

key to write the CoT answer, default 'cot_answer'
num_samples (int, default: 5 ) –

number of samples, default 5
model –

optional; None uses default Qwen model
user_prompt (str | None, default: None ) –

optional user prompt
**kwargs –

other base-class args

Examples:

from lazyllm.tools.data import genCot
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = genCot.SelfConsistencyCoTGenerator(
    input_key='query',
    output_key='cot_answer',
    num_samples=3,
    model=llm
)

data = {'query': 'What is 3*4?'}
res = op(data)
print(res)
# {'query': 'What is 3*4?', 'candidates': ['12', '12', '12'], 'cot_answer': '首先，我们需要理解问题的核心，即计算3乘以4的结果。

1. 确定操作：这是一个乘法问题，我们需要将两个数相乘。
2. 识别数字：问题中给出的两个数字是3和4。
3. 执行乘法：将3乘以4，计算过程如下：
   - 3 * 4 = 12

因此，3乘以4的结果是12。

最终答案为：oxed{12}'}

Source code in lazyllm/tools/data/operators/cot_ops.py

class SelfConsistencyCoTGenerator(GenCot):
    """Sample multiple CoT answers for the same question, extract \\boxed{{}} answers, take majority vote, and output one CoT that matches the majority answer.

Args:
    input_key (str): key of the input question, default 'query'
    output_key (str): key to write the CoT answer, default 'cot_answer'
    num_samples (int): number of samples, default 5
    model: optional; None uses default Qwen model
    user_prompt (str|None): optional user prompt
    **kwargs: other base-class args


Examples:
    ```python
    from lazyllm.tools.data import genCot
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = genCot.SelfConsistencyCoTGenerator(
        input_key='query',
        output_key='cot_answer',
        num_samples=3,
        model=llm
    )

    data = {'query': 'What is 3*4?'}
    res = op(data)
    print(res)
    # {'query': 'What is 3*4?', 'candidates': ['12', '12', '12'], 'cot_answer': '首先，我们需要理解问题的核心，即计算3乘以4的结果。

    1. 确定操作：这是一个乘法问题，我们需要将两个数相乘。
    2. 识别数字：问题中给出的两个数字是3和4。
    3. 执行乘法：将3乘以4，计算过程如下：
       - 3 * 4 = 12

    因此，3乘以4的结果是12。

    最终答案为：\boxed{12}'}
    ```
    """
    def __init__(self,
                 input_key='query',
                 output_key='cot_answer',
                 num_samples=5,
                 model=None,
                 user_prompt=None,
                 **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.output_key = output_key
        self.num_samples = num_samples
        self.user_prompt = user_prompt

        if model is None:
            self.model = TrainableModule(DEFAULT_MODEL)
        else:
            self.model = model.share()
        self.model.start()

    def _build_prompt(self, question):
        base_prompt = f'''
        问题：
        {question}

        规则：
        - 输出详细CoT
        - 最终答案必须使用 \boxed{{ANSWER}} 包裹
        '''
        if self.user_prompt is None:
            return '请为这个问题生成带有思维链（Chain-of-Thought, CoT）的输出结果：\n' + base_prompt
        return self.user_prompt + '\n' + f'问题：{question};'

    def forward(self, data):
        question = data.get(self.input_key, '')
        if not question:
            data[self.output_key] = None
            return data

        cot_list = []
        boxed_answers = []

        prompt = self._build_prompt(question)
        candidates = []
        for _ in range(self.num_samples):
            response = self.model(prompt)
            cot = response
            boxed = boxed_res_extractor(response)
            candidates.append(boxed)
            if boxed is not None:
                cot_list.append(cot)
                boxed_answers.append(boxed)

        if not boxed_answers:
            data[self.output_key] = None
            return data

        counter = Counter(boxed_answers)
        majority_answer = counter.most_common(1)[0][0]
        data['candidates'] = candidates
        for cot, ans in zip(cot_list, boxed_answers):
            if ans == majority_answer:
                data[self.output_key] = cot
                return data

        data[self.output_key] = None
        return data

`answer_verify(data, answer_key='reference', infer_key='llm_extracted', output_key='is_equal')`

Compare reference answer and model-extracted answer for mathematical equality. Uses math_verify to parse and verify; result written to the specified key. Registered as single-item forward.

Parameters:

data (dict) –

single data dict
answer_key (str, default: 'reference' ) –

key of reference answer, default 'reference'
infer_key (str, default: 'llm_extracted' ) –

key of LLM-extracted answer, default 'llm_extracted'
output_key (str, default: 'is_equal' ) –

key to write equality result, default 'is_equal'

Examples:

from lazyllm.tools.data import genCot

data = {'reference': '1/2', 'llm_extracted': '0.5'}
op = genCot.answer_verify(answer_key='reference', infer_key='llm_extracted', output_key='is_equal')
print(op(data))  # Add key/value: 'is_equal': True
# {'reference': '1/2', 'llm_extracted': '0.5', 'is_equal': True}

Source code in lazyllm/tools/data/operators/cot_ops.py

@data_register('data.genCot', rewrite_func='forward')
def answer_verify(data, answer_key='reference', infer_key='llm_extracted', output_key='is_equal'):
    """Compare reference answer and model-extracted answer for mathematical equality. Uses math_verify to parse and verify; result written to the specified key. Registered as single-item forward.

Args:
    data (dict): single data dict
    answer_key (str): key of reference answer, default 'reference'
    infer_key (str): key of LLM-extracted answer, default 'llm_extracted'
    output_key (str): key to write equality result, default 'is_equal'


Examples:
    ```python
    from lazyllm.tools.data import genCot

    data = {'reference': '1/2', 'llm_extracted': '0.5'}
    op = genCot.answer_verify(answer_key='reference', infer_key='llm_extracted', output_key='is_equal')
    print(op(data))  # Add key/value: 'is_equal': True
    # {'reference': '1/2', 'llm_extracted': '0.5', 'is_equal': True}
    ```
    """
    real_answer = data.get(answer_key, None)
    llm_answer = data.get(infer_key, None)

    if real_answer is None or llm_answer is None:
        data[output_key] = False
        return data

    try:
        parsed_real = math_verify.parse(str(real_answer))
        parsed_llm = math_verify.parse(str(llm_answer))
        data[output_key] = math_verify.verify(parsed_real, parsed_llm)

    except Exception as e:
        LOG.error(f'Error verifying answers: {e}')
        data[output_key] = False

    return data

Enhanced QA Operators

`lazyllm.tools.data.operators.enQa_ops`

`DiversityScorer`

Bases: EnQA

Score diversity of a list of questions; output list matches input order, each item has rewritten_query and diversity_score (0 similar / 1 diverse).

Parameters:

input_key (str, default: 'rewrite_querys' ) –

key of the question list, default 'rewrite_querys'
output_key (str, default: 'diversity_querys' ) –

key to write the scored list, default 'diversity_querys'
model –

optional; None uses default Qwen model
user_prompt (str | None, default: None ) –

optional user prompt
**kwargs –

other base-class args

Examples:

from lazyllm.tools.data import EnQA
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = EnQA.DiversityScorer(input_key='rewrite_querys', output_key='diversity_querys', model=llm)
data = {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!']}
res = op(data)
print(data)
# {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'diversity_querys': [{'rewritten_query': '今天是个好天气', 'diversity_score': 1}, {'rewritten_query': '今天天气不错', 'diversity_score': 1}, {'rewritten_query': 'It is a nice day!', 'diversity_score': 1}]}

Source code in lazyllm/tools/data/operators/enQa_ops.py

class DiversityScorer(EnQA):
    """Score diversity of a list of questions; output list matches input order, each item has rewritten_query and diversity_score (0 similar / 1 diverse).

Args:
    input_key (str): key of the question list, default 'rewrite_querys'
    output_key (str): key to write the scored list, default 'diversity_querys'
    model: optional; None uses default Qwen model
    user_prompt (str|None): optional user prompt
    **kwargs: other base-class args


Examples:
    ```python
    from lazyllm.tools.data import EnQA
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = EnQA.DiversityScorer(input_key='rewrite_querys', output_key='diversity_querys', model=llm)
    data = {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!']}
    res = op(data)
    print(data)
    # {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'diversity_querys': [{'rewritten_query': '今天是个好天气', 'diversity_score': 1}, {'rewritten_query': '今天天气不错', 'diversity_score': 1}, {'rewritten_query': 'It is a nice day!', 'diversity_score': 1}]}
    ```
    """

    def __init__(self,
                 input_key='rewrite_querys',
                 output_key='diversity_querys',
                 model=None,
                 user_prompt=None,
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.output_key = output_key
        self.user_prompt = user_prompt

        output_structure = '''
        输出格式要求：
        {
            "diversity_scores": [0,1]
        }
        '''

        if model is None:
            self.model = TrainableModule(DEFAULT_MODEL)
        else:
            self.model = model.share()

        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data):
        querys = data.get(self.input_key)
        if not querys:
            return None

        if data.get(self.output_key) is not None:
            return None

        base_prompt = f'''
        问题列表：
        {querys}

        规则：
        - 表达重复或相似度高：score = 0
        - 表达差异明显：score = 1
        - 输出与输入顺序一致
        '''

        if self.user_prompt is None:
            prompt = '判断下面问题列表的表达多样性。\n' + base_prompt
        else:
            prompt = self.user_prompt + '\n' + f'问题列表：{querys};'

        res = self.model(prompt)

        scores = res.get('diversity_scores', [])

        new_list = []
        for i, q in enumerate(querys):
            score = scores[i] if i < len(scores) else 0
            new_list.append({
                'rewritten_query': q,
                'diversity_score': score
            })

        data[self.output_key] = new_list
        return data

`QueryRewriter`

Bases: EnQA

Use an LLM to rewrite the original question into multiple semantically equivalent formulations. Writes a list to the specified output key.

Parameters:

input_key (str, default: 'query' ) –

key of the input question, default 'query'
output_key (str, default: 'rewrite_querys' ) –

key to write the list of rewrites, default 'rewrite_querys'
rewrite_num (int, default: 3 ) –

number of rewrites to generate, default 3
model –

optional; None uses default Qwen model
user_prompt (str | None, default: None ) –

optional user prompt
**kwargs –

other base-class args

Examples:

from lazyllm.tools.data import EnQA
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = EnQA.QueryRewriter(input_key='query', output_key='rewrite_querys', rewrite_num=2, model=llm)
data = {'query': 'What is machine learning?'}
res = op(data)  # data gets 'rewrite_querys': [str, str, ...]
print(res)
# [{'query': 'What is machine learning?', 'rewrite_querys': ['Could you explain what machine learning is?', 'What does the term machine learning refer to?']}]

Source code in lazyllm/tools/data/operators/enQa_ops.py

class QueryRewriter(EnQA):
    """Use an LLM to rewrite the original question into multiple semantically equivalent formulations. Writes a list to the specified output key.

Args:
    input_key (str): key of the input question, default 'query'
    output_key (str): key to write the list of rewrites, default 'rewrite_querys'
    rewrite_num (int): number of rewrites to generate, default 3
    model: optional; None uses default Qwen model
    user_prompt (str|None): optional user prompt
    **kwargs: other base-class args


Examples:
    ```python
    from lazyllm.tools.data import EnQA
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = EnQA.QueryRewriter(input_key='query', output_key='rewrite_querys', rewrite_num=2, model=llm)
    data = {'query': 'What is machine learning?'}
    res = op(data)  # data gets 'rewrite_querys': [str, str, ...]
    print(res)
    # [{'query': 'What is machine learning?', 'rewrite_querys': ['Could you explain what machine learning is?', 'What does the term machine learning refer to?']}]
    ```
    """

    def __init__(self,
                 input_key='query',
                 output_key='rewrite_querys',
                 rewrite_num=3,
                 model=None,
                 user_prompt=None,
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.output_key = output_key
        self.rewrite_num = rewrite_num
        self.user_prompt = user_prompt

        output_structure = f'''
        输出格式要求：
        {{
            "{self.output_key}": ["rewrite1","rewrite2"]
        }}
        '''

        if model is None:
            self.model = TrainableModule(DEFAULT_MODEL)
        else:
            self.model = model.share()

        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data):

        query = data.get(self.input_key)
        if not query:
            return None

        if data.get(self.output_key) is not None:
            return None

        base_prompt = f'''
        原问题：
        {query}

        规则：
        - 生成 {self.rewrite_num} 个不同表达
        - 保持语义一致
        - 不要解释
        '''

        if self.user_prompt is None:
            prompt = '请重写下面的问题，使其语义一致但表达不同。\n' + base_prompt
        else:
            prompt = self.user_prompt + \
                '\n' + f'原问题：{query} \n 生成 {self.rewrite_num} 个不同表达'

        res = self.model(prompt)

        data[self.output_key] = res.get(self.output_key, [])
        return data

`diversity_filter(data, input_key, min_score)`

Filter by diversity score: if the value at input_key is less than min_score, drop the item (return []); otherwise keep (return None to keep original data). Registered as single-item forward.

Parameters:

data (dict) –

single data dict
input_key (str) –

key holding the score
min_score –

minimum score threshold

Examples:

from lazyllm.tools.data import EnQA

data = {'query': 'a and b', 'rewritten_query': 'b', 'diversity_score': 0}
op = EnQA.diversity_filter(input_key='diversity_score', min_score=1)
print(op(data))  # [None] (drop) 
# []

Source code in lazyllm/tools/data/operators/enQa_ops.py

@data_register('data.enQA', rewrite_func='forward')
def diversity_filter(data, input_key, min_score):
    """Filter by diversity score: if the value at input_key is less than min_score, drop the item (return []); otherwise keep (return None to keep original data). Registered as single-item forward.

Args:
    data (dict): single data dict
    input_key (str): key holding the score
    min_score: minimum score threshold


Examples:
    ```python
    from lazyllm.tools.data import EnQA

    data = {'query': 'a and b', 'rewritten_query': 'b', 'diversity_score': 0}
    op = EnQA.diversity_filter(input_key='diversity_score', min_score=1)
    print(op(data))  # [None] (drop) 
    # []
    ```
    """
    score = data.get(input_key, 0)
    if score >= min_score:
        return None
    return []

`post_processor(data, input_key)`

Expand the specified key (list of dicts) into multiple rows: each dict merged with original data as one row, list key removed. Returns list of rows or None if no data. Registered as single-item forward.

Parameters:

data (dict) –

single data dict
input_key (str) –

key of the list of dicts to expand

Examples:

from lazyllm.tools.data import EnQA

data = {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'diversity_querys': [{'rewritten_query': '今天是个好天气', 'diversity_score': 1}, {'rewritten_query': '今天天气不错', 'diversity_score': 1}, {'rewritten_query': 'It is a nice day!', 'diversity_score': 1}]}
op = EnQA.post_processor(input_key='diversity_querys')
print(op(data))  
# [{'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'rewritten_query': '今天是个好天气', 'diversity_score': 1}, {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'rewritten_query': '今天天气不错', 'diversity_score': 1}, {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'rewritten_query': 'It is a nice day!', 'diversity_score': 1}]

Source code in lazyllm/tools/data/operators/enQa_ops.py

@data_register('data.enQA', rewrite_func='forward')
def post_processor(data, input_key):
    """Expand the specified key (list of dicts) into multiple rows: each dict merged with original data as one row, list key removed. Returns list of rows or None if no data. Registered as single-item forward.

Args:
    data (dict): single data dict
    input_key (str): key of the list of dicts to expand


Examples:
    ```python
    from lazyllm.tools.data import EnQA

    data = {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'diversity_querys': [{'rewritten_query': '今天是个好天气', 'diversity_score': 1}, {'rewritten_query': '今天天气不错', 'diversity_score': 1}, {'rewritten_query': 'It is a nice day!', 'diversity_score': 1}]}
    op = EnQA.post_processor(input_key='diversity_querys')
    print(op(data))  
    # [{'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'rewritten_query': '今天是个好天气', 'diversity_score': 1}, {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'rewritten_query': '今天天气不错', 'diversity_score': 1}, {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'rewritten_query': 'It is a nice day!', 'diversity_score': 1}]
    ```
    """
    items = data.get(input_key)
    if not items:
        return None

    result = []
    for obj in items:

        if not isinstance(obj, dict):
            continue

        new_row = data.copy()
        new_row.pop(input_key, None)
        for k, v in obj.items():
            new_row[k] = v

        result.append(new_row)

    return result

Math QA Operators

`lazyllm.tools.data.operators.math_ops`

`DifficultyEvaluator`

Bases: MathQA

Use an LLM to evaluate math question difficulty; output Easy | Medium | Hard. Skips if difficulty already present.

Parameters:

input_key (str, default: 'question' ) –

key of the question, default 'question'
output_key (str, default: 'difficulty' ) –

key to write difficulty, default 'difficulty'
model –

optional; None uses default Qwen model
user_prompt (str | None, default: None ) –

optional user prompt
**kwargs –

other base-class args

Examples:

from lazyllm.tools.data.operators.math_ops import DifficultyEvaluator

from lazyllm.tools.data import MathQA
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = MathQA.DifficultyEvaluator(input_key='question', output_key='difficulty', model=llm)
data = {'question': '1+1=?'}
res = op(data)  # each item gets 'difficulty': 'Easy'|'Medium'|'Hard'
print(res)
# [{'question': '1+1=?', 'difficulty': 'Easy'}]

Source code in lazyllm/tools/data/operators/math_ops.py

class DifficultyEvaluator(MathQA):
    """Use an LLM to evaluate math question difficulty; output Easy | Medium | Hard. Skips if difficulty already present.

Args:
    input_key (str): key of the question, default 'question'
    output_key (str): key to write difficulty, default 'difficulty'
    model: optional; None uses default Qwen model
    user_prompt (str|None): optional user prompt
    **kwargs: other base-class args


Examples:
    ```python
    from lazyllm.tools.data.operators.math_ops import DifficultyEvaluator

    from lazyllm.tools.data import MathQA
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = MathQA.DifficultyEvaluator(input_key='question', output_key='difficulty', model=llm)
    data = {'question': '1+1=?'}
    res = op(data)  # each item gets 'difficulty': 'Easy'|'Medium'|'Hard'
    print(res)
    # [{'question': '1+1=?', 'difficulty': 'Easy'}]
    ```
    """
    def __init__(self,
                 input_key='question',
                 output_key='difficulty',
                 model=None,
                 user_prompt=None,
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.output_key = output_key
        self.user_prompt = user_prompt

        output_structure = f'''
        输出格式要求：
        {{
            "{self.output_key}": "难度"
        }}
        '''

        self.model = model.share() or TrainableModule(DEFAULT_MODEL)

        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data):

        if data.get(self.output_key) is not None:
            return None

        question = data.get(self.input_key)

        base_prompt = f'''
        问题：
        {question}

        难度级别：
        - Easy : 小学
        - Medium : 初中/高中
        - Hard : 大学及以上

        '''

        if self.user_prompt is None:
            prompt = '判断下面数学问题的难度。\n' + base_prompt
        else:
            prompt = self.user_prompt + '\n' + f'问题：{question}'

        res = self.model(prompt)

        data[self.output_key] = res.get(self.output_key)
        return data

`DuplicateAnswerDetector`

Bases: MathQA

Detect duplicate/periodic/long-repeat in answers: periodic repetition, sentence-level repeat, or long substring repeat in question+answer. Sets output True if detected. No model call.

Parameters:

question_key (str, default: 'question' ) –

key of the question, default 'question'
answer_key (str, default: 'answer' ) –

key of the answer, default 'answer'
output_key (str, default: 'duplicate' ) –

key to write duplicate flag, default 'duplicate'
min_repeat_len (int, default: 15 ) –

min substring length for long repeat, default 15
repeat_threshold (int, default: 2 ) –

occurrence threshold for substring, default 2
periodic_min_repeat (int, default: 3 ) –

min period repeats for periodic, default 3
**kwargs –

other base-class args

Examples:

from lazyllm.tools.data import MathQA

op = MathQA.DuplicateAnswerDetector(question_key='question', answer_key='answer', output_key='duplicate')
data = {'question': 'Q', 'answer': 'A' * 50}
res = op(data)  # data['duplicate'] True
print(res)
# [{'question': 'Q', 'answer': 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA', 'duplicate': True}]

Source code in lazyllm/tools/data/operators/math_ops.py

class DuplicateAnswerDetector(MathQA):
    """Detect duplicate/periodic/long-repeat in answers: periodic repetition, sentence-level repeat, or long substring repeat in question+answer. Sets output True if detected. No model call.

Args:
    question_key (str): key of the question, default 'question'
    answer_key (str): key of the answer, default 'answer'
    output_key (str): key to write duplicate flag, default 'duplicate'
    min_repeat_len (int): min substring length for long repeat, default 15
    repeat_threshold (int): occurrence threshold for substring, default 2
    periodic_min_repeat (int): min period repeats for periodic, default 3
    **kwargs: other base-class args


Examples:
    ```python
    from lazyllm.tools.data import MathQA

    op = MathQA.DuplicateAnswerDetector(question_key='question', answer_key='answer', output_key='duplicate')
    data = {'question': 'Q', 'answer': 'A' * 50}
    res = op(data)  # data['duplicate'] True
    print(res)
    # [{'question': 'Q', 'answer': 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA', 'duplicate': True}]
    ```
    """
    def __init__(self,
                 question_key='question',
                 answer_key='answer',
                 output_key='duplicate',
                 min_repeat_len=15,
                 repeat_threshold=2,
                 periodic_min_repeat=3,
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)

        self.question_key = question_key
        self.answer_key = answer_key
        self.output_key = output_key

        self.min_repeat_len = min_repeat_len
        self.repeat_threshold = repeat_threshold
        self.periodic_min_repeat = periodic_min_repeat

    def _is_periodic(self, text):
        n = len(text)
        if n < 6:
            return False
        for size in range(1, n // 2 + 1):
            if n % size != 0:
                continue

            unit = text[:size]
            if unit * (n // size) == text:
                if (n // size) >= self.periodic_min_repeat:
                    return True

        return False

    def _has_long_repeat(self, merged_text):
        seen = {}
        text_len = len(merged_text)

        for i in range(text_len - self.min_repeat_len + 1):

            substr = merged_text[i:i + self.min_repeat_len]

            if not substr.strip():
                continue

            seen[substr] = seen.get(substr, 0) + 1

            if seen[substr] >= self.repeat_threshold:
                return True

        return False

    def _sentence_repeat(self, answer):
        sentences = re.split(r'[。！？.!?\n]', answer)
        counter = {}
        for s in sentences:
            s = s.strip()
            if len(s) < 10:
                continue
            counter[s] = counter.get(s, 0) + 1
            if counter[s] >= 3:
                return True
        return False

    def forward(self, data):
        assert isinstance(data, dict)
        question = str(data.get(self.question_key, ''))
        answer = str(data.get(self.answer_key, ''))
        data[self.output_key] = False
        if not answer:
            return data

        merged = question + '\n' + answer
        if self._is_periodic(answer):
            data[self.output_key] = True
            return data

        if self._sentence_repeat(answer):
            data[self.output_key] = True
            return data

        if self._has_long_repeat(merged):
            data[self.output_key] = True
            return data

        return data

`MathAnswerGenerator`

Bases: MathQA

Use an LLM to generate reasoning and answer for a math question, with final result in \boxed{{ANSWER}}. Skips if answer already exists and regenerate is not set.

Parameters:

input_key (str, default: 'question' ) –

key of the question, default 'question'
output_key (str, default: 'answer' ) –

key to write the answer, default 'answer'
regenerate_key (str, default: 'regenerate' ) –

key for force-regenerate flag, default 'regenerate'
model –

optional; None uses default Qwen model
user_prompt (str | None, default: None ) –

optional user prompt
**kwargs –

other base-class args

Examples:

from lazyllm.tools.data.operators.math_ops import MathAnswerGenerator

from lazyllm.tools.data import MathQA
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = MathQA.MathAnswerGenerator(input_key='question', output_key='answer', model=llm)
data = [{'question': 'Solve 10 * 10'}]
res = op(data) 
print(res)
# [{'question': 'Solve 10 * 10', 'answer': '首先，我们需要计算 \(10  imes 10\)。这是一个简单的乘法运算，其中两个乘数都是10。

步骤1：写下乘数10和另一个乘数10。
步骤2：将两个10相乘。

计算过程如下：
\[ 10       imes 10 = 100 \]

因此，最终结果是 \(oxed{100}\)。', 'regenerate': False}]

Source code in lazyllm/tools/data/operators/math_ops.py

class MathAnswerGenerator(MathQA):
    """Use an LLM to generate reasoning and answer for a math question, with final result in \\boxed{{ANSWER}}. Skips if answer already exists and regenerate is not set.

Args:
    input_key (str): key of the question, default 'question'
    output_key (str): key to write the answer, default 'answer'
    regenerate_key (str): key for force-regenerate flag, default 'regenerate'
    model: optional; None uses default Qwen model
    user_prompt (str|None): optional user prompt
    **kwargs: other base-class args


Examples:
    ```python
    from lazyllm.tools.data.operators.math_ops import MathAnswerGenerator

    from lazyllm.tools.data import MathQA
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = MathQA.MathAnswerGenerator(input_key='question', output_key='answer', model=llm)
    data = [{'question': 'Solve 10 * 10'}]
    res = op(data) 
    print(res)
    # [{'question': 'Solve 10 * 10', 'answer': '首先，我们需要计算 \(10 	imes 10\)。这是一个简单的乘法运算，其中两个乘数都是10。

    步骤1：写下乘数10和另一个乘数10。
    步骤2：将两个10相乘。

    计算过程如下：
    \[ 10 	imes 10 = 100 \]

    因此，最终结果是 \(\boxed{100}\)。', 'regenerate': False}]
    ```
    """
    def __init__(self,
                 input_key='question',
                 output_key='answer',
                 regenerate_key='regenerate',
                 model=None,
                 user_prompt=None,
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.output_key = output_key
        self.regenerate_key = regenerate_key
        self.user_prompt = user_prompt

        output_structure = f'''
        输出格式要求：
        {{
            "{self.output_key}": "推理结果boxed"
        }}
        '''

        self.model = model.share() or TrainableModule(DEFAULT_MODEL)

        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data):

        answer = data.get(self.output_key)
        regenerate = data.get(self.regenerate_key, False)

        if answer is not None and regenerate is False:
            return None

        question = data.get(self.input_key)

        base_prompt = f'''
        问题：
        {question}

        规则：
        - 输出详细的过程
        - 最终结果使用 \\boxed{{ANSWER}} 包裹
        '''

        if self.user_prompt is None:
            prompt = '请为这个数学问题生成推理结果。\n' + base_prompt
        else:
            prompt = self.user_prompt + '\n' + f'问题：{question}'

        res = self.model(prompt)

        data[self.output_key] = res.get(self.output_key)
        data[self.regenerate_key] = False

        return data

`QualityEvaluator`

Bases: MathQA

Use an LLM to score question-answer quality: 0 = regenerate, 1 = acceptable. Skips if output_key already present.

Parameters:

question_key (str, default: 'question' ) –

key of the question, default 'question'
answer_key (str, default: 'answer' ) –

key of the answer, default 'answer'
output_key (str, default: 'score' ) –

key to write score, default 'score'
model –

optional; None uses default Qwen model
user_prompt (str | None, default: None ) –

optional user prompt
**kwargs –

other base-class args

Examples:

from lazyllm.tools.data import MathQA
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = MathQA.QualityEvaluator(question_key='question', answer_key='answer', output_key='score', model=llm)
data = {'question': '今天天气如何', 'answer': '大家好~'}
res = op(data) # 质量低的会被打 0 分
print(res)
# [{'question': '今天天气如何', 'answer': '大家好~', 'score': 0}]

Source code in lazyllm/tools/data/operators/math_ops.py

class QualityEvaluator(MathQA):
    """Use an LLM to score question-answer quality: 0 = regenerate, 1 = acceptable. Skips if output_key already present.

Args:
    question_key (str): key of the question, default 'question'
    answer_key (str): key of the answer, default 'answer'
    output_key (str): key to write score, default 'score'
    model: optional; None uses default Qwen model
    user_prompt (str|None): optional user prompt
    **kwargs: other base-class args


Examples:
    ```python
    from lazyllm.tools.data import MathQA
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = MathQA.QualityEvaluator(question_key='question', answer_key='answer', output_key='score', model=llm)
    data = {'question': '今天天气如何', 'answer': '大家好~'}
    res = op(data) # 质量低的会被打 0 分
    print(res)
    # [{'question': '今天天气如何', 'answer': '大家好~', 'score': 0}]
    ```
    """
    def __init__(self,
                 question_key='question',
                 answer_key='answer',
                 output_key='score',
                 model=None,
                 user_prompt=None,
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)

        self.question_key = question_key
        self.answer_key = answer_key
        self.output_key = output_key
        self.user_prompt = user_prompt

        output_structure = f'''
        输出格式要求：
        {{
            "{self.output_key}": 0
        }}
        '''

        self.model = model.share() or TrainableModule(DEFAULT_MODEL)

        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data):

        if data.get(self.output_key) is not None:
            return None

        question = data.get(self.question_key)
        answer = data.get(self.answer_key)

        base_prompt = f'''
        问题：
        {question}

        答案：
        {answer}

        规则：
        - 输出 0 表示需要重新生成
        - 输出 1 表示质量合格
        '''

        if self.user_prompt is None:
            prompt = '请检查问题和答案的质量。\n' + base_prompt
        else:
            prompt = self.user_prompt + '\n' + f'问题：{question}; 答案: {answer}'

        res = self.model(prompt)

        data[self.output_key] = res.get(self.output_key)
        return data

`QuestionFusionGenerator`

Bases: MathQA

Use an LLM to fuse multiple questions into one and generate reasoning with \boxed{{}} answer. Requires at least 2 questions under list_key.

Parameters:

input_key (str, default: 'question' ) –

key for fused question, default 'question'
output_key (str, default: 'answer' ) –

key to write answer, default 'answer'
list_key (str, default: 'question_list' ) –

key of the question list, default 'question_list'
model –

optional; None uses default Qwen model
user_prompt (str | None, default: None ) –

optional user prompt
**kwargs –

other base-class args

Examples:

from lazyllm.tools.data import MathQA
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = MathQA.QuestionFusionGenerator(input_key='new_question', list_key='question_list', output_key='new_answer', model=llm)
data = {'question_list': [
    {'question': '1加1等于几？', 'answer': '1+1 = 2'}, 
    {'question': '2的平方等于几？', 'answer': '2*2 = 4'}]}
res = op(data) 
print(res)
# [{'question_list': [{'question': '1加1等于几？', 'answer': '1+1 = 2'}, {'question': '2的平方等于几？', 'answer': '2*2 = 4'}], 
# 'new_question': '如果1加1的结果与2的平方相比较，哪个更大？', 
# 'new_answer': '首先，我们解决第一个问题：1加1等于几？计算得到 1+1 = 2。然后，解决第二个问题：2的平方等于几？计算得到 2*2 = 4。最后，我们比较这两个结果，2和4。显然，4大于2。所以，2的平方更大。'}]

Source code in lazyllm/tools/data/operators/math_ops.py

class QuestionFusionGenerator(MathQA):
    """Use an LLM to fuse multiple questions into one and generate reasoning with \\boxed{{}} answer. Requires at least 2 questions under list_key.

Args:
    input_key (str): key for fused question, default 'question'
    output_key (str): key to write answer, default 'answer'
    list_key (str): key of the question list, default 'question_list'
    model: optional; None uses default Qwen model
    user_prompt (str|None): optional user prompt
    **kwargs: other base-class args


Examples:
    ```python
    from lazyllm.tools.data import MathQA
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = MathQA.QuestionFusionGenerator(input_key='new_question', list_key='question_list', output_key='new_answer', model=llm)
    data = {'question_list': [
        {'question': '1加1等于几？', 'answer': '1+1 = 2'}, 
        {'question': '2的平方等于几？', 'answer': '2*2 = 4'}]}
    res = op(data) 
    print(res)
    # [{'question_list': [{'question': '1加1等于几？', 'answer': '1+1 = 2'}, {'question': '2的平方等于几？', 'answer': '2*2 = 4'}], 
    # 'new_question': '如果1加1的结果与2的平方相比较，哪个更大？', 
    # 'new_answer': '首先，我们解决第一个问题：1加1等于几？计算得到 1+1 = 2。然后，解决第二个问题：2的平方等于几？计算得到 2*2 = 4。最后，我们比较这两个结果，2和4。显然，4大于2。所以，2的平方更大。'}]
    ```
    """
    def __init__(self,
                 input_key='question',
                 output_key='answer',
                 model=None,
                 user_prompt=None,
                 list_key='question_list',
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.output_key = output_key
        self.user_prompt = user_prompt
        self.list_key = list_key

        output_structure = f'''
        输出格式要求：
        {{
            "{self.input_key}": "融合后的问题",
            "{self.output_key}": "推理结果"
        }}
        '''

        self.model = model.share() or TrainableModule(DEFAULT_MODEL)

        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data):
        questions = data.get(self.list_key, [])
        if len(questions) <= 1:
            LOG.warning(f'QuestionFusionGenerator requires more than one question, but got {len(questions)}. Skipping.')
            return data
        base_prompt = f'''
        问题列表：
        {questions}

        规则：
        - 融合列表中的问题，生成一个更复杂的新问题
        - 输出详细的过程
        '''

        if self.user_prompt is None:
            prompt = base_prompt
        else:
            prompt = self.user_prompt + '\n' + f'融合列表中的问题，生成一个更复杂的新问题：{questions}'

        res = self.model(prompt)
        data[self.input_key] = res.get(self.input_key)
        data[self.output_key] = res.get(self.output_key)

        return data

`ReasoningAnswerTokenLengthFilter`

Bases: MathQA

Filter by answer length (tokens or chars): if over max_answer_token_length, clear the field and return modified data; if within limit return None to keep; if empty return []. Supports tokenizer or char count.

Parameters:

input_key (str, default: 'answer' ) –

key of the answer, default 'answer'
max_answer_token_length (int, default: 300 ) –

max allowed length, default 300
tokenize (bool, default: True ) –

whether to count by tokens; uses default Qwen tokenizer if True and tokenizer not provided
tokenizer –

optional
**kwargs –

other base-class args

Examples:

from lazyllm.tools.data import MathQA

op = MathQA.ReasoningAnswerTokenLengthFilter(input_key='answer', max_answer_token_length=100, tokenize=False)
data = [{'answer': 'short'}]
print(op(data))  # less than the max_length, keep the original input
# [{'answer': 'short'}]

Source code in lazyllm/tools/data/operators/math_ops.py

class ReasoningAnswerTokenLengthFilter(MathQA):
    """Filter by answer length (tokens or chars): if over max_answer_token_length, clear the field and return modified data; if within limit return None to keep; if empty return []. Supports tokenizer or char count.

Args:
    input_key (str): key of the answer, default 'answer'
    max_answer_token_length (int): max allowed length, default 300
    tokenize (bool): whether to count by tokens; uses default Qwen tokenizer if True and tokenizer not provided
    tokenizer: optional
    **kwargs: other base-class args


Examples:
    ```python
    from lazyllm.tools.data import MathQA

    op = MathQA.ReasoningAnswerTokenLengthFilter(input_key='answer', max_answer_token_length=100, tokenize=False)
    data = [{'answer': 'short'}]
    print(op(data))  # less than the max_length, keep the original input
    # [{'answer': 'short'}]
    ```
    """
    def __init__(self,
                 input_key='answer',
                 max_answer_token_length=300,
                 tokenize=True,
                 tokenizer=None,
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.max_answer_token_length = max_answer_token_length
        self.tokenizer = tokenizer

        if tokenize and tokenizer is None:
            LOG.warning(
                f'tokenize=True but tokenizer is None, '
                f'loading tokenizer from default model: {DEFAULT_TOKENIZER}'
            )
            try:
                self.tokenizer = transformers.AutoTokenizer.from_pretrained(
                    DEFAULT_TOKENIZER,
                    trust_remote_code=True
                )
                self.tokenize = True
            except Exception as e:
                LOG.warning(
                    f'failed to load tokenizer from {DEFAULT_TOKENIZER}, '
                    f'falling back to char count, error: {e}'
                )
                self.tokenize = False
                self.tokenizer = None
        else:
            self.tokenizer = tokenizer
            self.tokenize = tokenize

        self.empty_count = 0

    def _get_len(self, text: str):
        if text is None or (isinstance(text, str) and text.strip() == ''):
            self.empty_count += 1
            return self.max_answer_token_length + 1

        try:
            if self.tokenize:
                return len(
                    self.tokenizer.encode(
                        text,
                        add_special_tokens=False
                    )
                )
            return len(text)

        except Exception as e:
            LOG.warning(f'token encode failed: {e}')
            self.empty_count += 1
            return self.max_answer_token_length + 1

    def forward(self, data: dict):
        text = data.get(self.input_key, '')
        if not text:
            self.empty_count += 1
            return []

        token_len = self._get_len(text)

        if token_len <= self.max_answer_token_length:
            return None

        # clear eligible answer
        data[self.input_key] = ''
        return data

`DifficultyEvaluatorBatch(data, input_key='difficulty')`

Batch: aggregate counts of the specified key (e.g. difficulty) over the input list; returns a single-element list [{{key: count}}]. Registered as forward_batch_input.

Parameters:

data (list[dict]) –

list of input dicts
input_key (str, default: 'difficulty' ) –

key to aggregate, default 'difficulty'

Examples:

from lazyllm.tools.data import MathQA

op = MathQA.DifficultyEvaluatorBatch(input_key='difficulty')
data = [{'difficulty': 'Easy'}, {'difficulty': 'Hard'}, {'difficulty': 'Easy'}]
print(op(data))  
# [{'Easy': 2, 'Hard': 1}]

Source code in lazyllm/tools/data/operators/math_ops.py

@data_register(
    'data.mathQA',
    rewrite_func='forward_batch_input'
)
def DifficultyEvaluatorBatch(data, input_key='difficulty'):
    """Batch: aggregate counts of the specified key (e.g. difficulty) over the input list; returns a single-element list [{{key: count}}]. Registered as forward_batch_input.

Args:
    data (list[dict]): list of input dicts
    input_key (str): key to aggregate, default 'difficulty'


Examples:
    ```python
    from lazyllm.tools.data import MathQA

    op = MathQA.DifficultyEvaluatorBatch(input_key='difficulty')
    data = [{'difficulty': 'Easy'}, {'difficulty': 'Hard'}, {'difficulty': 'Easy'}]
    print(op(data))  
    # [{'Easy': 2, 'Hard': 1}]
    ```
    """
    result = {}
    for entry in data:
        key = entry.get(input_key)
        if key in result:
            result[key] += 1
        else:
            result[key] = 1
    return [result]

`math_answer_extractor(data, input_key='answer', output_key='math_answer')`

Extract the math answer inside \boxed{{}} from text and write to the specified output key. Registered as single-item forward.

Parameters:

data (dict) –

single data dict
input_key (str, default: 'answer' ) –

key of the text containing the answer, default 'answer'
output_key (str, default: 'math_answer' ) –

key to write the extracted value, default 'math_answer'

Examples:

from lazyllm.tools.data import MathQA

data = {'answer': 'So the answer is \boxed{{42}}.'}
op = MathQA.math_answer_extractor(input_key='answer', output_key='math_answer')
print(op(data))  # data['math_answer'] == '42'
# [{'answer': 'So the answer is \boxed{{42}}.', 'math_answer': '{42}'}]

Source code in lazyllm/tools/data/operators/math_ops.py

@data_register('data.mathQA', rewrite_func='forward')
def math_answer_extractor(data, input_key='answer', output_key='math_answer'):
    """Extract the math answer inside \\boxed{{}} from text and write to the specified output key. Registered as single-item forward.

Args:
    data (dict): single data dict
    input_key (str): key of the text containing the answer, default 'answer'
    output_key (str): key to write the extracted value, default 'math_answer'


Examples:
    ```python
    from lazyllm.tools.data import MathQA

    data = {'answer': 'So the answer is \\boxed{{42}}.'}
    op = MathQA.math_answer_extractor(input_key='answer', output_key='math_answer')
    print(op(data))  # data['math_answer'] == '42'
    # [{'answer': 'So the answer is \\boxed{{42}}.', 'math_answer': '{42}'}]
    ```
    """
    assert isinstance(data, dict)
    answer = data[input_key]
    math_answer = boxed_extractor(answer)
    data[output_key] = math_answer
    return data

Pdf QA Operators

`lazyllm.tools.data.operators.pdf_ops`

`Pdf2Md`

Bases: Pdf2Qa

Convert PDF to a list of Markdown documents. Uses MineruPDFReader (reader_url required). Supports cache.

Parameters:

input_key (str, default: 'pdf_path' ) –

key of the PDF path, default 'pdf_path'
output_key (str, default: 'docs' ) –

key to write the document list, default 'docs'
reader_url –

required, Mineru reader service URL
backend (str, default: 'vlm-vllm-async-engine' ) –

backend type, default 'vlm-vllm-async-engine'
upload_mode (bool, default: True ) –

whether to use upload mode, default True
use_cache (bool, default: False ) –

whether to use cache, default False
**kwargs –

other base-class args

Examples:

from lazyllm.tools.data import Pdf2Qa
from lazyllm.tools.data.operators.pdf_ops import Pdf2Md

op = Pdf2Qa.Pdf2Md(input_key='pdf_path', output_key='docs', reader_url='http://...')
data = [{'pdf_path': '/path/to/file.pdf'}]
res = op(data)  # each item gets 'docs' (list of doc content)

Source code in lazyllm/tools/data/operators/pdf_ops.py

class Pdf2Md(Pdf2Qa):
    """Convert PDF to a list of Markdown documents. Uses MineruPDFReader (reader_url required). Supports cache.

Args:
    input_key (str): key of the PDF path, default 'pdf_path'
    output_key (str): key to write the document list, default 'docs'
    reader_url: required, Mineru reader service URL
    backend (str): backend type, default 'vlm-vllm-async-engine'
    upload_mode (bool): whether to use upload mode, default True
    use_cache (bool): whether to use cache, default False
    **kwargs: other base-class args


Examples:
    ```python
    from lazyllm.tools.data import Pdf2Qa
    from lazyllm.tools.data.operators.pdf_ops import Pdf2Md

    op = Pdf2Qa.Pdf2Md(input_key='pdf_path', output_key='docs', reader_url='http://...')
    data = [{'pdf_path': '/path/to/file.pdf'}]
    res = op(data)  # each item gets 'docs' (list of doc content)
    ```"""
    def __init__(self,
                 input_key='pdf_path',
                 output_key='docs',
                 reader_url=None,
                 backend='vlm-vllm-async-engine',
                 upload_mode=True,
                 use_cache=False,
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)
        if not reader_url:
            raise ValueError('You must pass in a reader_url.')

        self.input_key = input_key
        self.output_key = output_key
        self.use_cache = use_cache

        self.reader = MineruPDFReader(
            url=reader_url,
            backend=backend,
            upload_mode=upload_mode
        )

    def forward(self, data):
        pdf_path = data.get(self.input_key)
        if not pdf_path:
            return None

        try:
            docs = self.reader(
                file=pdf_path,
                use_cache=self.use_cache
            )
            data[self.output_key] = docs

        except Exception as e:
            LOG.warning(f'PDF read failed: {e}')
            data[self.output_key] = None
        return data

Agentic rag

`lazyllm.tools.data.operators.agentic_rag.agenticrag_atomic_task_generator`

`AgenticRAGCleanQA`

Bases: agenticrag

Cleans and refines a generated QA pair by calling the LLM to produce a refined_answer .

Parameters:

llm –

language model service instance
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGCleanQA(llm=my_llm)
result = op({'question': 'What is...', 'answer': 'Raw answer'})
print(result['refined_answer'])

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py

class AgenticRAGCleanQA(agenticrag):
    """Cleans and refines a generated QA pair by calling the LLM to produce a refined_answer   .

Args:
    llm: language model service instance
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGCleanQA(llm=my_llm)
    result = op({'question': 'What is...', 'answer': 'Raw answer'})
    print(result['refined_answer'])
    ```
    """

    def __init__(self, llm=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.prompt_template = RAGQARefinementPrompt()
        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')
        question = data.get('question', '')
        answer = data.get('answer', '')

        user_prompt = self.prompt_template.build_prompt(
            {'question': question, 'original_answer': answer}
        )

        try:
            result = self._llm_serve(user_prompt)
            if isinstance(result, dict):
                data['refined_answer'] = str(result.get('refined_answer', ''))
            else:
                data['refined_answer'] = ''
        except Exception as e:
            LOG.warning(f'Failed to clean QA: {e}')
            data['refined_answer'] = ''

        return data

`AgenticRAGExpandConclusions`

Bases: agenticrag

Parses the JSON conclusion list in raw_conclusion and expands it into multiple candidate task records.

Only items containing 'conclusion' and 'R' are kept. Each valid item produces a new data row with candidate_tasks_str.

Parameters:

max_per_task (int, default: 10 ) –

maximum number of candidate tasks per sample
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGExpandConclusions(max_per_task=5)
rows = op({
    'raw_conclusion': '[{"conclusion":"A","R":"rel"}]',
    'identifier': 'doc1'
})
print(rows)

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py

class AgenticRAGExpandConclusions(agenticrag):
    """Parses the JSON conclusion list in raw_conclusion
and expands it into multiple candidate task records.

Only items containing 'conclusion' and 'R' are kept.
Each valid item produces a new data row with candidate_tasks_str.

Args:
    max_per_task (int): maximum number of candidate tasks per sample
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGExpandConclusions(max_per_task=5)
    rows = op({
        'raw_conclusion': '[{"conclusion":"A","R":"rel"}]',
        'identifier': 'doc1'
    })
    print(rows)
    ```
    """

    def __init__(self, max_per_task: int = 10, **kwargs):
        super().__init__(**kwargs)
        self.max_per_task = max_per_task

    def forward(self, data: dict) -> List[dict]:
        conclusion_str = data.get('raw_conclusion', '')
        identifier = data.get('identifier', '')

        if not conclusion_str:
            return []

        try:
            parsed = json.loads(_extract_json_content(conclusion_str))
            if isinstance(parsed, list):
                parsed = parsed[:self.max_per_task]
            else:
                return []
        except Exception as e:
            LOG.warning(f'Failed to parse conclusion JSON: {e}')
            return []

        expanded_rows = []
        for item in parsed:
            if isinstance(item, dict) and 'conclusion' in item and 'R' in item:
                new_row = data.copy()
                new_row['candidate_tasks_str'] = json.dumps(item, ensure_ascii=False)
                new_row['identifier'] = str(identifier)
                expanded_rows.append(new_row)

        return expanded_rows

`AgenticRAGGenerateQuestion`

Bases: agenticrag

Generates a question-answer pair from task identifier (ID), relationship (R), and answer (A).

Parameters:

llm –

language model service instance
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGGenerateQuestion(llm=my_llm)
result = op({
    'candidate_tasks_str': '{"conclusion":"Paris","R":"capital_of"}',
    'identifier': 'France'
})
print(result['question'], result['answer'])

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py

class AgenticRAGGenerateQuestion(agenticrag):
    """Generates a question-answer pair from task identifier (ID), relationship (R), and answer (A).

Args:
    llm: language model service instance
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGGenerateQuestion(llm=my_llm)
    result = op({
        'candidate_tasks_str': '{"conclusion":"Paris","R":"capital_of"}',
        'identifier': 'France'
    })
    print(result['question'], result['answer'])
    ```
    """

    def __init__(self, llm=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.prompt_template = RAGTaskToQuestionPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict):
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')
        candidate_str = data.get('candidate_tasks_str', '')
        identifier = data.get('identifier', '')
        try:
            task_item = json.loads(_extract_json_content(candidate_str))
            conclusion = task_item.get('conclusion', '')
            relation = task_item.get('R', '')
            user_prompt = self.prompt_template.build_prompt(
                identifier, conclusion, relation
            )

            result = self._llm_serve(user_prompt)
            if isinstance(result, dict) and 'Q' in result:
                data['question'] = str(result['Q'])
                data['answer'] = str(conclusion)
                return data
        except Exception as e:
            LOG.warning(f'Failed to generate question: {e}')

        return []

`AgenticRAGGetConclusion`

Bases: agenticrag

An operator that extracts conclusions and generates relationships using an LLM.

It builds prompts from the input text and stores the raw model output in data['raw_conclusion'] for downstream parsing and task expansion. If generation fails, an empty string is assigned.

Parameters:

llm –

language model service instance
input_key (str, default: 'prompts' ) –

name of the input text field, default 'prompts'
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGGetConclusion(llm=my_llm)
result = op({'prompts': 'Some document content'})
print(result['raw_conclusion'])

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py

class AgenticRAGGetConclusion(agenticrag):
    """An operator that extracts conclusions and generates relationships using an LLM.

It builds prompts from the input text and stores the raw model output
in data['raw_conclusion'] for downstream parsing and task expansion.
If generation fails, an empty string is assigned.

Args:
    llm: language model service instance
    input_key (str): name of the input text field, default 'prompts'
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGGetConclusion(llm=my_llm)
    result = op({'prompts': 'Some document content'})
    print(result['raw_conclusion'])
    ```
    """

    def __init__(self, llm=None, input_key: str = 'prompts', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.prompt_template = RAGFactsConclusionPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt)
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')
        content = data.get(self.input_key, '')
        user_prompt = self.prompt_template.build_prompt(content)

        try:
            result = self._llm_serve(user_prompt)
            data['raw_conclusion'] = result
        except Exception as e:
            LOG.warning(f'Failed to extract conclusion: {e}')
            data['raw_conclusion'] = ''

        return data

`AgenticRAGGetIdentifier`

Bases: agenticrag

An operator that extracts a content identifier from the input text using an LLM.

Parameters:

llm –

language model service instance
input_key (str, default: 'prompts' ) –

name of the input text field, default 'prompts'
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGGetIdentifier(llm=my_llm, input_key='prompts')
result = op({'prompts': 'What is the third movie in the Avatar series?'})
print('identifier:', result['identifier'])
# {'identifier': 'Avatar series'}

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py

class AgenticRAGGetIdentifier(agenticrag):
    """An operator that extracts a content identifier from the input text using an LLM.


Args:
    llm: language model service instance
    input_key (str): name of the input text field, default 'prompts'
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGGetIdentifier(llm=my_llm, input_key='prompts')
    result = op({'prompts': 'What is the third movie in the Avatar series?'})
    print('identifier:', result['identifier'])
    # {'identifier': 'Avatar series'}
    ```
    """

    def __init__(self, llm=None, input_key: str = 'prompts', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.prompt_template = RAGContentIdExtractorPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        content = data.get(self.input_key, '')
        user_prompt = self.prompt_template.build_prompt(content)

        try:
            result = self._llm_serve(user_prompt)
            if isinstance(result, dict):
                data['identifier'] = result.get('content_identifier', '')
            else:
                data['identifier'] = ''
        except Exception as e:
            LOG.warning(f'Failed to extract identifier: {e}')
            data['identifier'] = ''

        return data

`AgenticRAGGoldenDocAnswer`

Bases: agenticrag

Generates answers from a golden document and verifies via recall scoring.

It produces an answer using golden_doc and question, then scores it against refined_answer. Samples with insufficient score are filtered out.

Parameters:

llm –

language model service instance
input_key (str, default: 'prompts' ) –

golden document field name
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGGoldenDocAnswer(llm=my_llm)
result = op({
    'prompts': 'Golden document text',
    'question': 'Q?',
    'refined_answer': 'Expected A'
})
print(result)

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py

class AgenticRAGGoldenDocAnswer(agenticrag):
    """Generates answers from a golden document and verifies via recall scoring.

It produces an answer using golden_doc and question,
then scores it against refined_answer.
Samples with insufficient score are filtered out.

Args:
    llm: language model service instance
    input_key (str): golden document field name
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGGoldenDocAnswer(llm=my_llm)
    result = op({
        'prompts': 'Golden document text',
        'question': 'Q?',
        'refined_answer': 'Expected A'
    })
    print(result)
    ```
    """

    def __init__(self, llm=None, input_key: str = 'prompts', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.prompt_template = RAGDocGroundedAnswerPrompt()
        self.score_template = RAGConsistencyScoringPrompt()
        if llm is not None:
            self._llm_answer_serve = llm.share()
            self._llm_answer_serve.start()
            score_system_prompt = self.score_template.build_system_prompt()
            self._llm_score_serve = llm.share().prompt(score_system_prompt).formatter(JsonFormatter())
            self._llm_score_serve.start()
        else:
            self._llm_answer_serve = None
            self._llm_score_serve = None

    def forward(self, data: dict):
        if self._llm_answer_serve is None or self._llm_score_serve is None:
            raise ValueError('LLM is not configured')
        golden_doc = data.get(self.input_key, '')
        question = data.get('question', '')
        refined_answer = data.get('refined_answer', '')

        user_prompt = self.prompt_template.build_prompt(
            golden_doc, question
        )
        try:
            golden_doc_answer = self._llm_answer_serve(user_prompt)
            data['golden_doc_answer'] = golden_doc_answer
        except Exception as e:
            LOG.warning(f'Failed to get golden doc answer: {e}')
            return []

        score_prompt = self.score_template.build_prompt(
            refined_answer, golden_doc_answer
        )

        try:
            score_result = self._llm_score_serve(score_prompt)
            if isinstance(score_result, dict):
                score = score_result.get('answer_score', 0)
                data['golden_doc_score'] = score

                if score < 1:
                    return []
            else:
                return []
        except Exception as e:
            LOG.warning(f'Failed to calculate golden doc score: {e}')
            return []

        return data

`AgenticRAGGroupAndLimit`

Bases: agenticrag

Groups data by a specified key and limits the number of QA pairs per group.

It groups batch input by input_key and retains up to max_question items per group to control sample distribution.

Parameters:

input_key (str, default: 'prompts' ) –

grouping field name
max_question (int, default: 10 ) –

maximum QA pairs per group

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGGroupAndLimit(input_key='prompts', max_question=2)
result = op([
    {'prompts': 'doc1', 'question': 'Q1'},
    {'prompts': 'doc1', 'question': 'Q2'},
    {'prompts': 'doc1', 'question': 'Q3'}
])
print(result)  # only 2 kept for doc1

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py

class AgenticRAGGroupAndLimit(agenticrag):
    """Groups data by a specified key and limits the number of QA pairs per group.

It groups batch input by input_key and retains up to max_question
items per group to control sample distribution.

Args:
    input_key (str): grouping field name
    max_question (int): maximum QA pairs per group


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGGroupAndLimit(input_key='prompts', max_question=2)
    result = op([
        {'prompts': 'doc1', 'question': 'Q1'},
        {'prompts': 'doc1', 'question': 'Q2'},
        {'prompts': 'doc1', 'question': 'Q3'}
    ])
    print(result)  # only 2 kept for doc1
    ```
    """

    def __init__(
        self,
        input_key: str = 'prompts',
        max_question: int = 10,
        **kwargs,
    ):
        super().__init__(rewrite_func='forward_batch_input', **kwargs)
        self.input_key = input_key
        self.max_question = max_question

    def forward_batch_input(self, data: List[dict]) -> List[dict]:
        grouped_data = {}

        for item in data:
            key_value = item.get(self.input_key, '')
            grouped_data.setdefault(key_value, [])

            if len(grouped_data[key_value]) < self.max_question:
                grouped_data[key_value].append(item)

        result_list = []
        for items in grouped_data.values():
            result_list.extend(items)

        LOG.info(f'Grouped and limited to {len(result_list)} QA pairs')
        return result_list

`AgenticRAGLLMVerify`

Bases: agenticrag

Verifies QA quality via LLM answering and recall scoring.

The model first answers the question to produce llm_answer, then scores refined_answer against llm_answer. If score >= 1, the sample is filtered out; otherwise retained.

Parameters:

llm –

language model service instance
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGLLMVerify(llm=my_llm)
result = op({'question': 'Q?', 'refined_answer': 'A'})
print(result)

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py

class AgenticRAGLLMVerify(agenticrag):
    """Verifies QA quality via LLM answering and recall scoring.

The model first answers the question to produce llm_answer,
then scores refined_answer against llm_answer.
If score >= 1, the sample is filtered out; otherwise retained.

Args:
    llm: language model service instance
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGLLMVerify(llm=my_llm)
    result = op({'question': 'Q?', 'refined_answer': 'A'})
    print(result)
    ```
    """

    def __init__(self, llm=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.prompt_template = RAGTaskSolverPrompt()
        self.score_template = RAGConsistencyScoringPrompt()
        if llm is not None:
            self._llm_answer_serve = llm.share()
            self._llm_answer_serve.start()
            score_system_prompt = self.score_template.build_system_prompt()
            self._llm_score_serve = llm.share().prompt(score_system_prompt).formatter(JsonFormatter())
            self._llm_score_serve.start()
        else:
            self._llm_answer_serve = None
            self._llm_score_serve = None

    def forward(self, data: dict):
        if self._llm_answer_serve is None or self._llm_score_serve is None:
            raise ValueError('LLM is not configured')
        question = data.get('question', '')
        refined_answer = data.get('refined_answer', '')

        user_prompt = self.prompt_template.build_prompt(question)
        try:
            llm_answer = self._llm_answer_serve(user_prompt)
            data['llm_answer'] = llm_answer
        except Exception as e:
            LOG.warning(f'Failed to get LLM answer: {e}')
            return []

        score_prompt = self.score_template.build_prompt(
            refined_answer, llm_answer
        )

        try:
            score_result = self._llm_score_serve(score_prompt)
            if isinstance(score_result, dict):
                score = score_result.get('answer_score', 0)
                data['llm_score'] = score

                if score >= 1:
                    return []
            else:
                data['llm_score'] = 0
        except Exception as e:
            LOG.warning(f'Failed to calculate recall score: {e}')
            data['llm_score'] = 0

        return data

`AgenticRAGOptionalAnswers`

Bases: agenticrag

Generates multiple optional answers for a refined answer.

It calls the LLM to produce semantically equivalent or similar variants, stored in optional_answer.

Parameters:

llm –

language model service instance

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGOptionalAnswers(llm=my_llm)
result = op({'refined_answer': 'Paris'})
print(result['optional_answer'])

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py

class AgenticRAGOptionalAnswers(agenticrag):
    """Generates multiple optional answers for a refined answer.

It calls the LLM to produce semantically equivalent or similar variants,
stored in optional_answer.

Args:
    llm: language model service instance


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGOptionalAnswers(llm=my_llm)
    result = op({'refined_answer': 'Paris'})
    print(result['optional_answer'])
    ```
    """

    def __init__(self, llm=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.prompt_template = RAGAnswerVariantsPrompt()
        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')
        refined_answer = data.get('refined_answer', '')

        user_prompt = self.prompt_template.build_prompt(refined_answer)

        try:
            result = self._llm_serve(user_prompt)
            if isinstance(result, list):
                data['optional_answer'] = result
            else:
                data['optional_answer'] = [refined_answer]
        except Exception as e:
            LOG.warning(f'Failed to generate optional answers: {e}')
            data['optional_answer'] = [refined_answer]

        return data

`lazyllm.tools.data.operators.agentic_rag.agenticrag_depth_qa_generator`

`DepthQAGBackwardTask`

Bases: agenticrag

Generates a backward task from the existing identifier, producing a new identifier and relation.

This operator infers backwards from the given identifier to generate a new identifier and corresponding relation for building depth QA tasks.

Parameters:

llm –

language model service instance
identifier_key (str, default: 'identifier' ) –

original identifier field name, default 'identifier'
new_identifier_key (str, default: 'new_identifier' ) –

new identifier field name, default 'new_identifier'
relation_key (str, default: 'relation' ) –

relation field name, default 'relation'
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.DepthQAGBackwardTask(llm=my_llm)
result = op({'identifier': 'machine learning'})
print(result)
# {'identifier': 'machine learning', 'new_identifier': '...', 'relation': '...'}

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_depth_qa_generator.py

class DepthQAGBackwardTask(agenticrag):
    """Generates a backward task from the existing identifier, producing a new identifier and relation.

This operator infers backwards from the given identifier to generate a new identifier
and corresponding relation for building depth QA tasks.

Args:
    llm: language model service instance
    identifier_key (str): original identifier field name, default 'identifier'
    new_identifier_key (str): new identifier field name, default 'new_identifier'
    relation_key (str): relation field name, default 'relation'
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.DepthQAGBackwardTask(llm=my_llm)
    result = op({'identifier': 'machine learning'})
    print(result)
    # {'identifier': 'machine learning', 'new_identifier': '...', 'relation': '...'}
    ```
    """

    def __init__(self, llm=None, identifier_key: str = 'identifier',
                 new_identifier_key: str = 'new_identifier', relation_key: str = 'relation', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.identifier_key = identifier_key
        self.new_identifier_key = new_identifier_key
        self.relation_key = relation_key
        self.prompt_template = RAGDepthBackwardSupersetPrompt()

        if llm is not None:
            self._llm_serve = llm.share().formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        identifier = data.get(self.identifier_key, '')
        user_prompt = self.prompt_template.build_prompt(identifier)

        try:
            result = self._llm_serve(user_prompt)
            parsed = self._parse_backward_result(result)
            if parsed is not None:
                data[self.new_identifier_key] = parsed['identifier']
                data[self.relation_key] = parsed['relation']
                return data
        except Exception as e:
            LOG.warning(f'Failed to generate backward task: {e}')

        return []

    def _parse_backward_result(self, result) -> Optional[dict]:
        try:
            if isinstance(result, dict) and 'identifier' in result and 'relation' in result:
                return result
            LOG.warning('[Skipped]: Invalid backward result')
        except Exception as e:
            LOG.warning(f'[Error]: Failed to parse backward result: {e}')
        return None

`DepthQAGCheckSuperset`

Bases: agenticrag

Checks whether the newly generated query is a superset of the original identifier.

Verifies if the combination of new_identifier and relation constitutes a valid superset query of the original identifier. If validation passes, the data is retained; otherwise, an empty list is returned to filter out the sample.

Parameters:

llm –

language model service instance
new_identifier_key (str, default: 'new_identifier' ) –

new identifier field name, default 'new_identifier'
relation_key (str, default: 'relation' ) –

relation field name, default 'relation'
identifier_key (str, default: 'identifier' ) –

original identifier field name, default 'identifier'
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.DepthQAGCheckSuperset(llm=my_llm)
result = op({
    'identifier': 'Paris',
    'new_identifier': 'France',
    'relation': 'capital_of'
})
print(result)  # returns data if valid, empty list if invalid

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_depth_qa_generator.py

class DepthQAGCheckSuperset(agenticrag):
    """Checks whether the newly generated query is a superset of the original identifier.

Verifies if the combination of new_identifier and relation constitutes a valid superset query
of the original identifier. If validation passes, the data is retained; otherwise,
an empty list is returned to filter out the sample.

Args:
    llm: language model service instance
    new_identifier_key (str): new identifier field name, default 'new_identifier'
    relation_key (str): relation field name, default 'relation'
    identifier_key (str): original identifier field name, default 'identifier'
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.DepthQAGCheckSuperset(llm=my_llm)
    result = op({
        'identifier': 'Paris',
        'new_identifier': 'France',
        'relation': 'capital_of'
    })
    print(result)  # returns data if valid, empty list if invalid
    ```
    """

    def __init__(self, llm=None, new_identifier_key: str = 'new_identifier',
                 relation_key: str = 'relation', identifier_key: str = 'identifier', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.new_identifier_key = new_identifier_key
        self.relation_key = relation_key
        self.identifier_key = identifier_key
        self.prompt_template = RAGDepthSupersetValidationPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        new_identifier = data.get(self.new_identifier_key, '')
        relation = data.get(self.relation_key, '')
        identifier = data.get(self.identifier_key, '')

        user_prompt = self.prompt_template.build_prompt(new_identifier, relation, identifier)

        try:
            result = self._llm_serve(user_prompt)
            if self._is_valid_superset(result):
                return data
        except Exception as e:
            LOG.warning(f'Failed to check superset: {e}')

        return []

    def _is_valid_superset(self, result) -> bool:
        try:
            if isinstance(result, dict):
                return result.get('new_query') == 'valid'
        except Exception as e:
            LOG.warning(f'[Error]: Failed to check superset: {e}')
        return False

`DepthQAGGenerateQuestion`

Bases: agenticrag

Generates a depth question based on the new identifier, relation, and original identifier.

Uses an LLM to generate a question for depth QA tasks based on new_identifier, relation, and identifier, storing the result in the specified question_key field.

Parameters:

llm –

language model service instance
new_identifier_key (str, default: 'new_identifier' ) –

new identifier field name, default 'new_identifier'
relation_key (str, default: 'relation' ) –

relation field name, default 'relation'
identifier_key (str, default: 'identifier' ) –

original identifier field name, default 'identifier'
question_key (str, default: 'depth_question' ) –

field name to store generated question, default 'depth_question'
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.DepthQAGGenerateQuestion(llm=my_llm)
result = op({
    'identifier': 'Paris',
    'new_identifier': 'France',
    'relation': 'capital_of'
})
print(result['depth_question'])
# 'What is the capital of France?'

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_depth_qa_generator.py

class DepthQAGGenerateQuestion(agenticrag):
    """Generates a depth question based on the new identifier, relation, and original identifier.

Uses an LLM to generate a question for depth QA tasks based on new_identifier, relation,
and identifier, storing the result in the specified question_key field.

Args:
    llm: language model service instance
    new_identifier_key (str): new identifier field name, default 'new_identifier'
    relation_key (str): relation field name, default 'relation'
    identifier_key (str): original identifier field name, default 'identifier'
    question_key (str): field name to store generated question, default 'depth_question'
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.DepthQAGGenerateQuestion(llm=my_llm)
    result = op({
        'identifier': 'Paris',
        'new_identifier': 'France',
        'relation': 'capital_of'
    })
    print(result['depth_question'])
    # 'What is the capital of France?'
    ```
    """

    def __init__(self, llm=None, new_identifier_key: str = 'new_identifier',
                 relation_key: str = 'relation', identifier_key: str = 'identifier',
                 question_key: str = 'depth_question', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.new_identifier_key = new_identifier_key
        self.relation_key = relation_key
        self.identifier_key = identifier_key
        self.question_key = question_key
        self.prompt_template = RAGDepthQuestionFromContextPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        new_identifier = data.get(self.new_identifier_key, '')
        relation = data.get(self.relation_key, '')
        identifier = data.get(self.identifier_key, '')

        user_prompt = self.prompt_template.build_prompt(new_identifier, relation, identifier)

        try:
            result = self._llm_serve(user_prompt)
            parsed = self._parse_question_result(result)
            if parsed is not None:
                data[self.question_key] = parsed
                return data
        except Exception as e:
            LOG.warning(f'Failed to generate question: {e}')

        return []

    def _parse_question_result(self, result) -> Optional[str]:
        try:
            if isinstance(result, dict) and 'new_query' in result:
                return result['new_query']
        except Exception as e:
            LOG.warning(f'[Error]: Failed to parse question: {e}')
        return None

`DepthQAGGetIdentifier`

Bases: agenticrag

An operator that extracts a content identifier from the input text using an LLM.

If the identifier field already exists in the data, processing is skipped.

Parameters:

llm –

language model service instance
input_key (str, default: 'question' ) –

name of the input text field, default 'question'
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.DepthQAGGetIdentifier(llm=my_llm, input_key='question')
result = op({'question': 'What is the capital of France?'})
print('identifier:', result['identifier'])
# {'identifier': 'capital of France'}

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_depth_qa_generator.py

class DepthQAGGetIdentifier(agenticrag):
    """An operator that extracts a content identifier from the input text using an LLM.

If the identifier field already exists in the data, processing is skipped.

Args:
    llm: language model service instance
    input_key (str): name of the input text field, default 'question'
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.DepthQAGGetIdentifier(llm=my_llm, input_key='question')
    result = op({'question': 'What is the capital of France?'})
    print('identifier:', result['identifier'])
    # {'identifier': 'capital of France'}
    ```
    """

    def __init__(self, llm=None, input_key: str = 'question', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.prompt_template = RAGDepthQueryIdPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt)
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        # Skip if identifier already exists
        if 'identifier' in data:
            return data

        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        content = data.get(self.input_key, '')
        user_prompt = self.prompt_template.build_prompt(content)

        try:
            result = self._llm_serve(user_prompt)
            data['identifier'] = result
        except Exception as e:
            LOG.warning(f'Failed to get identifier: {e}')
            data['identifier'] = ''

        return data

`DepthQAGVerifyQuestion`

Bases: agenticrag

Verifies the quality of generated questions and filters out overly easy ones.

First has the LLM answer the question to produce llm_answer, then calculates a recall score against refined_answer. If score >= 1 (indicating the question is too easy), the sample is filtered out; otherwise the data is retained.

Parameters:

llm –

language model service instance
question_key (str, default: 'depth_question' ) –

question field name, default 'depth_question'
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.DepthQAGVerifyQuestion(llm=my_llm)
result = op({
    'depth_question': 'What is the capital of France?',
    'refined_answer': 'Paris'
})
# Returns data if question is challenging, empty list if too easy
print(result)

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_depth_qa_generator.py

class DepthQAGVerifyQuestion(agenticrag):
    """Verifies the quality of generated questions and filters out overly easy ones.

First has the LLM answer the question to produce llm_answer, then calculates a recall score
against refined_answer. If score >= 1 (indicating the question is too easy), the sample
is filtered out; otherwise the data is retained.

Args:
    llm: language model service instance
    question_key (str): question field name, default 'depth_question'
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.DepthQAGVerifyQuestion(llm=my_llm)
    result = op({
        'depth_question': 'What is the capital of France?',
        'refined_answer': 'Paris'
    })
    # Returns data if question is challenging, empty list if too easy
    print(result)
    ```
    """

    def __init__(self, llm=None, question_key: str = 'depth_question', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.question_key = question_key
        self.answer_template = RAGDepthSolverPrompt()
        self.score_template = RAGDepthConsistencyScoringPrompt()

        if llm is not None:
            self._llm_answer_serve = llm.share()
            self._llm_answer_serve.start()

            score_system_prompt = self.score_template.build_system_prompt()
            self._llm_score_serve = llm.share().prompt(score_system_prompt).formatter(JsonFormatter())
            self._llm_score_serve.start()
        else:
            self._llm_answer_serve = None
            self._llm_score_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_answer_serve is None or self._llm_score_serve is None:
            raise ValueError('LLM is not configured')

        question = data.get(self.question_key, '')

        if 'refined_answer' not in data and 'answer' in data:
            data['refined_answer'] = data['answer']

        refined_answer = data.get('refined_answer', '')

        user_prompt = self.answer_template.build_prompt(question)
        try:
            llm_answer = self._llm_answer_serve(user_prompt)
            data['llm_answer'] = llm_answer
        except Exception as e:
            LOG.warning(f'Failed to get LLM answer: {e}')
            return []

        score_prompt = self.score_template.build_prompt(refined_answer, llm_answer)

        try:
            score_result = self._llm_score_serve(score_prompt)
            if isinstance(score_result, dict):
                score = score_result.get('answer_score', 0)
            else:
                score = 0
            data['llm_score'] = score

            # Filter out easy questions (score >= 1)
            if score >= 1:
                data.pop('llm_answer', None)
                data.pop('llm_score', None)
                return []

            # Clean up temporary fields
            data.pop('llm_answer', None)
            data.pop('llm_score', None)
        except Exception as e:
            LOG.warning(f'Failed to calculate recall score: {e}')
            return []

        return data

`lazyllm.tools.data.operators.agentic_rag.agenticrag_qaf1_sample_evaluator`

`qaf1_calculate_score(data, result_key='F1Score')`

A function that calculates the F1 score for QA pairs.

Calculates the F1 score (combining precision and recall) based on normalized prediction and ground truth answers. Supports multiple ground truth answers, taking the highest F1 score as the final result. Cleans up temporary fields after calculation.

Parameters:

data (dict) –

single data dictionary
output_key (str) –

output field name for F1 score, default 'F1Score'

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.qaf1_calculate_score(output_key='F1Score')
result = op({
    '_normalized_prediction': 'paris is capital',
    '_normalized_ground_truths': ['capital is paris', 'paris capital france']
})
print(result['F1Score'])  # F1 score value between 0.0 and 1.0

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_qaf1_sample_evaluator.py

@data_register('data.agenticrag', rewrite_func='forward', _concurrency_mode='process')
def qaf1_calculate_score(data: dict, result_key: str = 'F1Score') -> dict:
    """A function that calculates the F1 score for QA pairs.

Calculates the F1 score (combining precision and recall) based on normalized
prediction and ground truth answers. Supports multiple ground truth answers,
taking the highest F1 score as the final result. Cleans up temporary fields after calculation.

Args:
    data (dict): single data dictionary
    output_key (str): output field name for F1 score, default 'F1Score'


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.qaf1_calculate_score(output_key='F1Score')
    result = op({
        '_normalized_prediction': 'paris is capital',
        '_normalized_ground_truths': ['capital is paris', 'paris capital france']
    })
    print(result['F1Score'])  # F1 score value between 0.0 and 1.0
    ```
    """
    normalized_prediction = data.get('_normalized_prediction', None)
    normalized_ground_truths = data.get('_normalized_ground_truths', None)

    if normalized_prediction is None or not normalized_ground_truths:
        data[result_key] = 0.0
    else:
        max_f1 = 0.0
        for normalized_ground_truth in normalized_ground_truths:
            f1 = _compute_f1_score(normalized_prediction, normalized_ground_truth)
            max_f1 = max(max_f1, f1)
        data[result_key] = max_f1

    # Clean up temporary fields
    data.pop('_normalized_prediction', None)
    data.pop('_normalized_ground_truths', None)

    return data

`qaf1_normalize_texts(data, predicted_key='refined_answer', reference_key='golden_doc_answer')`

A function that normalizes prediction and ground truth answer texts.

Performs standardization on prediction and ground truth answers, including: converting to lowercase, removing punctuation, removing articles (a/an/the), and normalizing whitespace. Normalized results are stored in temporary fields for subsequent F1 score calculation.

Parameters:

data (dict) –

single data dictionary
prediction_key (str) –

prediction answer field name, default 'refined_answer'
ground_truth_key (str) –

ground truth answer field name, default 'golden_doc_answer'

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.qaf1_normalize_texts(prediction_key='refined_answer', ground_truth_key='golden_doc_answer')
result = op({
    'refined_answer': 'Paris is the capital.',
    'golden_doc_answer': 'The capital is Paris!'
})
print(result['_normalized_prediction'])  # 'paris is capital'
print(result['_normalized_ground_truths'])  # ['capital is paris']

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_qaf1_sample_evaluator.py

@data_register('data.agenticrag', rewrite_func='forward', _concurrency_mode='process')
def qaf1_normalize_texts(data: dict,
                         predicted_key: str = 'refined_answer',
                         reference_key: str = 'golden_doc_answer') -> dict:
    """A function that normalizes prediction and ground truth answer texts.

Performs standardization on prediction and ground truth answers, including:
converting to lowercase, removing punctuation, removing articles (a/an/the),
and normalizing whitespace. Normalized results are stored in temporary fields
for subsequent F1 score calculation.

Args:
    data (dict): single data dictionary
    prediction_key (str): prediction answer field name, default 'refined_answer'
    ground_truth_key (str): ground truth answer field name, default 'golden_doc_answer'


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.qaf1_normalize_texts(prediction_key='refined_answer', ground_truth_key='golden_doc_answer')
    result = op({
        'refined_answer': 'Paris is the capital.',
        'golden_doc_answer': 'The capital is Paris!'
    })
    print(result['_normalized_prediction'])  # 'paris is capital'
    print(result['_normalized_ground_truths'])  # ['capital is paris']
    ```
    """
    prediction = data.get(predicted_key, None)
    ground_truths = data.get(reference_key, None)

    if prediction is None or ground_truths is None:
        data['_normalized_prediction'] = None
        data['_normalized_ground_truths'] = None
        return data

    # Normalize prediction
    data['_normalized_prediction'] = _normalize_response(str(prediction))

    # Normalize ground truths (handle both string and list)
    if isinstance(ground_truths, str):
        data['_normalized_ground_truths'] = [_normalize_response(str(ground_truths))]
    else:
        data['_normalized_ground_truths'] = [
            _normalize_response(str(gt)) for gt in ground_truths if gt is not None
        ]

    return data

`lazyllm.tools.data.operators.agentic_rag.agenticrag_width_qa_generator`

`WidthQAGCheckDecomposition`

Bases: agenticrag

An operator that verifies whether the merged question effectively decomposes the original questions.

This operator checks if the complex question generated by LLM correctly decomposes and includes the original questions. If validation passes, the data is retained; otherwise an empty list is returned to filter out the sample.

Parameters:

llm –

language model service instance
output_question_key (str, default: 'generated_width_task' ) –

field name for the generated question, default 'generated_width_task'
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.WidthQAGCheckDecomposition(llm=my_llm)
result = op({
    'question': 'What are the capitals of France and UK?',
    'original_question': ['What is Paris?', 'What is London?'],
    'index': 0
})
print(result)  # Returns data if valid, empty list if invalid

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_width_qa_generator.py

class WidthQAGCheckDecomposition(agenticrag):
    """An operator that verifies whether the merged question effectively decomposes the original questions.

This operator checks if the complex question generated by LLM correctly decomposes
and includes the original questions. If validation passes, the data is retained;
otherwise an empty list is returned to filter out the sample.

Args:
    llm: language model service instance
    output_question_key (str): field name for the generated question, default 'generated_width_task'
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.WidthQAGCheckDecomposition(llm=my_llm)
    result = op({
        'question': 'What are the capitals of France and UK?',
        'original_question': ['What is Paris?', 'What is London?'],
        'index': 0
    })
    print(result)  # Returns data if valid, empty list if invalid
    ```
    """

    def __init__(self, llm=None, output_question_key: str = 'generated_width_task', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.output_question_key = output_question_key
        self.prompt_template = RAGWidthDecompositionCheckPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def _build_check_input(self, item: dict) -> dict:
        ori_q = item.get('original_question', [])
        return {
            'index': item.get('index', 0),
            'complex_question': item.get('question', ''),
            'original_questions': ori_q if isinstance(ori_q, list) else [ori_q]
        }

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        check_input = self._build_check_input(data)
        user_prompt = self.prompt_template.build_prompt(check_input)

        try:
            result = self._llm_serve(user_prompt)

            if isinstance(result, dict):
                state = result.get('state', None)
                complex_question = result.get('complex_question', data.get('question'))

                if state == 1:
                    data['state'] = state
                    data[self.output_question_key] = complex_question
                    return data
                else:
                    return []
            else:
                LOG.warning('[Skipped]: Invalid check result')
                return []
        except Exception as e:
            LOG.warning(f'[Error]: Failed to parse check result: {e}')
            return []

`WidthQAGFilterByScore`

Bases: agenticrag

An operator that filters width questions based on recall score.

This operator compares golden_answer with llm_answer to calculate a recall score. If score >= 1, the sample is filtered out (indicating the question is too easy or LLM answered too well); otherwise the data is retained and temporary fields are cleaned.

Parameters:

llm –

language model service instance
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.WidthQAGFilterByScore(llm=my_llm)
result = op({
    'original_answer': ['Paris', 'London'],
    'llm_answer': 'Paris is the capital of France and London is the capital of UK',
    'state': 1
})
# Returns data if score < 1, empty list if score >= 1
print(result)

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_width_qa_generator.py

class WidthQAGFilterByScore(agenticrag):
    """An operator that filters width questions based on recall score.

This operator compares golden_answer with llm_answer to calculate a recall score.
If score >= 1, the sample is filtered out (indicating the question is too easy
or LLM answered too well); otherwise the data is retained and temporary fields are cleaned.

Args:
    llm: language model service instance
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.WidthQAGFilterByScore(llm=my_llm)
    result = op({
        'original_answer': ['Paris', 'London'],
        'llm_answer': 'Paris is the capital of France and London is the capital of UK',
        'state': 1
    })
    # Returns data if score < 1, empty list if score >= 1
    print(result)
    ```
    """

    def __init__(self, llm=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.score_template = RAGWidthConsistencyScoringPrompt()

        if llm is not None:
            system_prompt = self.score_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        golden_answer = data.get('original_answer', [])
        llm_answer = data.get('llm_answer', '')

        if not golden_answer or not llm_answer:
            return []

        user_prompt = self.score_template.build_prompt(golden_answer, llm_answer)

        try:
            score_result = self._llm_serve(user_prompt)

            if isinstance(score_result, dict):
                score = score_result.get('answer_score', 0)
            else:
                score = 0

            data['llm_score'] = score

            if score >= 1:
                data.pop('llm_answer', None)
                data.pop('llm_score', None)
                data.pop('state', None)
                return []

            data.pop('llm_answer', None)
            data.pop('llm_score', None)
            data.pop('state', None)
            return data
        except Exception as e:
            LOG.warning(f'Failed to calculate recall score: {e}')
            return []

`WidthQAGMergePairs`

Bases: agenticrag

An operator that merges adjacent QA pairs to generate width questions.

This operator receives a batch of QA data and uses an LLM to merge adjacent pairs into more complex width questions. Requires at least 2 items to perform merging.

Parameters:

llm –

language model service instance
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.WidthQAGMergePairs(llm=my_llm)
result = op([
    {'question': 'What is Paris?', 'golden_answer': 'Capital of France'},
    {'question': 'What is London?', 'golden_answer': 'Capital of UK'}
])
print(result[0]['question'])  # Merged complex question

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_width_qa_generator.py

class WidthQAGMergePairs(agenticrag):
    """An operator that merges adjacent QA pairs to generate width questions.

This operator receives a batch of QA data and uses an LLM to merge adjacent pairs
into more complex width questions. Requires at least 2 items to perform merging.

Args:
    llm: language model service instance
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.WidthQAGMergePairs(llm=my_llm)
    result = op([
        {'question': 'What is Paris?', 'golden_answer': 'Capital of France'},
        {'question': 'What is London?', 'golden_answer': 'Capital of UK'}
    ])
    print(result[0]['question'])  # Merged complex question
    ```
    """

    def __init__(self, llm=None, **kwargs):
        super().__init__(rewrite_func='forward_batch_input', **kwargs)
        self.prompt_template = RAGWidthQuestionSynthesisPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def _build_prompts(self, data: List[dict]) -> list:
        user_prompts = []
        for i in range(len(data) - 1):
            pair = [data[i], data[i + 1]]
            user_prompts.append(self.prompt_template.build_prompt(pair))
        return user_prompts

    def _parse_merge_result(self, result, idx: int, input_batch: List[dict]) -> Optional[dict]:
        try:
            if isinstance(result, dict):
                if isinstance(result, list) and len(result) > 0:
                    result = result[0]

            if not isinstance(result, dict) or 'question' not in result or 'index' not in result:
                LOG.warning(f'[Skipped]: Invalid merge result at index {idx}')
                return None

            indices = result['index'] if isinstance(result['index'], list) else [result['index']]
            group_items = [input_batch[i] for i in indices if i < len(input_batch)]

            if not group_items:
                return None

            return {
                'question': result['question'],
                'content_identifier': result.get('content_identifier', ''),
                'qa_index': indices,
                'index': idx,
                'original_answer': [item['golden_answer'] for item in group_items],
                'original_question': [item['question'] for item in group_items],
            }
        except Exception as e:
            LOG.warning(f'[Error]: Failed to parse merge result at index {idx}: {e}')
            return None

    def forward_batch_input(self, data: List[dict]) -> List[dict]:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        if len(data) < 2:
            LOG.warning('Need at least 2 items to merge.')
            return []

        LOG.info(f'Merging {len(data)} items into width questions...')
        user_prompts = self._build_prompts(data)

        if not user_prompts:
            return []

        merge_results = []
        for prompt in user_prompts:
            merge_results.append(self._llm_serve(prompt))

        merged_data_list = []
        for idx, result in enumerate(merge_results):
            parsed = self._parse_merge_result(result, idx, data)
            if parsed is not None:
                merged_data_list.append(parsed)

        LOG.info(f'Generated {len(merged_data_list)} merged questions.')
        return merged_data_list

`WidthQAGVerifyQuestion`

Bases: agenticrag

An operator that verifies if the generated question can be properly answered.

This operator uses an LLM to attempt answering the generated question and stores the answer in the llm_answer field for subsequent scoring.

Parameters:

llm –

language model service instance
output_question_key (str, default: 'generated_width_task' ) –

question field name, default 'generated_width_task'
**kwargs (dict, default: {} ) –

additional user-provided arguments.

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.WidthQAGVerifyQuestion(llm=my_llm)
result = op({
    'generated_width_task': 'What are the capitals of France and UK?',
    'index': 0
})
print(result['llm_answer'])  # LLM's answer to the question

Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_width_qa_generator.py

class WidthQAGVerifyQuestion(agenticrag):
    """An operator that verifies if the generated question can be properly answered.

This operator uses an LLM to attempt answering the generated question and stores
the answer in the llm_answer field for subsequent scoring.

Args:
    llm: language model service instance
    output_question_key (str): question field name, default 'generated_width_task'
    **kwargs (dict): additional user-provided arguments.


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.WidthQAGVerifyQuestion(llm=my_llm)
    result = op({
        'generated_width_task': 'What are the capitals of France and UK?',
        'index': 0
    })
    print(result['llm_answer'])  # LLM's answer to the question
    ```
    """

    def __init__(self, llm=None, output_question_key: str = 'generated_width_task', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.output_question_key = output_question_key
        self.prompt_template = RAGWidthVerificationPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def _parse_verify_result(self, result) -> Optional[str]:
        try:
            if isinstance(result, dict):
                return result.get('llm_answer', None)
        except Exception as e:
            LOG.warning(f'[Error]: Failed to parse verification result: {e}')
        return None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        question = data.get(self.output_question_key, '')

        verify_input = {
            'index': data.get('index', 0),
            'complex_question': question
        }

        user_prompt = self.prompt_template.build_prompt(verify_input)

        try:
            result = self._llm_serve(user_prompt)
            llm_answer = self._parse_verify_result(result)
            data['llm_answer'] = llm_answer
            return data
        except Exception as e:
            LOG.warning(f'Failed to verify question: {e}')
            return []

Embedding synthesis

`lazyllm.tools.data.operators.embedding_synthesis`

`EmbeddingFormatFlagEmbedding`

Bases: embedding

An operator that formats data into FlagEmbedding training format.

This operator formats the input query, pos (positive samples), and neg (negative samples) into the training data format required by the FlagEmbedding framework. Supports adding an instruction field for supervised Embedding training.

Parameters:

instruction (str, default: None ) –

Instruction text for supervised training scenarios. Defaults to None.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

A dictionary containing query, pos, neg, and optional prompt fields.

Examples:

from lazyllm.tools.data import embedding

op = embedding.EmbeddingFormatFlagEmbedding(instruction='Represent this sentence for searching relevant passages:')
result = op({'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking recipe']})
# Returns: {'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking recipe'], 'prompt': 'Represent this sentence for searching relevant passages:'}

Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_data_formatter.py

class EmbeddingFormatFlagEmbedding(embedding):
    """An operator that formats data into FlagEmbedding training format.

This operator formats the input query, pos (positive samples), and neg (negative samples)
into the training data format required by the FlagEmbedding framework.
Supports adding an instruction field for supervised Embedding training.

Args:
    instruction (str, optional): Instruction text for supervised training scenarios. Defaults to None.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: A dictionary containing query, pos, neg, and optional prompt fields.


Examples:
    ```python
    from lazyllm.tools.data import embedding

    op = embedding.EmbeddingFormatFlagEmbedding(instruction='Represent this sentence for searching relevant passages:')
    result = op({'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking recipe']})
    # Returns: {'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking recipe'], 'prompt': 'Represent this sentence for searching relevant passages:'}
    ```
    """
    def __init__(self, instruction: Optional[str] = None, **kwargs):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.instruction = instruction

    def forward(self, data: dict) -> dict:
        query = data.get('query', '')
        pos = data.get('pos', [])
        neg = data.get('neg', [])

        if not query or not pos:
            return []

        # Ensure pos and neg are lists
        if not isinstance(pos, list):
            pos = [pos]
        if not isinstance(neg, list):
            neg = [neg] if neg else []

        result = {
            'query': query,
            'pos': pos,
            'neg': neg,
        }
        if self.instruction:
            result['prompt'] = self.instruction

        return result

`EmbeddingFormatSentenceTransformers`

Bases: embedding

An operator that formats data into SentenceTransformers triplet training format.

This operator converts the input query, pos (positive samples), and neg (negative samples) into the anchor-positive-negative triplet format required by the SentenceTransformers framework. Suitable for training with losses like MultipleNegativesRankingLoss.

Parameters:

**kwargs (dict, default: {} ) –

Optional arguments passed to the parent class.

Returns:

–

List[dict]: A list of dictionaries containing anchor, positive, and negative fields,
–

with one triplet generated for each positive-negative pair.

Examples:

from lazyllm.tools.data import embedding

op = embedding.EmbeddingFormatSentenceTransformers()
result = op({'query': 'machine learning', 'pos': ['ML basics'], 'neg': ['cooking tips']})
# Returns: [{'anchor': 'machine learning', 'positive': 'ML basics', 'negative': 'cooking tips'}]

Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_data_formatter.py

class EmbeddingFormatSentenceTransformers(embedding):
    """An operator that formats data into SentenceTransformers triplet training format.

This operator converts the input query, pos (positive samples), and neg (negative samples)
into the anchor-positive-negative triplet format required by the SentenceTransformers framework.
Suitable for training with losses like MultipleNegativesRankingLoss.

Args:
    **kwargs (dict): Optional arguments passed to the parent class.

Returns:
    List[dict]: A list of dictionaries containing anchor, positive, and negative fields,
    with one triplet generated for each positive-negative pair.


Examples:
    ```python
    from lazyllm.tools.data import embedding

    op = embedding.EmbeddingFormatSentenceTransformers()
    result = op({'query': 'machine learning', 'pos': ['ML basics'], 'neg': ['cooking tips']})
    # Returns: [{'anchor': 'machine learning', 'positive': 'ML basics', 'negative': 'cooking tips'}]
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='process', **kwargs)

    def forward(self, data: dict) -> List[dict]:
        query = data.get('query', '')
        pos = data.get('pos', [])
        neg = data.get('neg', [])

        if not query or not pos:
            return []

        # Ensure pos and neg are lists
        pos_list = pos if isinstance(pos, list) else [pos]
        neg_list = neg if isinstance(neg, list) else [neg] if neg else []

        # Create anchor-positive-negative triplets
        results = []
        for p in pos_list:
            for n in neg_list:
                results.append(
                    {
                        'anchor': query,
                        'positive': p,
                        'negative': n,
                    }
                )

        return results

`EmbeddingFormatTriplet`

Bases: embedding

An operator that formats data into generic triplet format.

This operator converts the input query, pos (positive samples), and neg (negative samples) into a standard triplet format with field names query, positive, and negative. Compatible with various Embedding training frameworks.

Parameters:

**kwargs (dict, default: {} ) –

Optional arguments passed to the parent class.

Returns:

–

List[dict]: A list of dictionaries containing query, positive, and negative fields,
–

with one triplet generated for each positive-negative pair.

Examples:

from lazyllm.tools.data import embedding

op = embedding.EmbeddingFormatTriplet()
result = op({'query': 'deep learning', 'pos': ['neural networks', 'AI'], 'neg': ['history', 'geography']})
# Returns list of triplets combining each positive with each negative

Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_data_formatter.py

class EmbeddingFormatTriplet(embedding):
    """An operator that formats data into generic triplet format.

This operator converts the input query, pos (positive samples), and neg (negative samples)
into a standard triplet format with field names query, positive, and negative.
Compatible with various Embedding training frameworks.

Args:
    **kwargs (dict): Optional arguments passed to the parent class.

Returns:
    List[dict]: A list of dictionaries containing query, positive, and negative fields,
    with one triplet generated for each positive-negative pair.


Examples:
    ```python
    from lazyllm.tools.data import embedding

    op = embedding.EmbeddingFormatTriplet()
    result = op({'query': 'deep learning', 'pos': ['neural networks', 'AI'], 'neg': ['history', 'geography']})
    # Returns list of triplets combining each positive with each negative
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='process', **kwargs)

    def forward(self, data: dict) -> List[dict]:
        query = data.get('query', '')
        pos = data.get('pos', [])
        neg = data.get('neg', [])

        if not query or not pos:
            return []

        # Ensure pos and neg are lists
        pos_list = pos if isinstance(pos, list) else [pos]
        neg_list = neg if isinstance(neg, list) else [neg] if neg else []

        # Create query-positive-negative triplets
        results = []
        for p in pos_list:
            for n in neg_list:
                results.append(
                    {
                        'query': query,
                        'positive': p,
                        'negative': n,
                    }
                )

        return results

`EmbeddingGenerateQueries`

Bases: embedding

An operator that generates queries using LLM.

This operator calls a language model service to generate queries based on the built prompts. Returns the query response in JSON format.

Parameters:

llm –

LLM service instance for generating queries.
num_queries (int, default: 3 ) –

Number of queries to generate, defaults to 3.
lang (str, default: 'zh' ) –

Language, 'zh' for Chinese, 'en' for English, defaults to 'zh'.
query_types (List[str], default: None ) –

List of query types, defaults to ['factual', 'semantic', 'inferential'].
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Input data with '_query_response' field added containing the generated query response.

Examples:

from lazyllm.tools.data import embedding

# Assuming llm is an LLM service instance
generator = embedding.EmbeddingGenerateQueries(llm=llm, lang='zh')
data = {'_query_prompt': 'Generate queries for: machine learning tutorial'}
result = generator(data)
# Returns data with '_query_response' field containing JSON queries

Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_query_generator.py

class EmbeddingGenerateQueries(embedding):
    """An operator that generates queries using LLM.

This operator calls a language model service to generate queries based on the built prompts. Returns the query response in JSON format.

Args:
    llm: LLM service instance for generating queries.
    num_queries (int): Number of queries to generate, defaults to 3.
    lang (str): Language, 'zh' for Chinese, 'en' for English, defaults to 'zh'.
    query_types (List[str], optional): List of query types, defaults to ['factual', 'semantic', 'inferential'].
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Input data with '_query_response' field added containing the generated query response.


Examples:
    ```python
    from lazyllm.tools.data import embedding

    # Assuming llm is an LLM service instance
    generator = embedding.EmbeddingGenerateQueries(llm=llm, lang='zh')
    data = {'_query_prompt': 'Generate queries for: machine learning tutorial'}
    result = generator(data)
    # Returns data with '_query_response' field containing JSON queries
    ```
    """
    def __init__(
        self,
        llm=None,
        num_queries: int = 3,
        lang: str = 'zh',
        query_types: Optional[List[str]] = None,
        **kwargs,
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.prompt_template = EmbeddingQueryGeneratorPrompt(lang=lang)
        self.num_queries = num_queries
        self.query_types = query_types or ['factual', 'semantic', 'inferential']
        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = (
                llm.share()
                .prompt(system_prompt)
                .formatter(JsonFormatter())
            )
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(
        self,
        data: dict,
        input_key: str = 'passage',
        **kwargs,
    ) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        passage = data.get(input_key, '')
        if not passage:
            return {**data, '_query_response': ''}

        user_prompt = self.prompt_template.build_prompt(
            passage=passage,
            num_queries=self.num_queries,
            query_types=self.query_types,
        )
        if not user_prompt:
            return {**data, '_query_response': ''}

        try:
            result = self._llm_serve(user_prompt)

            if isinstance(result, str):
                response = result
            else:
                response = json.dumps(result, ensure_ascii=False)

            return {**data, '_query_response': response}

        except Exception as e:
            LOG.warning(f'Failed to generate queries: {e}')
            return {**data, '_query_response': ''}

`EmbeddingInitBM25`

Bases: embedding

An operator that initializes BM25 index.

This operator builds BM25 index based on corpus for subsequent keyword retrieval and hard negative mining. Supports Chinese and English tokenization, using jieba for Chinese and Stemmer for English stemming.

Parameters:

language (str, default: 'zh' ) –

Language type, 'zh' for Chinese, 'en' for English, defaults to 'zh'.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

–

List[dict]: Input data with BM25 index and related configuration added to each item.

Examples:

from lazyllm.tools.data import embedding

# First build corpus, then initialize BM25
corpus_op = embedding.build_embedding_corpus(input_pos_key='pos')
bm25_op = embedding.EmbeddingInitBM25(language='zh')
# Returns data with '_bm25' index and tokenizer configuration

Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_hard_negative_miner.py

class EmbeddingInitBM25(embedding):
    """An operator that initializes BM25 index.

This operator builds BM25 index based on corpus for subsequent keyword retrieval and hard negative mining.
Supports Chinese and English tokenization, using jieba for Chinese and Stemmer for English stemming.

Args:
    language (str): Language type, 'zh' for Chinese, 'en' for English, defaults to 'zh'.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    List[dict]: Input data with BM25 index and related configuration added to each item.


Examples:
    ```python
    from lazyllm.tools.data import embedding

    # First build corpus, then initialize BM25
    corpus_op = embedding.build_embedding_corpus(input_pos_key='pos')
    bm25_op = embedding.EmbeddingInitBM25(language='zh')
    # Returns data with '_bm25' index and tokenizer configuration
    ```
    """

    def __init__(self, language: str = 'zh', **kwargs):
        super().__init__(rewrite_func='forward_batch_input', **kwargs)
        self.language = language
        self._setup_tokenizer(language)

    def _setup_tokenizer(self, language: str):
        if language == 'en':
            self._stemmer = Stemmer.Stemmer('english')
            self._stopwords = language
            self._tokenizer = lambda t: t
        elif language == 'zh':
            self._stemmer = None
            self._stopwords = STOPWORDS_CHINESE
            self._tokenizer = lambda t: ' '.join(jieba.lcut(t))
        else:
            self._stemmer = None
            self._stopwords = None
            self._tokenizer = lambda t: t

    def forward_batch_input(self, inputs: List[dict], **kwargs) -> List[dict]:
        if not inputs:
            return inputs

        # Load corpus from file path instead of memory
        corpus_path = inputs[0].get('_corpus', '')
        if not corpus_path:
            LOG.warning('No corpus path found for BM25 initialization.')
            return [
                {**item, '_bm25': None, '_bm25_corpus': []}
                for item in inputs
            ]

        corpus = _load_corpus_from_path(corpus_path)
        if not corpus:
            LOG.warning(f'Failed to load corpus from {corpus_path}')
            return [
                {**item, '_bm25': None, '_bm25_corpus': []}
                for item in inputs
            ]

        LOG.info(f'Initializing BM25 index for {len(corpus)} documents...')

        corpus_tokens = bm25s.tokenize(
            [self._tokenizer(doc) for doc in corpus],
            stopwords=self._stopwords,
            stemmer=self._stemmer,
        )

        bm25_index = bm25s.BM25()
        bm25_index.index(corpus_tokens)

        LOG.info('BM25 index initialized.')

        return [
            {
                **item,
                '_bm25': bm25_index,
                '_bm25_corpus': corpus,
                '_bm25_tokenizer': self._tokenizer,
                '_bm25_stopwords': self._stopwords,
                '_bm25_stemmer': self._stemmer,
            }
            for item in inputs
        ]

`EmbeddingInitSemantic`

Bases: embedding

An operator that initializes semantic embeddings.

This operator uses Embedding service to compute vector representations for all documents in the corpus and saves them to files. Used for subsequent semantic similarity calculation and hard negative mining.

Parameters:

embedding_serving (Callable, default: None ) –

Embedding service callable for computing text vectors.
embeddings_dir (str, default: None ) –

Directory to save embedding files, defaults to corpus directory.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

–

List[dict]: Input data with semantic embedding file paths and corpus information added.

Examples:

from lazyllm.tools.data import embedding

# Assuming my_embedding_fn is an embedding service
semantic_op = embedding.EmbeddingInitSemantic(embedding_serving=my_embedding_fn)
# Returns data with '_semantic_embeddings_path' pointing to saved embeddings

Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_hard_negative_miner.py

class EmbeddingInitSemantic(embedding):
    """An operator that initializes semantic embeddings.

This operator uses Embedding service to compute vector representations for all documents in the corpus
and saves them to files. Used for subsequent semantic similarity calculation and hard negative mining.

Args:
    embedding_serving (Callable): Embedding service callable for computing text vectors.
    embeddings_dir (str, optional): Directory to save embedding files, defaults to corpus directory.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    List[dict]: Input data with semantic embedding file paths and corpus information added.


Examples:
    ```python
    from lazyllm.tools.data import embedding

    # Assuming my_embedding_fn is an embedding service
    semantic_op = embedding.EmbeddingInitSemantic(embedding_serving=my_embedding_fn)
    # Returns data with '_semantic_embeddings_path' pointing to saved embeddings
    ```
    """

    def __init__(
        self,
        embedding_serving: Optional[Callable] = None,
        embeddings_dir: Optional[str] = None,
        **kwargs,
    ):
        super().__init__(rewrite_func='forward_batch_input', **kwargs)
        self.embedding_serving = embedding_serving
        self.embeddings_dir = embeddings_dir

    def forward_batch_input(self, inputs: List[dict], **kwargs) -> List[dict]:
        if not inputs:
            return inputs

        # Load corpus from file path instead of memory
        corpus_path = inputs[0].get('_corpus', '')
        if not corpus_path:
            LOG.warning('No corpus path found for semantic initialization.')
            return [
                {
                    **item,
                    '_semantic_embeddings_path': '',
                    '_semantic_corpus': [],
                }
                for item in inputs
            ]

        # Verify all inputs share the same corpus path for consistency
        if not all(item.get('_corpus') == corpus_path for item in inputs):
            LOG.warning('Not all inputs share the same corpus path. Using corpus from first item.')

        corpus = _load_corpus_from_path(corpus_path)
        if not corpus or self.embedding_serving is None:
            LOG.warning(
                'No corpus or embedding_serving for semantic initialization.'
            )
            return [
                {
                    **item,
                    '_semantic_embeddings_path': '',
                    '_semantic_corpus': corpus or [],
                }
                for item in inputs
            ]

        LOG.info(f'Computing embeddings for {len(corpus)} documents...')
        embeddings = np.array(self.embedding_serving(corpus))
        LOG.info('Embeddings computed.')

        # Save embeddings to file instead of storing in memory for each item
        if self.embeddings_dir is None:
            embeddings_dir = os.path.dirname(corpus_path)
        else:
            embeddings_dir = self.embeddings_dir
        os.makedirs(embeddings_dir, exist_ok=True)

        embeddings_path = os.path.join(
            embeddings_dir, f'embeddings_{id(inputs)}.npy'
        )
        np.save(embeddings_path, embeddings)
        LOG.info(f'Saved embeddings to {embeddings_path}')

        return [
            {
                **item,
                '_semantic_embeddings_path': embeddings_path,
                '_semantic_corpus': corpus,
            }
            for item in inputs
        ]

`EmbeddingMineSemanticNegatives`

Bases: embedding

An operator that mines hard negative samples using semantic similarity.

This operator finds documents most similar to the query but not in positive samples based on semantic vector similarity. Suitable for mining hard negatives that are semantically similar but actually irrelevant, usually performs better than BM25 method.

Parameters:

num_negatives (int, default: 7 ) –

Number of negative samples to mine, defaults to 7.
embedding_serving (Callable, default: None ) –

Embedding service callable for computing query vectors.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Input data with negative samples mined based on semantic similarity added.

Examples:

from lazyllm.tools.data import embedding

# Assuming embeddings are initialized
semantic_miner = embedding.EmbeddingMineSemanticNegatives(num_negatives=5, embedding_serving=my_embedding_fn)
data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_semantic_embeddings_path': emb_path, '_semantic_corpus': corpus}
result = semantic_miner(data)
# Returns data with 'neg' field containing semantically similar negative samples

Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_hard_negative_miner.py

class EmbeddingMineSemanticNegatives(embedding):
    """An operator that mines hard negative samples using semantic similarity.

This operator finds documents most similar to the query but not in positive samples based on semantic vector similarity.
Suitable for mining hard negatives that are semantically similar but actually irrelevant,
usually performs better than BM25 method.

Args:
    num_negatives (int): Number of negative samples to mine, defaults to 7.
    embedding_serving (Callable): Embedding service callable for computing query vectors.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Input data with negative samples mined based on semantic similarity added.


Examples:
    ```python
    from lazyllm.tools.data import embedding

    # Assuming embeddings are initialized
    semantic_miner = embedding.EmbeddingMineSemanticNegatives(num_negatives=5, embedding_serving=my_embedding_fn)
    data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_semantic_embeddings_path': emb_path, '_semantic_corpus': corpus}
    result = semantic_miner(data)
    # Returns data with 'neg' field containing semantically similar negative samples
    ```
    """

    def __init__(
        self,
        num_negatives: int = 7,
        embedding_serving: Optional[Callable] = None,
        **kwargs,
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.num_negatives = num_negatives
        self.embedding_serving = embedding_serving

    @staticmethod
    def _cosine_similarity(
        query_emb: np.ndarray,
        corpus_embs: np.ndarray,
    ) -> np.ndarray:
        query_norm = np.linalg.norm(query_emb)
        if query_norm > 0:
            query_emb = query_emb / query_norm

        corpus_norms = np.linalg.norm(
            corpus_embs,
            axis=1,
            keepdims=True,
        )
        corpus_norms = np.where(corpus_norms > 0, corpus_norms, 1)
        corpus_normalized = corpus_embs / corpus_norms

        return np.dot(corpus_normalized, query_emb)

    def forward(
        self,
        data: dict,
        input_query_key: str = 'query',
        input_pos_key: str = 'pos',
        output_neg_key: str = 'neg',
        **kwargs,
    ) -> dict:
        # Load embeddings from file path
        embeddings_path = data.get('_semantic_embeddings_path', '')
        corpus_embeddings = _load_embeddings_from_path(embeddings_path)
        corpus = data.get('_semantic_corpus') or []

        if corpus_embeddings is None:
            LOG.warning('Semantic embeddings not initialized.')
            return {**data, output_neg_key: []}

        query = data.get(input_query_key, '')
        pos_samples = data.get(input_pos_key, [])

        if not query:
            return {**data, output_neg_key: []}

        if self.embedding_serving is None:
            return {**data, output_neg_key: []}

        pos_set = _normalize_pos_samples(pos_samples)

        query_embedding = np.array(
            self.embedding_serving([query])[0]
        )

        similarities = self._cosine_similarity(
            query_embedding,
            corpus_embeddings,
        )

        scored_docs = [
            (sim, doc)
            for sim, doc in zip(similarities, corpus)
            if doc not in pos_set
        ]

        scored_docs.sort(key=lambda x: x[0], reverse=True)

        negatives = [
            doc for _, doc in scored_docs[: self.num_negatives]
        ]

        return {**data, output_neg_key: negatives}

`EmbeddingParseQueries`

Bases: embedding

An operator that parses generated queries.

This operator parses the query response generated by LLM and expands each query into an independent data record.

Parameters:

input_key (str, default: 'passage' ) –

Input field name, defaults to 'passage'.
output_query_key (str, default: 'query' ) –

Output query field name, defaults to 'query'.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

–

List[dict]: List of parsed queries, each query as an independent data record.

Examples:

from lazyllm.tools.data import embedding

parser = embedding.EmbeddingParseQueries(input_key='passage', output_query_key='query')
data = {'_query_response': '[{"query": "what is ML?", "type": "factual"}]', 'passage': 'Machine learning is...'}
result = parser(data)
# Returns list of expanded query records with 'query' and 'pos' fields

Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_query_generator.py

class EmbeddingParseQueries(embedding):
    """An operator that parses generated queries.

This operator parses the query response generated by LLM and expands each query into an independent data record.

Args:
    input_key (str): Input field name, defaults to 'passage'.
    output_query_key (str): Output query field name, defaults to 'query'.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    List[dict]: List of parsed queries, each query as an independent data record.


Examples:
    ```python
    from lazyllm.tools.data import embedding

    parser = embedding.EmbeddingParseQueries(input_key='passage', output_query_key='query')
    data = {'_query_response': '[{"query": "what is ML?", "type": "factual"}]', 'passage': 'Machine learning is...'}
    result = parser(data)
    # Returns list of expanded query records with 'query' and 'pos' fields
    ```
    """
    def __init__(
        self,
        input_key: str = 'passage',
        output_query_key: str = 'query',
        **kwargs,
    ):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.input_key = input_key
        self.output_query_key = output_query_key

    def forward(
        self,
        data: dict,
        **kwargs,
    ) -> List[dict]:
        response = data.get('_query_response', '')
        if not response:
            return []

        passage = data.get(self.input_key, '')
        expanded_rows = []

        try:
            parsed = json.loads(_clean_json_block(response))
            queries = (
                parsed if isinstance(parsed, list)
                else parsed.get('queries', [])
            )

            for query_item in queries:
                if isinstance(query_item, dict):
                    query = query_item.get('query', '')
                    query_type = query_item.get('type', 'unknown')
                else:
                    query = str(query_item)
                    query_type = 'unknown'

                if query.strip():
                    new_row = data.copy()
                    new_row[self.output_query_key] = query.strip()
                    new_row['query_type'] = query_type
                    new_row['pos'] = [passage]

                    new_row.pop('_query_prompt', None)
                    new_row.pop('_query_response', None)

                    expanded_rows.append(new_row)

        except Exception as e:
            LOG.warning(f'Failed to parse query response: {e}')
            return []

        return expanded_rows

`EmbeddingTrainTestSplitter`

Bases: embedding

An operator that splits dataset into training and test sets.

This operator randomly shuffles the input data and splits it into training and test sets according to the specified ratio. Supports saving split data to JSONL files and stratified sampling by a specified key.

Parameters:

test_size (float, default: 0.1 ) –

Proportion of test set, defaults to 0.1 (i.e., 10%).
seed (int, default: 42 ) –

Random seed for reproducible splitting, defaults to 42.
stratify_key (str, default: None ) –

Key name for stratified sampling, defaults to None.
train_output_file (str, default: None ) –

Output file path for training set, defaults to None.
test_output_file (str, default: None ) –

Output file path for test set, defaults to None.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

–

List[dict]: All samples from both training and test sets, with a 'split' field added

to indicate which set each sample belongs to.

Examples:

from lazyllm.tools.data import embedding

op = embedding.EmbeddingTrainTestSplitter(test_size=0.2, seed=123, train_output_file='train.jsonl', test_output_file='test.jsonl')
data = [{'query': 'q1', 'pos': 'p1'}, {'query': 'q2', 'pos': 'p2'}, {'query': 'q3', 'pos': 'p3'}]
result = op(data)
# Returns all samples with 'split' field ('train' or 'test')
# Saves train data to train.jsonl and test data to test.jsonl

Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_data_formatter.py

class EmbeddingTrainTestSplitter(embedding):
    """An operator that splits dataset into training and test sets.

This operator randomly shuffles the input data and splits it into training and test sets
according to the specified ratio. Supports saving split data to JSONL files
and stratified sampling by a specified key.

Args:
    test_size (float): Proportion of test set, defaults to 0.1 (i.e., 10%).
    seed (int): Random seed for reproducible splitting, defaults to 42.
    stratify_key (str, optional): Key name for stratified sampling, defaults to None.
    train_output_file (str, optional): Output file path for training set, defaults to None.
    test_output_file (str, optional): Output file path for test set, defaults to None.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    List[dict]: All samples from both training and test sets, with a 'split' field added
to indicate which set each sample belongs to.


Examples:
    ```python
    from lazyllm.tools.data import embedding

    op = embedding.EmbeddingTrainTestSplitter(test_size=0.2, seed=123, train_output_file='train.jsonl', test_output_file='test.jsonl')
    data = [{'query': 'q1', 'pos': 'p1'}, {'query': 'q2', 'pos': 'p2'}, {'query': 'q3', 'pos': 'p3'}]
    result = op(data)
    # Returns all samples with 'split' field ('train' or 'test')
    # Saves train data to train.jsonl and test data to test.jsonl
    ```
    """
    def __init__(
        self,
        test_size: float = 0.1,
        seed: int = 42,
        stratify_key: Optional[str] = None,
        train_output_file: Optional[str] = None,
        test_output_file: Optional[str] = None,
        **kwargs,
    ):
        super().__init__(rewrite_func='forward_batch_input', **kwargs)
        self.test_size = test_size
        self.seed = seed
        self.stratify_key = stratify_key
        self.train_output_file = train_output_file
        self.test_output_file = test_output_file
        LOG.info(
            f'Initializing {self.__class__.__name__} with test_size: {test_size}'
        )

    def forward_batch_input(
        self,
        inputs: List[dict],
        **kwargs,
    ) -> List[dict]:
        assert isinstance(inputs, list), 'inputs must be a list of dict'

        LOG.info(
            f'Splitting {len(inputs)} samples with test_size={self.test_size}'
        )

        # Shuffle and split
        random.seed(self.seed)
        shuffled = inputs.copy()
        random.shuffle(shuffled)

        split_idx = int(len(shuffled) * (1 - self.test_size))
        train_data = shuffled[:split_idx]
        test_data = shuffled[split_idx:]

        # Add split labels
        for item in train_data:
            item['split'] = 'train'
        for item in test_data:
            item['split'] = 'test'

        LOG.info(
            f'Split completed: {len(train_data)} train, {len(test_data)} test'
        )

        if self.train_output_file:
            output_path = Path(self.train_output_file)
            output_path.parent.mkdir(parents=True, exist_ok=True)
            with open(output_path, 'w', encoding='utf-8') as f:
                for item in train_data:
                    item_copy = {
                        k: v for k, v in item.items() if k != 'split'
                    }
                    f.write(
                        json.dumps(item_copy, ensure_ascii=False) + '\n'
                    )
            LOG.info(f'Saved train data to {output_path}')

        if self.test_output_file:
            output_path = Path(self.test_output_file)
            output_path.parent.mkdir(parents=True, exist_ok=True)
            with open(output_path, 'w', encoding='utf-8') as f:
                for item in test_data:
                    item_copy = {
                        k: v for k, v in item.items() if k != 'split'
                    }
                    f.write(
                        json.dumps(item_copy, ensure_ascii=False) + '\n'
                    )
            LOG.info(f'Saved test data to {output_path}')

        return train_data + test_data

Knowledge clean

`lazyllm.tools.data.operators.knowledge_cleaning.file_or_url_to_markdown_converter_api`

`FileOrURLNormalizer`

Bases: kbc

File or URL normalizer operator.

This operator automatically identifies file format based on input type (file or URL) and performs normalization. Supports PDF, HTML/XML, TXT/MD files, and web URLs. For network PDFs, they will be downloaded locally first.

Parameters:

intermediate_dir (str, default: 'intermediate' ) –

Directory for intermediate files, defaults to 'intermediate'.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Normalized data containing the following fields:
_type –

File type ('pdf', 'html', 'text', 'invalid', 'unsupported')
_raw_path –

Local file path (if available)
_url –

URL address (if web page)
_output_path –

Expected Markdown output path
_error –

Error message (if any)

Examples:

from lazyllm.tools.data import kbc

normalizer = kbc.FileOrURLNormalizer(intermediate_dir='./temp')

# For file input
data = {'source': '/path/to/document.pdf'}
result = normalizer(data)
# Returns: {'source': '/path/to/document.pdf', '_type': 'pdf', '_raw_path': '/path/to/document.pdf', '_output_path': './temp/document.md'}

# For URL input
data = {'source': 'https://example.com/page.html'}
result = normalizer(data)
# Returns: {'source': 'https://example.com/page.html', '_type': 'html', '_url': 'https://example.com/page.html', '_output_path': './temp/url_xxx.md'}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/file_or_url_to_markdown_converter_api.py

class FileOrURLNormalizer(kbc):
    """File or URL normalizer operator.

This operator automatically identifies file format based on input type (file or URL) and performs normalization.
Supports PDF, HTML/XML, TXT/MD files, and web URLs. For network PDFs, they will be downloaded locally first.

Args:
    intermediate_dir (str): Directory for intermediate files, defaults to 'intermediate'.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Normalized data containing the following fields:
    _type: File type ('pdf', 'html', 'text', 'invalid', 'unsupported')
    _raw_path: Local file path (if available)
    _url: URL address (if web page)
    _output_path: Expected Markdown output path
    _error: Error message (if any)


Examples:
    ```python
    from lazyllm.tools.data import kbc

    normalizer = kbc.FileOrURLNormalizer(intermediate_dir='./temp')

    # For file input
    data = {'source': '/path/to/document.pdf'}
    result = normalizer(data)
    # Returns: {'source': '/path/to/document.pdf', '_type': 'pdf', '_raw_path': '/path/to/document.pdf', '_output_path': './temp/document.md'}

    # For URL input
    data = {'source': 'https://example.com/page.html'}
    result = normalizer(data)
    # Returns: {'source': 'https://example.com/page.html', '_type': 'html', '_url': 'https://example.com/page.html', '_output_path': './temp/url_xxx.md'}
    ```
    """
    def __init__(self, intermediate_dir: str = 'intermediate', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.intermediate_dir = intermediate_dir
        os.makedirs(self.intermediate_dir, exist_ok=True)

    def forward(
        self,
        data: dict,
        input_key: str = 'source',
        **kwargs,
    ) -> dict:
        src = data.get(input_key, '')
        if not src:
            return {**data, '_type': 'invalid', '_error': 'Empty source'}

        result = data.copy()

        if _is_url(src):
            if _is_pdf_url(src):
                pdf_path = os.path.join(
                    self.intermediate_dir,
                    f'crawled_{id(data)}.pdf',
                )
                downloaded_path = _download_pdf(src, pdf_path)

                if downloaded_path:
                    result['_type'] = 'pdf'
                    result['_raw_path'] = downloaded_path
                else:
                    result['_type'] = 'invalid'
                    result['_error'] = 'Failed to download PDF from URL'
            else:
                result['_type'] = 'html'
                result['_url'] = src

        else:
            if not os.path.exists(src):
                result['_type'] = 'invalid'
                result['_error'] = f'File not found: {src}'
            else:
                ext = Path(src).suffix.lower()

                if ext in [
                    '.pdf',
                    '.png',
                    '.jpg',
                    '.jpeg',
                    '.webp',
                    '.gif',
                ]:
                    result['_type'] = 'pdf'
                    result['_raw_path'] = src

                elif ext in ['.html', '.xml']:
                    result['_type'] = 'html'
                    result['_raw_path'] = src

                elif ext in ['.txt', '.md']:
                    result['_type'] = 'text'
                    result['_raw_path'] = src

                else:
                    result['_type'] = 'unsupported'
                    result['_error'] = f'Unsupported file type: {ext}'

        if '_raw_path' in result:
            name = Path(result['_raw_path']).stem
            result['_output_path'] = os.path.join(
                self.intermediate_dir,
                f'{name}.md',
            )

        elif '_url' in result:
            result['_output_path'] = os.path.join(
                self.intermediate_dir,
                f'url_{id(data)}.md',
            )

        return result

`HTMLToMarkdownConverter`

Bases: kbc

HTML to Markdown converter operator.

This operator uses the trafilatura library to extract content from HTML or XML files and convert to Markdown format. Supports local HTML files and web URLs, automatically handles page metadata.

Parameters:

**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Converted data containing the following fields:
_markdown_path –

Path to the generated Markdown file

Examples:

from lazyllm.tools.data import kbc

converter = kbc.HTMLToMarkdownConverter()

# After normalization
data = {'_type': 'html', '_url': 'https://example.com/article', '_output_path': './temp/output.md'}
result = converter(data)
# Returns: {'_type': 'html', '_url': 'https://example.com/article', '_output_path': './temp/output.md', '_markdown_path': './temp/output.md'}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/file_or_url_to_markdown_converter_api.py

class HTMLToMarkdownConverter(kbc):
    """HTML to Markdown converter operator.

This operator uses the trafilatura library to extract content from HTML or XML files and convert to Markdown format.
Supports local HTML files and web URLs, automatically handles page metadata.

Args:
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Converted data containing the following fields:
    _markdown_path: Path to the generated Markdown file


Examples:
    ```python
    from lazyllm.tools.data import kbc

    converter = kbc.HTMLToMarkdownConverter()

    # After normalization
    data = {'_type': 'html', '_url': 'https://example.com/article', '_output_path': './temp/output.md'}
    result = converter(data)
    # Returns: {'_type': 'html', '_url': 'https://example.com/article', '_output_path': './temp/output.md', '_markdown_path': './temp/output.md'}
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

    def forward(self, data: dict, **kwargs) -> dict:
        if data.get('_type', '') != 'html':
            return data

        url = data.get('_url')
        raw_path = data.get('_raw_path')
        output_path = data.get('_output_path', '')

        try:
            if url:
                downloaded = trafilatura.fetch_url(url)
                if not downloaded:
                    error_msg = (
                        'fail to fetch this url. '
                        'Please check your Internet Connection or URL correctness'
                    )
                    with open(output_path, 'w', encoding='utf-8') as f:
                        f.write(error_msg)
                    return {**data, '_markdown_path': output_path}

            elif raw_path:
                with open(raw_path, 'r', encoding='utf-8') as f:
                    downloaded = f.read()
            else:
                return {**data, '_markdown_path': ''}

            result = trafilatura.extract(
                downloaded,
                output_format='markdown',
                with_metadata=True,
            )

            if result:
                with open(output_path, 'w', encoding='utf-8') as f:
                    f.write(result)

                LOG.info(f'Extracted content written to {output_path}')
                return {**data, '_markdown_path': output_path}

            return {**data, '_markdown_path': ''}

        except Exception as e:
            LOG.error(f'Error extracting HTML/XML: {e}')
            return {**data, '_markdown_path': ''}

`PDFToMarkdownConverterAPI`

Bases: kbc

PDF to Markdown converter API operator.

This operator uses the MinerU service to convert PDF files (including scanned documents and images) to Markdown format. Supports calling MinerU via API for PDF parsing, with configurable backend engine and upload mode.

Parameters:

mineru_url (str, default: None ) –

MinerU service URL address.
mineru_backend (str, default: 'vlm-vllm-async-engine' ) –

MinerU backend engine type, defaults to 'vlm-vllm-async-engine'.
upload_mode (bool, default: True ) –

Whether to use upload mode, defaults to True.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Converted data containing the following fields:
_markdown_path –

Path to the generated Markdown file

Examples:

from lazyllm.tools.data import kbc

converter = kbc.PDFToMarkdownConverterAPI(
    mineru_url='your_mineru_url',
    mineru_backend='vlm-vllm-async-engine',
    upload_mode=True
)

# After normalization
data = {'_type': 'pdf', '_raw_path': '/path/to/doc.pdf', '_output_path': './temp/output.md'}
result = converter(data)
# Returns: {'_type': 'pdf', '_raw_path': '/path/to/doc.pdf', '_output_path': './temp/output.md', '_markdown_path': './temp/output.md'}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/file_or_url_to_markdown_converter_api.py

class PDFToMarkdownConverterAPI(kbc):
    """PDF to Markdown converter API operator.

This operator uses the MinerU service to convert PDF files (including scanned documents and images) to Markdown format.
Supports calling MinerU via API for PDF parsing, with configurable backend engine and upload mode.

Args:
    mineru_url (str): MinerU service URL address.
    mineru_backend (str): MinerU backend engine type, defaults to 'vlm-vllm-async-engine'.
    upload_mode (bool): Whether to use upload mode, defaults to True.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Converted data containing the following fields:
    _markdown_path: Path to the generated Markdown file


Examples:
    ```python
    from lazyllm.tools.data import kbc

    converter = kbc.PDFToMarkdownConverterAPI(
        mineru_url='your_mineru_url',
        mineru_backend='vlm-vllm-async-engine',
        upload_mode=True
    )

    # After normalization
    data = {'_type': 'pdf', '_raw_path': '/path/to/doc.pdf', '_output_path': './temp/output.md'}
    result = converter(data)
    # Returns: {'_type': 'pdf', '_raw_path': '/path/to/doc.pdf', '_output_path': './temp/output.md', '_markdown_path': './temp/output.md'}
    ```
    """
    def __init__(
        self,
        mineru_url: str = None,
        mineru_backend: str = 'vlm-vllm-async-engine',
        upload_mode: bool = True,
        **kwargs,
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.mineru_url = mineru_url
        self.mineru_backend = mineru_backend
        self.upload_mode = upload_mode

    def forward(self, data: dict, **kwargs) -> dict:
        if data.get('_type', '') != 'pdf':
            return data

        if self.mineru_url is None:
            LOG.error('mineru_url is required for PDF processing')
            return {**data, '_markdown_path': ''}

        try:
            from lazyllm.tools.rag import MineruPDFReader
        except ImportError:
            LOG.error('MineruPDFReader not available')
            return {**data, '_markdown_path': ''}

        raw_path = data.get('_raw_path')
        output_path = data.get('_output_path', '')

        if not raw_path:
            return {**data, '_markdown_path': ''}

        try:
            reader = MineruPDFReader(
                url=self.mineru_url,
                backend=self.mineru_backend,
                upload_mode=self.upload_mode,
                split_doc=False,
            )

            docs = reader(file=raw_path, use_cache=False)

            if not docs:
                LOG.warning(f'MinerU returned no documents for: {raw_path}')
                return {**data, '_markdown_path': ''}

            md_content = '\n'.join(
                doc.text for doc in docs if doc.text
            )

            if not md_content.strip():
                LOG.warning(
                    f'MinerU returned empty content for: {raw_path}',
                )
                return {**data, '_markdown_path': ''}

            os.makedirs(os.path.dirname(output_path), exist_ok=True)

            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(md_content)

            LOG.info(f'MinerU parsed: {raw_path} -> {output_path}')
            return {**data, '_markdown_path': output_path}

        except Exception as e:
            LOG.error(f'MinerU API failed for {raw_path}: {e}')
            return {**data, '_markdown_path': ''}

`lazyllm.tools.data.operators.knowledge_cleaning.kbc_chunk_generator_batch`

`KBCChunkText`

Bases: kbc

Text chunking operator.

This operator splits long text into chunks, supporting multiple chunking strategies: - token: Token-based chunking - sentence: Sentence boundary-based chunking - semantic: Semantic similarity-based chunking - recursive: Recursive chunking

Parameters:

chunk_size (int, default: 512 ) –

Maximum size of each chunk, defaults to 512.
chunk_overlap (int, default: 50 ) –

Overlap size between chunks, defaults to 50.
split_method (str, default: 'token' ) –

Chunking method, options: 'token', 'sentence', 'semantic', 'recursive', defaults to 'token'.
tokenizer_name (str, default: 'bert-base-uncased' ) –

Name of the tokenizer to use, defaults to 'bert-base-uncased'.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Data containing chunking results:
_chunks –

List of chunked texts
_chunk_error –

Chunking error message (if any)

Examples:

from lazyllm.tools.data import kbc

chunker = kbc.KBCChunkText(chunk_size=512, chunk_overlap=50, split_method='token')

data = {'_text_content': 'Long text content that needs to be chunked...'}
result = chunker(data)
# Returns: {'_text_content': 'Long text content...', '_chunks': ['chunk1', 'chunk2', ...]}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_chunk_generator_batch.py

class KBCChunkText(kbc):
    """Text chunking operator.

This operator splits long text into chunks, supporting multiple chunking strategies:
- token: Token-based chunking
- sentence: Sentence boundary-based chunking
- semantic: Semantic similarity-based chunking
- recursive: Recursive chunking

Args:
    chunk_size (int): Maximum size of each chunk, defaults to 512.
    chunk_overlap (int): Overlap size between chunks, defaults to 50.
    split_method (str): Chunking method, options: 'token', 'sentence', 'semantic', 'recursive', defaults to 'token'.
    tokenizer_name (str): Name of the tokenizer to use, defaults to 'bert-base-uncased'.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Data containing chunking results:
    _chunks: List of chunked texts
    _chunk_error: Chunking error message (if any)


Examples:
    ```python
    from lazyllm.tools.data import kbc

    chunker = kbc.KBCChunkText(chunk_size=512, chunk_overlap=50, split_method='token')

    data = {'_text_content': 'Long text content that needs to be chunked...'}
    result = chunker(data)
    # Returns: {'_text_content': 'Long text content...', '_chunks': ['chunk1', 'chunk2', ...]}
    ```
    """
    def __init__(
        self,
        chunk_size: int = 512,
        chunk_overlap: int = 50,
        split_method: str = 'token',
        tokenizer_name: str = 'bert-base-uncased',
        **kwargs,
    ):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.split_method = split_method
        self.tokenizer_name = tokenizer_name
        self._chunker = None
        self._tokenizer = None

    def _ensure_initialized(self):
        if self._tokenizer is None:
            self._tokenizer = transformers.AutoTokenizer.from_pretrained(self.tokenizer_name)
            self._chunker = self._initialize_chunker()

    def _initialize_chunker(self):
        if self.split_method == 'token':
            return chonkie.TokenChunker(
                tokenizer=self._tokenizer,
                chunk_size=self.chunk_size,
                chunk_overlap=self.chunk_overlap,
            )

        if self.split_method == 'sentence':
            return chonkie.SentenceChunker(
                chunk_size=self.chunk_size,
                chunk_overlap=self.chunk_overlap,
            )

        if self.split_method == 'semantic':
            return chonkie.SemanticChunker(
                chunk_size=self.chunk_size,
            )

        if self.split_method == 'recursive':
            return chonkie.RecursiveChunker(
                chunk_size=self.chunk_size,
                chunk_overlap=self.chunk_overlap,
            )

        raise ValueError(f'Unsupported split method: {self.split_method}')

    def forward(
        self,
        data: dict,
        **kwargs,
    ) -> dict:
        text = data.get('_text_content', '')
        if not text:
            return {**data, '_chunks': []}

        self._ensure_initialized()

        try:
            tokens = self._tokenizer.encode(text)
            total_tokens = len(tokens)
            max_tokens = self._tokenizer.model_max_length

            if total_tokens <= max_tokens:
                chunks = self._chunker(text)
            else:
                x = (total_tokens + max_tokens - 1) // max_tokens
                words = text.split()
                words_per_chunk = (len(words) + x - 1) // x

                chunks = []
                for j in range(0, len(words), words_per_chunk):
                    chunk_text = ' '.join(words[j:j + words_per_chunk])
                    chunks.extend(self._chunker(chunk_text))

            chunk_texts = [chunk.text for chunk in chunks]
            LOG.info(f'Split text into {len(chunks)} chunks.')
            return {**data, '_chunks': chunk_texts}

        except Exception as e:
            LOG.error(f'Error chunking text: {e}')
            return {**data, '_chunks': [], '_chunk_error': str(e)}

`KBCLoadText`

Bases: kbc

Operator for loading text file content.

This operator loads text file content from the specified path, supporting multiple file formats: - .txt, .md, .xml: Direct text content reading - .json, .jsonl: Extract and merge content from specified text fields

Parameters:

**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Data containing loading results:
_text_content –

Loaded text content
_load_error –

Loading error message (if any)

Examples:

from lazyllm.tools.data import kbc

loader = kbc.KBCLoadText()

# Load text file
data = {'text_path': '/path/to/document.txt'}
result = loader(data)
# Returns: {'text_path': '/path/to/document.txt', '_text_content': 'file content...'}

# Load JSON file
data = {'text_path': '/path/to/data.json'}
result = loader(data)
# Returns: {'text_path': '/path/to/data.json', '_text_content': 'extracted text...'}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_chunk_generator_batch.py

class KBCLoadText(kbc):
    """Operator for loading text file content.

This operator loads text file content from the specified path, supporting multiple file formats:
- .txt, .md, .xml: Direct text content reading
- .json, .jsonl: Extract and merge content from specified text fields

Args:
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Data containing loading results:
    _text_content: Loaded text content
    _load_error: Loading error message (if any)


Examples:
    ```python
    from lazyllm.tools.data import kbc

    loader = kbc.KBCLoadText()

    # Load text file
    data = {'text_path': '/path/to/document.txt'}
    result = loader(data)
    # Returns: {'text_path': '/path/to/document.txt', '_text_content': 'file content...'}

    # Load JSON file
    data = {'text_path': '/path/to/data.json'}
    result = loader(data)
    # Returns: {'text_path': '/path/to/data.json', '_text_content': 'extracted text...'}
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

    def forward(
        self,
        data: dict,
        input_key: str = 'text_path',
        **kwargs,
    ) -> dict:
        text_path = data.get(input_key, '')
        if not text_path:
            return {**data, '_text_content': '', '_load_error': 'Empty text path'}

        if not os.path.exists(text_path):
            LOG.error(f'Input file not found: {text_path}')
            return {
                **data,
                '_text_content': '',
                '_load_error': f'File not found: {text_path}',
            }

        try:
            if text_path.endswith(('.txt', '.md', '.xml')):
                with open(text_path, 'r', encoding='utf-8') as f:
                    text = f.read()
                return {**data, '_text_content': text}

            if text_path.endswith(('.json', '.jsonl')):
                with open(text_path, 'r', encoding='utf-8') as f:
                    if text_path.endswith('.json'):
                        file_data = json.load(f)
                    else:
                        file_data = [json.loads(line) for line in f]

                text_fields = ['text', 'content', 'body']
                for field in text_fields:
                    if isinstance(file_data, list) and file_data and field in file_data[0]:
                        text = '\n'.join(item[field] for item in file_data)
                        return {**data, '_text_content': text}
                    if isinstance(file_data, dict) and field in file_data:
                        text = file_data[field]
                        return {**data, '_text_content': text}

                LOG.error(f'No text field found in {text_path}')
                return {
                    **data,
                    '_text_content': '',
                    '_load_error': 'No text field found',
                }

            LOG.error(f'Unsupported file format for {text_path}')
            return {
                **data,
                '_text_content': '',
                '_load_error': 'Unsupported format',
            }

        except Exception as e:
            LOG.error(f'Error loading {text_path}: {e}')
            return {
                **data,
                '_text_content': '',
                '_load_error': str(e),
            }

`KBCSaveChunks`

Bases: kbc

Operator for saving text chunking results.

This operator saves chunked texts as JSON files, with each chunk as a JSON object. Supports specifying output directory, preserving the relative path structure of the original file.

Parameters:

output_dir (str, default: None ) –

Output directory path, defaults to None (save to 'extract' subdirectory of the original file's directory).
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Data containing save results:
chunk_path –

Path to the saved JSON file

Examples:

from lazyllm.tools.data import kbc

saver = kbc.KBCSaveChunks(output_dir='./output')

data = {'text_path': '/path/to/doc.txt', '_chunks': ['chunk1', 'chunk2']}
result = saver(data)
# Returns: {'text_path': '/path/to/doc.txt', 'chunk_path': './output/path/to/doc_chunk.json'}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_chunk_generator_batch.py

class KBCSaveChunks(kbc):
    """Operator for saving text chunking results.

This operator saves chunked texts as JSON files, with each chunk as a JSON object.
Supports specifying output directory, preserving the relative path structure of the original file.

Args:
    output_dir (str, optional): Output directory path, defaults to None (save to 'extract' subdirectory of the original file's directory).
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Data containing save results:
    chunk_path: Path to the saved JSON file


Examples:
    ```python
    from lazyllm.tools.data import kbc

    saver = kbc.KBCSaveChunks(output_dir='./output')

    data = {'text_path': '/path/to/doc.txt', '_chunks': ['chunk1', 'chunk2']}
    result = saver(data)
    # Returns: {'text_path': '/path/to/doc.txt', 'chunk_path': './output/path/to/doc_chunk.json'}
    ```
    """
    def __init__(self, output_dir: Optional[str] = None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.output_dir = output_dir

    def forward(
        self,
        data: dict,
        input_key: str = 'text_path',
        output_key: str = 'chunk_path',
        **kwargs,
    ) -> dict:
        chunks = data.get('_chunks', [])
        text_path = data.get(input_key, '')

        result = data.copy()

        if not chunks:
            LOG.warning(f'No chunks to save for {text_path}')
            result[output_key] = ''
            for key in ['_text_content', '_load_error', '_chunks', '_chunk_error']:
                result.pop(key, None)
            return result

        try:
            # Determine output directory
            if self.output_dir:
                # Use specified output directory, preserving relative structure
                abs_text_path = os.path.abspath(text_path)
                abs_cwd = os.path.abspath(os.getcwd())

                if abs_text_path.startswith(abs_cwd):
                    rel_path = os.path.relpath(os.path.dirname(abs_text_path), abs_cwd)
                else:
                    rel_path = os.path.dirname(abs_text_path).lstrip('/')

                output_dir = os.path.join(self.output_dir, rel_path)
            else:
                # Default: save to 'extract' subdirectory
                output_dir = os.path.join(os.path.dirname(text_path), 'extract')

            os.makedirs(output_dir, exist_ok=True)

            file_name = os.path.splitext(os.path.basename(text_path))[0] + '_chunk.json'
            output_path = os.path.join(output_dir, file_name)

            json_chunks = [{'raw_chunk': chunk} for chunk in chunks]

            with open(output_path, 'w', encoding='utf-8') as f:
                json.dump(json_chunks, f, ensure_ascii=False, indent=4)

            LOG.info(f'Saved {len(chunks)} chunks to {output_path}')

            result[output_key] = output_path
            for key in ['_text_content', '_load_error', '_chunks', '_chunk_error']:
                result.pop(key, None)

            return result

        except Exception as e:
            LOG.error(f'Error saving chunks: {e}')
            result[output_key] = ''
            for key in ['_text_content', '_load_error', '_chunks', '_chunk_error']:
                result.pop(key, None)
            return result

`lazyllm.tools.data.operators.knowledge_cleaning.kbc_chunk_generator`

`KBCExpandChunks`

Bases: kbc

Operator that expands chunked text into independent records.

This operator expands data records containing multiple text chunks into multiple independent data records, with each record containing one chunk. Suitable for scenarios where chunked texts need to be processed as independent samples.

Parameters:

output_key (str, default: 'raw_chunk' ) –

Output key name for storing chunk text, defaults to 'raw_chunk'.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

–

List[dict]: List of expanded independent data records, each containing one chunk.

Examples:

from lazyllm.tools.data import kbc

expander = kbc.KBCExpandChunks(output_key='raw_chunk')

data = {'text_path': '/path/to/doc.txt', '_chunks': ['chunk1 content', 'chunk2 content', 'chunk3 content']}
result = expander(data)
# Returns: [
#   {'text_path': '/path/to/doc.txt', 'raw_chunk': 'chunk1 content'},
#   {'text_path': '/path/to/doc.txt', 'raw_chunk': 'chunk2 content'},
#   {'text_path': '/path/to/doc.txt', 'raw_chunk': 'chunk3 content'}
# ]

Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_chunk_generator.py

class KBCExpandChunks(kbc):
    """Operator that expands chunked text into independent records.

This operator expands data records containing multiple text chunks into multiple independent data records,
with each record containing one chunk. Suitable for scenarios where chunked texts need to be processed
as independent samples.

Args:
    output_key (str): Output key name for storing chunk text, defaults to 'raw_chunk'.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    List[dict]: List of expanded independent data records, each containing one chunk.


Examples:
    ```python
    from lazyllm.tools.data import kbc

    expander = kbc.KBCExpandChunks(output_key='raw_chunk')

    data = {'text_path': '/path/to/doc.txt', '_chunks': ['chunk1 content', 'chunk2 content', 'chunk3 content']}
    result = expander(data)
    # Returns: [
    #   {'text_path': '/path/to/doc.txt', 'raw_chunk': 'chunk1 content'},
    #   {'text_path': '/path/to/doc.txt', 'raw_chunk': 'chunk2 content'},
    #   {'text_path': '/path/to/doc.txt', 'raw_chunk': 'chunk3 content'}
    # ]
    ```
    """
    def __init__(self, output_key: str = 'raw_chunk', **kwargs):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.output_key = output_key

    def forward(
        self,
        data: dict,
        **kwargs,
    ) -> List[dict]:
        chunks = data.get('_chunks', [])

        if not chunks:
            return []

        new_records = []
        for chunk_text in chunks:
            new_row = data.copy()
            new_row[self.output_key] = chunk_text
            new_row.pop('_text_content', None)
            new_row.pop('_chunks', None)
            new_records.append(new_row)

        return new_records

`lazyllm.tools.data.operators.knowledge_cleaning.kbc_multihop_qa_generator_batch`

`KBCExtractInfoPairs`

Bases: kbc

Information pair extraction operator.

This operator extracts information pairs from preprocessed text for multi-hop QA generation. Uses different sentence delimiters based on language type (Chinese or English), extracting premise-intermediate-conclusion triples and related contexts.

Parameters:

lang (str, default: 'en' ) –

Language type, 'en' for English, 'zh' for Chinese, defaults to 'en'.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Data containing information pairs:
_info_pairs –

List of information pairs, each containing premise, intermediate, conclusion, and related_contexts

Examples:

from lazyllm.tools.data import kbc

extractor = kbc.KBCExtractInfoPairs(lang='en')

data = {'_processed_chunks': [{'text': 'First sentence. Second sentence. Third sentence.', 'original_data': {}}]}
result = extractor(data)
# Returns: {'_processed_chunks': [...], '_info_pairs': [{'premise': 'First sentence', 'intermediate': 'Second sentence', 'conclusion': 'Third sentence', 'related_contexts': [], 'original_data': {}}]}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py

class KBCExtractInfoPairs(kbc):
    """Information pair extraction operator.

This operator extracts information pairs from preprocessed text for multi-hop QA generation.
Uses different sentence delimiters based on language type (Chinese or English),
extracting premise-intermediate-conclusion triples and related contexts.

Args:
    lang (str): Language type, 'en' for English, 'zh' for Chinese, defaults to 'en'.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Data containing information pairs:
    _info_pairs: List of information pairs, each containing premise, intermediate, conclusion, and related_contexts


Examples:
    ```python
    from lazyllm.tools.data import kbc

    extractor = kbc.KBCExtractInfoPairs(lang='en')

    data = {'_processed_chunks': [{'text': 'First sentence. Second sentence. Third sentence.', 'original_data': {}}]}
    result = extractor(data)
    # Returns: {'_processed_chunks': [...], '_info_pairs': [{'premise': 'First sentence', 'intermediate': 'Second sentence', 'conclusion': 'Third sentence', 'related_contexts': [], 'original_data': {}}]}
    ```
    """
    def __init__(self, lang: str = 'en', **kwargs):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.lang = lang

    def forward(self, data: dict, **kwargs) -> dict:
        processed_chunks = data.get('_processed_chunks', [])
        if not processed_chunks:
            return {**data, '_info_pairs': []}

        all_info_pairs = []

        for chunk in processed_chunks:
            text = chunk.get('text', '')
            original_data = chunk.get('original_data', {})

            if self.lang == 'en':
                sentences = [s.strip() for s in text.split('.') if s.strip()]
            else:
                sentences = [s.strip() for s in text.split('。') if s.strip()]

            for i in range(len(sentences) - 2):
                if len(sentences[i]) > 10 and len(sentences[i + 1]) > 10:
                    info_pair = {
                        'premise': sentences[i],
                        'intermediate': sentences[i + 1],
                        'conclusion': (
                            sentences[i + 2]
                            if i + 2 < len(sentences)
                            else ''
                        ),
                        'related_contexts': [
                            s
                            for j, s in enumerate(sentences)
                            if j not in (i, i + 1) and len(s) > 10
                        ][:2],
                        'original_data': original_data,
                    }
                    all_info_pairs.append(info_pair)

        return {**data, '_info_pairs': all_info_pairs}

`KBCGenerateMultiHopQA`

Bases: kbc

Multi-hop QA generation operator.

This operator uses LLM to generate multi-hop QA pairs based on extracted information pairs. Multi-hop QA requires multiple reasoning steps to answer, suitable for training complex QA models.

Parameters:

llm –

LLM service instance for generating QA pairs.
lang (str, default: 'en' ) –

Language type, 'en' for English, 'zh' for Chinese, defaults to 'en'.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Data containing generated QA results:
_qa_results –

List of QA results, each containing response and info_pair

Examples:

from lazyllm.tools.data import kbc

# Assuming llm is an LLM service instance
generator = kbc.KBCGenerateMultiHopQA(llm=llm, lang='en')

data = {'_info_pairs': [{'premise': 'A', 'intermediate': 'B', 'conclusion': 'C', 'original_data': {}}]}
result = generator(data)
# Returns: {'_info_pairs': [...], '_qa_results': [{'response': {...}, 'info_pair': {...}}]}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py

class KBCGenerateMultiHopQA(kbc):
    """Multi-hop QA generation operator.

This operator uses LLM to generate multi-hop QA pairs based on extracted information pairs.
Multi-hop QA requires multiple reasoning steps to answer, suitable for training complex QA models.

Args:
    llm: LLM service instance for generating QA pairs.
    lang (str): Language type, 'en' for English, 'zh' for Chinese, defaults to 'en'.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Data containing generated QA results:
    _qa_results: List of QA results, each containing response and info_pair


Examples:
    ```python
    from lazyllm.tools.data import kbc

    # Assuming llm is an LLM service instance
    generator = kbc.KBCGenerateMultiHopQA(llm=llm, lang='en')

    data = {'_info_pairs': [{'premise': 'A', 'intermediate': 'B', 'conclusion': 'C', 'original_data': {}}]}
    result = generator(data)
    # Returns: {'_info_pairs': [...], '_qa_results': [{'response': {...}, 'info_pair': {...}}]}
    ```
    """
    def __init__(self, llm=None, lang: str = 'en', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

        self.prompt_template = MultiHopQABuilderPrompt(lang=lang)

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = (
                llm.share()
                .prompt(system_prompt)
                .formatter(JsonFormatter())
            )
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict, **kwargs) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        info_pairs = data.get('_info_pairs', [])
        if not info_pairs:
            return {**data, '_qa_results': []}

        qa_results = []

        for pair in info_pairs:
            # Build context from info pair
            context = (
                f"{pair['premise']}. "
                f"{pair['intermediate']}. "
                f"{pair['conclusion']}"
            )

            # Build prompt for this info pair
            user_prompt = self.prompt_template.build_prompt(context)

            try:
                response = self._llm_serve(user_prompt)

                qa_results.append(
                    {
                        'response': response,
                        'info_pair': pair,
                    }
                )

            except Exception as e:
                LOG.warning(f'Failed to generate QA: {e}')

        return {**data, '_qa_results': qa_results}

`KBCLoadChunkFile`

Bases: kbc

Chunk file loading operator.

This operator loads JSON or JSONL format chunk files from the specified path. Supports chunk result files generated from the knowledge base cleaning process.

Parameters:

**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Data containing chunk data:
_chunks_data –

List of chunk data
_chunk_path –

Chunk file path

Examples:

from lazyllm.tools.data import kbc

loader = kbc.KBCLoadChunkFile()

data = {'chunk_path': '/path/to/chunks.json'}
result = loader(data)
# Returns: {'chunk_path': '/path/to/chunks.json', '_chunks_data': [...], '_chunk_path': '/path/to/chunks.json'}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py

class KBCLoadChunkFile(kbc):
    """Chunk file loading operator.

This operator loads JSON or JSONL format chunk files from the specified path.
Supports chunk result files generated from the knowledge base cleaning process.

Args:
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Data containing chunk data:
    _chunks_data: List of chunk data
    _chunk_path: Chunk file path


Examples:
    ```python
    from lazyllm.tools.data import kbc

    loader = kbc.KBCLoadChunkFile()

    data = {'chunk_path': '/path/to/chunks.json'}
    result = loader(data)
    # Returns: {'chunk_path': '/path/to/chunks.json', '_chunks_data': [...], '_chunk_path': '/path/to/chunks.json'}
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

    def forward(
        self,
        data: dict,
        input_key: str = 'chunk_path',
        **kwargs,
    ) -> dict:
        import os

        chunk_path = data.get(input_key, '')

        if not chunk_path or not os.path.exists(chunk_path):
            LOG.warning(f'Invalid chunk path: {chunk_path}')
            return {**data, '_chunks_data': []}

        try:
            if str(chunk_path).endswith('.json'):
                with open(chunk_path, 'r', encoding='utf-8') as f:
                    file_data = json.load(f)
            elif str(chunk_path).endswith('.jsonl'):
                with open(chunk_path, 'r', encoding='utf-8') as f:
                    file_data = [json.loads(line) for line in f]
            else:
                LOG.warning(f'Unsupported file format: {chunk_path}')
                return {**data, '_chunks_data': []}

            return {
                **data,
                '_chunks_data': file_data,
                '_chunk_path': chunk_path,
            }

        except Exception as e:
            LOG.error(f'Error loading chunk file {chunk_path}: {e}')
            return {**data, '_chunks_data': []}

`KBCPreprocessText`

Bases: kbc

Text preprocessing operator.

This operator preprocesses loaded chunk texts, filtering chunks based on length. Only retains chunks within the specified length range, avoiding processing text that is too short or too long.

Parameters:

min_length (int, default: 100 ) –

Minimum text length, defaults to 100.
max_length (int, default: 200000 ) –

Maximum text length, defaults to 200000.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Data containing preprocessing results:
_processed_chunks –

List of preprocessed chunks

Examples:

from lazyllm.tools.data import kbc

processor = kbc.KBCPreprocessText(min_length=50, max_length=10000)

data = {'_chunks_data': [{'cleaned_chunk': 'Short text.'}, {'cleaned_chunk': 'A much longer text that meets the length requirements and will be processed.'}]}
result = processor(data, text_field='cleaned_chunk')
# Returns: {'_chunks_data': [...], '_processed_chunks': [{'text': 'A much longer text...', 'original_data': {...}}]}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py

class KBCPreprocessText(kbc):
    """Text preprocessing operator.

This operator preprocesses loaded chunk texts, filtering chunks based on length.
Only retains chunks within the specified length range, avoiding processing text that is too short or too long.

Args:
    min_length (int): Minimum text length, defaults to 100.
    max_length (int): Maximum text length, defaults to 200000.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Data containing preprocessing results:
    _processed_chunks: List of preprocessed chunks


Examples:
    ```python
    from lazyllm.tools.data import kbc

    processor = kbc.KBCPreprocessText(min_length=50, max_length=10000)

    data = {'_chunks_data': [{'cleaned_chunk': 'Short text.'}, {'cleaned_chunk': 'A much longer text that meets the length requirements and will be processed.'}]}
    result = processor(data, text_field='cleaned_chunk')
    # Returns: {'_chunks_data': [...], '_processed_chunks': [{'text': 'A much longer text...', 'original_data': {...}}]}
    ```
    """
    def __init__(self, min_length: int = 100, max_length: int = 200000, **kwargs):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.min_length = min_length
        self.max_length = max_length

    def forward(
        self,
        data: dict,
        text_field: str = 'cleaned_chunk',
        **kwargs,
    ) -> dict:
        chunks_data = data.get('_chunks_data', [])
        if not chunks_data:
            return {**data, '_processed_chunks': []}

        processed = []
        for item in chunks_data:
            text = item.get(text_field, '')
            if not isinstance(text, str):
                continue

            text = text.strip()
            if self.min_length <= len(text) <= self.max_length:
                processed.append(
                    {
                        'text': text,
                        'original_data': item,
                    }
                )

        return {**data, '_processed_chunks': processed}

`KBCSaveEnhanced`

Bases: kbc

Enhanced data saving operator.

This operator merges generated QA pairs with original chunk data and saves them as enhanced chunk files. Supports specifying output directory, preserving the relative path structure of the original file.

Parameters:

output_dir (str, default: None ) –

Output directory path, defaults to None (save to the original file's directory).
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Data containing save results:
enhanced_chunk_path –

Path to the enhanced chunk file

Examples:

from lazyllm.tools.data import kbc

saver = kbc.KBCSaveEnhanced(output_dir='./enhanced_output')

data = {'_chunk_path': '/path/to/chunks.json', '_chunks_data': [{'id': 1, 'text': 'chunk1'}], '_qa_pairs': [{'id': 1, 'qa_pairs': {'question': 'Q1', 'answer': 'A1'}}]}
result = saver(data, output_key='enhanced_chunk_path')
# Returns: {'enhanced_chunk_path': './enhanced_output/path/to/chunks_enhanced.json'}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py

class KBCSaveEnhanced(kbc):
    """Enhanced data saving operator.

This operator merges generated QA pairs with original chunk data and saves them as enhanced chunk files.
Supports specifying output directory, preserving the relative path structure of the original file.

Args:
    output_dir (str, optional): Output directory path, defaults to None (save to the original file's directory).
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Data containing save results:
    enhanced_chunk_path: Path to the enhanced chunk file


Examples:
    ```python
    from lazyllm.tools.data import kbc

    saver = kbc.KBCSaveEnhanced(output_dir='./enhanced_output')

    data = {'_chunk_path': '/path/to/chunks.json', '_chunks_data': [{'id': 1, 'text': 'chunk1'}], '_qa_pairs': [{'id': 1, 'qa_pairs': {'question': 'Q1', 'answer': 'A1'}}]}
    result = saver(data, output_key='enhanced_chunk_path')
    # Returns: {'enhanced_chunk_path': './enhanced_output/path/to/chunks_enhanced.json'}
    ```
    """
    def __init__(self, output_dir: Optional[str] = None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.output_dir = output_dir

    def forward(self, data: dict, output_key: str = 'enhanced_chunk_path', **kwargs) -> dict:
        chunk_path = data.get('_chunk_path', '')
        result = data.copy()

        if not chunk_path:
            return _clean_enhanced_result(result, output_key)

        try:
            enhanced_data = _build_enhanced_data(
                data.get('_chunks_data', []),
                data.get('_qa_pairs', [])
            )
            output_path = _get_output_path(chunk_path, self.output_dir)
            _save_enhanced_data(enhanced_data, output_path)
            LOG.info(f'Saved enhanced chunks to {output_path}')
            return _clean_enhanced_result(result, output_key, output_path)
        except Exception as e:
            LOG.error(f'Error saving enhanced chunks: {e}')
            return _clean_enhanced_result(result, output_key)

`parse_qa_pairs(data)`

QA pair parsing function.

This function parses LLM-generated QA responses, extracting valid QA pairs. Supports multiple response formats (dict, list, string) and merges parsing results with original data.

Parameters:

data (dict) –

Data containing QA results.

Returns:

dict ( dict ) –

Data containing parsed QA pairs:
_qa_pairs ( dict ) –

List of parsed QA pairs

Examples:

from lazyllm.tools.data.operators.knowledge_cleaning.kbc_multihop_qa_generator_batch import parse_qa_pairs

data = {'_qa_results': [{'response': {'question': 'What is AI?', 'answer': 'Artificial Intelligence'}, 'info_pair': {'original_data': {'id': 1}}}]}
result = parse_qa_pairs(data)
# Returns: {'_qa_results': [...], '_qa_pairs': [{'id': 1, 'qa_pairs': {'question': 'What is AI?', 'answer': 'Artificial Intelligence'}}]}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py

@data_register('data.kbc', rewrite_func='forward', _concurrency_mode='process')
def parse_qa_pairs(data: dict) -> dict:
    """QA pair parsing function.

This function parses LLM-generated QA responses, extracting valid QA pairs.
Supports multiple response formats (dict, list, string) and merges parsing results with original data.

Args:
    data (dict): Data containing QA results.

Returns:
    dict: Data containing parsed QA pairs:
    _qa_pairs: List of parsed QA pairs


Examples:
    ```python
    from lazyllm.tools.data.operators.knowledge_cleaning.kbc_multihop_qa_generator_batch import parse_qa_pairs

    data = {'_qa_results': [{'response': {'question': 'What is AI?', 'answer': 'Artificial Intelligence'}, 'info_pair': {'original_data': {'id': 1}}}]}
    result = parse_qa_pairs(data)
    # Returns: {'_qa_results': [...], '_qa_pairs': [{'id': 1, 'qa_pairs': {'question': 'What is AI?', 'answer': 'Artificial Intelligence'}}]}
    ```
    """
    qa_results = data.get('_qa_results', [])
    if not qa_results:
        return {**data, '_qa_pairs': []}

    all_qa_pairs = []

    for qa_result in qa_results:
        response = qa_result.get('response', '')
        info_pair = qa_result.get('info_pair', {})
        original_data = info_pair.get('original_data', {})

        if isinstance(response, dict):
            if 'question' in response:
                all_qa_pairs.append(
                    {**original_data, 'qa_pairs': response}
                )

        elif isinstance(response, list):
            for item in response:
                if isinstance(item, dict) and 'question' in item:
                    all_qa_pairs.append(
                        {**original_data, 'qa_pairs': item}
                    )

        elif isinstance(response, str):
            LOG.warning(
                f'JsonFormatter failed to parse response, '
                f'skipping: {response[:100]}...'
            )

    return {**data, '_qa_pairs': all_qa_pairs}

`lazyllm.tools.data.operators.knowledge_cleaning.kbc_text_cleaner_batch`

`KBCGenerateCleanedText`

Bases: kbc

Cleaned text generation operator.

This operator uses LLM to clean raw chunk text, removing noise and formatting content. Supports multiple languages, falls back to original text when LLM call fails.

Parameters:

llm –

LLM service instance for cleaning text.
lang (str, default: 'en' ) –

Language type, 'en' for English, 'zh' for Chinese, defaults to 'en'.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Data containing cleaning results:
_cleaned_results –

List of cleaning results, each containing response, raw_chunk, and original_item

Examples:

from lazyllm.tools.data import kbc

# Assuming llm is an LLM service instance
cleaner = kbc.KBCGenerateCleanedText(llm=llm, lang='en')

data = {'_chunks_data': [{'raw_chunk': 'Noisy text with errors...'}]}
result = cleaner(data)
# Returns: {'_chunks_data': [...], '_cleaned_results': [{'response': 'Cleaned text', 'raw_chunk': '...', 'original_item': {...}}]}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner_batch.py

class KBCGenerateCleanedText(kbc):
    """Cleaned text generation operator.

This operator uses LLM to clean raw chunk text, removing noise and formatting content.
Supports multiple languages, falls back to original text when LLM call fails.

Args:
    llm: LLM service instance for cleaning text.
    lang (str): Language type, 'en' for English, 'zh' for Chinese, defaults to 'en'.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Data containing cleaning results:
    _cleaned_results: List of cleaning results, each containing response, raw_chunk, and original_item


Examples:
    ```python
    from lazyllm.tools.data import kbc

    # Assuming llm is an LLM service instance
    cleaner = kbc.KBCGenerateCleanedText(llm=llm, lang='en')

    data = {'_chunks_data': [{'raw_chunk': 'Noisy text with errors...'}]}
    result = cleaner(data)
    # Returns: {'_chunks_data': [...], '_cleaned_results': [{'response': 'Cleaned text', 'raw_chunk': '...', 'original_item': {...}}]}
    ```
    """
    def __init__(self, llm=None, lang: str = 'en', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.prompts = DocRefinementPrompt(lang=lang)
        if llm is not None:
            # Note: DocRefinementPrompt may not have system prompt, use empty string
            system_prompt = getattr(self.prompts, 'build_system_prompt', lambda: '')()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(
        self,
        data: dict,
        **kwargs
    ) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        chunks_data = data.get('_chunks_data', [])
        if not chunks_data:
            return {**data, '_cleaned_results': []}

        cleaned_results = []
        for item in chunks_data:
            raw_chunk = item.get('raw_chunk', '')
            if not raw_chunk:
                continue

            # Build prompt for this chunk
            user_prompt = self.prompts.build_prompt(raw_chunk)

            try:
                # Call LLM (system prompt and formatter already set in __init__)
                response = self._llm_serve(user_prompt)

                cleaned_results.append({
                    'response': response,
                    'raw_chunk': raw_chunk,
                    'original_item': item
                })
            except Exception as e:
                LOG.warning(f'Failed to clean text: {e}')
                # Use raw chunk as fallback
                cleaned_results.append({
                    'response': raw_chunk,
                    'raw_chunk': raw_chunk,
                    'original_item': item
                })

        return {**data, '_cleaned_results': cleaned_results}

`KBCLoadRAWChunkFile`

Bases: kbc

Raw chunk file loading operator.

This operator loads JSON or JSONL files containing raw chunks (raw_chunk) from the specified path. Used in the knowledge base cleaning process to load raw chunk data that needs cleaning.

Parameters:

**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Data containing raw chunk data:
_chunks_data –

List of raw chunk data
_chunk_path –

Chunk file path

Examples:

from lazyllm.tools.data import kbc

loader = kbc.KBCLoadRAWChunkFile()

data = {'chunk_path': '/path/to/raw_chunks.json'}
result = loader(data)
# Returns: {'chunk_path': '/path/to/raw_chunks.json', '_chunks_data': [{'raw_chunk': '...'}], '_chunk_path': '/path/to/raw_chunks.json'}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner_batch.py

class KBCLoadRAWChunkFile(kbc):
    """Raw chunk file loading operator.

This operator loads JSON or JSONL files containing raw chunks (raw_chunk) from the specified path.
Used in the knowledge base cleaning process to load raw chunk data that needs cleaning.

Args:
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Data containing raw chunk data:
    _chunks_data: List of raw chunk data
    _chunk_path: Chunk file path


Examples:
    ```python
    from lazyllm.tools.data import kbc

    loader = kbc.KBCLoadRAWChunkFile()

    data = {'chunk_path': '/path/to/raw_chunks.json'}
    result = loader(data)
    # Returns: {'chunk_path': '/path/to/raw_chunks.json', '_chunks_data': [{'raw_chunk': '...'}], '_chunk_path': '/path/to/raw_chunks.json'}
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

    def forward(
        self,
        data: dict,
        input_key: str = 'chunk_path',
        **kwargs
    ) -> dict:
        chunk_path = data.get(input_key, '')
        if not chunk_path or not os.path.exists(chunk_path):
            LOG.warning(f'Invalid chunk path: {chunk_path}')
            return {**data, '_chunks_data': [], '_chunk_path': chunk_path}

        try:
            if chunk_path.endswith('.json'):
                with open(chunk_path, 'r', encoding='utf-8') as f:
                    file_data = json.load(f)
            elif chunk_path.endswith('.jsonl'):
                with open(chunk_path, 'r', encoding='utf-8') as f:
                    file_data = [json.loads(line) for line in f]
            else:
                LOG.warning(f'Unsupported file format: {chunk_path}')
                return {**data, '_chunks_data': [], '_chunk_path': chunk_path}

            if not file_data or 'raw_chunk' not in file_data[0]:
                LOG.warning(f"'raw_chunk' field not found in: {chunk_path}")
                return {**data, '_chunks_data': [], '_chunk_path': chunk_path}

            return {**data, '_chunks_data': file_data, '_chunk_path': chunk_path}

        except Exception as e:
            LOG.error(f'Error loading chunk file {chunk_path}: {e}')
            return {**data, '_chunks_data': [], '_chunk_path': chunk_path}

`KBCSaveCleaned`

Bases: kbc

Cleaned data saving operator.

This operator saves cleaned chunk data as JSON files, preserving the correspondence between raw and cleaned chunks. Supports specifying output directory, preserving the relative path structure of the original file.

Parameters:

output_dir (str, default: None ) –

Output directory path, defaults to None (save to the original file's directory).
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Data containing save results:
cleaned_chunk_path –

Path to the cleaned chunk file

Examples:

from lazyllm.tools.data import kbc

saver = kbc.KBCSaveCleaned(output_dir='./cleaned_output')

data = {'_chunk_path': '/path/to/raw_chunks.json', '_cleaned_chunks': [{'raw_chunk': 'raw', 'cleaned_chunk': 'cleaned'}]}
result = saver(data, output_key='cleaned_chunk_path')
# Returns: {'cleaned_chunk_path': './cleaned_output/path/to/raw_chunks_cleaned.json'}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner_batch.py

class KBCSaveCleaned(kbc):
    """Cleaned data saving operator.

This operator saves cleaned chunk data as JSON files, preserving the correspondence between raw and cleaned chunks.
Supports specifying output directory, preserving the relative path structure of the original file.

Args:
    output_dir (str, optional): Output directory path, defaults to None (save to the original file's directory).
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Data containing save results:
    cleaned_chunk_path: Path to the cleaned chunk file


Examples:
    ```python
    from lazyllm.tools.data import kbc

    saver = kbc.KBCSaveCleaned(output_dir='./cleaned_output')

    data = {'_chunk_path': '/path/to/raw_chunks.json', '_cleaned_chunks': [{'raw_chunk': 'raw', 'cleaned_chunk': 'cleaned'}]}
    result = saver(data, output_key='cleaned_chunk_path')
    # Returns: {'cleaned_chunk_path': './cleaned_output/path/to/raw_chunks_cleaned.json'}
    ```
    """
    def __init__(self, output_dir: Optional[str] = None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.output_dir = output_dir

    def forward(self, data: dict, output_key: str = 'cleaned_chunk_path', **kwargs) -> dict:
        cleaned_chunks = data.get('_cleaned_chunks', [])
        chunk_path = data.get('_chunk_path', '')
        result = data.copy()

        if not chunk_path:
            return _clean_save_result(result, output_key)

        if not cleaned_chunks:
            LOG.warning(f'No cleaned chunks to save for {chunk_path}')
            return _clean_save_result(result, output_key, chunk_path)

        try:
            json_items = _build_json_items(cleaned_chunks)
            output_path = _get_save_output_path(chunk_path, self.output_dir)

            with open(output_path, 'w', encoding='utf-8') as f:
                json.dump(json_items, f, ensure_ascii=False, indent=4)

            LOG.info(f'Successfully saved cleaned chunks to {output_path}')
            return _clean_save_result(result, output_key, output_path)

        except Exception as e:
            LOG.error(f'Error saving cleaned chunks: {e}')
            return _clean_save_result(result, output_key)

`extract_cleaned_content(data)`

Extract cleaned content function.

This function extracts cleaned text content from LLM cleaning results, handling different response formats. Supports extracting content between and tags.

Parameters:

data (dict) –

Data containing cleaning results.

Returns:

dict ( dict ) –

Data containing extracted cleaned content:
_cleaned_chunks ( dict ) –

List of cleaned chunks, each containing raw_chunk, cleaned_chunk, and original_item

Examples:

from lazyllm.tools.data.operators.knowledge_cleaning.kbc_text_cleaner_batch import extract_cleaned_content

data = {'_cleaned_results': [{'response': '<cleaned_start>Clean text<cleaned_end>', 'raw_chunk': 'raw', 'original_item': {}}]}
result = extract_cleaned_content(data)
# Returns: {'_cleaned_results': [...], '_cleaned_chunks': [{'raw_chunk': 'raw', 'cleaned_chunk': 'Clean text', 'original_item': {}}]}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner_batch.py

@data_register('data.kbc', rewrite_func='forward', _concurrency_mode='process')
def extract_cleaned_content(data: dict) -> dict:
    """Extract cleaned content function.

This function extracts cleaned text content from LLM cleaning results, handling different response formats.
Supports extracting content between <cleaned_start> and <cleaned_end> tags.

Args:
    data (dict): Data containing cleaning results.

Returns:
    dict: Data containing extracted cleaned content:
    _cleaned_chunks: List of cleaned chunks, each containing raw_chunk, cleaned_chunk, and original_item


Examples:
    ```python
    from lazyllm.tools.data.operators.knowledge_cleaning.kbc_text_cleaner_batch import extract_cleaned_content

    data = {'_cleaned_results': [{'response': '<cleaned_start>Clean text<cleaned_end>', 'raw_chunk': 'raw', 'original_item': {}}]}
    result = extract_cleaned_content(data)
    # Returns: {'_cleaned_results': [...], '_cleaned_chunks': [{'raw_chunk': 'raw', 'cleaned_chunk': 'Clean text', 'original_item': {}}]}
    ```
    """
    cleaned_results = data.get('_cleaned_results', [])
    if not cleaned_results:
        return {**data, '_cleaned_chunks': []}

    cleaned_chunks = []
    for result in cleaned_results:
        response = result.get('response', '')
        raw_chunk = result.get('raw_chunk', '')
        original_item = result.get('original_item', {})

        # Handle different response types from JsonFormatter
        if isinstance(response, dict):
            # JsonFormatter returned a dict, extract text field or convert to string
            text = response.get('text', '') or response.get('content', '') or str(response)
        elif isinstance(response, list):
            # JsonFormatter returned a list, join or take first item
            text = response[0] if response else ''
            if isinstance(text, dict):
                text = text.get('text', '') or text.get('content', '') or str(text)
        elif isinstance(response, str):
            # JsonFormatter failed to parse, use as-is
            text = response
        else:
            text = str(response)

        # Extract content between tags
        if '<cleaned_start>' in text and '<cleaned_end>' in text:
            try:
                cleaned_text = text.split('<cleaned_start>')[1].split('<cleaned_end>')[0].strip()
            except IndexError:
                cleaned_text = text.strip()
        else:
            cleaned_text = text.strip()

        cleaned_chunks.append({
            'raw_chunk': raw_chunk,
            'cleaned_chunk': cleaned_text,
            'original_item': original_item
        })

    return {**data, '_cleaned_chunks': cleaned_chunks}

`lazyllm.tools.data.operators.knowledge_cleaning.kbc_text_cleaner`

`KBCGenerateCleanedTextSingle`

Bases: kbc

Single text cleaning generation operator.

This operator uses LLM to clean single raw text, removing noise and formatting content. Suitable for real-time cleaning of individual data items, falls back to original text when LLM call fails.

Parameters:

llm –

LLM service instance for cleaning text.
lang (str, default: 'en' ) –

Language type, 'en' for English, 'zh' for Chinese, defaults to 'en'.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Data containing cleaning response:
_cleaned_response –

LLM's cleaning response

Examples:

from lazyllm.tools.data import kbc

# Assuming llm is an LLM service instance
cleaner = kbc.KBCGenerateCleanedTextSingle(llm=llm, lang='en')

data = {'raw_chunk': 'Noisy text with errors...'}
result = cleaner(data, input_key='raw_chunk')
# Returns: {'raw_chunk': '...', '_cleaned_response': 'Cleaned text result'}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner.py

class KBCGenerateCleanedTextSingle(kbc):
    """Single text cleaning generation operator.

This operator uses LLM to clean single raw text, removing noise and formatting content.
Suitable for real-time cleaning of individual data items, falls back to original text when LLM call fails.

Args:
    llm: LLM service instance for cleaning text.
    lang (str): Language type, 'en' for English, 'zh' for Chinese, defaults to 'en'.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Data containing cleaning response:
    _cleaned_response: LLM's cleaning response


Examples:
    ```python
    from lazyllm.tools.data import kbc

    # Assuming llm is an LLM service instance
    cleaner = kbc.KBCGenerateCleanedTextSingle(llm=llm, lang='en')

    data = {'raw_chunk': 'Noisy text with errors...'}
    result = cleaner(data, input_key='raw_chunk')
    # Returns: {'raw_chunk': '...', '_cleaned_response': 'Cleaned text result'}
    ```
    """

    def __init__(self, llm=None, lang: str = 'en', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

        # Initialize prompt template
        self.prompts = DocRefinementPrompt(lang=lang)

        # Initialize LLM serve with system prompt and formatter
        if llm is not None:
            # Note: DocRefinementPrompt may not have system prompt, use empty string
            system_prompt = getattr(self.prompts, 'build_system_prompt', lambda: '')()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(
        self,
        data: dict,
        input_key: str = 'raw_chunk',
        **kwargs
    ) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        raw_content = data.get(input_key, '')
        if not raw_content:
            return {**data, '_cleaned_response': raw_content}

        # Build prompt for the raw content
        user_prompt = self.prompts.build_prompt(raw_content)

        try:
            # Call LLM (system prompt and formatter already set in __init__)
            response = self._llm_serve(user_prompt)
            return {**data, '_cleaned_response': response}
        except Exception as e:
            LOG.warning(f'Failed to clean text: {e}')
            # Use raw content as fallback
            return {**data, '_cleaned_response': raw_content}

`extract_cleaned_content_single(data, output_key='cleaned_chunk')`

Single cleaned content extraction function.

This function extracts cleaned text content from single LLM cleaning response, handling different response formats. Supports extracting content between and tags and cleans intermediate fields.

Parameters:

data (dict) –

Data containing cleaning response.
output_key (str, default: 'cleaned_chunk' ) –

Output key name, defaults to 'cleaned_chunk'.

Returns:

dict ( dict ) –

Data containing extracted cleaned content with field specified by output_key added.

Examples:

from lazyllm.tools.data.operators.knowledge_cleaning.kbc_text_cleaner import extract_cleaned_content_single

data = {'_cleaned_response': '<cleaned_start>Clean text<cleaned_end>'}
result = extract_cleaned_content_single(data, output_key='cleaned_chunk')
# Returns: {'cleaned_chunk': 'Clean text'}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner.py

@data_register('data.kbc', rewrite_func='forward', _concurrency_mode='process')
def extract_cleaned_content_single(
    data: dict,
    output_key: str = 'cleaned_chunk',
) -> dict:
    """Single cleaned content extraction function.

This function extracts cleaned text content from single LLM cleaning response, handling different response formats.
Supports extracting content between <cleaned_start> and <cleaned_end> tags and cleans intermediate fields.

Args:
    data (dict): Data containing cleaning response.
    output_key (str): Output key name, defaults to 'cleaned_chunk'.

Returns:
    dict: Data containing extracted cleaned content with field specified by output_key added.


Examples:
    ```python
    from lazyllm.tools.data.operators.knowledge_cleaning.kbc_text_cleaner import extract_cleaned_content_single

    data = {'_cleaned_response': '<cleaned_start>Clean text<cleaned_end>'}
    result = extract_cleaned_content_single(data, output_key='cleaned_chunk')
    # Returns: {'cleaned_chunk': 'Clean text'}
    ```
    """
    response = data.get('_cleaned_response', '')

    # Handle different response types from JsonFormatter
    if isinstance(response, dict):
        # JsonFormatter returned a dict, extract text field or convert to string
        text = response.get('text', '') or response.get('content', '') or str(response)
    elif isinstance(response, list):
        # JsonFormatter returned a list, join or take first item
        text = response[0] if response else ''
        if isinstance(text, dict):
            text = text.get('text', '') or text.get('content', '') or str(text)
    elif isinstance(response, str):
        # JsonFormatter failed to parse, use as-is
        text = response
    else:
        text = str(response)

    # Extract content between tags
    if '<cleaned_start>' in text and '<cleaned_end>' in text:
        try:
            cleaned_text = text.split('<cleaned_start>')[1].split('<cleaned_end>')[0].strip()
        except IndexError:
            cleaned_text = text.strip()
    else:
        cleaned_text = text.strip()

    result = data.copy()
    result[output_key] = cleaned_text
    # Clean intermediate fields
    for key in ['_cleaned_response']:
        result.pop(key, None)
    return result

`lazyllm.tools.data.operators.knowledge_cleaning.qa_extract`

`KBCExtractQAPairs`

Bases: kbc

QA pairs extraction operator.

This operator extracts QA pairs from loaded QA data and converts them to standard format. Supports customizing output field names for instruction, question, and answer.

Parameters:

qa_key (str, default: 'QA_pairs' ) –

QA data field name, defaults to 'QA_pairs'.
instruction (str, default: 'Please answer the following question based on the provided information.' ) –

Instruction text, defaults to 'Please answer the following question based on the provided information.'.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

–

List[dict]: List of extracted QA pairs, each containing instruction, input, and output fields.

Examples:

from lazyllm.tools.data import kbc

extractor = kbc.KBCExtractQAPairs(
    qa_key='QA_pairs',
    instruction='Please answer based on the context.'
)

data = {'_qa_data': {'qa_pairs': [{'question': 'What is AI?', 'answer': 'Artificial Intelligence'}]}}
result = extractor(
    data,
    output_instruction_key='instruction',
    output_question_key='input',
    output_answer_key='output'
)
# Returns: [{'instruction': 'Please answer based on the context.', 'input': 'What is AI?', 'output': 'Artificial Intelligence'}]

Source code in lazyllm/tools/data/operators/knowledge_cleaning/qa_extract.py

class KBCExtractQAPairs(kbc):
    """QA pairs extraction operator.

This operator extracts QA pairs from loaded QA data and converts them to standard format.
Supports customizing output field names for instruction, question, and answer.

Args:
    qa_key (str): QA data field name, defaults to 'QA_pairs'.
    instruction (str): Instruction text, defaults to 'Please answer the following question based on the provided information.'.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    List[dict]: List of extracted QA pairs, each containing instruction, input, and output fields.


Examples:
    ```python
    from lazyllm.tools.data import kbc

    extractor = kbc.KBCExtractQAPairs(
        qa_key='QA_pairs',
        instruction='Please answer based on the context.'
    )

    data = {'_qa_data': {'qa_pairs': [{'question': 'What is AI?', 'answer': 'Artificial Intelligence'}]}}
    result = extractor(
        data,
        output_instruction_key='instruction',
        output_question_key='input',
        output_answer_key='output'
    )
    # Returns: [{'instruction': 'Please answer based on the context.', 'input': 'What is AI?', 'output': 'Artificial Intelligence'}]
    ```
    """
    def __init__(
        self,
        qa_key: str = 'QA_pairs',
        instruction: str = 'Please answer the following question based on the provided information.',
        **kwargs
    ):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.qa_key = qa_key
        self.instruction = instruction

    def forward(
        self,
        data: dict,
        output_instruction_key: str = 'instruction',
        output_question_key: str = 'input',
        output_answer_key: str = 'output',
        **kwargs
    ) -> List[dict]:
        qa_data = data.get('_qa_data')
        if not qa_data:
            return []

        # Extract qa_pairs - handle both dict with 'qa_pairs' key and direct list
        qa_list = qa_data.get('qa_pairs', []) if isinstance(qa_data, dict) else qa_data
        if not isinstance(qa_list, list):
            qa_list = [qa_list] if isinstance(qa_list, dict) else []

        results = []
        for qa in qa_list:
            if not isinstance(qa, dict):
                continue

            question = qa.get('question', '').strip()
            answer = qa.get('answer', '').strip()

            if not question or not answer:
                continue

            item = {
                output_instruction_key: self.instruction,
                output_question_key: question,
                output_answer_key: answer
            }
            results.append(item)

        return results

`KBCLoadQAData`

Bases: kbc

QA data loading operator.

This operator loads QA data from input data or chunk files. First checks if QA data already exists in input data, if not, tries to load from enhanced chunk files, cleaned chunk files, or regular chunk files.

Parameters:

qa_key (str, default: 'QA_pairs' ) –

QA data field name, defaults to 'QA_pairs'.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Data containing QA data:
_qa_data –

Loaded QA data
_source_file –

Data source file path (if loaded from file)

Examples:

from lazyllm.tools.data import kbc

loader = kbc.KBCLoadQAData(qa_key='QA_pairs')

# From existing data
data = {'QA_pairs': [{'question': 'Q1', 'answer': 'A1'}]}
result = loader(data)
# Returns: {'QA_pairs': [...], '_qa_data': [...]}

# From file
data = {'enhanced_chunk_path': '/path/to/enhanced.json'}
result = loader(data)
# Returns: {'enhanced_chunk_path': '...', '_qa_data': [...], '_source_file': '/path/to/enhanced.json'}

Source code in lazyllm/tools/data/operators/knowledge_cleaning/qa_extract.py

class KBCLoadQAData(kbc):
    """QA data loading operator.

This operator loads QA data from input data or chunk files. First checks if QA data already exists in input data,
if not, tries to load from enhanced chunk files, cleaned chunk files, or regular chunk files.

Args:
    qa_key (str): QA data field name, defaults to 'QA_pairs'.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Data containing QA data:
    _qa_data: Loaded QA data
    _source_file: Data source file path (if loaded from file)


Examples:
    ```python
    from lazyllm.tools.data import kbc

    loader = kbc.KBCLoadQAData(qa_key='QA_pairs')

    # From existing data
    data = {'QA_pairs': [{'question': 'Q1', 'answer': 'A1'}]}
    result = loader(data)
    # Returns: {'QA_pairs': [...], '_qa_data': [...]}

    # From file
    data = {'enhanced_chunk_path': '/path/to/enhanced.json'}
    result = loader(data)
    # Returns: {'enhanced_chunk_path': '...', '_qa_data': [...], '_source_file': '/path/to/enhanced.json'}
    ```
    """
    def __init__(self, qa_key: str = 'QA_pairs', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.qa_key = qa_key

    def forward(
        self,
        data: dict,
        **kwargs
    ) -> dict:
        # Check if QA data already exists in the data
        if self.qa_key in data:
            return {**data, '_qa_data': data.get(self.qa_key)}

        # Try to load from chunk files
        path_keys = ['enhanced_chunk_path', 'cleaned_chunk_path', 'chunk_path']

        for path_key in path_keys:
            file_path = data.get(path_key)
            if not file_path or not Path(file_path).exists():
                continue

            try:
                with open(file_path, 'r', encoding='utf-8') as f:
                    chunks = json.load(f)
                    chunks = chunks if isinstance(chunks, list) else [chunks]

                    for chunk in chunks:
                        if self.qa_key in chunk:
                            return {
                                **data,
                                '_qa_data': chunk[self.qa_key],
                                '_source_file': file_path
                            }
            except Exception as e:
                LOG.error(f'Failed to load {file_path}: {e}')
                continue

        # No QA data found
        return {**data, '_qa_data': None}

Reranker synthesis

`lazyllm.tools.data.operators.reranker_synthesis`

`RerankerAdjustNegatives`

Bases: reranker

Reranker negative sample adjustment operator.

This operator adjusts the number of negative samples to match the target count. Truncates if there are too many, or pads by random sampling if there are too few. Uses deterministic random seed based on query content for reproducibility.

Parameters:

adjust_neg_count (int, default: 7 ) –

Target negative sample count, defaults to 7.
seed (int, default: 42 ) –

Random seed for random selection during padding, defaults to 42.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Adjusted data with updated _neg field.

Examples:

from lazyllm.tools.data import reranker

adjuster = reranker.RerankerAdjustNegatives(adjust_neg_count=5, seed=123)

# Too many negatives
data = {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8']}
result = adjuster(data)
# Returns: {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2', 'n3', 'n4', 'n5']}

# Too few negatives
data = {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2']}
result = adjuster(data)
# Returns: {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2', 'n1', 'n2', 'n1']}

Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_from_embedding_converter.py

class RerankerAdjustNegatives(reranker):
    """Reranker negative sample adjustment operator.

This operator adjusts the number of negative samples to match the target count. Truncates if there are too many, or pads by random sampling if there are too few. Uses deterministic random seed based on query content for reproducibility.

Args:
    adjust_neg_count (int): Target negative sample count, defaults to 7.
    seed (int): Random seed for random selection during padding, defaults to 42.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Adjusted data with updated _neg field.


Examples:
    ```python
    from lazyllm.tools.data import reranker

    adjuster = reranker.RerankerAdjustNegatives(adjust_neg_count=5, seed=123)

    # Too many negatives
    data = {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8']}
    result = adjuster(data)
    # Returns: {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2', 'n3', 'n4', 'n5']}

    # Too few negatives
    data = {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2']}
    result = adjuster(data)
    # Returns: {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2', 'n1', 'n2', 'n1']}
    ```
    """
    def __init__(self, adjust_neg_count: int = 7, seed: int = 42, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.adjust_neg_count = adjust_neg_count
        self.seed = seed

    def forward(self, data: dict, **kwargs) -> dict:
        if not data.get('_is_valid'):
            return data

        neg = data.get('_neg', [])

        if len(neg) > self.adjust_neg_count:
            # Truncate to target count
            neg = neg[:self.adjust_neg_count]
        elif len(neg) < self.adjust_neg_count and neg:
            # Pad with duplicates if needed (when we have some negatives)
            local_random = random.Random(f'{self.seed}_{data["_query"]}')
            while len(neg) < self.adjust_neg_count:
                neg.append(local_random.choice(neg))

        return {**data, '_neg': neg}

`RerankerBuildFormat`

Bases: reranker

Reranker format builder operator.

This operator converts validated data to standard reranker training format. Outputs a dictionary containing query, pos, and neg fields without prompts or instructions.

Parameters:

**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

dict –

Reranker format data containing query, pos, and neg fields. Returns empty dict if data is invalid.

Examples:

from lazyllm.tools.data import reranker

builder = reranker.RerankerBuildFormat()

data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking']}
result = builder(data)
# Returns: {'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking']}

Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_from_embedding_converter.py

class RerankerBuildFormat(reranker):
    """Reranker format builder operator.

This operator converts validated data to standard reranker training format. Outputs a dictionary containing query, pos, and neg fields without prompts or instructions.

Args:
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    dict: Reranker format data containing query, pos, and neg fields. Returns empty dict if data is invalid.


Examples:
    ```python
    from lazyllm.tools.data import reranker

    builder = reranker.RerankerBuildFormat()

    data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking']}
    result = builder(data)
    # Returns: {'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking']}
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

    def forward(self, data: dict, **kwargs) -> dict:
        if not data.get('_is_valid'):
            return {}

        # Build reranker format (no prompt/instruction)
        reranker_item = {
            'query': data['_query'],
            'pos': data['_pos'],
            'neg': data['_neg'],
        }

        return reranker_item

`RerankerFormatCrossEncoder`

Bases: reranker

CrossEncoder format conversion operator.

This operator converts validated data to CrossEncoder training format. Each query-document pair is an independent sample, with positive samples labeled 1 and negative samples labeled 0.

Parameters:

**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

–

List[dict]: List of converted data, each containing query, document, and label fields.

Examples:

from lazyllm.tools.data import reranker

formatter = reranker.RerankerFormatCrossEncoder()

data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking']}
result = formatter(data)
# Returns: [{'query': 'machine learning', 'document': 'ML tutorial', 'label': 1}, {'query': 'machine learning', 'document': 'cooking', 'label': 0}]

Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_data_formatter.py

class RerankerFormatCrossEncoder(reranker):
    """CrossEncoder format conversion operator.

This operator converts validated data to CrossEncoder training format. Each query-document pair is an independent sample, with positive samples labeled 1 and negative samples labeled 0.

Args:
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    List[dict]: List of converted data, each containing query, document, and label fields.


Examples:
    ```python
    from lazyllm.tools.data import reranker

    formatter = reranker.RerankerFormatCrossEncoder()

    data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking']}
    result = formatter(data)
    # Returns: [{'query': 'machine learning', 'document': 'ML tutorial', 'label': 1}, {'query': 'machine learning', 'document': 'cooking', 'label': 0}]
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

    def forward(self, data: dict, **kwargs) -> List[dict]:
        if not data.get('_is_valid'):
            return []

        query = data['_query']
        pos = data['_pos']
        neg = data['_neg']

        results = []

        # Positive samples with label 1
        for p in pos:
            results.append({'query': query, 'document': p, 'label': 1})

        # Negative samples with label 0
        for n in neg:
            results.append({'query': query, 'document': n, 'label': 0})

        return results

`RerankerFormatFlagReranker`

Bases: reranker

FlagReranker format conversion operator.

This operator converts validated data to FlagReranker training format. Ensures the number of negative samples meets training group size requirements, padding with duplicates if insufficient or truncating if excessive.

Parameters:

train_group_size (int, default: 8 ) –

Training group size (including 1 positive sample), defaults to 8.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

–

List[dict]: List of converted data, each containing query, pos, and neg fields.

Examples:

from lazyllm.tools.data import reranker

formatter = reranker.RerankerFormatFlagReranker(train_group_size=8)

data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking', 'history']}
result = formatter(data)
# Returns: [{'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking', 'history', ...]}]

Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_data_formatter.py

class RerankerFormatFlagReranker(reranker):
    """FlagReranker format conversion operator.

This operator converts validated data to FlagReranker training format. Ensures the number of negative samples meets training group size requirements, padding with duplicates if insufficient or truncating if excessive.

Args:
    train_group_size (int): Training group size (including 1 positive sample), defaults to 8.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    List[dict]: List of converted data, each containing query, pos, and neg fields.


Examples:
    ```python
    from lazyllm.tools.data import reranker

    formatter = reranker.RerankerFormatFlagReranker(train_group_size=8)

    data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking', 'history']}
    result = formatter(data)
    # Returns: [{'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking', 'history', ...]}]
    ```
    """
    def __init__(self, train_group_size: int = 8, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.train_group_size = train_group_size

    def forward(self, data: dict, **kwargs) -> List[dict]:
        if not data.get('_is_valid'):
            return []

        query = data['_query']
        pos = data['_pos']
        neg = data['_neg']

        # Ensure neg has exactly train_group_size - 1 samples
        num_neg_needed = self.train_group_size - 1
        if len(neg) < num_neg_needed:
            # Pad with duplicates if needed
            neg = (neg * (num_neg_needed // len(neg) + 1))[:num_neg_needed] if neg else []
        else:
            neg = neg[:num_neg_needed]

        return [{
            'query': query,
            'pos': pos,
            'neg': neg,
        }]

`RerankerFormatPairwise`

Bases: reranker

Pairwise format conversion operator.

This operator converts validated data to Pairwise training format. Creates pairwise combinations of positive and negative samples for training ranking models to distinguish relevant from irrelevant documents.

Parameters:

**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

–

List[dict]: List of converted data, each containing query, doc_pos, and doc_neg fields.

Examples:

from lazyllm.tools.data import reranker

formatter = reranker.RerankerFormatPairwise()

data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking']}
result = formatter(data)
# Returns: [{'query': 'machine learning', 'doc_pos': 'ML tutorial', 'doc_neg': 'cooking'}]

Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_data_formatter.py

class RerankerFormatPairwise(reranker):
    """Pairwise format conversion operator.

This operator converts validated data to Pairwise training format. Creates pairwise combinations of positive and negative samples for training ranking models to distinguish relevant from irrelevant documents.

Args:
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    List[dict]: List of converted data, each containing query, doc_pos, and doc_neg fields.


Examples:
    ```python
    from lazyllm.tools.data import reranker

    formatter = reranker.RerankerFormatPairwise()

    data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking']}
    result = formatter(data)
    # Returns: [{'query': 'machine learning', 'doc_pos': 'ML tutorial', 'doc_neg': 'cooking'}]
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

    def forward(self, data: dict, **kwargs) -> List[dict]:
        if not data.get('_is_valid'):
            return []

        query = data['_query']
        pos = data['_pos']
        neg = data['_neg']

        results = []

        # Create pairwise comparisons
        for p in pos:
            for n in neg:
                results.append({'query': query, 'doc_pos': p, 'doc_neg': n})

        return results

`RerankerGenerateQueries`

Bases: reranker

Generates multiple retrieval queries from a given passage.

This operator builds prompts using RerankerQueryGeneratorPrompt and calls the LLM to produce queries with different difficulty levels. The result is parsed by JsonFormatter and stored as a JSON string in the '_query_response' field.

If the passage is empty or generation fails, an empty response is returned.

Parameters:

llm_serving –

language model serving instance
lang (str, default: 'zh' ) –

language of generated queries, default 'zh'
num_queries (int, default: 3 ) –

number of queries to generate, default 3
difficulty_levels (List[str], default: None ) –

list of difficulty levels, default ['easy', 'medium', 'hard']
**kwargs (dict, default: {} ) –

Additional optional parameters passed to parent class.

Examples:

op = RerankerGenerateQueries(
    llm_serving=my_llm,
    lang='en',
    num_queries=5,
    difficulty_levels=['easy', 'hard']
)

result = op({'passage': 'Large language models are widely used in NLP.'})
print(result['_query_response'])

Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_query_generator.py

class RerankerGenerateQueries(reranker):
    """Generates multiple retrieval queries from a given passage.

This operator builds prompts using RerankerQueryGeneratorPrompt
and calls the LLM to produce queries with different difficulty levels.
The result is parsed by JsonFormatter and stored as a JSON string
in the '_query_response' field.

If the passage is empty or generation fails, an empty response is returned.

Args:
    llm_serving: language model serving instance
    lang (str): language of generated queries, default 'zh'
    num_queries (int): number of queries to generate, default 3
    difficulty_levels (List[str]): list of difficulty levels, default ['easy', 'medium', 'hard']
    **kwargs (dict): Additional optional parameters passed to parent class.


Examples:
    ```python
    op = RerankerGenerateQueries(
        llm_serving=my_llm,
        lang='en',
        num_queries=5,
        difficulty_levels=['easy', 'hard']
    )

    result = op({'passage': 'Large language models are widely used in NLP.'})
    print(result['_query_response'])
    ```
    """
    def __init__(
        self,
        llm_serving=None,
        lang: str = 'zh',
        num_queries: int = 3,
        difficulty_levels: Optional[List[str]] = None,
        **kwargs
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.num_queries = num_queries
        self.difficulty_levels = difficulty_levels or ['easy', 'medium', 'hard']
        self.prompt_template = RerankerQueryGeneratorPrompt(lang=lang)

        # Initialize LLM serve with system prompt and formatter
        if llm_serving is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm_serving.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(
        self,
        data: dict,
        input_key: str = 'passage',
        **kwargs
    ) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM serving is not configured')

        passage = data.get(input_key, '')
        if not passage:
            return {**data, '_query_response': ''}

        # Build user prompt from passage
        user_prompt = self.prompt_template.build_prompt(
            passage=passage,
            num_queries=self.num_queries,
            difficulty_levels=self.difficulty_levels
        )

        try:
            result = self._llm_serve(user_prompt)
            # JsonFormatter already parses JSON, handle both str and parsed result
            if isinstance(result, str):
                response = result
            else:
                response = json.dumps(result, ensure_ascii=False)
            return {**data, '_query_response': response}
        except Exception as e:
            LOG.warning(f'Failed to generate queries: {e}')
            return {**data, '_query_response': ''}

`RerankerInitBM25`

Bases: reranker

Initialize BM25 index operator.

This operator builds BM25 index based on corpus for keyword-based negative sample mining. Supports Chinese and English tokenization, Chinese uses jieba, English uses Stemmer stemming.

Parameters:

language (str, default: 'zh' ) –

Language type, 'zh' for Chinese, 'en' for English, defaults to 'zh'.
**kwargs (dict, default: {} ) –

Additional optional parameters passed to parent class.

Returns:

–

List[dict]: Input data list, each data adds BM25 index and tokenizer configuration.

Examples:

from lazyllm.tools.data import reranker

init_bm25 = reranker.RerankerInitBM25(language='zh')

# 先构建语料库
data_with_corpus = reranker.build_reranker_corpus(inputs)
# 然后初始化BM25
result = init_bm25(data_with_corpus)

Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py

class RerankerInitBM25(reranker):
    """Initialize BM25 index operator.

This operator builds BM25 index based on corpus for keyword-based negative sample mining.
Supports Chinese and English tokenization, Chinese uses jieba, English uses Stemmer stemming.

Args:
    language (str): Language type, 'zh' for Chinese, 'en' for English, defaults to 'zh'.
    **kwargs (dict): Additional optional parameters passed to parent class.

Returns:
    List[dict]: Input data list, each data adds BM25 index and tokenizer configuration.


Examples:
    ```python
    from lazyllm.tools.data import reranker

    init_bm25 = reranker.RerankerInitBM25(language='zh')

    # 先构建语料库
    data_with_corpus = reranker.build_reranker_corpus(inputs)
    # 然后初始化BM25
    result = init_bm25(data_with_corpus)
    ```
    """
    def __init__(self, language: str = 'zh', **kwargs):
        super().__init__(rewrite_func='forward_batch_input', **kwargs)
        self.language = language
        self._setup_tokenizer(language)

    def _setup_tokenizer(self, language: str):
        if language == 'en':
            self._stemmer = Stemmer.Stemmer('english')
            self._stopwords = language
            self._tokenizer = lambda t: t
        elif language == 'zh':
            self._stemmer = None
            self._stopwords = STOPWORDS_CHINESE
            self._tokenizer = lambda t: ' '.join(jieba.lcut(t))
        else:
            self._stemmer = None
            self._stopwords = None
            self._tokenizer = lambda t: t

    def forward_batch_input(self, inputs: List[dict], **kwargs) -> List[dict]:
        if not inputs:
            return inputs

        # Load corpus from file path instead of memory
        corpus_path = inputs[0].get('_corpus', '')
        if not corpus_path:
            LOG.warning('No corpus path found for BM25 initialization.')
            return [{**item, '_bm25': None, '_bm25_corpus': []} for item in inputs]

        corpus = _load_corpus_from_path(corpus_path)
        if not corpus:
            LOG.warning(f'Failed to load corpus from {corpus_path}')
            return [{**item, '_bm25': None, '_bm25_corpus': []} for item in inputs]

        LOG.info(f'Initializing BM25 index for {len(corpus)} documents...')
        corpus_tokens = bm25s.tokenize(
            [self._tokenizer(doc) for doc in corpus],
            stopwords=self._stopwords,
            stemmer=self._stemmer,
        )
        bm25_index = bm25s.BM25()
        bm25_index.index(corpus_tokens)
        LOG.info('BM25 index initialized.')

        return [{
            **item,
            '_bm25': bm25_index,
            '_bm25_corpus': corpus,
            '_bm25_tokenizer': self._tokenizer,
            '_bm25_stopwords': self._stopwords,
            '_bm25_stemmer': self._stemmer
        } for item in inputs]

`RerankerInitSemantic`

Bases: reranker

Initialize semantic embeddings operator.

This operator uses embedding service to compute vector representations for all documents in the corpus and saves them to files. Used for subsequent semantic similarity calculation and negative sample mining.

Parameters:

embedding_serving (Callable, default: None ) –

Embedding service callable function.
embeddings_dir (str, default: None ) –

Embedding file save directory, defaults to corpus directory.
**kwargs (dict, default: {} ) –

Additional optional parameters passed to parent class.

Returns:

–

List[dict]: Input data list, each data adds embedding file path and corpus information.

Examples:

from lazyllm.tools.data import reranker

# 假设 embedding_fn 是embedding服务
init_semantic = reranker.RerankerInitSemantic(embedding_serving=embedding_fn)

# 先构建语料库
data_with_corpus = reranker.build_reranker_corpus(inputs)
# 然后计算语义向量
result = init_semantic(data_with_corpus)

Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py

class RerankerInitSemantic(reranker):
    """Initialize semantic embeddings operator.

This operator uses embedding service to compute vector representations for all documents in the corpus and saves them to files.
Used for subsequent semantic similarity calculation and negative sample mining.

Args:
    embedding_serving (Callable): Embedding service callable function.
    embeddings_dir (str, optional): Embedding file save directory, defaults to corpus directory.
    **kwargs (dict): Additional optional parameters passed to parent class.

Returns:
    List[dict]: Input data list, each data adds embedding file path and corpus information.


Examples:
    ```python
    from lazyllm.tools.data import reranker

    # 假设 embedding_fn 是embedding服务
    init_semantic = reranker.RerankerInitSemantic(embedding_serving=embedding_fn)

    # 先构建语料库
    data_with_corpus = reranker.build_reranker_corpus(inputs)
    # 然后计算语义向量
    result = init_semantic(data_with_corpus)
    ```
    """
    def __init__(self, embedding_serving: Optional[Callable] = None, embeddings_dir: Optional[str] = None, **kwargs):
        super().__init__(rewrite_func='forward_batch_input', **kwargs)
        self.embedding_serving = embedding_serving
        self.embeddings_dir = embeddings_dir

    def forward_batch_input(self, inputs: List[dict], **kwargs) -> List[dict]:
        if not inputs:
            return inputs

        # Load corpus from file path instead of memory
        corpus_path = inputs[0].get('_corpus', '')
        if not corpus_path:
            LOG.warning('No corpus path found for semantic initialization.')
            return [{**item, '_semantic_embeddings_path': '', '_semantic_corpus': []}
                    for item in inputs]

        # Verify all inputs share the same corpus path for consistency
        if not all(item.get('_corpus') == corpus_path for item in inputs):
            LOG.warning('Not all inputs share the same corpus path. Using corpus from first item.')

        corpus = _load_corpus_from_path(corpus_path)
        if not corpus or self.embedding_serving is None:
            LOG.warning('No corpus or embedding_serving for semantic initialization.')
            return [{**item, '_semantic_embeddings_path': '', '_semantic_corpus': corpus or []}
                    for item in inputs]

        LOG.info(f'Computing embeddings for {len(corpus)} documents...')
        embeddings = np.array(self.embedding_serving(corpus))
        LOG.info('Embeddings computed.')

        # Save embeddings to file instead of storing in memory for each item
        if self.embeddings_dir is None:
            embeddings_dir = os.path.dirname(corpus_path)
        else:
            embeddings_dir = self.embeddings_dir
        os.makedirs(embeddings_dir, exist_ok=True)

        embeddings_path = os.path.join(embeddings_dir, f'reranker_embeddings_{id(inputs)}.npy')
        np.save(embeddings_path, embeddings)
        LOG.info(f'Saved embeddings to {embeddings_path}')

        return [{
            **item,
            '_semantic_embeddings_path': embeddings_path,
            '_semantic_corpus': corpus
        } for item in inputs]

`RerankerMineBM25Negatives`

Bases: reranker

BM25 negative sample mining operator.

This operator retrieves documents most relevant to the query but not in positive samples based on BM25 index. Suitable for mining hard negatives that have lexical overlap but different semantics.

Parameters:

num_negatives (int, default: 7 ) –

Number of negative samples to mine, defaults to 7.
**kwargs (dict, default: {} ) –

Additional optional parameters passed to parent class.

Returns:

dict –

Input data with mined negative samples list added.

Examples:

from lazyllm.tools.data import reranker

miner = reranker.RerankerMineBM25Negatives(num_negatives=5)

data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_bm25': bm25_index, '_bm25_corpus': corpus}
result = miner(data)
# Returns: {'query': '...', 'pos': [...], 'neg': ['bm25_neg1', 'bm25_neg2', ...]}

Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py

class RerankerMineBM25Negatives(reranker):
    """BM25 negative sample mining operator.

This operator retrieves documents most relevant to the query but not in positive samples based on BM25 index.
Suitable for mining hard negatives that have lexical overlap but different semantics.

Args:
    num_negatives (int): Number of negative samples to mine, defaults to 7.
    **kwargs (dict): Additional optional parameters passed to parent class.

Returns:
    dict: Input data with mined negative samples list added.


Examples:
    ```python
    from lazyllm.tools.data import reranker

    miner = reranker.RerankerMineBM25Negatives(num_negatives=5)

    data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_bm25': bm25_index, '_bm25_corpus': corpus}
    result = miner(data)
    # Returns: {'query': '...', 'pos': [...], 'neg': ['bm25_neg1', 'bm25_neg2', ...]}
    ```
    """
    def __init__(self, num_negatives: int = 7, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.num_negatives = num_negatives

    def forward(
        self,
        data: dict,
        input_query_key: str = 'query',
        input_pos_key: str = 'pos',
        output_neg_key: str = 'neg',
        **kwargs
    ) -> dict:
        bm25_index = data.get('_bm25')
        corpus = data.get('_bm25_corpus') or []
        tokenizer = data.get('_bm25_tokenizer', lambda t: t)
        stopwords = data.get('_bm25_stopwords')
        stemmer = data.get('_bm25_stemmer')

        if bm25_index is None:
            LOG.warning('BM25 index not initialized.')
            return {**data, output_neg_key: []}

        query = data.get(input_query_key, '')
        pos_samples = data.get(input_pos_key, [])

        if not query:
            return {**data, output_neg_key: []}

        pos_set = _normalize_pos_samples(pos_samples)
        tokenized_query = bm25s.tokenize(
            tokenizer(query), stopwords=stopwords, stemmer=stemmer
        )

        k = min(len(corpus) if corpus else 0,
                self.num_negatives + len(pos_set) + 10)
        indices, scores = bm25_index.retrieve(tokenized_query, k=k)

        negatives = []
        if not corpus:
            return {**data, output_neg_key: []}

        for idx in indices[0]:
            doc = corpus[idx]
            if doc not in pos_set:
                negatives.append(doc)
                if len(negatives) >= self.num_negatives:
                    break

        result = {k: v for k, v in data.items() if k not in (
            '_bm25', '_bm25_corpus', '_bm25_tokenizer', '_bm25_stopwords', '_bm25_stemmer'
        )}
        result[output_neg_key] = negatives
        return result

`RerankerMineMixedNegatives`

Bases: reranker

Mixed strategy negative sample mining operator.

This operator combines BM25 and semantic similarity methods to mine negative samples. Uses both methods according to specified ratio to obtain more diverse hard negatives.

Parameters:

embedding_serving (Callable, default: None ) –

Embedding service callable function.
num_negatives (int, default: 7 ) –

Number of negative samples to mine, defaults to 7.
bm25_ratio (float, default: 0.5 ) –

BM25 method ratio, remaining portion uses semantic method, defaults to 0.5.
**kwargs (dict, default: {} ) –

Additional optional parameters passed to parent class.

Returns:

dict –

Input data with mixed strategy mined negative samples list added.

Examples:

from lazyllm.tools.data import reranker

# 假设 embedding_fn 是embedding服务
miner = reranker.RerankerMineMixedNegatives(
    embedding_serving=embedding_fn,
    num_negatives=6,
    bm25_ratio=0.5  # 3个BM25负样本 + 3个语义负样本
)

data = {
    'query': 'machine learning',
    'pos': ['ML tutorial'],
    '_bm25': bm25_index,
    '_bm25_corpus': corpus,
    '_semantic_embeddings_path': emb_path,
    '_semantic_corpus': corpus
}
result = miner(data)
# Returns: {'query': '...', 'pos': [...], 'neg': [...]} 包含3个BM25负样本和3个语义负样本

Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py

class RerankerMineMixedNegatives(reranker):
    """Mixed strategy negative sample mining operator.

This operator combines BM25 and semantic similarity methods to mine negative samples. Uses both methods according to specified ratio to obtain more diverse hard negatives.

Args:
    embedding_serving (Callable): Embedding service callable function.
    num_negatives (int): Number of negative samples to mine, defaults to 7.
    bm25_ratio (float): BM25 method ratio, remaining portion uses semantic method, defaults to 0.5.
    **kwargs (dict): Additional optional parameters passed to parent class.

Returns:
    dict: Input data with mixed strategy mined negative samples list added.


Examples:
    ```python
    from lazyllm.tools.data import reranker

    # 假设 embedding_fn 是embedding服务
    miner = reranker.RerankerMineMixedNegatives(
        embedding_serving=embedding_fn,
        num_negatives=6,
        bm25_ratio=0.5  # 3个BM25负样本 + 3个语义负样本
    )

    data = {
        'query': 'machine learning',
        'pos': ['ML tutorial'],
        '_bm25': bm25_index,
        '_bm25_corpus': corpus,
        '_semantic_embeddings_path': emb_path,
        '_semantic_corpus': corpus
    }
    result = miner(data)
    # Returns: {'query': '...', 'pos': [...], 'neg': [...]} 包含3个BM25负样本和3个语义负样本
    ```
    """
    def __init__(self, embedding_serving: Optional[Callable] = None,
                 num_negatives: int = 7, bm25_ratio: float = 0.5, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.num_negatives = num_negatives
        self.bm25_ratio = bm25_ratio
        self.embedding_serving = embedding_serving

    def forward(
        self,
        data: dict,
        input_query_key: str = 'query',
        input_pos_key: str = 'pos',
        output_neg_key: str = 'neg',
        **kwargs
    ) -> dict:
        query = data.get(input_query_key, '')
        pos_samples = data.get(input_pos_key, [])

        if not query:
            return {**data, output_neg_key: []}

        pos_set = _normalize_pos_samples(pos_samples)

        # Calculate number of negatives for each strategy
        num_bm25 = max(1, int(self.num_negatives * self.bm25_ratio))
        num_semantic = self.num_negatives - num_bm25

        # Mine BM25 negatives first
        bm25_negatives = []
        bm25_index = data.get('_bm25')
        corpus_bm25 = data.get('_bm25_corpus') or []

        if bm25_index and corpus_bm25:
            tokenizer = data.get('_bm25_tokenizer', lambda t: t)
            stopwords = data.get('_bm25_stopwords')
            stemmer = data.get('_bm25_stemmer')

            tokenized_query = bm25s.tokenize(
                tokenizer(query), stopwords=stopwords, stemmer=stemmer
            )
            k = min(len(corpus_bm25), num_bm25 + len(pos_set) + 5)
            indices, scores = bm25_index.retrieve(tokenized_query, k=k)

            for idx in indices[0]:
                doc = corpus_bm25[idx]
                if doc not in pos_set:
                    bm25_negatives.append(doc)
                    if len(bm25_negatives) >= num_bm25:
                        break

        # Mine semantic negatives
        semantic_negatives = []
        # Load embeddings from file path
        embeddings_path = data.get('_semantic_embeddings_path', '')
        corpus_embeddings = _load_embeddings_from_path(embeddings_path)
        corpus_semantic = data.get('_semantic_corpus') or []

        if corpus_embeddings is not None and corpus_semantic and self.embedding_serving is not None:
            # Update pos_set to exclude BM25 negatives
            pos_set_extended = pos_set | set(bm25_negatives)

            query_embedding = np.array(self.embedding_serving([query])[0])

            # Compute cosine similarity using shared function
            similarities = _compute_cosine_similarity(query_embedding, corpus_embeddings)

            scored_docs = [(sim, doc) for sim, doc in zip(similarities, corpus_semantic)
                           if doc not in pos_set_extended]
            scored_docs.sort(key=lambda x: x[0], reverse=True)

            semantic_negatives = [doc for _, doc in scored_docs[:num_semantic]]

        negatives = bm25_negatives + semantic_negatives
        result = {k: v for k, v in data.items() if k not in (
            '_bm25', '_bm25_corpus', '_bm25_tokenizer', '_bm25_stopwords', '_bm25_stemmer'
        )}
        result[output_neg_key] = negatives
        return result

`RerankerMineRandomNegatives`

Bases: reranker

Random negative sample mining operator.

This operator randomly selects documents from corpus that are not in positive samples as negative samples. Suitable for baseline comparison or scenarios requiring random negative samples.

Parameters:

num_negatives (int, default: 7 ) –

Number of negative samples to mine, defaults to 7.
seed (int, default: 42 ) –

Random seed for reproducible selection, defaults to 42.
**kwargs (dict, default: {} ) –

Additional optional parameters passed to parent class.

Returns:

dict –

Input data with mined negative samples list added.

Examples:

from lazyllm.tools.data import reranker

miner = reranker.RerankerMineRandomNegatives(num_negatives=5, seed=123)

data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_corpus': corpus_path}
result = miner(data)
# Returns: {'query': '...', 'pos': [...], '_corpus': '...', 'neg': ['random_neg1', 'random_neg2', ...]}

Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py

class RerankerMineRandomNegatives(reranker):
    """Random negative sample mining operator.

This operator randomly selects documents from corpus that are not in positive samples as negative samples.
Suitable for baseline comparison or scenarios requiring random negative samples.

Args:
    num_negatives (int): Number of negative samples to mine, defaults to 7.
    seed (int): Random seed for reproducible selection, defaults to 42.
    **kwargs (dict): Additional optional parameters passed to parent class.

Returns:
    dict: Input data with mined negative samples list added.


Examples:
    ```python
    from lazyllm.tools.data import reranker

    miner = reranker.RerankerMineRandomNegatives(num_negatives=5, seed=123)

    data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_corpus': corpus_path}
    result = miner(data)
    # Returns: {'query': '...', 'pos': [...], '_corpus': '...', 'neg': ['random_neg1', 'random_neg2', ...]}
    ```
    """
    def __init__(self, num_negatives: int = 7, seed: int = 42, **kwargs):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.num_negatives = num_negatives
        self.seed = seed

    def forward(
        self,
        data: dict,
        input_query_key: str = 'query',
        input_pos_key: str = 'pos',
        output_neg_key: str = 'neg',
        **kwargs
    ) -> dict:
        # Load corpus from file path
        corpus_path = data.get('_corpus', '')
        if isinstance(corpus_path, str) and corpus_path:
            corpus = _load_corpus_from_path(corpus_path)
        elif isinstance(corpus_path, list):
            # Backward compatibility: corpus stored directly
            corpus = corpus_path
        else:
            corpus = []

        if not corpus:
            return {**data, output_neg_key: []}

        query = data.get(input_query_key, '')
        pos_samples = data.get(input_pos_key, [])

        if not query:
            return {**data, output_neg_key: []}

        pos_set = _normalize_pos_samples(pos_samples)
        candidates = [doc for doc in corpus if doc not in pos_set]

        if len(candidates) <= self.num_negatives:
            negatives = candidates
        else:
            # Use instance seed combined with query content for reproducibility
            local_random = random.Random(f'{self.seed}_{query}')
            negatives = local_random.sample(candidates, self.num_negatives)

        return {**data, output_neg_key: negatives}

`RerankerMineSemanticNegatives`

Bases: reranker

Semantic similarity negative sample mining operator.

This operator finds documents most similar to the query but not in positive samples based on semantic vector similarity. Suitable for mining hard negatives that are semantically similar but actually irrelevant, usually performs better than BM25 method.

Parameters:

num_negatives (int, default: 7 ) –

Number of negative samples to mine, defaults to 7.
embedding_serving (Callable, default: None ) –

Embedding service callable function for computing query vectors.
**kwargs (dict, default: {} ) –

Additional optional parameters passed to parent class.

Returns:

dict –

Input data with negative samples mined based on semantic similarity added.

Examples:

from lazyllm.tools.data import reranker

# 假设 embedding_fn 是embedding服务
miner = reranker.RerankerMineSemanticNegatives(num_negatives=5, embedding_serving=embedding_fn)

data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_semantic_embeddings_path': emb_path, '_semantic_corpus': corpus}
result = miner(data)
# Returns: {'query': '...', 'pos': [...], 'neg': ['semantic_neg1', 'semantic_neg2', ...]}

Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py

class RerankerMineSemanticNegatives(reranker):
    """Semantic similarity negative sample mining operator.

This operator finds documents most similar to the query but not in positive samples based on semantic vector similarity.
Suitable for mining hard negatives that are semantically similar but actually irrelevant, usually performs better than BM25 method.

Args:
    num_negatives (int): Number of negative samples to mine, defaults to 7.
    embedding_serving (Callable): Embedding service callable function for computing query vectors.
    **kwargs (dict): Additional optional parameters passed to parent class.

Returns:
    dict: Input data with negative samples mined based on semantic similarity added.


Examples:
    ```python
    from lazyllm.tools.data import reranker

    # 假设 embedding_fn 是embedding服务
    miner = reranker.RerankerMineSemanticNegatives(num_negatives=5, embedding_serving=embedding_fn)

    data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_semantic_embeddings_path': emb_path, '_semantic_corpus': corpus}
    result = miner(data)
    # Returns: {'query': '...', 'pos': [...], 'neg': ['semantic_neg1', 'semantic_neg2', ...]}
    ```
    """
    def __init__(self, num_negatives: int = 7,
                 embedding_serving: Optional[Callable] = None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.num_negatives = num_negatives
        self.embedding_serving = embedding_serving

    def forward(
        self,
        data: dict,
        input_query_key: str = 'query',
        input_pos_key: str = 'pos',
        output_neg_key: str = 'neg',
        **kwargs
    ) -> dict:
        # Load embeddings from file path
        embeddings_path = data.get('_semantic_embeddings_path', '')
        corpus_embeddings = _load_embeddings_from_path(embeddings_path)
        corpus = data.get('_semantic_corpus') or []

        if corpus_embeddings is None:
            LOG.warning('Semantic embeddings not initialized.')
            return {**data, output_neg_key: []}

        query = data.get(input_query_key, '')
        pos_samples = data.get(input_pos_key, [])

        if not query:
            return {**data, output_neg_key: []}

        pos_set = _normalize_pos_samples(pos_samples)

        if self.embedding_serving is None:
            return {**data, output_neg_key: []}

        query_embedding = np.array(self.embedding_serving([query])[0])
        similarities = _compute_cosine_similarity(query_embedding, corpus_embeddings)

        scored_docs = [(sim, doc) for sim, doc in zip(similarities, corpus)
                       if doc not in pos_set]
        scored_docs.sort(key=lambda x: x[0], reverse=True)

        negatives = [doc for _, doc in scored_docs[:self.num_negatives]]
        return {**data, output_neg_key: negatives}

`RerankerParseQueries`

Bases: reranker

Parses LLM-generated query results and expands them into multiple training samples.

It reads the '_query_response' JSON content and extracts the query list (supporting both list and {'queries': [...]} structures). Each query generates a new data record containing:

query: query text
difficulty: difficulty level (default 'medium')
pos: positive sample list (original passage)

Intermediate fields like '_query_response' are removed.

Parameters:

input_key (str, default: 'passage' ) –

source passage field name, default 'passage'
output_query_key (str, default: 'query' ) –

output query field name, default 'query'
**kwargs (dict, default: {} ) –

Additional optional parameters passed to parent class.

Examples:

op = RerankerParseQueries(input_key='passage', output_query_key='query')

data = {
    'passage': 'Large language models are widely used in NLP.',
    '_query_response': '[{"query": "What are LLMs used for?", "difficulty": "easy"}]'
}

rows = op(data)
for row in rows:
    print(row['query'], row['difficulty'], row['pos'])

Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_query_generator.py

class RerankerParseQueries(reranker):
    """Parses LLM-generated query results and expands them into multiple training samples.

It reads the '_query_response' JSON content and extracts the query list
(supporting both list and {'queries': [...]} structures).
Each query generates a new data record containing:

- query: query text
- difficulty: difficulty level (default 'medium')
- pos: positive sample list (original passage)

Intermediate fields like '_query_response' are removed.

Args:
    input_key (str): source passage field name, default 'passage'
    output_query_key (str): output query field name, default 'query'
    **kwargs (dict): Additional optional parameters passed to parent class.


Examples:
    ```python
    op = RerankerParseQueries(input_key='passage', output_query_key='query')

    data = {
        'passage': 'Large language models are widely used in NLP.',
        '_query_response': '[{"query": "What are LLMs used for?", "difficulty": "easy"}]'
    }

    rows = op(data)
    for row in rows:
        print(row['query'], row['difficulty'], row['pos'])
    ```
    """
    def __init__(
        self,
        input_key: str = 'passage',
        output_query_key: str = 'query',
        **kwargs
    ):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.input_key = input_key
        self.output_query_key = output_query_key

    def forward(
        self,
        data: dict,
        **kwargs
    ) -> List[dict]:
        response = data.get('_query_response', '')
        if not response:
            return []

        passage = data.get(self.input_key, '')
        expanded_rows = []

        try:
            parsed = json.loads(_clean_json_block(response))
            queries = parsed if isinstance(parsed, list) else parsed.get('queries', [])

            for query_item in queries:
                if isinstance(query_item, dict):
                    query = query_item.get('query', '')
                    difficulty = query_item.get('difficulty', 'medium')
                else:
                    query = str(query_item)
                    difficulty = 'medium'

                if query.strip():
                    new_row = data.copy()
                    new_row[self.output_query_key] = query.strip()
                    new_row['difficulty'] = difficulty
                    new_row['pos'] = [passage]  # Positive sample is the source passage
                    # Clean up intermediate fields
                    new_row.pop('_query_prompt', None)
                    new_row.pop('_query_response', None)
                    expanded_rows.append(new_row)

        except Exception as e:
            LOG.warning(f'Failed to parse LLM response: {e}')
            return []

        return expanded_rows

`RerankerTrainTestSplitter`

Bases: reranker

Reranker train/test splitter operator.

This operator randomly splits dataset into training and test sets, supporting specified split ratio and random seed. Can save training and test sets to specified files, with test set format converted for evaluation compatibility.

Parameters:

test_size (float, default: 0.1 ) –

Test set proportion, defaults to 0.1 (i.e., 10%).
seed (int, default: 42 ) –

Random seed for reproducible splitting, defaults to 42.
train_output_file (str, default: None ) –

Training set output file path, defaults to None.
test_output_file (str, default: None ) –

Test set output file path, defaults to None.
**kwargs (dict, default: {} ) –

Additional optional arguments passed to the parent class.

Returns:

–

List[dict]: List of split data, each sample contains split field marking its set ('train' or 'test').

Examples:

from lazyllm.tools.data import reranker

splitter = reranker.RerankerTrainTestSplitter(
    test_size=0.2,
    seed=123,
    train_output_file='train.jsonl',
    test_output_file='test.jsonl'
)

data = [
    {'query': 'q1', 'pos': ['p1'], 'neg': ['n1']},
    {'query': 'q2', 'pos': ['p2'], 'neg': ['n2']}
]
result = splitter(data)
# Returns: [{'query': 'q1', 'pos': ['p1'], 'neg': ['n1'], 'split': 'train'}, {'query': 'q2', 'pos': ['p2'], 'neg': ['n2'], 'split': 'test'}]

Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_data_formatter.py

class RerankerTrainTestSplitter(reranker):
    """Reranker train/test splitter operator.

This operator randomly splits dataset into training and test sets, supporting specified split ratio and random seed. Can save training and test sets to specified files, with test set format converted for evaluation compatibility.

Args:
    test_size (float): Test set proportion, defaults to 0.1 (i.e., 10%).
    seed (int): Random seed for reproducible splitting, defaults to 42.
    train_output_file (str, optional): Training set output file path, defaults to None.
    test_output_file (str, optional): Test set output file path, defaults to None.
    **kwargs (dict): Additional optional arguments passed to the parent class.

Returns:
    List[dict]: List of split data, each sample contains split field marking its set ('train' or 'test').


Examples:
    ```python
    from lazyllm.tools.data import reranker

    splitter = reranker.RerankerTrainTestSplitter(
        test_size=0.2,
        seed=123,
        train_output_file='train.jsonl',
        test_output_file='test.jsonl'
    )

    data = [
        {'query': 'q1', 'pos': ['p1'], 'neg': ['n1']},
        {'query': 'q2', 'pos': ['p2'], 'neg': ['n2']}
    ]
    result = splitter(data)
    # Returns: [{'query': 'q1', 'pos': ['p1'], 'neg': ['n1'], 'split': 'train'}, {'query': 'q2', 'pos': ['p2'], 'neg': ['n2'], 'split': 'test'}]
    ```
    """
    def __init__(
            self,
            test_size: float = 0.1,
            seed: int = 42,
            train_output_file: Optional[str] = None,
            test_output_file: Optional[str] = None,
            **kwargs
    ):
        super().__init__(rewrite_func='forward_batch_input', **kwargs)
        self.test_size = test_size
        self.seed = seed
        self.train_output_file = train_output_file
        self.test_output_file = test_output_file
        LOG.info(f'Initializing {self.__class__.__name__} with test_size: {test_size}')

    def forward_batch_input(self, data: List[dict]) -> List[dict]:
        assert isinstance(data, list), 'Input data must be a list'
        records = list(data)

        LOG.info(f'Splitting {len(records)} samples with test_size={self.test_size}')

        # Shuffle and split
        random.seed(self.seed)
        shuffled = records.copy()
        random.shuffle(shuffled)

        split_idx = int(len(shuffled) * (1 - self.test_size))
        train_data = shuffled[:split_idx]
        test_data = shuffled[split_idx:]

        # Add split labels
        for item in train_data:
            item['split'] = 'train'
        for item in test_data:
            item['split'] = 'test'

        LOG.info(f'Split completed: {len(train_data)} train, {len(test_data)} test')

        # Save to files if specified
        if self.train_output_file:
            output_path = Path(self.train_output_file)
            output_path.parent.mkdir(parents=True, exist_ok=True)
            with open(output_path, 'w', encoding='utf-8') as f:
                for item in train_data:
                    item_copy = {k: v for k, v in item.items() if k != 'split'}
                    f.write(json.dumps(item_copy, ensure_ascii=False) + '\n')
            LOG.info(f'Saved train data to {output_path}')

        if self.test_output_file:
            output_path = Path(self.test_output_file)
            output_path.parent.mkdir(parents=True, exist_ok=True)
            with open(output_path, 'w', encoding='utf-8') as f:
                for item in test_data:
                    # For eval data, rename pos to corpus for compatibility
                    item_copy = {
                        'query': item.get('query', ''),
                        'corpus': item.get('pos', []),
                        'neg': item.get('neg', [])
                    }
                    f.write(json.dumps(item_copy, ensure_ascii=False) + '\n')
            LOG.info(f'Saved test data to {output_path}')

        return train_data + test_data

LLM JSON Operators

`lazyllm.tools.data.operators.llm_base_ops`

`LLMDataJson`

Base class for LLM-based JSON data processing operators. Provides foundational logic for structured output, including automatic JsonFormatter configuration, retry mechanisms, and a pre/verify/post-processing lifecycle.

Constructor args:

model: a LazyLLM model instance.
prompt: optional, ChatPrompter or string to guide the LLM.
max_retries: maximum number of retries, default 3.
**kwargs: additional concurrency or persistence arguments for the base class.

Source code in lazyllm/tools/data/operators/llm_base_ops.py

class LLMDataJson:
    """Base class for LLM-based JSON data processing operators. Provides foundational logic for structured output,
including automatic JsonFormatter configuration, retry mechanisms, and a pre/verify/post-processing lifecycle.

Constructor args:

- model: a LazyLLM model instance.
- prompt: optional, ChatPrompter or string to guide the LLM.
- max_retries: maximum number of retries, default 3.
- **kwargs: additional concurrency or persistence arguments for the base class.
"""
    _default_prompt: Optional[Union[ChatPrompter, str]] = None
    _default_inference_kwargs = {
        'max_new_tokens': 512,
        'temperature': 0.2,
    }

    def __init__(self, model, prompt=None, max_retries=3, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        assert prompt is not None or self._default_prompt is not None, 'Prompt must be provided'
        prompt = prompt if prompt is not None else self._default_prompt
        self.model = model.share().prompt(prompt).formatter(JsonFormatter())
        self._max_retries = max_retries

    def preprocess(self, data: dict, **kwargs) -> Tuple[dict, dict]:
        raise NotImplementedError()

    def verify_output(self, output: dict, data: dict) -> bool:
        raise NotImplementedError()

    def postprocess(self, output: dict, data: dict) -> dict:
        raise NotImplementedError()

    def forward(self, data: dict, **kwargs) -> dict:
        prepared_data, infer_kwargs = self.preprocess(data, **kwargs)
        for key, default_val in self._default_inference_kwargs.items():
            infer_kwargs[key] = infer_kwargs.get(key, default_val)
        error_log = []
        for i in range(self._max_retries):
            try:
                res = self.model(prepared_data, **infer_kwargs)
                if self.verify_output(res, data):
                    return self.postprocess(res, data)
            except Exception as e:
                LOG.warning(f'LLM inference failed, try {i+1}/{self._max_retries}, Error: {e}')
                error_log.append(str(e))
                continue
        else:
            raise RuntimeError(f'LLM inference failed after {self._max_retries} retries. Errors: {"; ".join(error_log)}')

`lazyllm.tools.data.operators.llm_json_ops`

`FieldExtractor`

Bases: LLMDataJson, LLMJsonBase

Field extractor. Uses LLM to extract specific information from input text based on a provided list of fields.

Parameters:

model –

a LazyLLM model instance.
prompt –

optional custom extraction prompt.
input_keys –

list of input keys, defaults to ['persona', 'text', 'fields'].
output_key –

key name to store results in the data dict, default 'structured_data'.

Examples:

from lazyllm import OnlineChatModule
from lazyllm.tools.data.operators.llm_json_ops import FieldExtractor
model = OnlineChatModule(source='sensenova')
op = FieldExtractor(model=model)
inputs = [{
    'text': '张三，28岁，目前在上海',
    'fields': ['name', 'age', 'location']
}]
res = op(inputs)
print(res[0]['structured_data']) # {'name': '张三', 'age': '28', 'location': '上海'}

Source code in lazyllm/tools/data/operators/llm_json_ops.py

class FieldExtractor(LLMDataJson, LLMJsonBase):
    """Field extractor. Uses LLM to extract specific information from input text based on a provided list of fields.

Args:
    model: a LazyLLM model instance.
    prompt: optional custom extraction prompt.
    input_keys: list of input keys, defaults to ['persona', 'text', 'fields'].
    output_key: key name to store results in the data dict, default 'structured_data'.


Examples:
    ```python
    from lazyllm import OnlineChatModule
    from lazyllm.tools.data.operators.llm_json_ops import FieldExtractor
    model = OnlineChatModule(source='sensenova')
    op = FieldExtractor(model=model)
    inputs = [{
        'text': '张三，28岁，目前在上海',
        'fields': ['name', 'age', 'location']
    }]
    res = op(inputs)
    print(res[0]['structured_data']) # {'name': '张三', 'age': '28', 'location': '上海'}
    ```
    """
    _default_prompt = DataPrompt('zh')('field_extractor')
    _default_inference_kwargs = {
        'max_new_tokens': 1024,
        'temperature': 0.2,
    }

    def __init__(self, model, prompt=None, input_keys=None, output_key=None, **kwargs):
        super().__init__(model, prompt, **kwargs)
        self.input_keys = input_keys or ['persona', 'text', 'fields']
        assert len(self.input_keys) == 3, 'input_keys must contain exactly three keys.'
        self.output_key = output_key or 'structured_data'

    def preprocess(self, data: dict, **kwargs) -> Tuple[dict, dict]:
        raw_values = [data.get(k) for k in self.input_keys]
        persona, text, fields = ['' if v is None else str(v) for v in raw_values]
        if not text or not fields:
            raise ValueError(
                f'Missing required input keys. Received persona: "{persona}", '
                f'text: "{text}", fields: "{fields}"')
        return {'persona': persona or 'Extractor', 'text': text, 'fields': fields}, kwargs

    def verify_output(self, output: dict, data: dict) -> bool:
        if not isinstance(output, dict):
            return False
        for key in data.get(self.input_keys[2], []):
            if key not in output:
                return False
        return True

    def postprocess(self, output: dict, data: dict) -> dict:
        processed_output = {k: v.strip() if isinstance(v, str) else v for k, v in output.items()}
        data[self.output_key] = processed_output
        return data

`SchemaExtractor`

Bases: LLMDataJson, LLMJsonBase

Schema extractor. Uses LLM to extract structured data from text according to a specified schema (dict or Pydantic model).

Parameters:

model –

a LazyLLM model instance.
prompt –

optional custom extraction prompt.
input_key –

key name for input text, default 'text'.
output_key –

key name to store results in the data dict, default 'structured_data'.

Examples:

from lazyllm import OnlineChatModule
from lazyllm.tools.data.operators.llm_json_ops import SchemaExtractor
model = OnlineChatModule(source='sensenova')
op = SchemaExtractor(model=model)
inputs = [{'text': 'Math score is 95', 'schema': {'subject': 'str', 'score': 'int'}}]
res = op(inputs)
print(res[0]['structured_data']) # {'subject': 'Math', 'score': 95}

Source code in lazyllm/tools/data/operators/llm_json_ops.py

class SchemaExtractor(LLMDataJson, LLMJsonBase):
    """Schema extractor. Uses LLM to extract structured data from text according to a specified schema (dict or Pydantic model).

Args:
    model: a LazyLLM model instance.
    prompt: optional custom extraction prompt.
    input_key: key name for input text, default 'text'.
    output_key: key name to store results in the data dict, default 'structured_data'.


Examples:
    ```python
    from lazyllm import OnlineChatModule
    from lazyllm.tools.data.operators.llm_json_ops import SchemaExtractor
    model = OnlineChatModule(source='sensenova')
    op = SchemaExtractor(model=model)
    inputs = [{'text': 'Math score is 95', 'schema': {'subject': 'str', 'score': 'int'}}]
    res = op(inputs)
    print(res[0]['structured_data']) # {'subject': 'Math', 'score': 95}
    ```
    """
    _default_prompt = DataPrompt('zh')('schema_extractor')
    _default_inference_kwargs = {
        'max_new_tokens': 1024,
        'temperature': 0.2,
    }
    _default_schema = {'subject': 'subject of the event', 'description': 'detailed description of the event'}

    def __init__(self, model, prompt=None, input_key=None, output_key=None, **kwargs):
        super().__init__(model, prompt, **kwargs)
        self.input_key = input_key or 'text'
        self.output_key = output_key or 'structured_data'

    def _get_schema_dict(self, schema: Union[dict, type]) -> dict:
        if isinstance(schema, dict):
            return schema
        elif isinstance(schema, type) and issubclass(schema, BaseModel):
            return schema.model_json_schema()
        else:
            raise ValueError(
                f'Invalid schema format. Expected dict or BaseModel, got {type(schema)}. '
                f'Received schema: "{schema}"'
            )

    def preprocess(self, data: dict, **kwargs) -> Tuple[dict, dict]:
        text = data.get(self.input_key)
        schema = data.get('schema', self._default_schema)
        if not text:
            raise ValueError(f'Missing required input key "{self.input_key}". Received text: "{text}"')
        schema_dict = self._get_schema_dict(schema)
        return {'text': text, 'schema': str(schema_dict)}, kwargs

    def verify_output(self, output: dict, data: dict) -> bool:
        if not isinstance(output, dict):
            return False
        schema = data.get('schema', self._default_schema)
        if isinstance(schema, type) and issubclass(schema, BaseModel):
            try:
                schema(**output)
                return True
            except ValidationError:
                return False
        for key in schema:
            if key not in output:
                return False
        return True

    def postprocess(self, output: dict, data: dict) -> dict:
        processed_output = {k: v.strip() if isinstance(v, str) else v for k, v in output.items()}
        data[self.output_key] = processed_output
        return data

Data Processing Pipeline

Demo Pipeline

`lazyllm.tools.data.pipelines.demo_pipelines`

`build_demo_pipeline(input_key='text')`

Build a demo data processing pipeline composed of several example operators.

Parameters:

input_key (str, default: 'text' ) –

the text field name to process, default 'text'

Returns:

A callable pipeline object that executes registered operators in sequence.

Examples:

from lazyllm.tools.data.pipelines.demo_pipelines import build_demo_pipeline

ppl = build_demo_pipeline(input_key='text')
data = [{'text': 'lazyLLM'}]
res = ppl(data)
print(res)  # demonstrates how operators are combined and applied

Source code in lazyllm/tools/data/pipelines/demo_pipelines.py

def build_demo_pipeline(input_key='text'):
    """Build a demo data processing pipeline composed of several example operators.

Args:
    input_key (str): the text field name to process, default 'text'

**Returns:**

    A callable pipeline object that executes registered operators in sequence.


Examples:
    ```python
    from lazyllm.tools.data.pipelines.demo_pipelines import build_demo_pipeline

    ppl = build_demo_pipeline(input_key='text')
    data = [{'text': 'lazyLLM'}]
    res = ppl(data)
    print(res)  # demonstrates how operators are combined and applied
    ```
    """
    with pipeline() as ppl:
        ppl.build_pre_suffix = demo1.build_pre_suffix(input_key=input_key, prefix='Hello, ', suffix='!')
        ppl.process_uppercase = demo1.process_uppercase(input_key=input_key)
        ppl.add_suffix = demo2.AddSuffix(input_key=input_key, suffix='!!!', _max_workers=4)
        ppl.rich_content = demo2.rich_content(input_key=input_key, _concurrency_mode='single')
    return ppl