Skip to content

数据处理

数据处理算子

基类算子

lazyllm.tools.data.LazyLLMDataBase

数据处理算子基类。为注册到 data_register 的算子提供统一行为,包括并发执行、结果保存/恢复、进度记录和错误收集。

主要方法和行为:

  • forward(self, input, **kwargs): 处理单条数据(子类/函数实现)。
  • forward_batch_input(self, inputs, **kwargs): 处理批量数据并返回最终结果(子类/函数实现)。
  • call(self, inputs): 统一入口,会根据子类是否实现 forward 或 forward_batch_input 选择执行逻辑;支持并发执行、断点续传和保存结果。
  • set_output(self, path): 设置导出路径,调用后 call 返回导出文件路径而不是内存结果。

构造函数参数:

  • _concurrency_mode (str): 并发模式,'process'|'thread'|'single'。
  • _save_data (bool): 是否保存中间结果到磁盘以便 Resume。
  • _max_workers (int|None): 最大并发工作进程/线程数,None 表示使用默认。
  • _ignore_errors (bool): 是否忽略任务异常。
  • **kwargs (dict): 其它传递给算子的参数。

配置项(通过 lazyllm.config):

  • data_process_path (str): 存储处理结果的根路径。
  • data_process_resume (bool): 是否开启 Resume 功能,从进度文件继续处理。

Examples:

from lazyllm.tools.data import LazyLLMDataBase

# simple usage: subclass and implement forward
class EchoOp(LazyLLMDataBase):
    def forward(self, data):
        return {'text': data.get('text', '')}

op = EchoOp(_save_data=True)
res = op([{'text': 'hello'}])  # returns list or exported path depending on set_output
Source code in lazyllm/tools/data/base_data.py
class LazyLLMDataBase(metaclass=LazyLLMRegisterMetaClass):
    """数据处理算子基类。为注册到 data_register 的算子提供统一行为,包括并发执行、结果保存/恢复、进度记录和错误收集。

主要方法和行为:

- forward(self, input, **kwargs): 处理单条数据(子类/函数实现)。
- forward_batch_input(self, inputs, **kwargs): 处理批量数据并返回最终结果(子类/函数实现)。
- __call__(self, inputs): 统一入口,会根据子类是否实现 forward 或 forward_batch_input 选择执行逻辑;支持并发执行、断点续传和保存结果。
- set_output(self, path): 设置导出路径,调用后 __call__ 返回导出文件路径而不是内存结果。

构造函数参数:

- _concurrency_mode (str): 并发模式,'process'|'thread'|'single'。
- _save_data (bool): 是否保存中间结果到磁盘以便 Resume。
- _max_workers (int|None): 最大并发工作进程/线程数,None 表示使用默认。
- _ignore_errors (bool): 是否忽略任务异常。
- **kwargs (dict): 其它传递给算子的参数。

配置项(通过 lazyllm.config):

- data_process_path (str): 存储处理结果的根路径。
- data_process_resume (bool): 是否开启 Resume 功能,从进度文件继续处理。


Examples:
    ```python
    from lazyllm.tools.data import LazyLLMDataBase

    # simple usage: subclass and implement forward
    class EchoOp(LazyLLMDataBase):
        def forward(self, data):
            return {'text': data.get('text', '')}

    op = EchoOp(_save_data=True)
    res = op([{'text': 'hello'}])  # returns list or exported path depending on set_output
    ```
    """
    def __init__(self, _concurrency_mode=None, _save_data=True, _max_workers=None,
                 _ignore_errors=True, **kwargs):
        self._concurrency_mode = _concurrency_mode or getattr(self, '_concurrency_mode', 'process')
        if _max_workers:
            self._max_workers = _max_workers
        elif self._concurrency_mode == 'process':
            self._max_workers = os.cpu_count()
        else:
            self._max_workers = min(max(32, (os.cpu_count() or 1) * 5), 128)
        self._ignore_errors = _ignore_errors
        self._store = DataStateStore(self.__class__.__name__, _save_data)
        self._lazyllm_kwargs = kwargs
        self._export_path = None

    def set_output(self, output_path):
        """设置输出路径,用于把最终结果导出为 jsonl 文件并返回文件路径。

Args:
    output_path (str): 文件夹路径或具体 .jsonl 文件路径。若为文件夹,则在该文件夹下创建以类名命名的 jsonl 文件。

行为:

- 如果传入的是文件夹路径,则在该文件夹下创建以类名命名的 jsonl 文件。
- 如果传入的是以 .jsonl 结尾的路径,则直接写入该文件(必要时会创建目录)。
- 返回写入的绝对路径字符串。


Examples:
    ```python
    from lazyllm.tools.data import Demo2

    # export to a directory (will create DemoClass.jsonl)
    op = Demo2.rich_content(input_key='text').set_output('./out_dir')
    path = op([{'text': 'sample'}])
    print(path)  # ./out_dir/RichContent.jsonl or similar

    # export to a specific file
    op = Demo2.rich_content(input_key='text').set_output('./out_dir/results.jsonl')
    path = op([{'text': 'sample'}])
    print(path)  # ./out_dir/results.jsonl
    ```
    """
        self._export_path = output_path
        return self

    def _overwrote(self, f):
        return getattr(self.__class__, f) is not getattr(__class__, f) or \
            getattr(self.__class__, '__reg_overwrite__', None) == f

    def forward(self, input_data, **kwargs):
        """子类需要实现的方法,处理单条数据。返回值支持:

- dict: 表示处理后的单条结果。
- list: 表示将一条输入展开为多条输出。
- None: 表示保留原始输入(不修改)。
- 抛出异常或返回错误对象会被记录到错误文件并跳过(依赖配置和调用者)。

Args:
    input (dict): 单条输入数据字典。
    **kwargs (dict): 其它用户传入的参数。


Examples:
    ```python
    from lazyllm.tools.data import LazyLLMDataBase

    class MyOp(LazyLLMDataBase):
        def forward(self, data):
            # return dict or list or None
            return {'text': data.get('text', '').upper()}

    op = MyOp()
    print(op([{'text': 'a'}]))
    ```
    """
        raise NotImplementedError()

    def forward_batch_input(self, inputs, **kwargs):
        """子类可实现的批量处理方法,用于在非逐条并发场景下直接接收整个输入列表并返回最终结果列表(可用于自定义批量逻辑或外部服务一次性处理)。

Args:
    inputs (list[dict]): 输入数据列表。
    **kwargs (dict): 其它用户传入的参数。


Examples:
    ```python
    from lazyllm.tools.data import LazyLLMDataBase

    class BatchOp(LazyLLMDataBase):
        def forward_batch_input(self, inputs):
            # implement batch processing and return a list
            return [{'text': i.get('text', '').lower()} for i in inputs]

    op = BatchOp()
    print(op([{'text': 'A'}, {'text': 'B'}]))
    ```
    """
        raise NotImplementedError()

    def _run_one(self, data):
        try:
            kwargs = getattr(self, '_lazyllm_kwargs', {})
            return self.forward(data, **kwargs)
        except Exception as e:
            err_msg = str(e)
            if isinstance(data, dict):
                return {**data, 'infer_error': err_msg}
            return {'input': data, 'infer_error': err_msg}

    def _process_forward_common(self, data):
        self._store.load_progress()
        results = []
        pbar = tqdm(total=len(data), desc=f'Processing {self.__class__.__name__}', unit='item')

        if self._store.is_done:
            pbar.update(len(data))
            pending_indices = []
        else:
            if len(self._store.processed_indices) > 0:
                pbar.update(len(self._store.processed_indices))

            pending_indices = [idx for idx in range(len(data)) if idx not in self._store.processed_indices]

        if not pending_indices:
            pbar.close()
            return self._store.load_results()

        if self._concurrency_mode == 'single':
            for idx in pending_indices:
                res = self._run_one(data[idx])
                self._handle_result(res, data[idx], results, [idx])
                pbar.update(1)
        else:
            self._process_parallel(data, pending_indices, results, pbar)

        pbar.close()
        # Flush remaining
        if self._store.save_data:
            self._store.save_results([], force=True)  # Flush
            return self._store.load_results()
        return results

    def _process_parallel(self, data, pending_indices, results, pbar):

        executor_cls = ProcessPoolExecutor if self._concurrency_mode == 'process' else ThreadPoolExecutor
        idx_iter = iter(pending_indices)
        futures = {}

        with executor_cls(max_workers=self._max_workers) as executor:
            # 1. Submit initial batch
            for _ in range(self._max_workers):
                try:
                    idx = next(idx_iter)
                    fut = executor.submit(self._run_one, data[idx])
                    futures[fut] = idx
                except StopIteration:
                    break

            # 2. Loop
            while futures:
                done, _ = wait(futures.keys(), return_when=FIRST_COMPLETED)

                for fut in done:
                    idx = futures.pop(fut)
                    try:
                        res = fut.result()
                        self._handle_result(res, data[idx], results, [idx])
                    except Exception as e:
                        if not self._ignore_errors:
                            raise e
                        LOG.error(f'Task failed: {e}')

                    pbar.update(1)

                    # Submit next
                    try:
                        next_idx = next(idx_iter)
                        new_fut = executor.submit(self._run_one, data[next_idx])
                        futures[new_fut] = next_idx
                    except StopIteration:
                        pass

    def _handle_result(self, res, original_data, results, indices):
        if isinstance(res, dict) and 'infer_error' in res:
            if self._store.save_data:
                self._store.save_errors(res)
                self._store.save_results([], indices)
            return

        # Logic to interpret return value
        final_res = []
        if res is None:
            final_res.append(original_data)  # Keep original
        elif isinstance(res, list):
            if res:  # Not empty
                final_res.extend(res)
            # Empty list means delete (do nothing)
        elif isinstance(res, dict):
            final_res.append(res)
        else:
            # Treat unexpected return types as errors
            err_msg = f'Invalid return type {type(res)} from {self.__class__.__name__}, expect dict or list or None'
            LOG.error(err_msg)
            if isinstance(original_data, dict):
                error_res = original_data.copy()
                error_res['infer_error'] = err_msg
            else:
                error_res = {'input': original_data, 'infer_error': err_msg}

            if self._store.save_data:
                self._store.save_errors(error_res)
                self._store.save_results([], indices)
            return

        if self._store.save_data:
            self._store.save_results(final_res, indices)
        else:
            results.extend(final_res)

    def _export_file(self, result):
        if not self._export_path or result is None:
            return result

        path = self._export_path
        if not path.endswith('.jsonl'):
            os.makedirs(path, exist_ok=True)
            path = os.path.join(path, f'{self.__class__.__name__}.jsonl')
        else:
            dir_name = os.path.dirname(path)
            if dir_name:
                os.makedirs(dir_name, exist_ok=True)

        abs_path = os.path.abspath(path)
        with open(abs_path, 'w', encoding='utf-8') as f:
            for item in result:
                f.write(json.dumps(item, ensure_ascii=False) + '\n')
        return abs_path

    def __call__(self, inputs):
        if not isinstance(inputs, list):
            inputs = [inputs]

        kwargs = getattr(self, '_lazyllm_kwargs', {})
        res = []

        if self._overwrote('forward_batch_input'):
            self._store.load_progress()
            if self._store.save_data and self._store.resume and self._store.is_done:
                LOG.warning(f'skip {self.__class__.__name__} and load data from {self._store.save_path}')
                res = self._store.load_results()
            else:
                res = self.forward_batch_input(inputs, **kwargs)

                if self._store.save_data and res is not None:
                    self._store.save_results(res if isinstance(res, list) else [res], indices='Done', force=True)

        elif self._overwrote('forward'):
            res = self._process_forward_common(inputs)
        else:
            raise RuntimeError('Must implement forward or forward_batch_input')

        return self._export_file(res)

forward(input_data, **kwargs)

子类需要实现的方法,处理单条数据。返回值支持:

  • dict: 表示处理后的单条结果。
  • list: 表示将一条输入展开为多条输出。
  • None: 表示保留原始输入(不修改)。
  • 抛出异常或返回错误对象会被记录到错误文件并跳过(依赖配置和调用者)。

Parameters:

  • input (dict) –

    单条输入数据字典。

  • **kwargs (dict, default: {} ) –

    其它用户传入的参数。

Examples:

from lazyllm.tools.data import LazyLLMDataBase

class MyOp(LazyLLMDataBase):
    def forward(self, data):
        # return dict or list or None
        return {'text': data.get('text', '').upper()}

op = MyOp()
print(op([{'text': 'a'}]))
Source code in lazyllm/tools/data/base_data.py
    def forward(self, input_data, **kwargs):
        """子类需要实现的方法,处理单条数据。返回值支持:

- dict: 表示处理后的单条结果。
- list: 表示将一条输入展开为多条输出。
- None: 表示保留原始输入(不修改)。
- 抛出异常或返回错误对象会被记录到错误文件并跳过(依赖配置和调用者)。

Args:
    input (dict): 单条输入数据字典。
    **kwargs (dict): 其它用户传入的参数。


Examples:
    ```python
    from lazyllm.tools.data import LazyLLMDataBase

    class MyOp(LazyLLMDataBase):
        def forward(self, data):
            # return dict or list or None
            return {'text': data.get('text', '').upper()}

    op = MyOp()
    print(op([{'text': 'a'}]))
    ```
    """
        raise NotImplementedError()

forward_batch_input(inputs, **kwargs)

子类可实现的批量处理方法,用于在非逐条并发场景下直接接收整个输入列表并返回最终结果列表(可用于自定义批量逻辑或外部服务一次性处理)。

Parameters:

  • inputs (list[dict]) –

    输入数据列表。

  • **kwargs (dict, default: {} ) –

    其它用户传入的参数。

Examples:

from lazyllm.tools.data import LazyLLMDataBase

class BatchOp(LazyLLMDataBase):
    def forward_batch_input(self, inputs):
        # implement batch processing and return a list
        return [{'text': i.get('text', '').lower()} for i in inputs]

op = BatchOp()
print(op([{'text': 'A'}, {'text': 'B'}]))
Source code in lazyllm/tools/data/base_data.py
    def forward_batch_input(self, inputs, **kwargs):
        """子类可实现的批量处理方法,用于在非逐条并发场景下直接接收整个输入列表并返回最终结果列表(可用于自定义批量逻辑或外部服务一次性处理)。

Args:
    inputs (list[dict]): 输入数据列表。
    **kwargs (dict): 其它用户传入的参数。


Examples:
    ```python
    from lazyllm.tools.data import LazyLLMDataBase

    class BatchOp(LazyLLMDataBase):
        def forward_batch_input(self, inputs):
            # implement batch processing and return a list
            return [{'text': i.get('text', '').lower()} for i in inputs]

    op = BatchOp()
    print(op([{'text': 'A'}, {'text': 'B'}]))
    ```
    """
        raise NotImplementedError()

set_output(output_path)

设置输出路径,用于把最终结果导出为 jsonl 文件并返回文件路径。

Parameters:

  • output_path (str) –

    文件夹路径或具体 .jsonl 文件路径。若为文件夹,则在该文件夹下创建以类名命名的 jsonl 文件。

行为:

  • 如果传入的是文件夹路径,则在该文件夹下创建以类名命名的 jsonl 文件。
  • 如果传入的是以 .jsonl 结尾的路径,则直接写入该文件(必要时会创建目录)。
  • 返回写入的绝对路径字符串。

Examples:

from lazyllm.tools.data import Demo2

# export to a directory (will create DemoClass.jsonl)
op = Demo2.rich_content(input_key='text').set_output('./out_dir')
path = op([{'text': 'sample'}])
print(path)  # ./out_dir/RichContent.jsonl or similar

# export to a specific file
op = Demo2.rich_content(input_key='text').set_output('./out_dir/results.jsonl')
path = op([{'text': 'sample'}])
print(path)  # ./out_dir/results.jsonl
Source code in lazyllm/tools/data/base_data.py
    def set_output(self, output_path):
        """设置输出路径,用于把最终结果导出为 jsonl 文件并返回文件路径。

Args:
    output_path (str): 文件夹路径或具体 .jsonl 文件路径。若为文件夹,则在该文件夹下创建以类名命名的 jsonl 文件。

行为:

- 如果传入的是文件夹路径,则在该文件夹下创建以类名命名的 jsonl 文件。
- 如果传入的是以 .jsonl 结尾的路径,则直接写入该文件(必要时会创建目录)。
- 返回写入的绝对路径字符串。


Examples:
    ```python
    from lazyllm.tools.data import Demo2

    # export to a directory (will create DemoClass.jsonl)
    op = Demo2.rich_content(input_key='text').set_output('./out_dir')
    path = op([{'text': 'sample'}])
    print(path)  # ./out_dir/RichContent.jsonl or similar

    # export to a specific file
    op = Demo2.rich_content(input_key='text').set_output('./out_dir/results.jsonl')
    path = op([{'text': 'sample'}])
    print(path)  # ./out_dir/results.jsonl
    ```
    """
        self._export_path = output_path
        return self

演示算子

lazyllm.tools.data.operators.demo_ops

AddSuffix

Bases: Demo2

通过类方式实现的算子,为指定字段添加后缀。支持并发配置(通过构造参数)。

Parameters:

  • suffix (str) –

    要添加的后缀

  • input_key (str, default: 'content' ) –

    文本字段名

  • _max_workers (int | None) –

    可选,最大并发数

  • _concurrency_mode (str, default: 'process' ) –

    可选,并发模式

  • _save_data (bool) –

    可选,是否保存结果

Examples:

from lazyllm.tools.data import Demo2

op = Demo2.AddSuffix(suffix='!!!', input_key='text', _max_workers=2)
data = [{'text': 'wow'}]
res = op(data)
print(res)
# [{'text': 'wow!!!'}]
Source code in lazyllm/tools/data/operators/demo_ops.py
class AddSuffix(Demo2):
    """通过类方式实现的算子,为指定字段添加后缀。支持并发配置(通过构造参数)。

Args:
    suffix (str): 要添加的后缀
    input_key (str): 文本字段名
    _max_workers (int|None): 可选,最大并发数
    _concurrency_mode (str): 可选,并发模式
    _save_data (bool): 可选,是否保存结果


Examples:
    ```python
    from lazyllm.tools.data import Demo2

    op = Demo2.AddSuffix(suffix='!!!', input_key='text', _max_workers=2)
    data = [{'text': 'wow'}]
    res = op(data)
    print(res)
    # [{'text': 'wow!!!'}]
    ```
    """
    def __init__(self, suffix, input_key='content', _concurrency_mode='process', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        self.suffix = suffix
        self.input_key = input_key

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        data[self.input_key] = f'{data.get(self.input_key, "")}{self.suffix}'
        return data

build_pre_suffix(data, input_key='content', prefix='', suffix='')

对输入列表中每项在指定字段前后添加前缀和后缀。此算子以批处理函数注册(forward_batch_input)。

Parameters:

  • data (list[dict]) –

    输入列表

  • input_key (str, default: 'content' ) –

    文本字段名

  • prefix (str, default: '' ) –

    要添加的前缀

  • suffix (str, default: '' ) –

    要添加的后缀

Examples:

from lazyllm.tools.data import Demo1

op = Demo1.build_pre_suffix(input_key='text', prefix='Hello, ', suffix='!')
data = [{'text': 'world'}]
res = op(data)
print(res)
# [{'text': 'Hello, world!'}]
Source code in lazyllm/tools/data/operators/demo_ops.py
@data_register('data.demo1', rewrite_func='forward_batch_input')
def build_pre_suffix(data, input_key='content', prefix='', suffix=''):
    """对输入列表中每项在指定字段前后添加前缀和后缀。此算子以批处理函数注册(forward_batch_input)。

Args:
    data (list[dict]): 输入列表
    input_key (str): 文本字段名
    prefix (str): 要添加的前缀
    suffix (str): 要添加的后缀


Examples:
    ```python
    from lazyllm.tools.data import Demo1

    op = Demo1.build_pre_suffix(input_key='text', prefix='Hello, ', suffix='!')
    data = [{'text': 'world'}]
    res = op(data)
    print(res)
    # [{'text': 'Hello, world!'}]
    ```
    """
    assert isinstance(data, list)
    for item in data:
        item[input_key] = f'{prefix}{item.get(input_key, "")}{suffix}'
    return data

error_prone_op(data, input_key='content')

一个用于测试的算子:在特定输入(content == 'fail')时抛出异常,否则返回处理后的字典结果。用于验证错误收集与跳过逻辑。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名

Examples:

from lazyllm.tools.data import Demo2

op = Demo2.error_prone_op(input_key='text', _save_data=True, _concurrency_mode='single')
data = [{'text': 'ok'}, {'text': 'fail'}, {'text': 'ok2'}]
res = op(data)
print(res)
# [{'text': 'Processed: ok'}, {'text': 'Processed: ok2'}]
# valid results skip the failed item; error details written to error file
Source code in lazyllm/tools/data/operators/demo_ops.py
@data_register('data.demo2', rewrite_func='forward')
def error_prone_op(data, input_key='content'):
    """一个用于测试的算子:在特定输入(content == 'fail')时抛出异常,否则返回处理后的字典结果。用于验证错误收集与跳过逻辑。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名


Examples:
    ```python
    from lazyllm.tools.data import Demo2

    op = Demo2.error_prone_op(input_key='text', _save_data=True, _concurrency_mode='single')
    data = [{'text': 'ok'}, {'text': 'fail'}, {'text': 'ok2'}]
    res = op(data)
    print(res)
    # [{'text': 'Processed: ok'}, {'text': 'Processed: ok2'}]
    # valid results skip the failed item; error details written to error file
    ```
    """
    assert isinstance(data, dict)
    content = data.get(input_key, '')
    if content == 'fail':
        raise ValueError('Intentional error for testing.')
    data[input_key] = f'Processed: {content}'
    return data

process_uppercase(data, input_key='content')

将输入文本字段转换为大写。适用于单条处理函数注册(forward)。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

Examples:

from lazyllm.tools.data import Demo1

op = Demo1.process_uppercase(input_key='text')
data = [{'text': 'hello'}]
res = op(data)
print(res)
# [{'text': 'HELLO'}]
Source code in lazyllm/tools/data/operators/demo_ops.py
@data_register('data.demo1', rewrite_func='forward', _concurrency_mode='process')
def process_uppercase(data, input_key='content'):
    """将输入文本字段转换为大写。适用于单条处理函数注册(forward)。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'


Examples:
    ```python
    from lazyllm.tools.data import Demo1

    op = Demo1.process_uppercase(input_key='text')
    data = [{'text': 'hello'}]
    res = op(data)
    print(res)
    # [{'text': 'HELLO'}]
    ```
    """
    assert isinstance(data, dict)
    data[input_key] = data.get(input_key, '').upper()
    return data

rich_content(data, input_key='content')

将单条输入拆分为多条输出,生成富内容表示(原始 + 若干派生)。适用于返回 list 的 forward。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名

Examples:

from lazyllm.tools.data import Demo2

op = Demo2.rich_content(input_key='text')
data = [{'text': 'This is a test.'}]
res = op(data)
print(res)
# [
#   {'text': 'This is a test.'},
#   {'text': 'This is a test. - part 1'},
#   {'text': 'This is a test. - part 2'}
# ]
Source code in lazyllm/tools/data/operators/demo_ops.py
@data_register('data.demo2', rewrite_func='forward', _concurrency_mode='process')
def rich_content(data, input_key='content'):
    """将单条输入拆分为多条输出,生成富内容表示(原始 + 若干派生)。适用于返回 list 的 forward。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名


Examples:
    ```python
    from lazyllm.tools.data import Demo2

    op = Demo2.rich_content(input_key='text')
    data = [{'text': 'This is a test.'}]
    res = op(data)
    print(res)
    # [
    #   {'text': 'This is a test.'},
    #   {'text': 'This is a test. - part 1'},
    #   {'text': 'This is a test. - part 2'}
    # ]
    ```
    """
    assert isinstance(data, dict)
    content = data.get(input_key, '')
    new_res = [data]
    for i in range(2):
        new_data = data.copy()
        new_data[input_key] = f'{content} - part {i+1}'
        new_res.append(new_data)
    return new_res

偏好数据处理算子

lazyllm.tools.data.operators.preference_ops

IntentExtractor

Bases: PreferenceOps

偏好数据处理算子:意图提取器。

从输入数据 dict 的指定字段中提取“核心意图”,并将结果写回到输出字段中,便于后续生成多候选回复与偏好对构造。

注意:

  • 该算子内部使用模型 + JSON 格式化器,期望模型输出为 JSON dict;若无法解析为 dict,则输出为 None。
  • 默认并发模式为 thread。

Parameters:

  • model

    LazyLLM 模型对象(必需),会被 share() 后复用。

  • input_key (str, default: 'content' ) –

    输入文本字段名,默认 'content'。

  • output_key (str, default: 'intent' ) –

    输出意图字段名,默认 'intent'。

  • **kwargs

    传递给基类算子的其它参数(如 _max_workers、_save_data 等)。

Examples:

from lazyllm.tools.data.operators.preference_ops import IntentExtractor

# model 需要由你的项目环境提供,例如 lazyllm.xxx(...) 得到的模型对象
op = IntentExtractor(model=model, input_key='content', output_key='intent')
print(op({'content': 'I want to stay at a hotel in Beijing.'}))
# [{
#   'content': 'I want to stay at a hotel in Beijing.',
#   'intent': {
#     'intent': 'book_hotel',
#     'entities': [{'entity': 'location', 'value': 'Beijing'}]
#   }
# }]
Source code in lazyllm/tools/data/operators/preference_ops.py
class IntentExtractor(PreferenceOps):
    """偏好数据处理算子:意图提取器。

从输入数据 dict 的指定字段中提取“核心意图”,并将结果写回到输出字段中,便于后续生成多候选回复与偏好对构造。

注意:

- 该算子内部使用模型 + JSON 格式化器,期望模型输出为 JSON dict;若无法解析为 dict,则输出为 None。
- 默认并发模式为 thread。

Args:
    model: LazyLLM 模型对象(必需),会被 share() 后复用。
    input_key (str): 输入文本字段名,默认 'content'。
    output_key (str): 输出意图字段名,默认 'intent'。
    **kwargs: 传递给基类算子的其它参数(如 _max_workers、_save_data 等)。


Examples:
    ```python
    from lazyllm.tools.data.operators.preference_ops import IntentExtractor

    # model 需要由你的项目环境提供,例如 lazyllm.xxx(...) 得到的模型对象
    op = IntentExtractor(model=model, input_key='content', output_key='intent')
    print(op({'content': 'I want to stay at a hotel in Beijing.'}))
    # [{
    #   'content': 'I want to stay at a hotel in Beijing.',
    #   'intent': {
    #     'intent': 'book_hotel',
    #     'entities': [{'entity': 'location', 'value': 'Beijing'}]
    #   }
    # }]
    ```
    """
    def __init__(self, model=None, input_key='content', output_key='intent', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        sys_prompt = '你是一个意图提取助手,请从用户文本中提取核心意图,并以 JSON 格式返回。'
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        if self.input_key in data:
            data[self.output_key] = self.extract(data[self.input_key])
        return data

    def extract(self, raw_text):
        instruction = f'提炼以下用户文本的核心意图: \n{raw_text}'
        res = self.model(instruction)
        return res if isinstance(res, dict) else None

PreferencePairConstructor

Bases: PreferenceOps

偏好数据处理算子:偏好对构造器(chosen / rejected)。

根据候选回复列表及其评分列表,构造一对 (chosen, rejected),并输出为偏好数据格式:

  • instruction: 指令文本(默认取 intent 字段)
  • chosen: 更优的回复
  • rejected: 更差的回复

支持两种策略:

  • max_min: 选择最高分作为 chosen、最低分作为 rejected(要求最高分 > 最低分)。
  • threshold: 从高到低寻找分差 >= threshold 的一对,满足则返回。

注意:若输入为空、长度不一致、或无法构造有效 pair,则返回空列表 [](用于在流水线中过滤无效样本)。

Parameters:

  • strategy (str, default: 'max_min' ) –

    'max_min' 或 'threshold',默认 'max_min'。

  • threshold (float, default: 0.5 ) –

    strategy == 'threshold' 时使用的最小分差,默认 0.5。

  • instruction_key (str, default: 'intent' ) –

    指令字段名,默认 'intent'。

  • response_key (str, default: 'responses' ) –

    候选回复列表字段名,默认 'responses'。

  • score_key (str, default: 'evaluation' ) –

    评分列表字段名,默认 'evaluation'。

  • output_chosen_key (str, default: 'chosen' ) –

    输出 chosen 字段名,默认 'chosen'。

  • output_rejected_key (str, default: 'rejected' ) –

    输出 rejected 字段名,默认 'rejected'。

  • **kwargs

    传递给基类算子的其它参数。

Examples:

from lazyllm.tools.data.operators.preference_ops import PreferencePairConstructor

op = PreferencePairConstructor(strategy='max_min', instruction_key='intent',
                              response_key='responses', score_key='evaluation')
data = {
    'intent': 'book a hotel',
    'responses': ['good response', 'bad response'],
    'evaluation': [10, 6],
}
print(op(data))
# [{
#   'instruction': 'book a hotel',
#   'chosen': 'good response',
#   'rejected': 'bad response'
# }]
Source code in lazyllm/tools/data/operators/preference_ops.py
class PreferencePairConstructor(PreferenceOps):
    """偏好数据处理算子:偏好对构造器(chosen / rejected)。

根据候选回复列表及其评分列表,构造一对 (chosen, rejected),并输出为偏好数据格式:

- instruction: 指令文本(默认取 intent 字段)
- chosen: 更优的回复
- rejected: 更差的回复

支持两种策略:

- max_min: 选择最高分作为 chosen、最低分作为 rejected(要求最高分 > 最低分)。
- threshold: 从高到低寻找分差 >= threshold 的一对,满足则返回。

注意:若输入为空、长度不一致、或无法构造有效 pair,则返回空列表 [](用于在流水线中过滤无效样本)。

Args:
    strategy (str): 'max_min' 或 'threshold',默认 'max_min'。
    threshold (float): strategy == 'threshold' 时使用的最小分差,默认 0.5。
    instruction_key (str): 指令字段名,默认 'intent'。
    response_key (str): 候选回复列表字段名,默认 'responses'。
    score_key (str): 评分列表字段名,默认 'evaluation'。
    output_chosen_key (str): 输出 chosen 字段名,默认 'chosen'。
    output_rejected_key (str): 输出 rejected 字段名,默认 'rejected'。
    **kwargs: 传递给基类算子的其它参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.preference_ops import PreferencePairConstructor

    op = PreferencePairConstructor(strategy='max_min', instruction_key='intent',
                                  response_key='responses', score_key='evaluation')
    data = {
        'intent': 'book a hotel',
        'responses': ['good response', 'bad response'],
        'evaluation': [10, 6],
    }
    print(op(data))
    # [{
    #   'instruction': 'book a hotel',
    #   'chosen': 'good response',
    #   'rejected': 'bad response'
    # }]
    ```
    """
    def __init__(self, strategy='max_min', threshold=0.5,
                 instruction_key='intent', response_key='responses', score_key='evaluation',
                 output_chosen_key='chosen', output_rejected_key='rejected', **kwargs):
        super().__init__(**kwargs)
        self.strategy = strategy
        self.threshold = threshold
        self.instruction_key = instruction_key
        self.response_key = response_key
        self.score_key = score_key
        self.output_chosen_key = output_chosen_key
        self.output_rejected_key = output_rejected_key

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        if self.response_key in data and self.score_key in data:
            responses = data[self.response_key]
            scores = data[self.score_key]

            if not responses or not scores or len(responses) != len(scores):
                return []

            chosen, rejected = self.construct_pair(responses, scores)

            if chosen is not None and rejected is not None:
                return {
                    'instruction': data.get(self.instruction_key, ''),
                    self.output_chosen_key: chosen,
                    self.output_rejected_key: rejected
                }

        return []

    def construct_pair(self, responses, scores):
        if len(responses) < 2:
            return None, None

        pairs = list(zip(responses, scores))
        pairs.sort(key=lambda x: x[1], reverse=True)

        if self.strategy == 'max_min':
            chosen_pair = pairs[0]
            rejected_pair = pairs[-1]

            if chosen_pair[1] > rejected_pair[1]:
                return chosen_pair[0], rejected_pair[0]

        elif self.strategy == 'threshold':
            for i in range(len(pairs)):
                for j in range(i + 1, len(pairs)):
                    score_diff = pairs[i][1] - pairs[j][1]
                    if score_diff >= self.threshold:
                        return pairs[i][0], pairs[j][0]

        return None, None

PreferenceResponseGenerator

Bases: PreferenceOps

偏好数据处理算子:多候选回复生成器。

根据上一步得到的意图(或任意指令文本),生成 n 条候选回复列表写入到输出字段中。

Parameters:

  • model

    LazyLLM 模型对象(必需),会被 share() 后复用。

  • n (int, default: 3 ) –

    生成候选回复条数,默认 3。

  • temperature (float, default: 1.0 ) –

    采样温度,默认 1.0。

  • system_prompt (str | None, default: None ) –

    可选系统提示词;提供则会对模型调用 .prompt(system_prompt)。

  • input_key (str, default: 'intent' ) –

    输入字段名,默认 'intent'。

  • output_key (str, default: 'responses' ) –

    输出字段名,默认 'responses'。

  • **kwargs

    传递给基类算子的其它参数。

Examples:

from lazyllm.tools.data.operators.preference_ops import PreferenceResponseGenerator

op = PreferenceResponseGenerator(model=model, n=3, temperature=0.8, input_key='intent', output_key='responses')
print(op({'intent': 'book a hotel'}))
# [{
#   'intent': {'intent': 'book a hotel'},
#   'responses': [
#     "<think>Okay, the user wants to book a hotel. ...",
#     "<think>Okay, the user wants to book a hotel. ..."
#   ]
# }]
Source code in lazyllm/tools/data/operators/preference_ops.py
class PreferenceResponseGenerator(PreferenceOps):
    """偏好数据处理算子:多候选回复生成器。

根据上一步得到的意图(或任意指令文本),生成 n 条候选回复列表写入到输出字段中。

Args:
    model: LazyLLM 模型对象(必需),会被 share() 后复用。
    n (int): 生成候选回复条数,默认 3。
    temperature (float): 采样温度,默认 1.0。
    system_prompt (str|None): 可选系统提示词;提供则会对模型调用 .prompt(system_prompt)。
    input_key (str): 输入字段名,默认 'intent'。
    output_key (str): 输出字段名,默认 'responses'。
    **kwargs: 传递给基类算子的其它参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.preference_ops import PreferenceResponseGenerator

    op = PreferenceResponseGenerator(model=model, n=3, temperature=0.8, input_key='intent', output_key='responses')
    print(op({'intent': 'book a hotel'}))
    # [{
    #   'intent': {'intent': 'book a hotel'},
    #   'responses': [
    #     "<think>Okay, the user wants to book a hotel. ...",
    #     "<think>Okay, the user wants to book a hotel. ..."
    #   ]
    # }]
    ```
    """
    def __init__(self, model=None, n=3, temperature=1.0, system_prompt=None,
                 input_key='intent', output_key='responses', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.n = n
        self.temperature = temperature
        self.input_key = input_key
        self.output_key = output_key
        self.model = model.share()
        if system_prompt:
            self.model = self.model.prompt(system_prompt)

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        if self.input_key in data:
            data[self.output_key] = self.generate(data[self.input_key])
        return data

    def generate(self, x):
        responses = []
        for _ in range(self.n):
            response = self.model(x, temperature=self.temperature)
            responses.append(response)
        return responses

ResponseEvaluator

Bases: PreferenceOps

偏好数据处理算子:候选回复评测器。

对同一条指令下的多个候选回复逐一打分,输出每条回复的分数列表,便于后续构造 chosen/rejected。

评分维度(总分 10 分):

  • 有用性 (Helpfulness) 4 分
  • 真实性 (Truthfulness) 3 分
  • 流畅度 (Fluency) 3 分

注意:

  • 该算子内部使用模型 + JSON 格式化器;每条回复都期望输出包含 total_score 的 dict。
  • 如果某条回复无法解析 total_score,会记录 warning,并为该条回复记 0 分。

Parameters:

  • model

    LazyLLM 模型对象(必需),会被 share() 后复用。

  • input_key (str, default: 'content' ) –

    指令/原始内容字段名,默认 'content'。

  • response_key (str, default: 'responses' ) –

    候选回复列表字段名,默认 'responses'。

  • output_key (str, default: 'evaluation' ) –

    输出评分列表字段名,默认 'evaluation'。

  • **kwargs

    传递给基类算子的其它参数。

Examples:

from lazyllm.tools.data.operators.preference_ops import ResponseEvaluator

op = ResponseEvaluator(model=model, input_key='intent', response_key='responses', output_key='evaluation')
data = {
    'intent': {'intent': 'book a hotel'},
    'responses': [
        'I can help you book a hotel in Beijing.',
        'Here are some hotels for you.'
    ],
}
print(op(data))
# [{
#   'intent': {'intent': 'book a hotel'},
#   'responses': [
#     'I can help you book a hotel in Beijing.',
#     'Here are some hotels for you.'
#   ],
#   'evaluation': [10, 8]
# }]
Source code in lazyllm/tools/data/operators/preference_ops.py
class ResponseEvaluator(PreferenceOps):
    """偏好数据处理算子:候选回复评测器。

对同一条指令下的多个候选回复逐一打分,输出每条回复的分数列表,便于后续构造 chosen/rejected。

评分维度(总分 10 分):

- 有用性 (Helpfulness) 4 分
- 真实性 (Truthfulness) 3 分
- 流畅度 (Fluency) 3 分

注意:

- 该算子内部使用模型 + JSON 格式化器;每条回复都期望输出包含 total_score 的 dict。
- 如果某条回复无法解析 total_score,会记录 warning,并为该条回复记 0 分。

Args:
    model: LazyLLM 模型对象(必需),会被 share() 后复用。
    input_key (str): 指令/原始内容字段名,默认 'content'。
    response_key (str): 候选回复列表字段名,默认 'responses'。
    output_key (str): 输出评分列表字段名,默认 'evaluation'。
    **kwargs: 传递给基类算子的其它参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.preference_ops import ResponseEvaluator

    op = ResponseEvaluator(model=model, input_key='intent', response_key='responses', output_key='evaluation')
    data = {
        'intent': {'intent': 'book a hotel'},
        'responses': [
            'I can help you book a hotel in Beijing.',
            'Here are some hotels for you.'
        ],
    }
    print(op(data))
    # [{
    #   'intent': {'intent': 'book a hotel'},
    #   'responses': [
    #     'I can help you book a hotel in Beijing.',
    #     'Here are some hotels for you.'
    #   ],
    #   'evaluation': [10, 8]
    # }]
    ```
    """
    def __init__(self, model=None, input_key='content', response_key='responses', output_key='evaluation', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.response_key = response_key
        self.output_key = output_key
        sys_prompt = (
            '你是一个专业的回复评测判官。请针对用户提供的指令和回复,从以下三个维度进行打分,总分为 10 分:\n'
            '1. 有用性 (Helpfulness): 满分 4 分。回复是否解决了用户的问题。\n'
            '2. 真实性 (Truthfulness): 满分 3 分。回复内容是否准确、无误导。\n'
            '3. 流畅度 (Fluency): 满分 3 分。回复是否自然、逻辑清晰。\n'
            '请先给出详细的理由 (Rationale),然后以 JSON 格式输出各项得分及总分。\n'
            '输出示例:\n'
            '{\n'
            '  "rationale": "回复简洁且准确...",\n'
            '  "scores": {"helpfulness": 4, "truthfulness": 3, "fluency": 3},\n'
            '  "total_score": 10\n'
            '}'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        if self.input_key in data and self.response_key in data:
            data[self.output_key] = self.evaluate(data[self.input_key], data[self.response_key])
        return data

    def evaluate(self, instruction, responses):
        scores = []
        for resp in responses:
            prompt = (
                f'指令: {instruction}\n\n'
                f'回复: {resp}\n\n'
                '请对上述回复进行打分。'
            )
            res = self.model(prompt)
            if isinstance(res, dict):
                scores.append(res.get('total_score', 0))
            else:
                LOG.warning(f'Failed to extract total_score from response: {res}')
                scores.append(0)
        return scores

工具调用数据处理算子

lazyllm.tools.data.operators.tool_use_ops

ChainedLogicAssembler

Bases: ToolUseOps

工具调用数据生成算子:顺序任务生成器。

基于原子任务列表,生成“后继任务关系”与对应的组合任务列表,用于构造线性或有依赖关系的任务链。

输出 JSON 典型结构:

  • items: 列表,每项为:
  • task: 当前原子任务
  • next_task: 紧随其后的任务
  • composed_task: 由 task + next_task 组合而成的描述

Parameters:

  • model

    LazyLLM 模型对象(必需)。

  • input_key (str, default: 'atomic_tasks' ) –

    输入原子任务字段名,默认 'atomic_tasks'。

  • output_key (str, default: 'sequential_tasks' ) –

    输出顺序任务列表字段名,默认 'sequential_tasks'。

  • system_prompt (str | None, default: None ) –

    可选系统提示词。

  • **kwargs

    传递给基类算子的其它参数。

Examples:

from lazyllm.tools.data.operators.tool_use_ops import ChainedLogicAssembler

atomic_tasks = [
    {'task': '获取出发地与目的地'},
    {'task': '确认出行日期'},
    {'task': '筛选符合条件的车次'},
]
op = ChainedLogicAssembler(model=model, input_key='atomic_tasks', output_key='sequential_tasks')
print(op({'atomic_tasks': atomic_tasks}))
# {
#   'atomic_tasks': [...],
#   'sequential_tasks': [
#     {'task': '获取出发地与目的地', 'next_task': '确认出行日期', 'composed_task': '先获取站点再确认日期'},
#     {'task': '确认出行日期', 'next_task': '筛选符合条件的车次', 'composed_task': '在已知日期基础上筛选车次'},
#     ...
#   ]
# }
Source code in lazyllm/tools/data/operators/tool_use_ops.py
class ChainedLogicAssembler(ToolUseOps):
    """工具调用数据生成算子:顺序任务生成器。

基于原子任务列表,生成“后继任务关系”与对应的组合任务列表,用于构造线性或有依赖关系的任务链。

输出 JSON 典型结构:

- items: 列表,每项为:
  - task: 当前原子任务
  - next_task: 紧随其后的任务
  - composed_task: 由 task + next_task 组合而成的描述

Args:
    model: LazyLLM 模型对象(必需)。
    input_key (str): 输入原子任务字段名,默认 'atomic_tasks'。
    output_key (str): 输出顺序任务列表字段名,默认 'sequential_tasks'。
    system_prompt (str|None): 可选系统提示词。
    **kwargs: 传递给基类算子的其它参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.tool_use_ops import ChainedLogicAssembler

    atomic_tasks = [
        {'task': '获取出发地与目的地'},
        {'task': '确认出行日期'},
        {'task': '筛选符合条件的车次'},
    ]
    op = ChainedLogicAssembler(model=model, input_key='atomic_tasks', output_key='sequential_tasks')
    print(op({'atomic_tasks': atomic_tasks}))
    # {
    #   'atomic_tasks': [...],
    #   'sequential_tasks': [
    #     {'task': '获取出发地与目的地', 'next_task': '确认出行日期', 'composed_task': '先获取站点再确认日期'},
    #     {'task': '确认出行日期', 'next_task': '筛选符合条件的车次', 'composed_task': '在已知日期基础上筛选车次'},
    #     ...
    #   ]
    # }
    ```
    """
    def __init__(
        self, model=None, input_key='atomic_tasks', output_key='sequential_tasks', system_prompt=None, **kwargs
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        sys_prompt = system_prompt or (
            '你是一个任务编排助手。你的任务是根据原子任务集合,生成:\n'
            '1) 每个任务的后继任务(next_task)\n'
            '2) 由两者组合形成的组合任务(composed_task)\n'
            '只输出 JSON,不要输出任何额外文本。\n'
            'JSON 结构:\n'
            '{\n'
            '  "items": [\n'
            '    {"task": "...", "next_task": "...", "composed_task": "..."}\n'
            '  ]\n'
            '}\n'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        tasks = data.get(self.input_key, None)
        if not tasks:
            data[self.output_key] = []
            return data
        tasks_text = json.dumps(tasks, ensure_ascii=False) if not isinstance(tasks, str) else tasks
        instruction = f'原子任务列表:\n{tasks_text}\n\n请生成后继与组合任务并输出 JSON。'
        parsed = self.model(instruction)
        items = parsed.get('items') if isinstance(parsed, dict) else None
        data[self.output_key] = items if isinstance(items, list) else (parsed if parsed else [])
        return data

ContextualBeacon

Bases: ToolUseOps

工具调用数据生成算子:场景抽取器。

从一段对话文本中抽取可用于后续任务/工具调用数据生成的“场景信息”,并以结构化 JSON 形式写入输出字段。

输出 JSON 典型结构:

  • scene: 一句话场景描述
  • domain: 领域/主题
  • user_profile: 用户角色/背景(可为空)
  • assistant_goal: 助手应完成的目标
  • constraints: 约束条件列表
  • key_entities: 关键实体列表

Parameters:

  • model

    LazyLLM 模型对象(必需),会被 share() 后复用并接 JSON 格式化器。

  • input_key (str, default: 'content' ) –

    输入对话内容字段名,默认 'content'。

  • output_key (str, default: 'scenario' ) –

    输出场景字段名,默认 'scenario'。

  • system_prompt (str | None, default: None ) –

    可选,自定义系统提示词,不传则使用内置中文提示。

  • **kwargs

    传递给基类算子的其它参数(如 _max_workers、_save_data 等)。

Examples:

from lazyllm.tools.data.operators.tool_use_ops import ContextualBeacon

op = ContextualBeacon(model=model, input_key='content', output_key='scenario')
item = {
    'content': 'User: 我想订一张从北京到上海的高铁票,下午出发最好。\nAssistant: 好的,请问具体日期?'
}
print(op(item))

# Output Example:
# {
#   'content': 'User: 我想订一张从北京到上海的高铁票,下午出发最好。\nAssistant: 好的,请问具体日期?',
#   'scenario': {
#     'scene': '用户咨询高铁购票服务',
#     'domain': '出行/购票',
#     'user_profile': '普通出行乘客',
#     'assistant_goal': '帮助用户完成车次与时间筛选并完成购票',
#     'constraints': ['出发地为北京', '目的地为上海', '尽量下午出发'],
#     'key_entities': ['北京', '上海', '高铁', '下午']
#   }
# }
Source code in lazyllm/tools/data/operators/tool_use_ops.py
class ContextualBeacon(ToolUseOps):
    """工具调用数据生成算子:场景抽取器。

从一段对话文本中抽取可用于后续任务/工具调用数据生成的“场景信息”,并以结构化 JSON 形式写入输出字段。

输出 JSON 典型结构:

- scene: 一句话场景描述
- domain: 领域/主题
- user_profile: 用户角色/背景(可为空)
- assistant_goal: 助手应完成的目标
- constraints: 约束条件列表
- key_entities: 关键实体列表

Args:
    model: LazyLLM 模型对象(必需),会被 share() 后复用并接 JSON 格式化器。
    input_key (str): 输入对话内容字段名,默认 'content'。
    output_key (str): 输出场景字段名,默认 'scenario'。
    system_prompt (str|None): 可选,自定义系统提示词,不传则使用内置中文提示。
    **kwargs: 传递给基类算子的其它参数(如 _max_workers、_save_data 等)。


Examples:

    from lazyllm.tools.data.operators.tool_use_ops import ContextualBeacon

    op = ContextualBeacon(model=model, input_key='content', output_key='scenario')
    item = {
        'content': 'User: 我想订一张从北京到上海的高铁票,下午出发最好。\\nAssistant: 好的,请问具体日期?'
    }
    print(op(item))

    # Output Example:
    # {
    #   'content': 'User: 我想订一张从北京到上海的高铁票,下午出发最好。\\nAssistant: 好的,请问具体日期?',
    #   'scenario': {
    #     'scene': '用户咨询高铁购票服务',
    #     'domain': '出行/购票',
    #     'user_profile': '普通出行乘客',
    #     'assistant_goal': '帮助用户完成车次与时间筛选并完成购票',
    #     'constraints': ['出发地为北京', '目的地为上海', '尽量下午出发'],
    #     'key_entities': ['北京', '上海', '高铁', '下午']
    #   }
    # }
    """
    def __init__(self, model=None, input_key='content', output_key='scenario', system_prompt=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        sys_prompt = system_prompt or (
            '你是一个对话场景分析助手。你的任务是从对话内容中提取可用于数据生成的场景信息。\n'
            '只输出 JSON,不要输出任何额外文本。\n'
            'JSON 结构:\n'
            '{\n'
            '  "scene": "一句话场景描述",\n'
            '  "domain": "领域/主题",\n'
            '  "user_profile": "用户角色/背景(可为空)",\n'
            '  "assistant_goal": "助手应完成的目标",\n'
            '  "constraints": ["约束1","约束2"],\n'
            '  "key_entities": ["关键实体1","关键实体2"]\n'
            '}\n'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        content = data.get(self.input_key, '')
        if not content:
            data[self.output_key] = None
            return data
        instruction = f'对话内容如下:\n{content}\n\n请提取场景信息并输出 JSON。'
        parsed = self.model(instruction)
        data[self.output_key] = parsed if parsed is not None else ''
        return data

DecompositionKernel

Bases: ToolUseOps

工具调用数据生成算子:原子任务生成器。

基于单个场景,生成一组粒度较小、目标单一的“原子任务”列表,用于后续任务编排与工具设计。

输出 JSON 典型结构:

  • tasks: 原子任务列表,每项包含:
  • task: 任务描述
  • input: 任务输入(可为空)
  • output: 任务输出(可为空)
  • constraints: 相关约束列表

Parameters:

  • model

    LazyLLM 模型对象(必需)。

  • input_key (str, default: 'scenario' ) –

    输入场景字段名,默认 'scenario'。

  • output_key (str, default: 'atomic_tasks' ) –

    输出原子任务列表字段名,默认 'atomic_tasks'。

  • n (int, default: 5 ) –

    原子任务数量上限,默认 5。

  • system_prompt (str | None, default: None ) –

    可选系统提示词。

  • **kwargs

    传递给基类算子的其它参数。

Examples:

from lazyllm.tools.data.operators.tool_use_ops import DecompositionKernel

scenario = {
    'scene': '用户咨询高铁购票服务',
    'assistant_goal': '帮助用户完成车次筛选并购票',
}
op = DecompositionKernel(model=model, input_key='scenario', output_key='atomic_tasks', n=4)
print(op({'scenario': scenario}))
# {
#   'scenario': {...},
#   'atomic_tasks': [
#     {'task': '获取用户出发地和目的地', 'input': '', 'output': '出发地与目的地', 'constraints': [...]},
#     {'task': '确认出行日期与大致时间', ...},
#     ...
#   ]
# }
Source code in lazyllm/tools/data/operators/tool_use_ops.py
class DecompositionKernel(ToolUseOps):
    """工具调用数据生成算子:原子任务生成器。

基于单个场景,生成一组粒度较小、目标单一的“原子任务”列表,用于后续任务编排与工具设计。

输出 JSON 典型结构:

- tasks: 原子任务列表,每项包含:
  - task: 任务描述
  - input: 任务输入(可为空)
  - output: 任务输出(可为空)
  - constraints: 相关约束列表

Args:
    model: LazyLLM 模型对象(必需)。
    input_key (str): 输入场景字段名,默认 'scenario'。
    output_key (str): 输出原子任务列表字段名,默认 'atomic_tasks'。
    n (int): 原子任务数量上限,默认 5。
    system_prompt (str|None): 可选系统提示词。
    **kwargs: 传递给基类算子的其它参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.tool_use_ops import DecompositionKernel

    scenario = {
        'scene': '用户咨询高铁购票服务',
        'assistant_goal': '帮助用户完成车次筛选并购票',
    }
    op = DecompositionKernel(model=model, input_key='scenario', output_key='atomic_tasks', n=4)
    print(op({'scenario': scenario}))
    # {
    #   'scenario': {...},
    #   'atomic_tasks': [
    #     {'task': '获取用户出发地和目的地', 'input': '', 'output': '出发地与目的地', 'constraints': [...]},
    #     {'task': '确认出行日期与大致时间', ...},
    #     ...
    #   ]
    # }
    ```
    """
    def __init__(
        self, model=None, input_key='scenario', output_key='atomic_tasks', n=5, system_prompt=None, **kwargs
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        self.n = n
        sys_prompt = system_prompt or (
            '你是一个任务分解助手。你的任务是根据给定场景,生成一组可执行的原子任务(粒度小、单目标)。\n'
            '只输出 JSON,不要输出任何额外文本。\n'
            'JSON 结构:\n'
            '{\n'
            '  "tasks": [\n'
            '    {"task": "任务描述", "input": "输入(可为空)", "output": "输出(可为空)", "constraints": ["..."]}\n'
            '  ]\n'
            '}\n'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        scenario = data.get(self.input_key, None)
        if scenario is None or scenario == '':
            data[self.output_key] = []
            return data
        scenario_text = json.dumps(scenario, ensure_ascii=False) if not isinstance(scenario, str) else scenario
        instruction = f'场景:\n{scenario_text}\n\n请生成不超过 {self.n} 个原子任务并输出 JSON。'
        parsed = self.model(instruction)
        tasks = parsed.get('tasks') if isinstance(parsed, dict) else None
        data[self.output_key] = tasks if isinstance(tasks, list) else (parsed if parsed else [])
        return data

DialogueSimulator

Bases: ToolUseOps

工具调用数据生成算子:多轮对话生成器(含 Tool 调用)。

根据组合任务与可用函数列表,生成带有 User / Assistant / Tool 三种角色的多轮对话 JSON,用于构造工具调用训练数据。

输出 JSON 典型结构:

  • messages: 列表,每项为:
  • role: 'user' | 'assistant' | 'tool'
  • content: 文本内容
  • name: 工具名(仅 role == 'tool' 时可选)

Parameters:

  • model

    LazyLLM 模型对象(必需)。

  • input_composition_key (str, default: 'composition_task' ) –

    输入组合任务字段名,默认 'composition_task'。

  • input_functions_key (str, default: 'functions' ) –

    输入函数列表字段名,默认 'functions'。

  • output_key (str, default: 'conversation' ) –

    输出多轮对话字段名,默认 'conversation'。

  • n_turns (int, default: 6 ) –

    期望的轮次数量(提示给模型),默认 6。

  • system_prompt (str | None, default: None ) –

    可选系统提示词。

  • **kwargs

    传递给基类算子的其它参数。

Examples:

from lazyllm.tools.data.operators.tool_use_ops import DialogueSimulator

composition_task = '根据用户需求查询并推荐合适的高铁车次'
functions = [
    {
        'name': 'query_train_tickets',
        'description': '查询高铁车次',
        'args': [...],
        'returns': {...},
    }
]
op = DialogueSimulator(model=model,
                                    input_composition_key='composition_task',
                                    input_functions_key='functions',
                                    output_key='conversation',
                                    n_turns=6)
print(op({'composition_task': composition_task, 'functions': functions}))
# {
#   'composition_task': '根据用户需求查询并推荐合适的高铁车次',
#   'functions': [...],
#   'conversation': {
#     'messages': [
#       {'role': 'user', 'content': '我想订一张明天下午从北京到上海的高铁票'},
#       {'role': 'assistant', 'content': '好的,我先为您确认出发时间与车次。'},
#       {'role': 'tool', 'name': 'query_train_tickets', 'content': '{...工具返回...}'},
#       ...
#     ]
#   }
# }
Source code in lazyllm/tools/data/operators/tool_use_ops.py
class DialogueSimulator(ToolUseOps):
    """工具调用数据生成算子:多轮对话生成器(含 Tool 调用)。

根据组合任务与可用函数列表,生成带有 User / Assistant / Tool 三种角色的多轮对话 JSON,用于构造工具调用训练数据。

输出 JSON 典型结构:

- messages: 列表,每项为:
  - role: 'user' | 'assistant' | 'tool'
  - content: 文本内容
  - name: 工具名(仅 role == 'tool' 时可选)

Args:
    model: LazyLLM 模型对象(必需)。
    input_composition_key (str): 输入组合任务字段名,默认 'composition_task'。
    input_functions_key (str): 输入函数列表字段名,默认 'functions'。
    output_key (str): 输出多轮对话字段名,默认 'conversation'。
    n_turns (int): 期望的轮次数量(提示给模型),默认 6。
    system_prompt (str|None): 可选系统提示词。
    **kwargs: 传递给基类算子的其它参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.tool_use_ops import DialogueSimulator

    composition_task = '根据用户需求查询并推荐合适的高铁车次'
    functions = [
        {
            'name': 'query_train_tickets',
            'description': '查询高铁车次',
            'args': [...],
            'returns': {...},
        }
    ]
    op = DialogueSimulator(model=model,
                                        input_composition_key='composition_task',
                                        input_functions_key='functions',
                                        output_key='conversation',
                                        n_turns=6)
    print(op({'composition_task': composition_task, 'functions': functions}))
    # {
    #   'composition_task': '根据用户需求查询并推荐合适的高铁车次',
    #   'functions': [...],
    #   'conversation': {
    #     'messages': [
    #       {'role': 'user', 'content': '我想订一张明天下午从北京到上海的高铁票'},
    #       {'role': 'assistant', 'content': '好的,我先为您确认出发时间与车次。'},
    #       {'role': 'tool', 'name': 'query_train_tickets', 'content': '{...工具返回...}'},
    #       ...
    #     ]
    #   }
    # }
    ```
    """
    def __init__(
        self,
        model=None,
        input_composition_key='composition_task',
        input_functions_key='functions',
        output_key='conversation',
        n_turns=6,
        system_prompt=None,
        **kwargs,
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.task_key = input_composition_key
        self.functions_key = input_functions_key
        self.output_key = output_key
        self.n_turns = n_turns
        sys_prompt = system_prompt or (
            '你是一个多轮对话数据生成助手。你需要根据组合任务与可用函数,模拟一段多轮对话。\n'
            '对话由 User/Assistant/Tool 三种角色组成:\n'
            '- User 提出需求与补充信息\n'
            '- Assistant 规划并在适当时机调用 Tool\n'
            '- Tool 返回函数执行结果\n'
            '只输出 JSON,不要输出任何额外文本。\n'
            'JSON 结构:\n'
            '{\n'
            '  "messages": [\n'
            '    {"role":"user","content":"..."},\n'
            '    {"role":"assistant","content":"..."},\n'
            '    {"role":"tool","name":"function_name","content":"..."}\n'
            '  ]\n'
            '}\n'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        task = data.get(self.task_key, None)
        functions = data.get(self.functions_key, None)
        if task is None or task == '':
            data[self.output_key] = []
            return data
        task_text = json.dumps(task, ensure_ascii=False) if not isinstance(task, str) else task
        functions_text = (json.dumps(functions, ensure_ascii=False) if functions is not None
                          and not isinstance(functions, str) else (functions or ''))
        instruction = (
            f'组合任务:\n{task_text}\n\n'
            f'函数列表:\n{functions_text}\n\n'
            f'请生成约 {self.n_turns} 轮对话的 messages 并输出 JSON。'
        )
        parsed = self.model(instruction)
        data[self.output_key] = parsed if parsed is not None else []
        return data

ProtocolSpecifier

Bases: ToolUseOps

工具调用数据生成算子:函数规格生成器。

根据组合任务及其子任务,生成一组适合用于工具调用(function calling)的函数规格列表。

输出 JSON 典型结构:

  • functions: 列表,每项包含:
  • name: 函数名称
  • description: 函数用途描述
  • args: 参数列表,每个参数包含 name/type/description
  • returns: 返回值类型与描述

Parameters:

  • model

    LazyLLM 模型对象(必需)。

  • input_composition_key (str, default: 'composition_task' ) –

    输入组合任务字段名,默认 'composition_task'。

  • input_atomic_key (str, default: 'atomic_tasks' ) –

    输入原子任务字段名,默认 'atomic_tasks'。

  • output_key (str, default: 'functions' ) –

    输出函数规格列表字段名,默认 'functions'。

  • system_prompt (str | None, default: None ) –

    可选系统提示词。

  • **kwargs

    传递给基类算子的其它参数。

Examples:

from lazyllm.tools.data.operators.tool_use_ops import ProtocolSpecifier

composition_task = '根据用户出发地、目的地和日期查询可选高铁车次并返回候选列表'
atomic_tasks = [
    {'task': '获取出发地与目的地'},
    {'task': '确认出行日期'},
    {'task': '调用车次查询接口并过滤结果'},
]
op = ProtocolSpecifier(model=model,
                       input_composition_key='composition_task',
                       input_atomic_key='atomic_tasks',
                       output_key='functions')
print(op({'composition_task': composition_task, 'atomic_tasks': atomic_tasks}))
# {
#   'composition_task': '根据用户出发地、目的地和日期查询可选高铁车次并返回候选列表',
#   'atomic_tasks': [...],
#   'functions': [
#     {
#       'name': 'query_train_tickets',
#       'description': '根据出发地、目的地与日期查询高铁车次',
#       'args': [{'name': 'from_city', 'type': 'string', ...}, ...],
#       'returns': {'type': 'TrainList', 'description': '符合条件的车次列表'}
#     },
#     ...
#   ]
# }
Source code in lazyllm/tools/data/operators/tool_use_ops.py
class ProtocolSpecifier(ToolUseOps):
    """工具调用数据生成算子:函数规格生成器。

根据组合任务及其子任务,生成一组适合用于工具调用(function calling)的函数规格列表。

输出 JSON 典型结构:

- functions: 列表,每项包含:
  - name: 函数名称
  - description: 函数用途描述
  - args: 参数列表,每个参数包含 name/type/description
  - returns: 返回值类型与描述

Args:
    model: LazyLLM 模型对象(必需)。
    input_composition_key (str): 输入组合任务字段名,默认 'composition_task'。
    input_atomic_key (str): 输入原子任务字段名,默认 'atomic_tasks'。
    output_key (str): 输出函数规格列表字段名,默认 'functions'。
    system_prompt (str|None): 可选系统提示词。
    **kwargs: 传递给基类算子的其它参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.tool_use_ops import ProtocolSpecifier

    composition_task = '根据用户出发地、目的地和日期查询可选高铁车次并返回候选列表'
    atomic_tasks = [
        {'task': '获取出发地与目的地'},
        {'task': '确认出行日期'},
        {'task': '调用车次查询接口并过滤结果'},
    ]
    op = ProtocolSpecifier(model=model,
                           input_composition_key='composition_task',
                           input_atomic_key='atomic_tasks',
                           output_key='functions')
    print(op({'composition_task': composition_task, 'atomic_tasks': atomic_tasks}))
    # {
    #   'composition_task': '根据用户出发地、目的地和日期查询可选高铁车次并返回候选列表',
    #   'atomic_tasks': [...],
    #   'functions': [
    #     {
    #       'name': 'query_train_tickets',
    #       'description': '根据出发地、目的地与日期查询高铁车次',
    #       'args': [{'name': 'from_city', 'type': 'string', ...}, ...],
    #       'returns': {'type': 'TrainList', 'description': '符合条件的车次列表'}
    #     },
    #     ...
    #   ]
    # }
    ```
    """
    def __init__(
        self,
        model=None,
        input_composition_key='composition_task',
        input_atomic_key='atomic_tasks',
        output_key='functions',
        system_prompt=None,
        **kwargs,
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.task_key = input_composition_key
        self.subtask_key = input_atomic_key
        self.output_key = output_key
        sys_prompt = system_prompt or (
            '你是一个函数设计助手。给定组合任务及其子任务,请生成一组函数规格,便于后续工具调用。\n'
            '只输出 JSON,不要输出任何额外文本。\n'
            'JSON 结构:\n'
            '{\n'
            '  "functions": [\n'
            '    {"name": "function_name", "description": "...", '
            '"args": [{"name":"...","type":"...","description":"..."}], '
            '"returns": {"type":"...","description":"..."}}\n'
            '  ]\n'
            '}\n'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        task = data.get(self.task_key, None)
        subtasks = data.get(self.subtask_key, None)
        if task is None or task == '':
            data[self.output_key] = []
            return data
        task_text = json.dumps(task, ensure_ascii=False) if not isinstance(task, str) else task
        subtasks_text = (json.dumps(subtasks, ensure_ascii=False) if subtasks is not None
                         and not isinstance(subtasks, str) else (subtasks or ''))
        instruction = (
            f'组合任务:\n{task_text}\n\n'
            f'子任务(可选):\n{subtasks_text}\n\n'
            '请生成函数列表并输出 JSON。'
        )
        parsed = self.model(instruction)
        funcs = parsed.get('functions') if isinstance(parsed, dict) else None
        data[self.output_key] = funcs if isinstance(funcs, list) else (parsed if parsed else [])
        return data

ScenarioDiverger

Bases: ToolUseOps

工具调用数据生成算子:场景扩展器。

在已有基础场景的基础上,生成若干个语义相关但细节不同的替代场景列表,便于扩充数据多样性。

输出 JSON 典型结构:

  • scenarios: 场景列表,每项为包含 scene/domain/assistant_goal/constraints/key_entities 等字段的字典。

Parameters:

  • model

    LazyLLM 模型对象(必需)。

  • input_key (str, default: 'scenario' ) –

    输入场景字段名,默认 'scenario'(可为 dict 或 str)。

  • output_key (str, default: 'expanded_scenarios' ) –

    输出扩展场景列表字段名,默认 'expanded_scenarios'。

  • n (int, default: 3 ) –

    希望生成的场景数量上限,默认 3。

  • system_prompt (str | None, default: None ) –

    可选系统提示词。

  • **kwargs

    传递给基类算子的其它参数。

Examples:

from lazyllm.tools.data.operators.tool_use_ops import ScenarioDiverger

base = {
    'scene': '用户咨询高铁购票服务',
    'domain': '出行/购票',
    'assistant_goal': '帮助用户完成车次筛选并购票',
}
op = ScenarioDiverger(model=model, input_key='scenario', output_key='expanded_scenarios', n=3)
print(op({'scenario': base}))
# {
#   'scenario': {...},
#   'expanded_scenarios': [
#     {'scene': '用户预订跨城商务出差火车票', ...},
#     {'scene': '用户为家人购买回乡火车票', ...},
#     ...
#   ]
# }
Source code in lazyllm/tools/data/operators/tool_use_ops.py
class ScenarioDiverger(ToolUseOps):
    """工具调用数据生成算子:场景扩展器。

在已有基础场景的基础上,生成若干个语义相关但细节不同的替代场景列表,便于扩充数据多样性。

输出 JSON 典型结构:

- scenarios: 场景列表,每项为包含 scene/domain/assistant_goal/constraints/key_entities 等字段的字典。

Args:
    model: LazyLLM 模型对象(必需)。
    input_key (str): 输入场景字段名,默认 'scenario'(可为 dict 或 str)。
    output_key (str): 输出扩展场景列表字段名,默认 'expanded_scenarios'。
    n (int): 希望生成的场景数量上限,默认 3。
    system_prompt (str|None): 可选系统提示词。
    **kwargs: 传递给基类算子的其它参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.tool_use_ops import ScenarioDiverger

    base = {
        'scene': '用户咨询高铁购票服务',
        'domain': '出行/购票',
        'assistant_goal': '帮助用户完成车次筛选并购票',
    }
    op = ScenarioDiverger(model=model, input_key='scenario', output_key='expanded_scenarios', n=3)
    print(op({'scenario': base}))
    # {
    #   'scenario': {...},
    #   'expanded_scenarios': [
    #     {'scene': '用户预订跨城商务出差火车票', ...},
    #     {'scene': '用户为家人购买回乡火车票', ...},
    #     ...
    #   ]
    # }
    ```
    """
    def __init__(
        self, model=None, input_key='scenario', output_key='expanded_scenarios', n=3, system_prompt=None, **kwargs
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        self.n = n
        sys_prompt = system_prompt or (
            '你是一个场景扩展助手。你的任务是基于给定的原始场景,生成多个可替代的新场景,语义相关但细节不同。\n'
            '只输出 JSON,不要输出任何额外文本。\n'
            'JSON 结构:\n'
            '{\n'
            '  "scenarios": [\n'
            '    {"scene": "...", "domain": "...", "assistant_goal": "...", "constraints": ["..."], '
            '"key_entities": ["..."]}\n'
            '  ]\n'
            '}\n'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        base = data.get(self.input_key, None)
        if base is None or base == '':
            data[self.output_key] = []
            return data
        base_text = json.dumps(base, ensure_ascii=False) if not isinstance(base, str) else base
        instruction = f'原始场景:\n{base_text}\n\n请生成 {self.n} 个替代场景并输出 JSON。'
        parsed = self.model(instruction)
        scenarios = parsed.get('scenarios') if isinstance(parsed, dict) else None
        data[self.output_key] = scenarios if isinstance(scenarios, list) else (parsed if parsed else [])
        return data

TopologyArchitect

Bases: ToolUseOps

工具调用数据生成算子:并行/顺序/混合任务组合生成器。

基于原子任务列表,自动生成三类任务组合:

  • parallel_tasks: 可以并行执行的任务组合
  • sequential_tasks: 具有明确先后依赖的任务组合
  • hybrid_tasks: 同时包含并行与顺序关系的混合任务组合

Parameters:

  • model

    LazyLLM 模型对象(必需)。

  • input_key (str, default: 'atomic_tasks' ) –

    输入原子任务字段名,默认 'atomic_tasks'。

  • output_key (str, default: 'para_seq_tasks' ) –

    输出任务组合字段名,默认 'para_seq_tasks'。

  • system_prompt (str | None, default: None ) –

    可选系统提示词。

  • **kwargs

    传递给基类算子的其它参数。

Examples:

from lazyllm.tools.data.operators.tool_use_ops import TopologyArchitect

atomic_tasks = [
    {'task': '收集出行需求'},
    {'task': '查询可选车次'},
    {'task': '对比价格与时间'},
    {'task': '完成下单支付'},
]
op = TopologyArchitect(model=model, input_key='atomic_tasks', output_key='para_seq_tasks')
print(op({'atomic_tasks': atomic_tasks}))
# {
#   'atomic_tasks': [...],
#   'para_seq_tasks': {
#     'parallel_tasks': ['同时查询不同日期/车次方案', ...],
#     'sequential_tasks': ['先确认日期再选车次', ...],
#     'hybrid_tasks': ['并行对比多个方案后统一决策并下单', ...]
#   }
# }
Source code in lazyllm/tools/data/operators/tool_use_ops.py
class TopologyArchitect(ToolUseOps):
    """工具调用数据生成算子:并行/顺序/混合任务组合生成器。

基于原子任务列表,自动生成三类任务组合:

- parallel_tasks: 可以并行执行的任务组合
- sequential_tasks: 具有明确先后依赖的任务组合
- hybrid_tasks: 同时包含并行与顺序关系的混合任务组合

Args:
    model: LazyLLM 模型对象(必需)。
    input_key (str): 输入原子任务字段名,默认 'atomic_tasks'。
    output_key (str): 输出任务组合字段名,默认 'para_seq_tasks'。
    system_prompt (str|None): 可选系统提示词。
    **kwargs: 传递给基类算子的其它参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.tool_use_ops import TopologyArchitect

    atomic_tasks = [
        {'task': '收集出行需求'},
        {'task': '查询可选车次'},
        {'task': '对比价格与时间'},
        {'task': '完成下单支付'},
    ]
    op = TopologyArchitect(model=model, input_key='atomic_tasks', output_key='para_seq_tasks')
    print(op({'atomic_tasks': atomic_tasks}))
    # {
    #   'atomic_tasks': [...],
    #   'para_seq_tasks': {
    #     'parallel_tasks': ['同时查询不同日期/车次方案', ...],
    #     'sequential_tasks': ['先确认日期再选车次', ...],
    #     'hybrid_tasks': ['并行对比多个方案后统一决策并下单', ...]
    #   }
    # }
    ```
    """
    def __init__(
        self, model=None, input_key='atomic_tasks', output_key='para_seq_tasks', system_prompt=None, **kwargs
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        sys_prompt = system_prompt or (
            '你是一个任务组合生成助手。你的任务是基于原子任务生成三类任务:\n'
            '1) 并行任务(parallel_tasks):可以同时进行的任务组合\n'
            '2) 后继任务(sequential_tasks):有明确先后依赖的任务组合\n'
            '3) 组合任务(hybrid_tasks):包含并行与先后依赖的混合组合\n'
            '只输出 JSON,不要输出任何额外文本。\n'
            'JSON 结构:\n'
            '{\n'
            '  "parallel_tasks": ["..."],\n'
            '  "sequential_tasks": ["..."],\n'
            '  "hybrid_tasks": ["..."]\n'
            '}\n'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        tasks = data.get(self.input_key, None)
        if not tasks:
            data[self.output_key] = {'parallel_tasks': [], 'sequential_tasks': [], 'hybrid_tasks': []}
            return data
        tasks_text = json.dumps(tasks, ensure_ascii=False) if not isinstance(tasks, str) else tasks
        instruction = f'原子任务列表:\n{tasks_text}\n\n请生成三类任务并输出 JSON。'
        parsed = self.model(instruction)
        default_val = {'parallel_tasks': [], 'sequential_tasks': [], 'hybrid_tasks': []}
        data[self.output_key] = parsed if parsed is not None else default_val
        return data

ViabilitySieve

Bases: ToolUseOps

工具调用数据生成算子:组合任务可行性过滤器。

对一组“组合任务”进行可运行性与完备性评审,筛选出被认为合理可行的组合任务列表。

模型内部期望的中间 JSON 结构:

  • items: 列表,每项包含 composed_task、is_valid、reason 等字段。

在算子输出中,仅保留 is_valid 为 true 且含有 composed_task 的项;如果模型未按预期输出,则尽量回退返回原 items 或原始 parsed 结果。

Parameters:

  • model

    LazyLLM 模型对象(必需)。

  • input_composition_key (str, default: 'composition_tasks' ) –

    输入组合任务字段名,默认 'composition_tasks'。

  • input_atomic_key (str, default: 'atomic_tasks' ) –

    输入原子任务字段名(可选),默认 'atomic_tasks'。

  • output_key (str, default: 'filtered_composition_tasks' ) –

    输出过滤后组合任务字段名,默认 'filtered_composition_tasks'。

  • system_prompt (str | None, default: None ) –

    可选系统提示词。

  • **kwargs

    传递给基类算子的其它参数。

Examples:

from lazyllm.tools.data.operators.tool_use_ops import ViabilitySieve

composition_tasks = ['先获取出发地和目的地再筛选车次', '直接随机推荐一个车次']
atomic_tasks = [
    {'task': '获取出发地与目的地'}, {'task': '确认出行日期'}, {'task': '筛选符合条件的车次'}
]
op = ViabilitySieve(model=model,
                           input_composition_key='composition_tasks',
                           input_atomic_key='atomic_tasks',
                           output_key='filtered_composition_tasks')
print(op({'composition_tasks': composition_tasks, 'atomic_tasks': atomic_tasks}))
# {
#   'composition_tasks': [...],
#   'atomic_tasks': [...],
#   'filtered_composition_tasks': ['先获取出发地和目的地再筛选车次', ...]
# }
Source code in lazyllm/tools/data/operators/tool_use_ops.py
class ViabilitySieve(ToolUseOps):
    """工具调用数据生成算子:组合任务可行性过滤器。

对一组“组合任务”进行可运行性与完备性评审,筛选出被认为合理可行的组合任务列表。

模型内部期望的中间 JSON 结构:

- items: 列表,每项包含 composed_task、is_valid、reason 等字段。

在算子输出中,仅保留 is_valid 为 true 且含有 composed_task 的项;如果模型未按预期输出,则尽量回退返回原 items 或原始 parsed 结果。

Args:
    model: LazyLLM 模型对象(必需)。
    input_composition_key (str): 输入组合任务字段名,默认 'composition_tasks'。
    input_atomic_key (str): 输入原子任务字段名(可选),默认 'atomic_tasks'。
    output_key (str): 输出过滤后组合任务字段名,默认 'filtered_composition_tasks'。
    system_prompt (str|None): 可选系统提示词。
    **kwargs: 传递给基类算子的其它参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.tool_use_ops import ViabilitySieve

    composition_tasks = ['先获取出发地和目的地再筛选车次', '直接随机推荐一个车次']
    atomic_tasks = [
        {'task': '获取出发地与目的地'}, {'task': '确认出行日期'}, {'task': '筛选符合条件的车次'}
    ]
    op = ViabilitySieve(model=model,
                               input_composition_key='composition_tasks',
                               input_atomic_key='atomic_tasks',
                               output_key='filtered_composition_tasks')
    print(op({'composition_tasks': composition_tasks, 'atomic_tasks': atomic_tasks}))
    # {
    #   'composition_tasks': [...],
    #   'atomic_tasks': [...],
    #   'filtered_composition_tasks': ['先获取出发地和目的地再筛选车次', ...]
    # }
    ```
    """
    def __init__(
        self,
        model=None,
        input_composition_key='composition_tasks',
        input_atomic_key='atomic_tasks',
        output_key='filtered_composition_tasks',
        system_prompt=None,
        **kwargs,
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.composition_key = input_composition_key
        self.subtask_key = input_atomic_key
        self.output_key = output_key
        sys_prompt = system_prompt or (
            '你是一个任务可运行性评审助手。你需要判断组合任务是否具备可行性与完备性:\n'
            '- 可行性:子任务是否能支撑组合任务目标\n'
            '- 完备性:是否缺少关键步骤或前置条件\n'
            '只输出 JSON,不要输出任何额外文本。\n'
            'JSON 结构:\n'
            '{\n'
            '  "items": [\n'
            '    {"composed_task": "...", "is_valid": true, "reason": "..."}\n'
            '  ]\n'
            '}\n'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        composition_tasks = data.get(self.composition_key, None)
        subtasks = data.get(self.subtask_key, None)
        if not composition_tasks:
            data[self.output_key] = []
            return data
        composition_text = (json.dumps(composition_tasks, ensure_ascii=False)
                            if not isinstance(composition_tasks, str) else composition_tasks)
        subtasks_text = (json.dumps(subtasks, ensure_ascii=False) if subtasks is not None
                         and not isinstance(subtasks, str) else (subtasks or ''))
        instruction = (
            f'组合任务:\n{composition_text}\n\n'
            f'子任务(可选):\n{subtasks_text}\n\n'
            '请逐条判断并输出 JSON。'
        )
        parsed = self.model(instruction)
        items = parsed.get('items') if isinstance(parsed, dict) else None
        valid = []
        if isinstance(items, list):
            for it in items:
                if isinstance(it, dict) and it.get('is_valid') is True and it.get('composed_task'):
                    valid.append(it.get('composed_task'))
        data[self.output_key] = valid if valid else (
            items if isinstance(items, list) else (parsed if parsed else [])
        )
        return data

Text2SQL 数据处理算子

lazyllm.tools.data.operators.text2sql_ops

SQLConsensusUnifier

Bases: Text2SQLOps

Text2SQL 数据处理算子:CoT 轨迹投票选择器。

对一组 CoT 轨迹(cot_responses)进行 SQL 解析与执行,基于执行结果的一致性与正确性,从中选出“最佳” CoT 及对应 SQL。

行为:

  • 从每条 CoT 中解析 SQL(使用与 SQLForge 相同的解析逻辑)。
  • 调用 database_manager.batch_execute_queries 执行 SQL,计算结果 signature 与 success。
  • 使用投票策略 _vote_select 选择最终 CoT,并将:
  • output_cot_key(默认 'cot_reasoning')设置为最佳 CoT 文本;
  • 同时覆盖数据中的 'SQL' 字段为最佳 SQL。

Parameters:

  • database_manager

    提供 SQL 执行能力的管理器(必需),需实现: - batch_execute_queries(list[(db_id, sql)])

  • **kwargs

    其它传递给基类算子的参数。

Examples:

from lazyllm.tools.data.operators.text2sql_ops import SQLConsensusUnifier

op = SQLConsensusUnifier(database_manager=database_manager)
item = {
    'db_id': 'db_1',
    'cot_responses': [
        '...CoT + ```sql SELECT count(*) FROM orders WHERE status = 'paid'```',
        '...CoT + ```sql SELECT count(*) FROM orders```',
    ]
}
res = op(item)
print(res['cot_reasoning'][:200])
print(res['SQL'])
# "...首先识别需要统计已支付订单数量,其次在 orders 表中过滤 status = 'paid' ... ```sql SELECT count(*) FROM orders WHERE status = 'paid';```"
# "SELECT count(*) FROM orders WHERE status = 'paid';"
Source code in lazyllm/tools/data/operators/text2sql_ops.py
class SQLConsensusUnifier(Text2SQLOps):
    """Text2SQL 数据处理算子:CoT 轨迹投票选择器。

对一组 CoT 轨迹(cot_responses)进行 SQL 解析与执行,基于执行结果的一致性与正确性,从中选出“最佳” CoT 及对应 SQL。

行为:

- 从每条 CoT 中解析 SQL(使用与 SQLForge 相同的解析逻辑)。
- 调用 database_manager.batch_execute_queries 执行 SQL,计算结果 signature 与 success。
- 使用投票策略 _vote_select 选择最终 CoT,并将:
  - output_cot_key(默认 'cot_reasoning')设置为最佳 CoT 文本;
  - 同时覆盖数据中的 'SQL' 字段为最佳 SQL。

Args:
    database_manager: 提供 SQL 执行能力的管理器(必需),需实现:
        - batch_execute_queries(list[(db_id, sql)])
    **kwargs: 其它传递给基类算子的参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import SQLConsensusUnifier

    op = SQLConsensusUnifier(database_manager=database_manager)
    item = {
        'db_id': 'db_1',
        'cot_responses': [
            '...CoT + ```sql SELECT count(*) FROM orders WHERE status = \'paid\'```',
            '...CoT + ```sql SELECT count(*) FROM orders```',
        ]
    }
    res = op(item)
    print(res['cot_reasoning'][:200])
    print(res['SQL'])
    # "...首先识别需要统计已支付订单数量,其次在 orders 表中过滤 status = 'paid' ... ```sql SELECT count(*) FROM orders WHERE status = 'paid';```"
    # "SELECT count(*) FROM orders WHERE status = 'paid';"
    ```
    """
    def __init__(self, database_manager=None, **kwargs):
        super().__init__(**kwargs)
        self.database_manager = database_manager
        self.tie_breaker = 'shortest_sql'

    def forward(
        self,
        data,
        input_cot_responses_key='cot_responses',
        input_db_id_key='db_id',
        output_cot_key='cot_reasoning',
        output_sql_key='SQL',
        **kwargs,
    ):
        assert isinstance(data, dict)
        if self.database_manager is None:
            raise ValueError('database_manager is required')

        cot_responses = data.get(input_cot_responses_key, [])
        if not isinstance(cot_responses, list) or not cot_responses:
            data[output_cot_key] = ''
            return data

        db_id = data.get(input_db_id_key)
        if not db_id:
            data[output_cot_key] = ''
            return data

        candidates = []
        queries = []
        for resp in cot_responses:
            sql = _parse_sql_response(resp)
            if sql:
                queries.append((str(db_id).strip(), sql))
                candidates.append({'cot': resp, 'sql': sql})

        if not queries:
            data[output_cot_key] = ''
            return data

        try:
            query_results = self.database_manager.batch_execute_queries(queries)
            for cand, result in zip(candidates, query_results):
                cand['signature'] = _result_to_signature(result)
                cand['is_valid'] = result.success if hasattr(result, 'success') else False
        except Exception as e:
            LOG.error(f'Failed to execute queries for voting: {e}')

        best = _vote_select(candidates, self.tie_breaker)
        if best:
            data[output_cot_key] = best.get('cot', '')
            data[output_sql_key] = best.get('sql', '')
        else:
            data[output_cot_key] = ''

        return data

SQLContextAssembler

Bases: Text2SQLOps

Text2SQL 数据生成算子:Prompt 构造器。

根据数据库 Schema、自然语言问题与证据,构造下游 Text2SQL 模型的输入提示词(prompt)。

行为:

  • 优先调用 database_manager.get_db_details(db_id) 获取 Schema 文本;若不存在则回退到 get_create_statements_and_insert_statements。
  • 支持自定义 prompt_template;否则使用简单英文模板。

Parameters:

  • database_manager

    提供 Schema 的管理器(必需),需实现: - get_db_details(db_id)(可选) - get_create_statements_and_insert_statements(db_id)

  • prompt_template

    可选,自定义 prompt 构造器。

  • **kwargs

    其它传递给基类算子的参数。

Examples:

from lazyllm.tools.data.operators.text2sql_ops import SQLContextAssembler

op = SQLContextAssembler(database_manager=database_manager)
item = {
    'db_id': 'db_1',
    'question': '有多少已支付的订单?',
    'evidence': '订单表中 status 字段标记订单状态。'
}
res = op(item)
print(res['prompt'])
# Database Schema:
# CREATE TABLE orders (id INT, status TEXT, ...);
# ...
#
# Question: 有多少已支付的订单?
# Evidence: 订单表中 status 字段标记订单状态。
# Generate a SQL query for postgres.
Source code in lazyllm/tools/data/operators/text2sql_ops.py
class SQLContextAssembler(Text2SQLOps):
    """Text2SQL 数据生成算子:Prompt 构造器。

根据数据库 Schema、自然语言问题与证据,构造下游 Text2SQL 模型的输入提示词(prompt)。

行为:

- 优先调用 database_manager.get_db_details(db_id) 获取 Schema 文本;若不存在则回退到 get_create_statements_and_insert_statements。
- 支持自定义 prompt_template;否则使用简单英文模板。

Args:
    database_manager: 提供 Schema 的管理器(必需),需实现:
        - get_db_details(db_id)(可选)
        - get_create_statements_and_insert_statements(db_id)
    prompt_template: 可选,自定义 prompt 构造器。
    **kwargs: 其它传递给基类算子的参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import SQLContextAssembler

    op = SQLContextAssembler(database_manager=database_manager)
    item = {
        'db_id': 'db_1',
        'question': '有多少已支付的订单?',
        'evidence': '订单表中 status 字段标记订单状态。'
    }
    res = op(item)
    print(res['prompt'])
    # Database Schema:
    # CREATE TABLE orders (id INT, status TEXT, ...);
    # ...
    #
    # Question: 有多少已支付的订单?
    # Evidence: 订单表中 status 字段标记订单状态。
    # Generate a SQL query for postgres.
    ```
    """
    def __init__(self, database_manager=None, prompt_template=None, **kwargs):
        super().__init__(**kwargs)
        self.database_manager = database_manager
        self.prompt_template = prompt_template

    def get_create_statements_and_insert_statements(self, db_id):
        return self.database_manager.get_create_statements_and_insert_statements(db_id)

    def _build_prompt(self, db_details, intent, evidence, db_engine):
        template = self.prompt_template
        if template is not None and hasattr(template, 'build_prompt'):
            return str(template.build_prompt(
                db_details=db_details,
                intent=intent,
                evidence=evidence,
                db_engine=db_engine
            ))

        return (
            f'Database Schema:\n{db_details}\n\n'
            f'Intent: {intent}\n'
            f'Evidence: {evidence}\n'
            f'Generate a SQL query for {db_engine}.'
        )

    def forward(self, data, input_intent_key='intent', input_db_id_key='db_id',
                input_evidence_key='evidence', output_prompt_key='prompt', **kwargs):
        assert isinstance(data, dict)
        if self.database_manager is None:
            raise ValueError('database_manager is required')

        db_engine = getattr(self.database_manager, 'db_type', 'unknown')
        db_id = data.get(input_db_id_key)
        intent = data.get(input_intent_key)
        evidence = data.get(input_evidence_key, '')

        if not db_id or not intent:
            LOG.warning(f'Missing db_id or intent for item: {data}')
            data[output_prompt_key] = ''
            return data

        db_id = str(db_id).replace('\n', '').replace('\r', '').strip()

        try:
            if hasattr(self.database_manager, 'get_db_details'):
                db_details = self.database_manager.get_db_details(db_id)
            else:
                create_statements, _ = self.database_manager.get_create_statements_and_insert_statements(db_id)
                db_details = '\n\n'.join([str(s) for s in create_statements])

            prompt = self._build_prompt(
                db_details=db_details,
                intent=intent,
                evidence=evidence,
                db_engine=db_engine
            )
            data[output_prompt_key] = prompt
        except Exception as e:
            LOG.error(f'Failed to generate context for db_id={db_id}: {e}')
            data[output_prompt_key] = ''

        return data

SQLEffortRanker

Bases: Text2SQLOps

Text2SQL 数据分类算子:SQL 执行难度分类器。

基于 SQLContextAssembler 生成的 prompt、多次采样生成 SQL 并与金标 SQL 在数据库上对比执行结果,从“可被模型正确生成的次数”角度对样本执行难度进行分类。

主要流程:

  1. 使用输入的 prompt,重复调用模型生成 num_generations 条 SQL,并解析出 SQL 文本。
  2. 对每条 SQL 与金标 SQL 组成比较对 (db_id, predicted_sql, gold_sql),调用 database_manager.batch_compare_queries。
  3. 根据匹配次数 cnt_true 与难度阈值 difficulty_thresholds,将样本分类为 easy/medium/hard/extra/gold error。

Parameters:

  • model

    LazyLLM 模型对象(必需)。

  • database_manager

    提供 batch_compare_queries 能力的管理器(必需)。

  • num_generations (int, default: 10 ) –

    采样生成 SQL 的次数,默认 10;若小于最大阈值会被自动上调为某个 5 的倍数。

  • difficulty_thresholds (list[int] | None, default: None ) –

    难度阈值列表,默认 [2, 5, 9]。

  • difficulty_labels (list[str] | None, default: None ) –

    难度标签列表,默认 ['extra', 'hard', 'medium', 'easy']。

  • system_prompt (str | None, default: None ) –

    可选系统提示词。

  • **kwargs

    其它传递给基类算子的参数。

Examples:

from lazyllm.tools.data.operators.text2sql_ops import SQLEffortRanker

op = SQLEffortRanker(model=model, database_manager=database_manager, num_generations=15)
item = {
    'db_id': 'db_1',
    'prompt': 'Database Schema: ... Question: 有多少已支付的订单?',
    'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';'
}
res = op(item)
print(res)
# {
#   'db_id': 'db_1',
#   'prompt': 'Database Schema: ... Question: 有多少已支付的订单?',
#   'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
#   'sql_execution_difficulty': 'medium'
# }
Source code in lazyllm/tools/data/operators/text2sql_ops.py
class SQLEffortRanker(Text2SQLOps):
    """Text2SQL 数据分类算子:SQL 执行难度分类器。

基于 SQLContextAssembler 生成的 prompt、多次采样生成 SQL 并与金标 SQL 在数据库上对比执行结果,从“可被模型正确生成的次数”角度对样本执行难度进行分类。

主要流程:

1. 使用输入的 prompt,重复调用模型生成 num_generations 条 SQL,并解析出 SQL 文本。
2. 对每条 SQL 与金标 SQL 组成比较对 (db_id, predicted_sql, gold_sql),调用 database_manager.batch_compare_queries。
3. 根据匹配次数 cnt_true 与难度阈值 difficulty_thresholds,将样本分类为 easy/medium/hard/extra/gold error。

Args:
    model: LazyLLM 模型对象(必需)。
    database_manager: 提供 batch_compare_queries 能力的管理器(必需)。
    num_generations (int): 采样生成 SQL 的次数,默认 10;若小于最大阈值会被自动上调为某个 5 的倍数。
    difficulty_thresholds (list[int]|None): 难度阈值列表,默认 [2, 5, 9]。
    difficulty_labels (list[str]|None): 难度标签列表,默认 ['extra', 'hard', 'medium', 'easy']。
    system_prompt (str|None): 可选系统提示词。
    **kwargs: 其它传递给基类算子的参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import SQLEffortRanker

    op = SQLEffortRanker(model=model, database_manager=database_manager, num_generations=15)
    item = {
        'db_id': 'db_1',
        'prompt': 'Database Schema: ... Question: 有多少已支付的订单?',
        'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';'
    }
    res = op(item)
    print(res)
    # {
    #   'db_id': 'db_1',
    #   'prompt': 'Database Schema: ... Question: 有多少已支付的订单?',
    #   'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';',
    #   'sql_execution_difficulty': 'medium'
    # }
    ```
    """
    def __init__(self, model=None, database_manager=None, num_generations=10,
                 difficulty_thresholds=None, difficulty_labels=None,
                 system_prompt=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.database_manager = database_manager
        self.num_generations = int(num_generations)
        sys_prompt = system_prompt or (
            'You are a SQL generator. '
            'Return ONLY the SQL query inside a Markdown code block: ```sql ... ```.'
        )
        self.model = model.share().prompt(sys_prompt) if model else None
        if difficulty_thresholds is None:
            difficulty_thresholds = [2, 5, 9]
        if difficulty_labels is None:
            difficulty_labels = ['extra', 'hard', 'medium', 'easy']

        self.difficulty_config = {
            'thresholds': difficulty_thresholds,
            'labels': difficulty_labels,
        }
        self.timeout = 5.0

        if self.num_generations <= self.difficulty_config['thresholds'][-1]:
            nearest_multiple = ((self.difficulty_config['thresholds'][-1] // 5) + 1) * 5
            LOG.warning(f'num_generations is less than the last threshold, will be set to {nearest_multiple}')
            self.num_generations = nearest_multiple
        if len(self.difficulty_config['thresholds']) != len(self.difficulty_config['labels']) - 1:
            raise ValueError('Thresholds and labels configuration mismatch')

    @staticmethod
    def parse_response(response):
        return _parse_sql_response(response)

    @staticmethod
    def _prepare_comparisons(predicted_sqls_list, ground_truth_list, db_ids, idxs):
        comparisons = []
        for predicted_sqls, ground_truth, db_id in zip(predicted_sqls_list, ground_truth_list, db_ids):
            for predicted_sql in predicted_sqls:
                comparisons.append((db_id, predicted_sql, ground_truth))
        return comparisons

    def classify_difficulty(self, score):
        if score == -1:
            return 'gold error'
        thresholds = self.difficulty_config['thresholds']
        labels = self.difficulty_config['labels']
        for i, threshold in enumerate(thresholds):
            if score <= threshold:
                return labels[i]
        return labels[-1]

    def report_statistics(self, inputs, output_difficulty_key):
        difficulties = [item.get(output_difficulty_key) for item in inputs]
        counts = pd.Series(difficulties).value_counts()
        LOG.info('SQL Difficulty Statistics')
        stats = [f'{d.title()}: {counts.get(d, 0)}' for d in ['easy', 'medium', 'hard', 'extra', 'gold error']]
        LOG.info(', '.join(stats))

    def _generate_and_parse_sqls(self, input_prompts):
        prompts = [q for q in input_prompts for _ in range(self.num_generations)]
        responses = []
        try:
            responses = self.model(prompts)
            if isinstance(responses, str):
                responses = [responses]
        except Exception as e:
            LOG.error(f'Generation failed: {e}')
            responses = [''] * len(prompts)

        all_parsed_sqls = []
        for response in responses:
            all_parsed_sqls.append(_parse_sql_response(response) if response else '')
        return all_parsed_sqls

    def forward(self, data, input_db_id_key='db_id', input_sql_key='SQL',
                input_prompt_key='prompt', output_difficulty_key='sql_execution_difficulty',
                **kwargs):
        assert isinstance(data, dict)
        if self.model is None or self.database_manager is None:
            raise ValueError('model and database_manager are required')

        prompt = data.get(input_prompt_key)
        ground_truth = data.get(input_sql_key)
        db_id = data.get(input_db_id_key)

        if not prompt or not ground_truth or not db_id:
            data[output_difficulty_key] = 'unknown'
            return data

        prompts = [prompt] * self.num_generations
        try:
            responses = self.model(prompts)
            if isinstance(responses, str):
                responses = [responses]
        except Exception as e:
            LOG.error(f'Generation failed: {e}')
            responses = [''] * self.num_generations

        parsed_sqls = [_parse_sql_response(r) if r else '' for r in responses]

        comparisons = [(db_id, sql, ground_truth) for sql in parsed_sqls]
        try:
            batch_results = self.database_manager.batch_compare_queries(comparisons)
            cnt_true = sum(1 for res in batch_results if res.res == 1)
            data[output_difficulty_key] = self.classify_difficulty(cnt_true)
        except Exception as e:
            LOG.error(f'Comparison failed: {e}')
            data[output_difficulty_key] = 'gold error'

        return data

SQLForge

Bases: Text2SQLOps

Text2SQL 数据生成算子:SQL 生成器。

基于数据库 Schema 与样例数据,为给定或全部数据库自动生成可执行的 SQL 语句集合,并标注大致复杂度类型。

主要行为:

  • 对每个数据库生成 output_num 条 SQL。
  • 内置默认提示词(可自定义 prompt_template),控制难度标签(easy/medium/hard 等)。
  • 从模型返回中解析出 sql ... 代码块中的 SQL 文本。

Parameters:

  • model

    LazyLLM 模型对象(必需),会被 share() 后复用。

  • database_manager

    提供数据库 Schema 与样例数据的管理器(必需),需实现: - list_databases() - get_create_statements_and_insert_statements(db_name)

  • output_num (int, default: 300 ) –

    每个数据库生成的 SQL 数量,默认 300。

  • prompt_template

    可选,自定义 prompt 构造器对象,需实现 build_prompt(...)。

  • system_prompt (str | None, default: None ) –

    可选系统提示词,不传则使用内置英文提示。

  • **kwargs

    传递给基类 Text2SQLOps/LazyLLMDataBase 的其它参数。

Examples:

from lazyllm.tools.data.operators.text2sql_ops import SQLForge

# 假设 database_manager 已封装了你的 SQLite / Postgres 等数据库
op = SQLForge(model=model, database_manager=database_manager, output_num=10)

# 如果 data 中不指定 db_id,则为所有数据库各生成若干条 SQL
res = op({})
print(res[0])
# {
#   'db_id': 'database_1',
#   'SQL': 'SELECT ...',
#   'sql_complexity_type': 'easy'
# }
Source code in lazyllm/tools/data/operators/text2sql_ops.py
class SQLForge(Text2SQLOps):
    """Text2SQL 数据生成算子:SQL 生成器。

基于数据库 Schema 与样例数据,为给定或全部数据库自动生成可执行的 SQL 语句集合,并标注大致复杂度类型。

主要行为:

- 对每个数据库生成 output_num 条 SQL。
- 内置默认提示词(可自定义 prompt_template),控制难度标签(easy/medium/hard 等)。
- 从模型返回中解析出 ```sql ... ``` 代码块中的 SQL 文本。

Args:
    model: LazyLLM 模型对象(必需),会被 share() 后复用。
    database_manager: 提供数据库 Schema 与样例数据的管理器(必需),需实现:
        - list_databases()
        - get_create_statements_and_insert_statements(db_name)
    output_num (int): 每个数据库生成的 SQL 数量,默认 300。
    prompt_template: 可选,自定义 prompt 构造器对象,需实现 build_prompt(...)。
    system_prompt (str|None): 可选系统提示词,不传则使用内置英文提示。
    **kwargs: 传递给基类 Text2SQLOps/LazyLLMDataBase 的其它参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import SQLForge

    # 假设 database_manager 已封装了你的 SQLite / Postgres 等数据库
    op = SQLForge(model=model, database_manager=database_manager, output_num=10)

    # 如果 data 中不指定 db_id,则为所有数据库各生成若干条 SQL
    res = op({})
    print(res[0])
    # {
    #   'db_id': 'database_1',
    #   'SQL': 'SELECT ...',
    #   'sql_complexity_type': 'easy'
    # }
    ```
    """
    def __init__(self, model=None, database_manager=None, output_num=300,
                 prompt_template=None, system_prompt=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.database_manager = database_manager
        self.output_num = output_num
        self.prompt_template = prompt_template
        sys_prompt = system_prompt or (
            'You are a SQL generator for Text2SQL tasks.\n'
            'Return ONLY one SQL query inside a Markdown code block: ```sql ... ```.\n'
        )
        self.model = model.share().prompt(sys_prompt) if model else None

    def _build_prompt(self, create_statements, insert_statements, db_engine):
        template = self.prompt_template
        if template is not None and hasattr(template, 'build_prompt'):
            built = template.build_prompt(
                insert_statements=insert_statements,
                create_statements=create_statements,
                db_engine=db_engine,
            )
            if isinstance(built, tuple) and len(built) >= 2:
                return str(built[0]), str(built[1])
            return str(built), 'unknown'
        return _default_build_prompt(create_statements, insert_statements, db_engine)

    def _validate_manager(self):
        if self.model is None:
            raise ValueError('model is required')
        if self.database_manager is None:
            raise ValueError('database_manager is required')
        if not hasattr(self.database_manager, 'list_databases'):
            raise ValueError('database_manager.list_databases is required')
        if not hasattr(self.database_manager, 'get_create_statements_and_insert_statements'):
            raise ValueError('database_manager.get_create_statements_and_insert_statements is required')

    def forward(self, data, output_sql_key='SQL', output_db_id_key='db_id',
                output_complexity_type_key='sql_complexity_type', **kwargs):
        assert isinstance(data, dict)
        self._validate_manager()

        db_engine = getattr(self.database_manager, 'db_type', 'unknown')
        db_id_in_data = data.get(output_db_id_key)
        if db_id_in_data:
            db_names = [db_id_in_data]
        else:
            db_names = self.database_manager.list_databases() or []

        if not db_names:
            LOG.warning('No databases found in database_manager.list_databases()')
            return []

        prompts, db_ids, complexity_types = self._collect_prompts(db_names, db_engine)

        responses = []
        for p in prompts:
            try:
                responses.append(self.model(p))
            except Exception as e:
                LOG.error(f'Failed to generate SQL: {e}')
                responses.append('')

        return [
            {
                output_db_id_key: db_id,
                output_sql_key: _parse_sql_response(resp),
                output_complexity_type_key: complexity,
            }
            for db_id, resp, complexity in zip(db_ids, responses, complexity_types)
        ]

    def _collect_prompts(self, db_names, db_engine):
        prompts = []
        db_ids = []
        complexity_types = []

        LOG.info(f'Generating {self.output_num} SQLs for each database')
        for db_name in tqdm(db_names, desc='Processing Databases'):
            create_statements, insert_statements = self.database_manager.get_create_statements_and_insert_statements(
                db_name
            )
            for _ in range(int(self.output_num)):
                prompt, complexity = self._build_prompt(create_statements, insert_statements, db_engine=db_engine)
                prompts.append(prompt)
                db_ids.append(db_name)
                complexity_types.append(complexity)
        return prompts, db_ids, complexity_types

SQLIntentSynthesizer

Bases: Text2SQLOps

Text2SQL 数据生成算子:自然语言问题生成器。

基于给定 SQL + 数据库 Schema 以及列注释信息,生成与 SQL 语义对应的自然语言问题,并可附带“外部知识”提示,以支持 Text2SQL 训练。

主要特性:

  • 支持多候选问题生成(input_query_num),并通过 embedding 去重/多样性选择。
  • 内置输出格式标记:[QUESTION-START]/[QUESTION-END] 与 [EXTERNAL-KNOWLEDGE-START]/[...-END]。

Parameters:

  • model

    LazyLLM 文本生成模型(必需)。

  • embedding_model

    可选向量模型,用于对候选问题做多样性选择;需支持: - generate_embedding_from_input(texts) 或直接可调用(texts)。

  • database_manager

    提供 Schema 的管理器(必需),需实现: - get_create_statements_and_insert_statements(db_id)

  • input_query_num (int, default: 5 ) –

    每条 SQL 生成候选问题的数量,默认 5。

  • prompt_template

    可选,自定义 prompt 构造器。

  • system_prompt (str | None, default: None ) –

    可选系统提示词,默认简要英文助手提示。

  • **kwargs

    其它传递给基类算子的参数。

Examples:

from lazyllm.tools.data.operators.text2sql_ops import SQLIntentSynthesizer

op = SQLIntentSynthesizer(model=model,
                               embedding_model=embedding_model,
                               database_manager=database_manager,
                               input_query_num=5)
item = {'db_id': 'db_1', 'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';'}
res = op(item)
print(res)
# {
#   'db_id': 'db_1',
#   'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
#   'question_type': 'default',
#   'question': '有多少已支付的订单?',
#   'evidence': '...可选的外部知识...'
# }
Source code in lazyllm/tools/data/operators/text2sql_ops.py
class SQLIntentSynthesizer(Text2SQLOps):
    """Text2SQL 数据生成算子:自然语言问题生成器。

基于给定 SQL + 数据库 Schema 以及列注释信息,生成与 SQL 语义对应的自然语言问题,并可附带“外部知识”提示,以支持 Text2SQL 训练。

主要特性:

- 支持多候选问题生成(input_query_num),并通过 embedding 去重/多样性选择。
- 内置输出格式标记:[QUESTION-START]/[QUESTION-END] 与 [EXTERNAL-KNOWLEDGE-START]/[...-END]。

Args:
    model: LazyLLM 文本生成模型(必需)。
    embedding_model: 可选向量模型,用于对候选问题做多样性选择;需支持:
        - generate_embedding_from_input(texts) 或直接可调用(texts)。
    database_manager: 提供 Schema 的管理器(必需),需实现:
        - get_create_statements_and_insert_statements(db_id)
    input_query_num (int): 每条 SQL 生成候选问题的数量,默认 5。
    prompt_template: 可选,自定义 prompt 构造器。
    system_prompt (str|None): 可选系统提示词,默认简要英文助手提示。
    **kwargs: 其它传递给基类算子的参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import SQLIntentSynthesizer

    op = SQLIntentSynthesizer(model=model,
                                   embedding_model=embedding_model,
                                   database_manager=database_manager,
                                   input_query_num=5)
    item = {'db_id': 'db_1', 'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';'}
    res = op(item)
    print(res)
    # {
    #   'db_id': 'db_1',
    #   'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';',
    #   'question_type': 'default',
    #   'question': '有多少已支付的订单?',
    #   'evidence': '...可选的外部知识...'
    # }
    ```
    """
    def __init__(self, model=None, embedding_model=None, database_manager=None,
                 input_query_num=5, prompt_template=None, system_prompt=None,
                 input_intent_key='intent', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.embedding_model = embedding_model
        self.database_manager = database_manager
        self.question_candidates_num = int(input_query_num)
        self.prompt_template = prompt_template
        self.input_intent_key = input_intent_key
        sys_prompt = system_prompt or 'You are a helpful assistant.'
        self.model = model.share().prompt(sys_prompt) if model else None

    @staticmethod
    def _is_non_empty_text(x):
        return isinstance(x, str) and x.strip() != ''

    def extract_column_descriptions(self, create_statements):
        column_name2column_desc = {}
        pattern = r'"(\w+)"\s+\w+\s*/\*\s*(.*?)\s*\*/'
        if not create_statements:
            return column_name2column_desc
        for create_statement in create_statements:
            for column_name, description in re.findall(pattern, str(create_statement)):
                col = str(column_name).lower()
                if col not in column_name2column_desc:
                    column_name2column_desc[col] = str(description)
        return column_name2column_desc

    def parse_llm_response(self, response):
        if not isinstance(response, str):
            LOG.warning(f'Invalid response type: {type(response)}, expected str. Response: {response}')
            return None

        question_pattern = re.compile(r'\[QUESTION-START\](.*?)\[QUESTION-END\]', re.DOTALL)
        external_knowledge_pattern = re.compile(
            r'\[EXTERNAL-KNOWLEDGE-START\](.*?)\[EXTERNAL-KNOWLEDGE-END\]', re.DOTALL
        )

        question_match = question_pattern.search(response)
        external_knowledge_match = external_knowledge_pattern.search(response)

        question_content = question_match.group(1).strip() if question_match else ''
        external_knowledge_content = external_knowledge_match.group(1).strip() if external_knowledge_match else ''

        if question_content == '':
            return None
        return {'question': question_content, 'external_knowledge': external_knowledge_content}

    @staticmethod
    def _cosine_distance(a, b):
        if not a or not b:
            return 1.0
        n = min(len(a), len(b))
        dot = 0.0
        na = 0.0
        nb = 0.0
        for i in range(n):
            x = float(a[i])
            y = float(b[i])
            dot += x * y
            na += x * x
            nb += y * y
        denom = math.sqrt(na) * math.sqrt(nb)
        if denom == 0.0:
            return 1.0
        return 1.0 - (dot / denom)

    def _select_best_question(self, question_candidates, start_idx, embeddings):
        if not question_candidates:
            return None
        if len(question_candidates) == 1:
            return question_candidates[0]
        if embeddings is None or start_idx < 0:
            return random.sample(question_candidates, 1)[0]

        end_idx = start_idx + len(question_candidates)
        if end_idx > len(embeddings):
            return random.sample(question_candidates, 1)[0]

        candidate_embeddings = embeddings[start_idx:end_idx]
        distance_sums = []
        for i in range(len(candidate_embeddings)):
            s = 0.0
            for j in range(len(candidate_embeddings)):
                if i == j:
                    continue
                s += self._cosine_distance(candidate_embeddings[i], candidate_embeddings[j])
            distance_sums.append(s)
        min_index = min(range(len(distance_sums)), key=distance_sums.__getitem__)
        return question_candidates[min_index]

    def _build_prompt(self, sql, db_id, db_id2column_info, db_engine):
        template = self.prompt_template
        if template is not None and hasattr(template, 'build_prompt'):
            built = template.build_prompt(sql, db_id, db_id2column_info, db_engine)
            if isinstance(built, tuple) and len(built) >= 2:
                return str(built[0]), str(built[1])
            return str(built), 'unknown'

        column_info = db_id2column_info.get(db_id, {})
        column_info_text = '\n'.join([f'- {k}: {v}' for k, v in list(column_info.items())[:200]])
        prompt = (
            f'You are a Text2SQL intent synthesizer.\n'
            f'Database engine: {db_engine}\n'
            f'db_id: {db_id}\n\n'
            f'Given a SQL query, generate a natural language intent that matches it.\n'
            f'If helpful, you may use the following column descriptions:\n{column_info_text}\n\n'
            f'Output format:\n'
            f'[INTENT-START] ... [INTENT-END]\n'
            f'[EXTERNAL-KNOWLEDGE-START] ... [EXTERNAL-KNOWLEDGE-END]\n\n'
            f'SQL:\n{sql}\n'
        )
        return prompt, 'default'

    def _generate_embeddings(self, texts):
        if not texts:
            return []
        emb = self.embedding_model
        if emb is None:
            return None
        try:
            if hasattr(emb, 'generate_embedding_from_input'):
                vectors = emb.generate_embedding_from_input(texts)
            elif callable(emb):
                vectors = emb(texts)
            else:
                return None
            if not isinstance(vectors, list):
                return None
            return vectors
        except Exception as e:
            LOG.warning(f'Embedding generation failed: {e}')
            return None

    def _validate_generator_manager(self):
        if self.model is None:
            raise ValueError('model is required')
        if self.database_manager is None:
            raise ValueError('database_manager is required')
        if not hasattr(self.database_manager, 'get_create_statements_and_insert_statements'):
            raise ValueError('database_manager.get_create_statements_and_insert_statements is required')

    def forward(self, data, input_sql_key='SQL', input_db_id_key='db_id',
                output_intent_key=None, output_evidence_key='evidence', **kwargs):
        assert isinstance(data, dict)
        self._validate_generator_manager()

        if output_intent_key is None:
            output_intent_key = self.input_intent_key

        if self._is_non_empty_text(data.get(self.input_intent_key)):
            return data

        db_engine = getattr(self.database_manager, 'db_type', 'unknown')
        sql = data.get(input_sql_key, '')
        db_id = data.get(input_db_id_key, '')

        try:
            create_statements, _ = self.database_manager.get_create_statements_and_insert_statements(db_id)
            column_info = self.extract_column_descriptions(create_statements)
        except Exception as e:
            LOG.warning(f'Failed to extract schema for db_id={db_id}: {e}')
            column_info = {}

        prompt, question_type = self._build_prompt(str(sql), str(db_id), {db_id: column_info}, db_engine)
        data['question_type'] = question_type

        responses = []
        for _ in range(self.question_candidates_num):
            try:
                responses.append(self.model(prompt))
            except Exception as e:
                LOG.error(f'Failed to generate question: {e}')
                responses.append('')

        candidates = []
        embedding_texts = []
        for resp in responses:
            parsed = self.parse_llm_response(resp)
            if parsed:
                candidates.append(parsed)
                text = f'{parsed.get("external_knowledge", "")} {parsed.get("question", "")}'.strip()
                embedding_texts.append(text)

        embeddings = self._generate_embeddings(embedding_texts) if embedding_texts else None
        best = self._select_best_question(candidates, 0, embeddings)

        if best is not None:
            data[output_intent_key] = best.get('question', '')
            data[output_evidence_key] = best.get('external_knowledge', '')

        return data

SQLReasoningTracer

Bases: Text2SQLOps

Text2SQL 数据生成算子:CoT 轨迹生成器。

针对给定 (问题, SQL, 数据库 Schema, 证据) 生成若干条“从问题到 SQL 的链式思考(Chain-of-Thought)”文本,用于训练/分析。

Parameters:

  • model

    LazyLLM 模型对象(必需)。

  • database_manager

    提供 Schema 的管理器(必需),需实现: - get_create_statements_and_insert_statements(db_id)

  • prompt_template

    可选,自定义 prompt 构造器。

  • output_num (int, default: 3 ) –

    每条样本生成的 CoT 轨迹数量,默认 3(>=1)。

  • **kwargs

    其它传递给基类算子的参数。

Examples:

from lazyllm.tools.data.operators.text2sql_ops import SQLReasoningTracer

op = SQLReasoningTracer(model=model, database_manager=database_manager, output_num=3)
item = {
    'db_id': 'db_1',
    'question': '有多少已支付的订单?',
    'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
    'evidence': ''
}
res = op(item)
print(len(res['cot_responses']))
print(res['cot_responses'][0][:200])  # 打印第一条 CoT 的前 200 个字符
# 3
# "Database Schema: ... Question: 有多少已支付的订单? ... 推理步骤1:... 推理步骤2:... ```sql SELECT count(*) FROM orders WHERE status = 'paid';```"
Source code in lazyllm/tools/data/operators/text2sql_ops.py
class SQLReasoningTracer(Text2SQLOps):
    """Text2SQL 数据生成算子:CoT 轨迹生成器。

针对给定 (问题, SQL, 数据库 Schema, 证据) 生成若干条“从问题到 SQL 的链式思考(Chain-of-Thought)”文本,用于训练/分析。

Args:
    model: LazyLLM 模型对象(必需)。
    database_manager: 提供 Schema 的管理器(必需),需实现:
        - get_create_statements_and_insert_statements(db_id)
    prompt_template: 可选,自定义 prompt 构造器。
    output_num (int): 每条样本生成的 CoT 轨迹数量,默认 3(>=1)。
    **kwargs: 其它传递给基类算子的参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import SQLReasoningTracer

    op = SQLReasoningTracer(model=model, database_manager=database_manager, output_num=3)
    item = {
        'db_id': 'db_1',
        'question': '有多少已支付的订单?',
        'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';',
        'evidence': ''
    }
    res = op(item)
    print(len(res['cot_responses']))
    print(res['cot_responses'][0][:200])  # 打印第一条 CoT 的前 200 个字符
    # 3
    # "Database Schema: ... Question: 有多少已支付的订单? ... 推理步骤1:... 推理步骤2:... ```sql SELECT count(*) FROM orders WHERE status = 'paid';```"
    ```
    """
    def __init__(self, model=None, database_manager=None, prompt_template=None,
                 output_num=3, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.database_manager = database_manager
        self.prompt_template = prompt_template
        self.output_num = int(output_num)
        if self.output_num < 1:
            raise ValueError('output_num must be >= 1')
        sys_prompt = 'You are a database expert. Please generate a step-by-step reasoning ' \
                     '(Chain of Thought) and the final SQL.'
        self.model = model.share().prompt(sys_prompt) if model else None

    def _build_prompt(self, item, schema_str):
        intent = item.get(self.input_intent_key)
        gold_sql = item.get(self.input_sql_key)
        evidence = item.get(self.input_evidence_key, '')

        template = self.prompt_template
        if template is not None and hasattr(template, 'build_prompt'):
            return template.build_prompt(schema_str, intent, gold_sql, evidence)

        return (
            f'Database Schema:\n{schema_str}\n\n'
            f'Intent: {intent}\n'
            f'Evidence: {evidence}\n'
            f'Target SQL: {gold_sql}\n\n'
            f'Please provide a detailed step-by-step reasoning that leads to the correct SQL query.'
        )

    def forward(self, data, input_sql_key='SQL', input_intent_key='intent',
                input_db_id_key='db_id', input_evidence_key='evidence',
                output_cot_key='cot_responses', **kwargs):
        assert isinstance(data, dict)
        self._validate_manager()

        self.input_intent_key = input_intent_key
        self.input_sql_key = input_sql_key
        self.input_db_id_key = input_db_id_key
        self.input_evidence_key = input_evidence_key

        db_id = data.get(input_db_id_key)
        if not db_id:
            LOG.warning('Missing db_id for reasoning tracing')
            return data

        try:
            create_statements, _ = self.database_manager.get_create_statements_and_insert_statements(db_id)
            schema_str = '\n\n'.join([str(s) for s in create_statements])
            prompt = self._build_prompt(data, schema_str)

            responses = []
            for _ in range(self.output_num):
                try:
                    responses.append(self.model(prompt))
                except Exception as e:
                    LOG.error(f'Failed to generate reasoning trace: {e}')
                    responses.append('')
            data[output_cot_key] = responses
        except Exception as e:
            LOG.error(f'Error during reasoning tracing for db_id={db_id}: {e}')
            data[output_cot_key] = []

        return data

    def _validate_manager(self):
        if self.model is None:
            raise ValueError('model is required')
        if self.database_manager is None:
            raise ValueError('database_manager is required')

SQLRuntimeSieve

Bases: Text2SQLOps

Text2SQL 数据过滤算子:SQL 可执行性过滤器。

对每条数据中的 SQL 进行简单语法形态过滤(仅保留 SELECT / WITH 开头的查询),并调用 database_manager 进行 EXPLAIN 校验;只保留可在目标库上成功执行的 SQL。

Parameters:

  • database_manager

    提供数据库连接与 explain 能力的管理器(必需),需实现: - database_exists(db_id) - batch_explain_queries(list[(db_id, sql)])

  • **kwargs

    传递给基类算子的其它参数。

Examples:

from lazyllm.tools.data.operators.text2sql_ops import SQLRuntimeSieve

op = SQLRuntimeSieve(database_manager=database_manager)
item = {'db_id': 'db_1', 'SQL': 'SELECT * FROM users;'}
res = op(item)
print(res)  # 若 SQL 可在 db_1 上 explain 成功,则返回原始 dict;否则返回 None
Source code in lazyllm/tools/data/operators/text2sql_ops.py
class SQLRuntimeSieve(Text2SQLOps):
    """Text2SQL 数据过滤算子:SQL 可执行性过滤器。

对每条数据中的 SQL 进行简单语法形态过滤(仅保留 SELECT / WITH 开头的查询),并调用 database_manager 进行 EXPLAIN 校验;只保留可在目标库上成功执行的 SQL。

Args:
    database_manager: 提供数据库连接与 explain 能力的管理器(必需),需实现:
        - database_exists(db_id)
        - batch_explain_queries(list[(db_id, sql)])
    **kwargs: 传递给基类算子的其它参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import SQLRuntimeSieve

    op = SQLRuntimeSieve(database_manager=database_manager)
    item = {'db_id': 'db_1', 'SQL': 'SELECT * FROM users;'}
    res = op(item)
    print(res)  # 若 SQL 可在 db_1 上 explain 成功,则返回原始 dict;否则返回 None
    ```
    """
    def __init__(self, database_manager=None, **kwargs):
        super().__init__(**kwargs)
        self.database_manager = database_manager

    def filter_select_sql(self, sql):
        if not isinstance(sql, str):
            return False
        sql_wo_comments = re.sub(r'/\*.*?\*/', '', sql, flags=re.DOTALL)
        sql_wo_comments = re.sub(r'--.*', '', sql_wo_comments)
        sql_wo_comments = sql_wo_comments.strip()

        if sql_wo_comments.lower().startswith('select') or \
           sql_wo_comments.lower().startswith('with'):
            return True
        return False

    def forward(self, data, input_sql_key='SQL', input_db_id_key='db_id', **kwargs):
        assert isinstance(data, dict)
        if self.database_manager is None:
            LOG.error('database_manager is required for SQLExecutabilityFilter')
            return data

        sql = data.get(input_sql_key)
        db_id = data.get(input_db_id_key)

        if not self.filter_select_sql(sql):
            return []

        if not self.database_manager.database_exists(db_id):
            LOG.warning(f'Database {db_id} not found in registry, please check the database folder')
            return []

        try:
            execution_results = self.database_manager.batch_explain_queries([(db_id, sql)])
            if execution_results and execution_results[0].success:
                return data
        except Exception as e:
            LOG.error(f'Error during explain_query: {e}')

        return []

SQLSyntaxProfiler

Bases: Text2SQLOps

Text2SQL 数据分类算子:SQL 组件难度分类器。

使用 SQL 结构级别的难度评估器(EvalHardness/EvalHardnessLite),根据 SQL 中涉及的组件复杂度对其进行难度打标(easy/medium/hard/extra 等)。

Parameters:

  • difficulty_thresholds (list[int] | None, default: None ) –

    难度阈值列表,默认 [2, 4, 6]。

  • difficulty_labels (list[str] | None, default: None ) –

    难度标签列表,默认 ['easy', 'medium', 'hard', 'extra']。

  • **kwargs

    其它传递给基类算子的参数。

Examples:

from lazyllm.tools.data.operators.text2sql_ops import SQLSyntaxProfiler

op = SQLSyntaxProfiler()
item = {'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';'}
res = op(item)
print(res)
# {
#   'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
#   'sql_component_difficulty': 'easy'
# }
Source code in lazyllm/tools/data/operators/text2sql_ops.py
class SQLSyntaxProfiler(Text2SQLOps):
    """Text2SQL 数据分类算子:SQL 组件难度分类器。

使用 SQL 结构级别的难度评估器(EvalHardness/EvalHardnessLite),根据 SQL 中涉及的组件复杂度对其进行难度打标(easy/medium/hard/extra 等)。

Args:
    difficulty_thresholds (list[int]|None): 难度阈值列表,默认 [2, 4, 6]。
    difficulty_labels (list[str]|None): 难度标签列表,默认 ['easy', 'medium', 'hard', 'extra']。
    **kwargs: 其它传递给基类算子的参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import SQLSyntaxProfiler

    op = SQLSyntaxProfiler()
    item = {'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';'}
    res = op(item)
    print(res)
    # {
    #   'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';',
    #   'sql_component_difficulty': 'easy'
    # }
    ```
    """
    def __init__(self, difficulty_thresholds=None, difficulty_labels=None, **kwargs):
        super().__init__(**kwargs)
        if difficulty_thresholds is None:
            difficulty_thresholds = [2, 4, 6]
        if difficulty_labels is None:
            difficulty_labels = ['easy', 'medium', 'hard', 'extra']

        self.difficulty_config = {
            'thresholds': difficulty_thresholds,
            'labels': difficulty_labels,
        }
        if len(self.difficulty_config['thresholds']) != len(self.difficulty_config['labels']) - 1:
            raise ValueError('Thresholds and labels configuration mismatch')

    def eval_component_hardness(self, sql, schema):
        evaluator = EvalHardness(Schema(schema), sql)
        return evaluator.run()

    def eval_hardness_lite(self, sql):
        evaluator = EvalHardnessLite(str(sql), self.difficulty_config)
        return evaluator.run()

    def forward(self, data, input_sql_key='SQL',
                output_difficulty_key='sql_component_difficulty', **kwargs):
        assert isinstance(data, dict)
        sql = data.get(input_sql_key)
        if not sql:
            data[output_difficulty_key] = 'unknown'
            return data
        hardness = self.eval_hardness_lite(str(sql))
        data[output_difficulty_key] = hardness

        return data

TSQLSemanticAuditor

Bases: Text2SQLOps

Text2SQL 数据过滤算子:问句-SQL 一致性过滤器。

给定自然语言问题 + 证据(可选)+ SQL + 数据库 Schema,判断 SQL 是否能够正确回答该问题,保留“正确”的样本。

内部逻辑:

  • 调用 database_manager 获取 db_id 对应的 DDL(create_statements)。
  • 通过模型生成判断(Yes/No),仅当返回中包含 'yes' 时保留该样本,否则丢弃(返回 None)。

Parameters:

  • model

    LazyLLM 模型对象(必需)。

  • database_manager

    提供 Schema 的管理器(必需),需实现: - get_create_statements_and_insert_statements(db_id)

  • prompt_template

    可选,自定义 prompt 构造器。

  • system_prompt (str | None, default: None ) –

    可选系统提示词,默认英文 Yes/No 判定说明。

  • **kwargs

    其它传递给基类算子的参数。

Examples:

from lazyllm.tools.data.operators.text2sql_ops import TSQLSemanticAuditor

op = TSQLSemanticAuditor(model=model, database_manager=database_manager)
item = {
    'db_id': 'db_1',
    'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
    'question': '有多少已支付的订单?',
    'evidence': ''
}
res = op(item)
print(res)
# {
#   'db_id': 'db_1',
#   'SQL': 'SELECT count(*) FROM orders WHERE status = 'paid';',
#   'question': '有多少已支付的订单?',
#   'evidence': ''
# }
# 如果模型判断不匹配,则返回 None
Source code in lazyllm/tools/data/operators/text2sql_ops.py
class TSQLSemanticAuditor(Text2SQLOps):
    """Text2SQL 数据过滤算子:问句-SQL 一致性过滤器。

给定自然语言问题 + 证据(可选)+ SQL + 数据库 Schema,判断 SQL 是否能够正确回答该问题,保留“正确”的样本。

内部逻辑:

- 调用 database_manager 获取 db_id 对应的 DDL(create_statements)。
- 通过模型生成判断(Yes/No),仅当返回中包含 'yes' 时保留该样本,否则丢弃(返回 None)。

Args:
    model: LazyLLM 模型对象(必需)。
    database_manager: 提供 Schema 的管理器(必需),需实现:
        - get_create_statements_and_insert_statements(db_id)
    prompt_template: 可选,自定义 prompt 构造器。
    system_prompt (str|None): 可选系统提示词,默认英文 Yes/No 判定说明。
    **kwargs: 其它传递给基类算子的参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.text2sql_ops import TSQLSemanticAuditor

    op = TSQLSemanticAuditor(model=model, database_manager=database_manager)
    item = {
        'db_id': 'db_1',
        'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';',
        'question': '有多少已支付的订单?',
        'evidence': ''
    }
    res = op(item)
    print(res)
    # {
    #   'db_id': 'db_1',
    #   'SQL': 'SELECT count(*) FROM orders WHERE status = \'paid\';',
    #   'question': '有多少已支付的订单?',
    #   'evidence': ''
    # }
    # 如果模型判断不匹配,则返回 None
    ```
    """
    def __init__(self, model=None, database_manager=None, prompt_template=None,
                 system_prompt=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.database_manager = database_manager
        self.prompt_template = prompt_template
        sys_prompt = system_prompt or (
            'You are an expert in SQL and database analysis.\n'
            'Your task is to determine if a given SQL query correctly answers a natural language '
            'question based on the provided database schema.\n'
            'Respond ONLY with "Yes" if the SQL is correct and "No" otherwise.'
        )
        self.model = model.share().prompt(sys_prompt) if model else None

    def _parse_consistency_response(self, response):
        if not isinstance(response, str):
            return False
        response = response.strip().lower()
        if 'yes' in response:
            return True
        return False

    def _build_prompt(self, question, sql, db_details):
        template = self.prompt_template
        if template is not None and hasattr(template, 'build_prompt'):
            return str(template.build_prompt(question, sql, db_details))

        return (
            f'Database Schema:\n{db_details}\n\n'
            f'Question: {question}\n\n'
            f'SQL Query: {sql}\n\n'
            f'Does the SQL query correctly answer the question according to the schema? (Yes/No)'
        )

    def forward(self, data, input_sql_key='SQL', input_db_id_key='db_id',
                input_question_key='question', input_evidence_key='evidence', **kwargs):
        assert isinstance(data, dict)
        if self.model is None:
            raise ValueError('model is required')
        if self.database_manager is None:
            raise ValueError('database_manager is required')

        sql = data.get(input_sql_key)
        question = data.get(input_question_key)
        evidence = data.get(input_evidence_key, '')
        db_id = data.get(input_db_id_key)

        if not question or str(question).strip() == '':
            return []

        if evidence:
            question = f'{question}\n{evidence}'

        try:
            create_statements, _ = self.database_manager.get_create_statements_and_insert_statements(db_id)
            db_details = '\n\n'.join([str(s) for s in create_statements])
            prompt = self._build_prompt(str(question), str(sql), db_details)
            response = self.model(prompt)
            if self._parse_consistency_response(response):
                return data
        except Exception as e:
            LOG.warning(f'Failed to check correspondence: {e}')

        return []

预训练算子

lazyllm.tools.data.operators.pt_op

ContextQualFilter

Bases: PT

使用 VLM 或 LLM 评估 context 是否适合生成 QA 对;仅保留 score=1(适合)的样本。

Parameters:

  • llm

    视觉或文本语言模型实例

  • context_key (str, default: 'context' ) –

    上下文字段名,默认 'context'

  • image_key (str, default: 'image_path' ) –

    图片路径字段名,默认 'image_path'

  • prompt (str, default: None ) –

    可选,自定义提示词

Examples:

from lazyllm.tools.data import pt

vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
op = pt.ContextQualFilter(vlm)
res = op([{'context': 'Good context for QA.', 'image_path': '/path/to/image.jpg'}])
# only samples with score=1 are kept
Source code in lazyllm/tools/data/operators/pt_op.py
class ContextQualFilter(PT):
    """使用 VLM 或 LLM 评估 context 是否适合生成 QA 对;仅保留 score=1(适合)的样本。

Args:
    llm: 视觉或文本语言模型实例
    context_key (str): 上下文字段名,默认 'context'
    image_key (str): 图片路径字段名,默认 'image_path'
    prompt (str): 可选,自定义提示词


Examples:
    ```python
    from lazyllm.tools.data import pt

    vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
    op = pt.ContextQualFilter(vlm)
    res = op([{'context': 'Good context for QA.', 'image_path': '/path/to/image.jpg'}])
    # only samples with score=1 are kept
    ```
    """
    DEFAULT_PROMPT = (
        'Evaluate whether the given context (text and/or images) is suitable for generating QA pairs. '
        'Output JSON only. Do not output any other irrelevant content.\n'
        '{\n'
        '  "score": 0,\n'
        '  "reason": ""\n'
        '}\n'
        'score: MUST be 0 or 1 only. 1=suitable, 0=not suitable. Good context has sufficient info for Q&A.'
    )

    def __init__(self, llm, context_key='context', image_key='image_path',
                 prompt: Optional[str] = None,
                 _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        if llm is None:
            raise ValueError('ContextQualFilter requires llm (vision- or text-language model).')
        self.context_key = context_key
        self.image_key = image_key
        self.prompt = prompt or self.DEFAULT_PROMPT
        self._evaluator = llm.share().prompt(self.prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        context = data.get(self.context_key, '')
        if not context:
            return []
        paths = _normalize_image_paths(data.get(self.image_key, ''))
        try:
            query = f'Context:\n{context}\n\nIs this context suitable for generating QA pairs?'
            inputs = encode_query_with_filepaths(query, paths) if paths else query
            out = self._evaluator(inputs)
            if not isinstance(out, dict):
                return []
            score = out.get('score', out.get('suitable', 0))
            try:
                score = int(float(score))
            except (TypeError, ValueError):
                score = 0
            if score != 1:
                return []
            return data
        except Exception as e:
            LOG.warning(f'Context qualification evaluation failed: {e}')
            return []

GraphRetriever

Bases: PT_MM

从 context 字段中解析 Markdown 格式的图片链接 ![alt](path),提取存在磁盘上的图片路径并写入 img_key。 不修改原始 context;若 context.strip() 为空,则 img_key 为 [],样本仍保留。

Parameters:

  • context_key (str, default: 'context' ) –

    文本上下文字段名,默认 'context'

  • img_key (str, default: 'image_path' ) –

    图片路径输出字段名,默认 'image_path'

  • images_folder (str, default: None ) –

    可选,图片根目录,用于解析相对路径

Examples:

from lazyllm.tools.data import pt_mm

op = pt_mm.GraphRetriever(context_key='context', img_key='img', _save_data=False)
data = {'context': 'Some content ![](/path/to/fig.png)'}
res = op([data])
# res[0]['img'] contains resolved absolute path

# empty context: res[0]['img'] == [], record kept, source context unchanged
empty_res = op([{'context': '   '}])
Source code in lazyllm/tools/data/operators/pt_op.py
class GraphRetriever(PT_MM):
    """从 context 字段中解析 Markdown 格式的图片链接 `![alt](path)`,提取存在磁盘上的图片路径并写入 img_key。
不修改原始 context;若 context.strip() 为空,则 img_key 为 [],样本仍保留。

Args:
    context_key (str): 文本上下文字段名,默认 'context'
    img_key (str): 图片路径输出字段名,默认 'image_path'
    images_folder (str): 可选,图片根目录,用于解析相对路径


Examples:
    ```python
    from lazyllm.tools.data import pt_mm

    op = pt_mm.GraphRetriever(context_key='context', img_key='img', _save_data=False)
    data = {'context': 'Some content ![](/path/to/fig.png)'}
    res = op([data])
    # res[0]['img'] contains resolved absolute path

    # empty context: res[0]['img'] == [], record kept, source context unchanged
    empty_res = op([{'context': '   '}])
    ```
    """
    def __init__(self, context_key='context', img_key='image_path', images_folder: Optional[str] = None,
                 _concurrency_mode='process', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        self.context_key = context_key
        self.img_key = img_key
        self.images_folder = images_folder

    def _parse_str_for_paths(self, s) -> list:
        matches = re.findall(r'!\[.*?\]\((.*?)\)', str(s))
        candidates = matches if matches else [str(s)] if s else []
        paths = []
        for p in candidates:
            if not p or not p.strip():
                continue
            raw = os.path.join(self.images_folder, os.path.basename(p)) if self.images_folder else p
            full = os.path.abspath(raw)
            if os.path.exists(full):
                paths.append(full)
        return paths

    def _extract_img_paths(self, img_data) -> list:
        valid_paths = []
        if isinstance(img_data, list):
            for item in img_data:
                if isinstance(item, list):
                    for sub in item:
                        valid_paths.extend(self._parse_str_for_paths(sub))
                else:
                    valid_paths.extend(self._parse_str_for_paths(item))
        else:
            valid_paths.extend(self._parse_str_for_paths(img_data))
        return list(dict.fromkeys(valid_paths))

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        context = data.get(self.context_key, '')
        if isinstance(context, list):
            context = '\n\n'.join(str(c) for c in context)
        context_stripped = context.strip() if context else ''
        if not context_stripped:
            data[self.img_key] = []
            return data
        valid_paths = self._extract_img_paths(context)
        data[self.img_key] = valid_paths
        return data

ImageDedup

Bases: PT_MM

基于图片文件哈希去重,保留首次出现的图片,跳过重复项。

Parameters:

  • image_key (str, default: 'image_path' ) –

    图片路径字段名,默认 'image_path'

  • hash_method (str, default: 'md5' ) –

    哈希算法,默认 'md5'

Examples:

from lazyllm.tools.data import pt_mm

op = pt_mm.ImageDedup()
batch = [{'image_path': 'a.jpg', 'id': 1}, {'image_path': 'a.jpg', 'id': 2}, {'image_path': 'b.jpg', 'id': 3}]
res = op(batch)
# len(res) == 2, duplicate removed
Source code in lazyllm/tools/data/operators/pt_op.py
class ImageDedup(PT_MM):
    """基于图片文件哈希去重,保留首次出现的图片,跳过重复项。

Args:
    image_key (str): 图片路径字段名,默认 'image_path'
    hash_method (str): 哈希算法,默认 'md5'


Examples:
    ```python
    from lazyllm.tools.data import pt_mm

    op = pt_mm.ImageDedup()
    batch = [{'image_path': 'a.jpg', 'id': 1}, {'image_path': 'a.jpg', 'id': 2}, {'image_path': 'b.jpg', 'id': 3}]
    res = op(batch)
    # len(res) == 2, duplicate removed
    ```
    """
    def __init__(self, image_key='image_path', hash_method='md5', **kwargs):
        super().__init__(**kwargs)
        self.image_key = image_key
        self.hash_method = hash_method

    def _calc_hash(self, image_path):
        try:
            if not os.path.exists(image_path):
                return None
            hash_obj = hashlib.new(self.hash_method)
            with open(image_path, 'rb') as f:
                for chunk in iter(lambda: f.read(4096), b''):
                    hash_obj.update(chunk)
            return hash_obj.hexdigest()
        except Exception as e:
            LOG.warning(f'Failed to calculate hash for {image_path}: {e}')
            return None

    def forward_batch_input(self, data, **kwargs):
        assert isinstance(data, list)
        seen_hashes: Set[str] = set()
        deduplicated_data = []
        for item in data:
            assert isinstance(item, dict)
            paths = _normalize_image_paths(item.get(self.image_key, ''))
            if not paths:
                continue
            image_hash = self._calc_hash(paths[0])
            if image_hash is None:
                continue
            if image_hash in seen_hashes:
                continue
            seen_hashes.add(image_hash)
            deduplicated_data.append(item)
        return deduplicated_data

Phi4QAGenerator

Bases: PT

使用 LLM 将 context(含可选图片)转换为预训练格式的 Phi-4 风格多轮问答对。

Parameters:

  • llm

    视觉或文本语言模型实例

  • image_key (str, default: 'image_path' ) –

    图片路径字段名,默认 'image_path'

  • context_key (str, default: 'context' ) –

    上下文字段名,默认 'context'

  • num_qa (int, default: 5 ) –

    生成的问答对数量,默认 5

  • prompt (str, default: None ) –

    可选,自定义提示词

Examples:

from lazyllm.tools.data import pt

vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
op = pt.Phi4QAGenerator(vlm, num_qa=2)
res = op([{'context': 'Some context.', 'image_path': '/path/to/image.jpg'}])
# res[0]['qa_pairs'] contains pretraining-format Q&A
Source code in lazyllm/tools/data/operators/pt_op.py
class Phi4QAGenerator(PT):
    """使用 LLM 将 context(含可选图片)转换为预训练格式的 Phi-4 风格多轮问答对。

Args:
    llm: 视觉或文本语言模型实例
    image_key (str): 图片路径字段名,默认 'image_path'
    context_key (str): 上下文字段名,默认 'context'
    num_qa (int): 生成的问答对数量,默认 5
    prompt (str): 可选,自定义提示词


Examples:
    ```python
    from lazyllm.tools.data import pt

    vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
    op = pt.Phi4QAGenerator(vlm, num_qa=2)
    res = op([{'context': 'Some context.', 'image_path': '/path/to/image.jpg'}])
    # res[0]['qa_pairs'] contains pretraining-format Q&A
    ```
    """
    DEFAULT_PROMPT = (
        'Convert the given context (text and/or images) into pretraining-format multi-turn Q&A dialogue data. '
        'Output JSON only. Do not output any other irrelevant content.\n'
        '{\n'
        '  "qa_pairs": [\n'
        '    {"query": "", "answer": ""}\n'
        '  ]\n'
        '}\n'
        'Each item has query (question) and answer. Generate natural, instructional Q&A suitable for LM pretraining.'
    )

    def __init__(self, llm, image_key='image_path', context_key='context', num_qa=5,
                 prompt: Optional[str] = None,
                 _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        if llm is None:
            raise ValueError('Phi4QAGenerator requires llm (vision- or text-language model).')
        self.image_key = image_key
        self.context_key = context_key
        self.num_qa = num_qa
        self.prompt = prompt or self.DEFAULT_PROMPT
        self._generator = llm.share().prompt(self.prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        context = data.get(self.context_key, '')
        if not context:
            return []
        paths = _normalize_image_paths(data.get(self.image_key, ''))
        try:
            query = f'Context:\n{context}\n\nGenerate {self.num_qa} pretraining-format Q&A pairs (phi-4 style).'
            inputs = encode_query_with_filepaths(query, paths) if paths else query
            out = self._generator(inputs)
            if not isinstance(out, dict):
                data['qa_pairs'] = []
                return data
            raw = out.get('qa_pairs', [])
            if not isinstance(raw, list):
                data['qa_pairs'] = []
                return data
            qa_pairs = []
            for item in raw:
                if isinstance(item, dict) and 'query' in item and 'answer' in item:
                    qa_pairs.append({'query': str(item['query']), 'answer': str(item['answer'])})
                elif isinstance(item, dict) and 'question' in item:
                    qa_pairs.append({
                        'query': str(item.get('question', item.get('query', ''))),
                        'answer': str(item.get('answer', item.get('ans', ''))),
                    })
            data['qa_pairs'] = qa_pairs
            return data
        except Exception as e:
            LOG.warning(f'Phi4 Q&A generation failed: {e}')
            return []

TextRelevanceFilter

Bases: PT_MM

使用 VLM 判断图文相关性,过滤低于阈值的样本。

Parameters:

  • vlm

    视觉语言模型实例

  • image_key (str, default: 'image_path' ) –

    图片路径字段名,默认 'image_path'

  • text_key (str, default: 'text' ) –

    文本字段名,默认 'text'

  • threshold (float, default: 0.6 ) –

    相关性阈值 [0,1],默认 0.6

  • prompt (str, default: None ) –

    可选,自定义提示词

Examples:

from lazyllm.tools.data import pt_mm

vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
op = pt_mm.TextRelevanceFilter(vlm, threshold=0.5)
res = op([{'image_path': '/path/to/image.jpg', 'text': 'a red square'}])
# samples with relevance >= threshold are kept
Source code in lazyllm/tools/data/operators/pt_op.py
class TextRelevanceFilter(PT_MM):
    """使用 VLM 判断图文相关性,过滤低于阈值的样本。

Args:
    vlm: 视觉语言模型实例
    image_key (str): 图片路径字段名,默认 'image_path'
    text_key (str): 文本字段名,默认 'text'
    threshold (float): 相关性阈值 [0,1],默认 0.6
    prompt (str): 可选,自定义提示词


Examples:
    ```python
    from lazyllm.tools.data import pt_mm

    vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
    op = pt_mm.TextRelevanceFilter(vlm, threshold=0.5)
    res = op([{'image_path': '/path/to/image.jpg', 'text': 'a red square'}])
    # samples with relevance >= threshold are kept
    ```
    """
    DEFAULT_PROMPT = (
        'You are an image-text relevance judge.\n'
        'Given ONE image and ONE piece of text, you must output STRICT JSON and nothing else.\n'
        'JSON schema:\n'
        '{\n'
        '  "relevance": 0.0,  // float in [0, 1]\n'
        '  "reason": ""      // short string\n'
        '}\n'
        'Rules:\n'
        '- relevance=1 means fully relevant; relevance=0 means irrelevant.\n'
        '- Do not output markdown, code fences, or any extra words outside JSON.\n'
    )

    def __init__(self, vlm, image_key='image_path', text_key='text', threshold=0.6,
                 prompt: Optional[str] = None,
                 _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        if vlm is None:
            raise ValueError('TextRelevanceFilter requires vlm (vision-language model).')
        self.image_key = image_key
        self.text_key = text_key
        self.threshold = threshold
        self.prompt = prompt or self.DEFAULT_PROMPT
        self._judge = vlm.share().prompt(self.prompt).formatter(JsonFormatter())

    def _calc_relevance(self, image_path, text):
        if not text or not image_path or not os.path.exists(image_path):
            return 0.0
        try:
            out = self._judge(encode_query_with_filepaths(text, [image_path]))
            v = out.get('relevance', 0.0) if isinstance(out, dict) else 0.0
            v = max(0.0, min(1.0, float(v))) if isinstance(v, (int, float)) else 0.0
            return v
        except Exception as e:
            LOG.warning(f'VLM relevance failed: {e}')
            return 0.0

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        paths = _normalize_image_paths(data.get(self.image_key, ''))
        text = data.get(self.text_key, '')
        if not paths or not text:
            return []
        try:
            scores = [self._calc_relevance(p, text) for p in paths]
            mean_relevance = sum(scores) / len(scores) if scores else 0.0
            if mean_relevance < self.threshold:
                return []
            valid_paths = [p for p, s in zip(paths, scores) if s >= self.threshold]
            if not valid_paths:
                return []
            data[self.image_key] = valid_paths
            data['image_text_relevance'] = mean_relevance
            return data
        except Exception as e:
            LOG.warning(f'Failed to calculate image-text relevance: {e}')
            return []

VQAGenerator

Bases: PT_MM

使用 VLM 根据 context 和图片生成视觉问答对(VQA pairs)。

Parameters:

  • vlm

    视觉语言模型实例

  • image_key (str, default: 'image_path' ) –

    图片路径字段名,默认 'image_path'

  • context_key (str, default: 'context' ) –

    上下文字段名,默认 'context'

  • num_qa (int, default: 5 ) –

    生成的问答对数量,默认 5

  • prompt (str, default: None ) –

    可选,自定义提示词

Examples:

from lazyllm.tools.data import pt_mm

vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
op = pt_mm.VQAGenerator(vlm, num_qa=3)
res = op([{'image_path': '/path/to/image.jpg', 'context': 'A simple image.'}])
# res[0]['qa_pairs'] contains [{'query': '...', 'answer': '...'}, ...]
Source code in lazyllm/tools/data/operators/pt_op.py
class VQAGenerator(PT_MM):
    """使用 VLM 根据 context 和图片生成视觉问答对(VQA pairs)。

Args:
    vlm: 视觉语言模型实例
    image_key (str): 图片路径字段名,默认 'image_path'
    context_key (str): 上下文字段名,默认 'context'
    num_qa (int): 生成的问答对数量,默认 5
    prompt (str): 可选,自定义提示词


Examples:
    ```python
    from lazyllm.tools.data import pt_mm

    vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
    op = pt_mm.VQAGenerator(vlm, num_qa=3)
    res = op([{'image_path': '/path/to/image.jpg', 'context': 'A simple image.'}])
    # res[0]['qa_pairs'] contains [{'query': '...', 'answer': '...'}, ...]
    ```
    """
    DEFAULT_PROMPT = (
        'Generate Visual Question Answering (VQA) pairs from the given context and image(s). '
        'Output JSON only. Do not output any other irrelevant content.\n'
        '{\n'
        '  "qa_pairs": [\n'
        '    {"query": "", "answer": ""}\n'
        '  ]\n'
        '}\n'
        'Each item in qa_pairs has query (question) and answer. '
        'All questions should be answerable from the context and image.'
    )

    def __init__(self, vlm, image_key='image_path', context_key='context', num_qa=5,
                 prompt: Optional[str] = None,
                 _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        if vlm is None:
            raise ValueError('VQAGenerator requires vlm (vision-language model).')
        self.image_key = image_key
        self.context_key = context_key
        self.num_qa = num_qa
        self.prompt = prompt or self.DEFAULT_PROMPT
        self._generator = vlm.share().prompt(prompt or self.DEFAULT_PROMPT).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        paths = _normalize_image_paths(data.get(self.image_key, ''))
        context = data.get(self.context_key, '')
        if not paths or not context:
            return []
        try:
            query = f'Context: {context}\n\nGenerate {self.num_qa} QA pairs based on the context and image(s).'
            out = self._generator(encode_query_with_filepaths(query, paths))
            if not isinstance(out, dict):
                data['qa_pairs'] = []
                return data
            raw = out.get('qa_pairs', [])
            if not isinstance(raw, list):
                data['qa_pairs'] = []
                return data
            qa_pairs = []
            for item in raw:
                if isinstance(item, dict) and 'query' in item and 'answer' in item:
                    qa_pairs.append({'query': str(item['query']), 'answer': str(item['answer'])})
                elif isinstance(item, dict) and 'question' in item:
                    qa_pairs.append({
                        'query': str(item.get('question', item.get('query', ''))),
                        'answer': str(item.get('answer', item.get('ans', ''))),
                    })
            data['qa_pairs'] = qa_pairs
            return data
        except Exception as e:
            LOG.warning(f'VQA generation failed: {e}')
            return []

VQAScorer

Bases: PT_MM

使用 VLM 对 VQA 对(query、answer、image_path)进行质量打分,评估图文问答的质量。

Parameters:

  • vlm

    视觉语言模型实例

  • image_key (str, default: 'image_path' ) –

    图片路径字段名,默认 'image_path'

  • query_key (str, default: 'query' ) –

    问题字段名,默认 'query'

  • answer_key (str, default: 'answer' ) –

    答案字段名,默认 'answer'

  • prompt (str, default: None ) –

    可选,自定义提示词

Examples:

from lazyllm.tools.data import pt_mm

vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
op = pt_mm.VQAScorer(vlm)
res = op([{
    'image_path': '/path/to/image.jpg',
    'query': 'What color is it?',
    'answer': 'Red',
}])
# res[0]['quality_score'] contains score, relevance, correctness, reason
Source code in lazyllm/tools/data/operators/pt_op.py
class VQAScorer(PT_MM):
    """使用 VLM 对 VQA 对(query、answer、image_path)进行质量打分,评估图文问答的质量。

Args:
    vlm: 视觉语言模型实例
    image_key (str): 图片路径字段名,默认 'image_path'
    query_key (str): 问题字段名,默认 'query'
    answer_key (str): 答案字段名,默认 'answer'
    prompt (str): 可选,自定义提示词


Examples:
    ```python
    from lazyllm.tools.data import pt_mm

    vlm = lazyllm.OnlineChatModule(source='sensenova', model='SenseNova-V6-5-Turbo')
    op = pt_mm.VQAScorer(vlm)
    res = op([{
        'image_path': '/path/to/image.jpg',
        'query': 'What color is it?',
        'answer': 'Red',
    }])
    # res[0]['quality_score'] contains score, relevance, correctness, reason
    ```
    """
    DEFAULT_PROMPT = (
        'Given an image and a VQA pair (query, answer), rate the quality of this VQA. '
        'Output JSON only. Do not output any other irrelevant content.\n'
        '{\n'
        '  "score": 0.0,\n'
        '  "relevance": 0.0,\n'
        '  "correctness": 0.0,\n'
        '  "reason": ""\n'
        '}\n'
        'score: overall VQA quality [0, 1]; relevance: answer relevance to query [0, 1]; '
        'correctness: answer correctness given the image [0, 1]. All floats.'
    )

    def __init__(self, vlm, image_key='image_path', query_key='query', answer_key='answer',
                 prompt: Optional[str] = None,
                 _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        if vlm is None:
            raise ValueError('VQAScorer requires vlm (vision-language model).')
        self.image_key = image_key
        self.query_key = query_key
        self.answer_key = answer_key
        self.prompt = prompt or self.DEFAULT_PROMPT
        self._scorer = vlm.share().prompt(self.prompt).formatter(JsonFormatter())

    def _clamp_score(self, v):
        try:
            return max(0.0, min(1.0, float(v)))
        except (TypeError, ValueError):
            return 0.0

    def _calc_vqa_quality(self, query, answer, image_path):
        if not query or not answer:
            return 0.0, {}
        if not image_path or not os.path.exists(image_path):
            return 0.0, {}
        try:
            eval_query = (
                f'Query: {query}\nAnswer: {answer}\n\n'
                'Rate the quality of this VQA pair given the image. How relevant and correct is the answer?'
            )
            out = self._scorer(encode_query_with_filepaths(eval_query, [image_path]))
            if not isinstance(out, dict):
                return 0.0, {}
            score = self._clamp_score(out.get('score', out.get('overall', 0.0)))
            return score, out
        except Exception as e:
            LOG.warning(f'VLM VQA scoring failed: {e}')
            return 0.0, {}

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        paths = _normalize_image_paths(data.get(self.image_key, ''))
        query = data.get(self.query_key, '')
        answer = data.get(self.answer_key, '')
        if not paths or not query or not answer:
            return []
        try:
            image_path = paths[0]
            _, out = self._calc_vqa_quality(query, answer, image_path)
            data['quality_score'] = {
                'score': self._clamp_score(out.get('score', out.get('overall', 0.0))),
                'relevance': self._clamp_score(out.get('relevance', 0.0)),
                'correctness': self._clamp_score(out.get('correctness', 0.0)),
                'reason': str(out.get('reason', '')),
            }
            return data
        except Exception as e:
            LOG.warning(f'Failed to score VQA quality: {e}')
            return []

integrity_check(data, image_key='image_path', input_key=None)

检查图片文件完整性,过滤损坏或空文件,保留可正常打开的图片路径。

Parameters:

  • data (dict) –

    单条数据字典

  • image_key (str, default: 'image_path' ) –

    图片路径字段名,默认 'image_path'

  • input_key (str, default: None ) –

    可选,覆盖 image_key

Examples:

from lazyllm.tools.data import pt_mm

op = pt_mm.integrity_check()
res = op([{'image_path': '/path/to/image.jpg'}, {'image_path': '/nonexistent.png'}])
# only valid images retained
Source code in lazyllm/tools/data/operators/pt_op.py
@data_register('data.pt_mm', rewrite_func='forward', _concurrency_mode='thread')
def integrity_check(data, image_key='image_path', input_key=None):
    """检查图片文件完整性,过滤损坏或空文件,保留可正常打开的图片路径。

Args:
    data (dict): 单条数据字典
    image_key (str): 图片路径字段名,默认 'image_path'
    input_key (str): 可选,覆盖 image_key


Examples:
    ```python
    from lazyllm.tools.data import pt_mm

    op = pt_mm.integrity_check()
    res = op([{'image_path': '/path/to/image.jpg'}, {'image_path': '/nonexistent.png'}])
    # only valid images retained
    ```
    """
    assert isinstance(data, dict)
    if input_key:
        image_key = input_key
    paths = _normalize_image_paths(data.get(image_key, ''))
    if not paths:
        return []
    valid_paths = []
    try:
        for image_path in paths:
            if not os.path.exists(image_path):
                LOG.warning(f'Image path not found: {image_path}')
                continue
            try:
                with PIL.Image.open(image_path) as img:
                    img.verify()
                if os.path.getsize(image_path) == 0:
                    continue
                valid_paths.append(image_path)
            except Exception as e:
                LOG.warning(f'Failed to check file integrity for {image_path}: {e}')
                continue
        if not valid_paths:
            return []
        data[image_key] = valid_paths
        return data
    except Exception as e:
        LOG.warning(f'Failed to check file integrity: {e}')
        return []

resolution_filter(data, image_key='image_path', min_width=256, min_height=256, max_width=4096, max_height=4096, input_key=None)

按最小/最大宽高过滤图片,保留尺寸在指定范围内的图片路径。

Parameters:

  • data (dict) –

    单条数据字典

  • image_key (str, default: 'image_path' ) –

    图片路径字段名,默认 'image_path'

  • min_width (int, default: 256 ) –

    最小宽度,默认 256

  • min_height (int, default: 256 ) –

    最小高度,默认 256

  • max_width (int, default: 4096 ) –

    最大宽度,默认 4096

  • max_height (int, default: 4096 ) –

    最大高度,默认 4096

  • input_key (str, default: None ) –

    可选,覆盖 image_key

Examples:

from lazyllm.tools.data import pt_mm

op = pt_mm.resolution_filter(min_width=256, min_height=256, max_width=4096, max_height=4096)
res = op([{'image_path': '/path/to/image.jpg'}])
Source code in lazyllm/tools/data/operators/pt_op.py
@data_register('data.pt_mm', rewrite_func='forward', _concurrency_mode='thread')
def resolution_filter(data, image_key='image_path', min_width=256, min_height=256,
                      max_width=4096, max_height=4096, input_key=None):
    """按最小/最大宽高过滤图片,保留尺寸在指定范围内的图片路径。   

Args:
    data (dict): 单条数据字典
    image_key (str): 图片路径字段名,默认 'image_path'
    min_width (int): 最小宽度,默认 256
    min_height (int): 最小高度,默认 256
    max_width (int): 最大宽度,默认 4096
    max_height (int): 最大高度,默认 4096
    input_key (str): 可选,覆盖 image_key


Examples:
    ```python
    from lazyllm.tools.data import pt_mm

    op = pt_mm.resolution_filter(min_width=256, min_height=256, max_width=4096, max_height=4096)
    res = op([{'image_path': '/path/to/image.jpg'}])
    ```
    """
    assert isinstance(data, dict)
    if input_key:
        image_key = input_key
    paths = _normalize_image_paths(data.get(image_key, ''))
    if not paths:
        return []
    valid_paths = []
    try:
        for image_path in paths:
            if not os.path.exists(image_path):
                LOG.warning(f'Image path not found or invalid: {image_path}')
                continue
            with PIL.Image.open(image_path) as img:
                width, height = img.size
                if width < min_width or height < min_height:
                    continue
                if width > max_width or height > max_height:
                    continue
                valid_paths.append(image_path)
        if not valid_paths:
            return []
        data[image_key] = valid_paths
        return data
    except Exception as e:
        LOG.warning(f'Failed to check image resolution: {e}')
        return []

resolution_resize(data, image_key='image_path', max_side=1024, input_key=None, inplace=True)

将图片最长边缩放到不超过 max_side,可原位覆盖或生成新文件。

Parameters:

  • data (dict) –

    单条数据字典

  • image_key (str, default: 'image_path' ) –

    图片路径字段名,默认 'image_path'

  • max_side (int, default: 1024 ) –

    最长边上限,默认 1024

  • inplace (bool, default: True ) –

    是否覆盖原文件,默认 True;False 时生成 _resized 后缀新文件

  • input_key (str, default: None ) –

    可选,覆盖 image_key

Examples:

from lazyllm.tools.data import pt_mm

op = pt_mm.resolution_resize(max_side=400, inplace=False)
res = op([{'image_path': '/path/to/image.jpg'}])
# resized file saved as image_resized.jpg in same directory
Source code in lazyllm/tools/data/operators/pt_op.py
@data_register('data.pt_mm', rewrite_func='forward', _concurrency_mode='thread')
def resolution_resize(data, image_key='image_path', max_side=1024, input_key=None, inplace=True):
    """将图片最长边缩放到不超过 max_side,可原位覆盖或生成新文件。

Args:
    data (dict): 单条数据字典
    image_key (str): 图片路径字段名,默认 'image_path'
    max_side (int): 最长边上限,默认 1024
    inplace (bool): 是否覆盖原文件,默认 True;False 时生成 _resized 后缀新文件
    input_key (str): 可选,覆盖 image_key


Examples:
    ```python
    from lazyllm.tools.data import pt_mm

    op = pt_mm.resolution_resize(max_side=400, inplace=False)
    res = op([{'image_path': '/path/to/image.jpg'}])
    # resized file saved as image_resized.jpg in same directory
    ```
    """
    assert isinstance(data, dict)
    if input_key:
        image_key = input_key
    paths = _normalize_image_paths(data.get(image_key, ''))
    if not paths:
        return []
    valid_paths = []
    try:
        for image_path in paths:
            if not os.path.exists(image_path):
                LOG.warning(f'Image path not found or invalid: {image_path}')
                continue
            with PIL.Image.open(image_path) as img:
                img.load()
                w, h = img.size
                if max(w, h) <= max_side:
                    valid_paths.append(image_path)
                    continue
                scale = max_side / max(w, h)
                new_w, new_h = int(round(w * scale)), int(round(h * scale))
                if new_w < 1 or new_h < 1:
                    continue
                resample = getattr(
                    getattr(PIL.Image, 'Resampling', None), 'LANCZOS', PIL.Image.LANCZOS
                )
                out = img.resize((new_w, new_h), resample)
                if inplace:
                    save_path = image_path
                else:
                    base, ext = os.path.splitext(image_path)
                    save_path = f'{base}_resized{ext}'
                out.save(save_path, quality=95)
                valid_paths.append(save_path)
        if not valid_paths:
            return []
        data[image_key] = valid_paths
        return data
    except Exception as e:
        LOG.warning(f'Failed to resize image resolution: {e}')
        return []

精炼算子

lazyllm.tools.data.operators.refine_op

remove_emoji(data, input_key='content')

移除指定字段中的 emoji 字符。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

Examples:

from lazyllm.tools.data import refine

func = refine.remove_emoji(input_key='content')
inputs = [{'content': 'Hello 😊 World 🌍!'}]
res = func(inputs)
print(res)
# [{'content': 'Hello  World !'}]
Source code in lazyllm/tools/data/operators/refine_op.py
@data_register('data.refine', rewrite_func='forward', _concurrency_mode='process')
def remove_emoji(data, input_key='content'):
    """移除指定字段中的 emoji 字符。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'


Examples:
    ```python
    from lazyllm.tools.data import refine

    func = refine.remove_emoji(input_key='content')
    inputs = [{'content': 'Hello 😊 World 🌍!'}]
    res = func(inputs)
    print(res)
    # [{'content': 'Hello  World !'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key, '')
    if text:
        data[input_key] = EMOJIS.sub('', text)
    return data

remove_extra_spaces(data, input_key='content')

将指定字段中的多余空白(多个空格、换行、制表符)归一化为单个空格。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

Examples:

from lazyllm.tools.data import refine

func = refine.remove_extra_spaces(input_key='content')
inputs = [{'content': 'hello   world\n\n  foo\tbar'}]
res = func(inputs)
print(res)
# [{'content': 'hello world foo bar'}]
Source code in lazyllm/tools/data/operators/refine_op.py
@data_register('data.refine', rewrite_func='forward', _concurrency_mode='process')
def remove_extra_spaces(data, input_key='content'):
    """将指定字段中的多余空白(多个空格、换行、制表符)归一化为单个空格。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'


Examples:
    ```python
    from lazyllm.tools.data import refine

    func = refine.remove_extra_spaces(input_key='content')
    inputs = [{'content': 'hello   world\\n\\n  foo\\tbar'}]
    res = func(inputs)
    print(res)
    # [{'content': 'hello world foo bar'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key, '')
    if text:
        text = text.replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ')
        data[input_key] = ' '.join(text.split())
    return data

remove_html_entity(data, input_key='content')

移除指定字段中的 HTML 实体(如  、<、& 等)。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

Examples:

from lazyllm.tools.data import refine

func = refine.remove_html_entity(input_key='content')
inputs = [{'content': 'Hello&nbsp;World &amp; &lt;tag&gt;'}]
res = func(inputs)
print(res)
# [{'content': 'HelloWorld  tag'}]
Source code in lazyllm/tools/data/operators/refine_op.py
@data_register('data.refine', rewrite_func='forward', _concurrency_mode='process')
def remove_html_entity(data, input_key='content'):
    """移除指定字段中的 HTML 实体(如 &nbsp;、&lt;、&amp; 等)。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'


Examples:
    ```python
    from lazyllm.tools.data import refine

    func = refine.remove_html_entity(input_key='content')
    inputs = [{'content': 'Hello&nbsp;World &amp; &lt;tag&gt;'}]
    res = func(inputs)
    print(res)
    # [{'content': 'HelloWorld  tag'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key, '')
    if text:
        data[input_key] = HTML_ENTITY_PATTERN.sub('', text)
    return data

remove_html_url(data, input_key='content')

移除指定字段中的 HTTP/HTTPS 链接和 HTML 标签。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

Examples:

from lazyllm.tools.data import refine

func = refine.remove_html_url(input_key='content')
inputs = [{'content': 'Check https://example.com and <b>bold</b>'}]
res = func(inputs)
print(res)
# [{'content': 'Check  and bold'}]
Source code in lazyllm/tools/data/operators/refine_op.py
@data_register('data.refine', rewrite_func='forward', _concurrency_mode='process')
def remove_html_url(data, input_key='content'):
    """移除指定字段中的 HTTP/HTTPS 链接和 HTML 标签。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'


Examples:
    ```python
    from lazyllm.tools.data import refine

    func = refine.remove_html_url(input_key='content')
    inputs = [{'content': 'Check https://example.com and <b>bold</b>'}]
    res = func(inputs)
    print(res)
    # [{'content': 'Check  and bold'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key, '')
    if text:
        text = URL_PATTERN.sub('', text)
        text = HTML_PATTERN.sub('', text)
        data[input_key] = text
    return data

过滤算子

lazyllm.tools.data.operators.filter_op

CapitalWordFilter

Bases: Filter

过滤全大写单词占比过高的文本。

Parameters:

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • max_ratio (float, default: 0.5 ) –

    全大写单词最大占比,默认 0.5

  • use_tokenizer (bool, default: False ) –

    是否使用分词,默认 False

  • _concurrency_mode (str, default: 'thread' ) –

    可选,并发模式

Examples:

from lazyllm.tools.data import filter

func = filter.CapitalWordFilter(input_key='content', max_ratio=0.5)
inputs = [{'content': 'Normal text with Some Capitals'}, {'content': 'MOSTLY UPPERCASE'}]
res = func(inputs)
print(res)
# [{'content': 'Normal text with Some Capitals'}]
Source code in lazyllm/tools/data/operators/filter_op.py
class CapitalWordFilter(Filter):
    """过滤全大写单词占比过高的文本。

Args:
    input_key (str): 文本字段名,默认 'content'
    max_ratio (float): 全大写单词最大占比,默认 0.5
    use_tokenizer (bool): 是否使用分词,默认 False
    _concurrency_mode (str): 可选,并发模式


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.CapitalWordFilter(input_key='content', max_ratio=0.5)
    inputs = [{'content': 'Normal text with Some Capitals'}, {'content': 'MOSTLY UPPERCASE'}]
    res = func(inputs)
    print(res)
    # [{'content': 'Normal text with Some Capitals'}]
    ```
    """
    def __init__(self, input_key='content', max_ratio=0.5, use_tokenizer=False,
                 _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        self.input_key = input_key
        self.max_ratio = max_ratio
        self.use_tokenizer = use_tokenizer

        if self.use_tokenizer:
            nltk_data_dir = _setup_nltk_data_dir()
            try:
                nltk.data.find('tokenizers/punkt_tab')
            except LookupError:
                LOG.info('Downloading NLTK punkt_tab tokenizer...')
                nltk.download('punkt_tab', quiet=True, download_dir=nltk_data_dir)

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)

        text = data.get(self.input_key)
        if not isinstance(text, str) or not text.strip():
            return []

        if self.use_tokenizer:
            words = nltk.word_tokenize(text)
        else:
            words = text.split()

        num_words = len(words)
        if num_words == 0:
            return []

        num_caps_words = sum(1 for word in words if word.isupper())
        ratio = num_caps_words / num_words

        if ratio <= self.max_ratio:
            return data
        else:
            return []

MinHashDeduplicator

Bases: Filter

使用 MinHash LSH 去除近似重复文本,批处理时保留首次出现的文本。

Parameters:

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • threshold (float, default: 0.85 ) –

    相似度阈值,默认 0.85

  • num_perm (int, default: 128 ) –

    MinHash 排列数,默认 128

  • use_n_gram (bool, default: True ) –

    是否使用 n-gram,默认 True

  • ngram (int, default: 5 ) –

    n-gram 长度,默认 5

Examples:

from lazyllm.tools.data import filter

func = filter.MinHashDeduplicator(input_key='content', threshold=0.85)
inputs = [{'uid': '0', 'content': '这是第一段不同的内容。'}, {'uid': '1', 'content': '这是第一段不同的内容。'}]
res = func(inputs)
print(res)
# [{'uid': '0', 'content': '这是第一段不同的内容。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
class MinHashDeduplicator(Filter):
    """使用 MinHash LSH 去除近似重复文本,批处理时保留首次出现的文本。

Args:
    input_key (str): 文本字段名,默认 'content'
    threshold (float): 相似度阈值,默认 0.85
    num_perm (int): MinHash 排列数,默认 128
    use_n_gram (bool): 是否使用 n-gram,默认 True
    ngram (int): n-gram 长度,默认 5


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.MinHashDeduplicator(input_key='content', threshold=0.85)
    inputs = [{'uid': '0', 'content': '这是第一段不同的内容。'}, {'uid': '1', 'content': '这是第一段不同的内容。'}]
    res = func(inputs)
    print(res)
    # [{'uid': '0', 'content': '这是第一段不同的内容。'}]
    ```
    """
    __reg_overwrite__ = 'forward_batch_input'

    def __init__(self, input_key='content', threshold=0.85, num_perm=128, use_n_gram=True, ngram=5, **kwargs):
        super().__init__(**kwargs)
        self.input_key = input_key
        self.threshold = threshold
        self.num_perm = num_perm
        self.use_n_gram = use_n_gram
        self.ngram = ngram
        self.lsh = datasketch.MinHashLSH(threshold=self.threshold, num_perm=self.num_perm)
        self._item_counter = 0
        self._minhash_map = {}

    def _create_minhash(self, text):
        minhash = datasketch.MinHash(num_perm=self.num_perm)
        if self.use_n_gram:
            for i in range(len(text) - self.ngram + 1):
                minhash.update(text[i:i + self.ngram].encode('utf8'))
        else:
            for char in text:
                minhash.update(char.encode('utf8'))
        return minhash

    def forward_batch_input(self, data, **kwargs):
        assert isinstance(data, list)

        kept_items = []
        for item in data:
            if not isinstance(item, dict) or self.input_key not in item:
                continue

            text = item[self.input_key]
            if not isinstance(text, str) or not text.strip():
                continue

            minhash = self._create_minhash(text)
            result = self.lsh.query(minhash)

            if len(result) == 0:
                self.lsh.insert(self._item_counter, minhash)
                self._minhash_map[self._item_counter] = minhash
                self._item_counter += 1
                kept_items.append(item)

        return kept_items

StopWordFilter

Bases: Filter

过滤停用词占比过高的文本(如几乎全为「的了呢」的无效内容)。

Parameters:

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • max_ratio (float, default: 0.5 ) –

    停用词最大占比,超过则过滤,默认 0.5

  • use_tokenizer (bool, default: True ) –

    是否使用分词,默认 True

  • language (str, default: 'zh' ) –

    语言,'zh' 或 'en',默认 'zh'

  • _concurrency_mode (str, default: 'thread' ) –

    可选,并发模式

Examples:

from lazyllm.tools.data import filter

func = filter.StopWordFilter(input_key='content', max_ratio=0.5, language='zh')
inputs = [{'content': '这是一段包含实际内容的正常文本。'}, {'content': '的了吗呢吧啊'}]
res = func(inputs)
print(res)
# [{'content': '这是一段包含实际内容的正常文本。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
class StopWordFilter(Filter):
    """过滤停用词占比过高的文本(如几乎全为「的了呢」的无效内容)。

Args:
    input_key (str): 文本字段名,默认 'content'
    max_ratio (float): 停用词最大占比,超过则过滤,默认 0.5
    use_tokenizer (bool): 是否使用分词,默认 True
    language (str): 语言,'zh' 或 'en',默认 'zh'
    _concurrency_mode (str): 可选,并发模式


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.StopWordFilter(input_key='content', max_ratio=0.5, language='zh')
    inputs = [{'content': '这是一段包含实际内容的正常文本。'}, {'content': '的了吗呢吧啊'}]
    res = func(inputs)
    print(res)
    # [{'content': '这是一段包含实际内容的正常文本。'}]
    ```
    """
    def __init__(self, input_key='content', max_ratio=0.5, use_tokenizer=True, language='zh',
                 _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        self.input_key = input_key
        self.max_ratio = max_ratio
        self.use_tokenizer = use_tokenizer
        self.language = language.lower()

        nltk_data_dir = _setup_nltk_data_dir()
        try:
            nltk.data.find('corpora/stopwords')
        except LookupError:
            LOG.info('Downloading NLTK stopwords...')
            nltk.download('stopwords', quiet=True, download_dir=nltk_data_dir)

        if self.language in ['en', 'english']:
            self.stopwords = set(nltk.corpus.stopwords.words('english'))
        elif self.language in ['zh', 'cn', 'chinese']:
            self.stopwords = set(nltk.corpus.stopwords.words('chinese'))
        else:
            LOG.warning(f'Unsupported language: {self.language}, using English stopwords')
            self.stopwords = set(nltk.corpus.stopwords.words('english'))

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)

        text = data.get(self.input_key)
        if not isinstance(text, str) or not text.strip():
            return []

        if self.language in ['zh', 'cn', 'chinese']:
            if self.use_tokenizer:
                words = list(jieba.cut(text.lower()))
            else:
                words = list(text)
        elif self.language in ['en', 'english']:
            if self.use_tokenizer:
                words = nltk.word_tokenize(text.lower())
            else:
                words = text.lower().split()
        else:
            words = text.lower().split()

        num_words = len(words)
        if num_words == 0:
            return []

        num_stop_words = sum(1 for w in words if w in self.stopwords)
        ratio = num_stop_words / num_words

        if ratio < self.max_ratio:
            return data
        else:
            return []

SymbolRatioFilter

Bases: Filter

过滤指定符号(如 #、...、…)占比过高的文本。

Parameters:

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • max_ratio (float, default: 0.3 ) –

    符号与词数最大比例,默认 0.3

  • symbols (list | None, default: None ) –

    要统计的符号列表,默认 ['#', '...', '…']

  • _concurrency_mode (str, default: 'process' ) –

    可选,并发模式

Examples:

from lazyllm.tools.data import filter

func = filter.SymbolRatioFilter(input_key='content', max_ratio=0.3)
inputs = [{'content': 'Normal text without symbols'}, {'content': '### ... … ###'}]
res = func(inputs)
print(res)
# [{'content': 'Normal text without symbols'}]
Source code in lazyllm/tools/data/operators/filter_op.py
class SymbolRatioFilter(Filter):
    """过滤指定符号(如 #、...、…)占比过高的文本。

Args:
    input_key (str): 文本字段名,默认 'content'
    max_ratio (float): 符号与词数最大比例,默认 0.3
    symbols (list|None): 要统计的符号列表,默认 ['#', '...', '…']
    _concurrency_mode (str): 可选,并发模式


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.SymbolRatioFilter(input_key='content', max_ratio=0.3)
    inputs = [{'content': 'Normal text without symbols'}, {'content': '### ... … ###'}]
    res = func(inputs)
    print(res)
    # [{'content': 'Normal text without symbols'}]
    ```
    """
    def __init__(self, input_key='content', max_ratio=0.3, symbols=None, _concurrency_mode='process', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        self.input_key = input_key
        self.max_ratio = max_ratio
        self.symbols = symbols or ['#', '...', '…']
        self.tokenizer = nltk.WordPunctTokenizer()

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)

        text = data.get(self.input_key)
        if not isinstance(text, str) or not text.strip():
            return []

        tokens = self.tokenizer.tokenize(text)
        word_tokens = [t for t in tokens if t not in self.symbols]
        num_words = len(word_tokens)
        if num_words == 0:
            return []

        num_symbols = sum(text.count(symbol) for symbol in self.symbols)
        ratio = num_symbols / num_words
        if ratio < self.max_ratio:
            return data
        else:
            return []

TargetLanguageFilter

Bases: Filter

使用 FastText 进行语言识别,仅保留指定语言的文本。

Parameters:

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • target_language (str | list, default: 'zho_Hans' ) –

    目标语言代码,如 'zho_Hans'、'eng_Latn'

  • threshold (float, default: 0.6 ) –

    置信度阈值,默认 0.6

  • model_path (str | None, default: None ) –

    FastText 模型路径

  • _concurrency_mode (str, default: 'thread' ) –

    可选,并发模式

Examples:

from lazyllm.tools.data import filter

func = filter.TargetLanguageFilter(input_key='content', target_language='zho_Hans', threshold=0.3)
inputs = [{'content': '这是一段中文文本。'}, {'content': 'This is English.'}]
res = func(inputs)
print(res)
# [{'content': '这是一段中文文本。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
class TargetLanguageFilter(Filter):
    """使用 FastText 进行语言识别,仅保留指定语言的文本。

Args:
    input_key (str): 文本字段名,默认 'content'
    target_language (str|list): 目标语言代码,如 'zho_Hans'、'eng_Latn'
    threshold (float): 置信度阈值,默认 0.6
    model_path (str|None): FastText 模型路径
    _concurrency_mode (str): 可选,并发模式


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.TargetLanguageFilter(input_key='content', target_language='zho_Hans', threshold=0.3)
    inputs = [{'content': '这是一段中文文本。'}, {'content': 'This is English.'}]
    res = func(inputs)
    print(res)
    # [{'content': '这是一段中文文本。'}]
    ```
    """
    COMMON_LANGUAGES = {
        'zho_Hans', 'zho_Hant', 'eng_Latn', 'spa_Latn', 'fra_Latn',
        'deu_Latn', 'jpn', 'kor', 'rus_Cyrl', 'ara', 'por_Latn',
        'ita_Latn', 'nld_Latn', 'pol_Latn', 'tur_Latn', 'vie',
        'tha', 'hin', 'ind_Latn', 'msa_Latn'
    }

    def __init__(self, input_key='content', target_language='zho_Hans', threshold=0.6, model_path=None,
                 _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        self.input_key = input_key
        if isinstance(target_language, str):
            self.allowed_languages = {target_language}
        else:
            self.allowed_languages = set(target_language)
        self.threshold = threshold
        if model_path is None:
            try:
                default_cache_dir = config['model_cache_dir']
            except (KeyError, TypeError):
                default_cache_dir = os.path.join(os.path.expanduser('~'), '.lazyllm', 'models')
            model_path = os.path.join(default_cache_dir, 'fasttext-language-identification', 'model.bin')
        self.model_path = model_path
        self._validate_languages()
        self.model = self._load_model()

    def _validate_languages(self):
        invalid_langs = self.allowed_languages - self.COMMON_LANGUAGES
        if invalid_langs:
            LOG.warning(
                f'TargetLanguageFilter: Invalid language codes: {invalid_langs}\n'
                f'Common language codes:\n'
                f'  - zho_Hans (Simplified Chinese), zho_Hant (Traditional Chinese)\n'
                f'  - eng_Latn (English)\n'
                f'  - spa_Latn (Spanish), fra_Latn (French), deu_Latn (German)\n'
                f'  - jpn (Japanese), kor (Korean)\n'
                f'  - rus_Cyrl (Russian), ara (Arabic)\n'
                f'  - por_Latn (Portuguese), ita_Latn (Italian)\n'
                f'Full list: {sorted(self.COMMON_LANGUAGES)}'
            )

    def _load_model(self):
        try:
            if os.path.isfile(self.model_path):
                model_file = self.model_path
            elif os.path.isdir(self.model_path):
                model_file = os.path.join(self.model_path, 'model.bin')
            else:
                model_file = self._download_model()

            if not os.path.exists(model_file):
                raise FileNotFoundError(f'Model file not found at {model_file}')

            LOG.info(f'Loading FastText language model from {model_file}...')
            model = fasttext.load_model(model_file)
            LOG.info('FastText language model loaded successfully.')
            return model
        except Exception as e:
            LOG.error(f'Error loading FastText model: {e}')
            raise

    def _download_model(self):
        LOG.info('Downloading FastText language identification model...')
        model_repo = 'facebook/fasttext-language-identification'
        try:
            model_source = config['model_source']
        except (KeyError, TypeError):
            model_source = 'modelscope'

        if os.path.isdir(self.model_path) or self.model_path.endswith(os.sep):
            model_dir = self.model_path if os.path.isdir(self.model_path) else os.path.dirname(self.model_path)
        else:
            model_dir = os.path.dirname(self.model_path)

        os.makedirs(model_dir, exist_ok=True)
        model_manager = ModelManager(model_source=model_source)
        downloaded_path = model_manager.hub_downloader.download(model_repo, model_dir)

        if not downloaded_path:
            raise RuntimeError(f'Failed to download model: {model_repo}')
        model_file = os.path.join(downloaded_path, 'model.bin')
        if not os.path.exists(model_file):
            raise FileNotFoundError(f'Model file not found at {model_file}')

        self.model_path = model_file
        return model_file

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)

        text = data.get(self.input_key)
        if not isinstance(text, str) or not text.strip():
            return []

        k = max(5, len(self.allowed_languages))
        labels, scores = self.model.predict(text.replace('\n', ' ').strip(), k=k)
        if len(labels) > 0 and len(scores) > 0:
            for label, score in zip(labels, scores):
                pred_label = label.replace('__label__', '')
                if pred_label in self.allowed_languages and score >= self.threshold:
                    return data

        return []

WordBlocklistFilter

Bases: Filter

使用 AC 自动机多模式匹配过滤包含敏感词/违禁词超过阈值的文本。

Parameters:

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • blocklist (list | None, default: None ) –

    违禁词列表

  • blocklist_path (str | None, default: None ) –

    违禁词文件路径

  • language (str, default: 'zh' ) –

    语言,'zh' 或 'en',默认 'zh'

  • threshold (int, default: 1 ) –

    允许出现的违禁词最大数量,默认 1

  • _concurrency_mode (str, default: 'thread' ) –

    可选,并发模式

Examples:

from lazyllm.tools.data import filter

func = filter.WordBlocklistFilter(input_key='content', blocklist=['敏感', '违禁'], threshold=0)
inputs = [{'content': '这是正常的文本内容。'}, {'content': '这里包含敏感词。'}]
res = func(inputs)
print(res)
# [{'content': '这是正常的文本内容。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
class WordBlocklistFilter(Filter):
    """使用 AC 自动机多模式匹配过滤包含敏感词/违禁词超过阈值的文本。

Args:
    input_key (str): 文本字段名,默认 'content'
    blocklist (list|None): 违禁词列表
    blocklist_path (str|None): 违禁词文件路径
    language (str): 语言,'zh' 或 'en',默认 'zh'
    threshold (int): 允许出现的违禁词最大数量,默认 1
    _concurrency_mode (str): 可选,并发模式


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.WordBlocklistFilter(input_key='content', blocklist=['敏感', '违禁'], threshold=0)
    inputs = [{'content': '这是正常的文本内容。'}, {'content': '这里包含敏感词。'}]
    res = func(inputs)
    print(res)
    # [{'content': '这是正常的文本内容。'}]
    ```
    """
    def __init__(self, input_key='content', blocklist=None, blocklist_path=None,
                 language='zh', threshold=1, _concurrency_mode='thread', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        self.input_key = input_key
        self.threshold = threshold
        self.language = language.lower()

        if blocklist is not None:
            words = [w.strip().lower() for w in blocklist if w and w.strip()]
        elif blocklist_path is not None:
            words = self._load_blocklist_from_file(blocklist_path)
        else:
            default_path = self._get_default_blocklist_path()
            words = self._load_blocklist_from_file(default_path)

        self._blocklist_words = words
        self._automaton = self._build_automaton(words)

        LOG.info(f'WordBlocklistFilter initialized with {len(words)} blocked words (AC automaton), '
                 f'language={self.language}')

    def _build_automaton(self, words):
        A = ahocorasick.Automaton()
        for idx, word in enumerate(words):
            A.add_word(word, (idx, word))
        A.make_automaton()
        return A

    def __getstate__(self):
        state = self.__dict__.copy()
        # automaton may not pickle well in process mode; keep words to rebuild
        state['_automaton'] = None
        return state

    def __setstate__(self, state):
        self.__dict__.update(state)
        if self._automaton is None and self._blocklist_words:
            self._automaton = self._build_automaton(self._blocklist_words)

    def _get_default_blocklist_path(self):
        current_dir = os.path.dirname(os.path.abspath(__file__))
        if self.language in ['zh', 'cn', 'chinese']:
            filename = 'zh.txt'
        elif self.language in ['en', 'english']:
            filename = 'en.txt'
        else:
            LOG.warning(f'Unsupported language: {self.language}, defaulting to zh.txt')
            filename = 'zh.txt'
        blocklist_path = os.path.join(current_dir, 'blocklist', filename)
        return blocklist_path

    def _load_blocklist_from_file(self, file_path):
        LOG.info(f'Loading blocklist from {file_path}...')
        with open(file_path, 'r', encoding='utf-8') as f:
            words = list(dict.fromkeys(line.strip().lower() for line in f if line.strip()))
        LOG.info(f'Loaded {len(words)} words from blocklist')
        return words

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)

        text = data.get(self.input_key)
        if not isinstance(text, str) or not text.strip():
            return data

        text_lower = text.lower()
        blocklist_count = sum(1 for _ in self._automaton.iter(text_lower))

        if blocklist_count <= self.threshold:
            return data
        else:
            return []

bullet_point_filter(data, input_key='content', max_ratio=0.9)

过滤子弹点行占比过高的文本(如目录、纯列表)。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • max_ratio (float, default: 0.9 ) –

    以子弹点开头的行最大占比,默认 0.9

Examples:

from lazyllm.tools.data import filter

func = filter.bullet_point_filter(input_key='content', max_ratio=0.5)
inputs = [{'content': 'Normal paragraph text'}, {'content': '- Item 1\n- Item 2\n- Item 3'}]
res = func(inputs)
print(res)
# [{'content': 'Normal paragraph text'}]
Source code in lazyllm/tools/data/operators/filter_op.py
@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def bullet_point_filter(data, input_key='content', max_ratio=0.9):
    """过滤子弹点行占比过高的文本(如目录、纯列表)。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'
    max_ratio (float): 以子弹点开头的行最大占比,默认 0.9


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.bullet_point_filter(input_key='content', max_ratio=0.5)
    inputs = [{'content': 'Normal paragraph text'}, {'content': '- Item 1\\n- Item 2\\n- Item 3'}]
    res = func(inputs)
    print(res)
    # [{'content': 'Normal paragraph text'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    lines = [line.strip() for line in text.split('\n') if line.strip()]
    num_lines = len(lines)
    if num_lines == 0:
        return []
    num_bullet_lines = sum(
        1 for line in lines
        if any(line.startswith(bullet) for bullet in BULLET_CHARS)
    )
    ratio = num_bullet_lines / num_lines
    if ratio <= max_ratio:
        return data
    else:
        return []

char_count_filter(data, input_key='content', min_chars=100, max_chars=100000)

按去除空白后的字符数过滤,保留在 [min_chars, max_chars] 范围内的文本。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • min_chars (int, default: 100 ) –

    最小字符数,默认 100

  • max_chars (int, default: 100000 ) –

    最大字符数,默认 100000

Examples:

from lazyllm.tools.data import filter

func = filter.char_count_filter(input_key='content', min_chars=10, max_chars=100)
inputs = [{'content': '短'}, {'content': '这是一段中等长度的文本内容。'}]
res = func(inputs)
print(res)
# [{'content': '这是一段中等长度的文本内容。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def char_count_filter(data, input_key='content', min_chars=100, max_chars=100000):
    """按去除空白后的字符数过滤,保留在 [min_chars, max_chars] 范围内的文本。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'
    min_chars (int): 最小字符数,默认 100
    max_chars (int): 最大字符数,默认 100000


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.char_count_filter(input_key='content', min_chars=10, max_chars=100)
    inputs = [{'content': '短'}, {'content': '这是一段中等长度的文本内容。'}]
    res = func(inputs)
    print(res)
    # [{'content': '这是一段中等长度的文本内容。'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    text_no_space = text.strip().replace(' ', '').replace('\n', '').replace('\t', '')
    num_chars = len(text_no_space)
    if min_chars <= num_chars <= max_chars:
        return data
    else:
        return []

colon_end_filter(data, input_key='content')

过滤以冒号结尾的文本。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

Examples:

from lazyllm.tools.data import filter

func = filter.colon_end_filter(input_key='content')
inputs = [{'content': '这是正常结尾。'}, {'content': '这是冒号结尾:'}]
res = func(inputs)
print(res)
# [{'content': '这是正常结尾。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def colon_end_filter(data, input_key='content'):
    """过滤以冒号结尾的文本。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.colon_end_filter(input_key='content')
    inputs = [{'content': '这是正常结尾。'}, {'content': '这是冒号结尾:'}]
    res = func(inputs)
    print(res)
    # [{'content': '这是正常结尾。'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return data
    if text.rstrip().endswith(':') or text.rstrip().endswith(':'):
        return []
    else:
        return data

curly_bracket_filter(data, input_key='content', max_ratio=0.08)

过滤花括号 {} 占比过高的文本。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • max_ratio (float, default: 0.08 ) –

    花括号最大占比,默认 0.08

Examples:

from lazyllm.tools.data import filter

func = filter.curly_bracket_filter(input_key='content', max_ratio=0.08)
inputs = [{'content': 'Normal text'}, {'content': '{{{{{' * 10}]
res = func(inputs)
print(res)
# [{'content': 'Normal text'}]
Source code in lazyllm/tools/data/operators/filter_op.py
@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def curly_bracket_filter(data, input_key='content', max_ratio=0.08):
    """过滤花括号 {} 占比过高的文本。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'
    max_ratio (float): 花括号最大占比,默认 0.08


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.curly_bracket_filter(input_key='content', max_ratio=0.08)
    inputs = [{'content': 'Normal text'}, {'content': '{{{{{' * 10}]
    res = func(inputs)
    print(res)
    # [{'content': 'Normal text'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    num_brackets = text.count('{') + text.count('}')
    ratio = num_brackets / len(text) if len(text) > 0 else 0
    if ratio < max_ratio:
        return data
    else:
        return []

ellipsis_end_filter(data, input_key='content', max_ratio=0.3)

过滤以省略号(...、…、……)结尾的行占比过高的文本。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • max_ratio (float, default: 0.3 ) –

    以省略号结尾的行最大占比,默认 0.3

Examples:

from lazyllm.tools.data import filter

func = filter.ellipsis_end_filter(input_key='content', max_ratio=0.3)
inputs = [{'content': '第一行。\n第二行。\n第三行。'}, {'content': '第一行...\n第二行...'}]
res = func(inputs)
print(res)
# [{'content': '第一行。\n第二行。\n第三行。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def ellipsis_end_filter(data, input_key='content', max_ratio=0.3):
    """过滤以省略号(...、…、……)结尾的行占比过高的文本。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'
    max_ratio (float): 以省略号结尾的行最大占比,默认 0.3


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.ellipsis_end_filter(input_key='content', max_ratio=0.3)
    inputs = [{'content': '第一行。\\n第二行。\\n第三行。'}, {'content': '第一行...\\n第二行...'}]
    res = func(inputs)
    print(res)
    # [{'content': '第一行。\\n第二行。\\n第三行。'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return data
    ellipsis = ['...', '…', '……']
    lines = [line.strip() for line in text.split('\n') if line.strip()]
    num_lines = len(lines)
    if num_lines == 0:
        return data
    num_occurrences = sum(
        1 for line in lines
        if any(line.endswith(e) for e in ellipsis)
    )
    ratio = num_occurrences / num_lines
    if ratio < max_ratio:
        return data
    else:
        return []

idcard_filter(data, input_key='content', threshold=3)

过滤包含过多身份证/证件相关词汇的文本。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • threshold (int, default: 3 ) –

    匹配到相关词的最大数量,超过则过滤,默认 3

Examples:

from lazyllm.tools.data import filter

func = filter.idcard_filter(input_key='content', threshold=1)
inputs = [{'content': '这是正常文本'}, {'content': '请提供身份证号码和ID number'}]
res = func(inputs)
print(res)
# [{'content': '这是正常文本'}]
Source code in lazyllm/tools/data/operators/filter_op.py
@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def idcard_filter(data, input_key='content', threshold=3):
    """过滤包含过多身份证/证件相关词汇的文本。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'
    threshold (int): 匹配到相关词的最大数量,超过则过滤,默认 3


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.idcard_filter(input_key='content', threshold=1)
    inputs = [{'content': '这是正常文本'}, {'content': '请提供身份证号码和ID number'}]
    res = func(inputs)
    print(res)
    # [{'content': '这是正常文本'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    all_patterns = ID_CARD_CHINESE_TERMS + ID_CARD_ENGLISH_TERMS
    pattern = re.compile('|'.join(f'({p})' for p in all_patterns), re.I)
    matches = pattern.findall(text)
    has_too_many_id_terms = len(matches) >= threshold
    if not has_too_many_id_terms:
        return data
    else:
        return []

javascript_filter(data, input_key='content', min_non_script_lines=3)

过滤含大量 JavaScript 相关模式的文本(如代码、脚本片段)。短文本(<=3行)不检测,直接保留,避免误伤正常短句。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • min_non_script_lines (int, default: 3 ) –

    最少非脚本行数,默认 3

Examples:

from lazyllm.tools.data import filter

func = filter.javascript_filter(input_key='content', min_non_script_lines=2)
inputs = [{'content': 'Short normal text'}, {'content': 'function() { return 1; }
const x = 1;
var y = 2;
let z = 3;'}]
res = func(inputs)
print(res)
# [{'content': 'Short normal text'}]
Source code in lazyllm/tools/data/operators/filter_op.py
@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def javascript_filter(data, input_key='content', min_non_script_lines=3):
    """过滤含大量 JavaScript 相关模式的文本(如代码、脚本片段)。短文本(<=3行)不检测,直接保留,避免误伤正常短句。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'
    min_non_script_lines (int): 最少非脚本行数,默认 3


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.javascript_filter(input_key='content', min_non_script_lines=2)
    inputs = [{'content': 'Short normal text'}, {'content': 'function() { return 1; }
    const x = 1;
    var y = 2;
    let z = 3;'}]
    res = func(inputs)
    print(res)
    # [{'content': 'Short normal text'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    lines = [line.strip() for line in text.split('\n') if line.strip()]
    num_lines = len(lines)
    if num_lines == 0:
        return []
    if num_lines <= 3:
        return data
    num_script_lines = sum(
        1 for line in lines
        if any(pattern in line.lower() for pattern in JAVASCRIPT_PATTERNS)
    )
    num_non_script_lines = num_lines - num_script_lines
    if num_non_script_lines >= min_non_script_lines:
        return data
    else:
        return []

lorem_ipsum_filter(data, input_key='content', max_ratio=3e-08)

过滤 Lorem ipsum、占位符等占位文本。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • max_ratio (float, default: 3e-08 ) –

    占位模式最大出现比例,默认 3e-8

Examples:

from lazyllm.tools.data import filter

func = filter.lorem_ipsum_filter(input_key='content')
inputs = [{'content': 'This is real content'}, {'content': 'Lorem ipsum dolor sit amet'}]
res = func(inputs)
print(res)
# [{'content': 'This is real content'}]
Source code in lazyllm/tools/data/operators/filter_op.py
@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def lorem_ipsum_filter(data, input_key='content', max_ratio=3e-8):
    """过滤 Lorem ipsum、占位符等占位文本。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'
    max_ratio (float): 占位模式最大出现比例,默认 3e-8


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.lorem_ipsum_filter(input_key='content')
    inputs = [{'content': 'This is real content'}, {'content': 'Lorem ipsum dolor sit amet'}]
    res = func(inputs)
    print(res)
    # [{'content': 'This is real content'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    pattern_str = '|'.join(f'({p})' for p in LOREM_PATTERNS)
    pattern = re.compile(pattern_str, re.IGNORECASE)
    matches = pattern.findall(text)
    num_occurrences = len(matches)
    ratio = num_occurrences / len(text) if len(text) > 0 else 0
    if ratio <= max_ratio:
        return data
    else:
        return []

no_punc_filter(data, input_key='content', max_length_between_punct=112, language='zh')

过滤标点之间段路过长的文本(如无标点超长串)。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • max_length_between_punct (int, default: 112 ) –

    标点间最大长度,默认 112

  • language (str, default: 'zh' ) –

    语言,'zh' 或 'en',默认 'zh'

Examples:

from lazyllm.tools.data import filter

func = filter.no_punc_filter(input_key='content', max_length_between_punct=20, language='zh')
inputs = [{'content': '这是。正常。文本。'}, {'content': '这是一段没有标点符号的超长文本' * 10}]
res = func(inputs)
print(res)
# [{'content': '这是。正常。文本。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def no_punc_filter(data, input_key='content', max_length_between_punct=112, language='zh'):
    """过滤标点之间段路过长的文本(如无标点超长串)。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'
    max_length_between_punct (int): 标点间最大长度,默认 112
    language (str): 语言,'zh' 或 'en',默认 'zh'


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.no_punc_filter(input_key='content', max_length_between_punct=20, language='zh')
    inputs = [{'content': '这是。正常。文本。'}, {'content': '这是一段没有标点符号的超长文本' * 10}]
    res = func(inputs)
    print(res)
    # [{'content': '这是。正常。文本。'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    language = language.lower()
    if language in ['zh', 'cn', 'chinese']:
        punct_pattern = r'[。!?;,、:""''()《》【】…—.!?,;:]'
    elif language in ['en', 'english']:
        punct_pattern = r'[–.!?,;•/|…:;\'\"]'
    else:
        LOG.warning(f'Unsupported language: {language}, using Chinese punctuation')
        punct_pattern = r'[。!?;,、:""''()《》【】…—.!?,;:]'
    paragraphs = text.split('\n')
    max_length = 0
    for paragraph in paragraphs:
        if len(paragraph.strip()) == 0:
            continue
        segments = re.split(punct_pattern, paragraph)
        for segment in segments:
            segment = segment.strip()
            if not segment:
                continue
            if language in ['en', 'english']:
                length = len(segment.split())
            else:
                length = len(segment)
            if length > max_length:
                max_length = length
    if max_length <= max_length_between_punct:
        return data
    else:
        return []

null_content_filter(data, input_key='content')

过滤空内容或仅空白字符的文本。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

Examples:

from lazyllm.tools.data import filter

func = filter.null_content_filter(input_key='content')
inputs = [{'content': 'Valid content'}, {'content': ''}, {'content': '   '}]
res = func(inputs)
print(res)
# [{'content': 'Valid content'}]
Source code in lazyllm/tools/data/operators/filter_op.py
@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def null_content_filter(data, input_key='content'):
    """过滤空内容或仅空白字符的文本。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.null_content_filter(input_key='content')
    inputs = [{'content': 'Valid content'}, {'content': ''}, {'content': '   '}]
    res = func(inputs)
    print(res)
    # [{'content': 'Valid content'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if text is not None and isinstance(text, str) and text.strip() != '':
        return data
    else:
        return []

sentence_count_filter(data, input_key='content', min_sentences=3, max_sentences=1000, language='zh')

按句子数量过滤,保留在 [min_sentences, max_sentences] 范围内的文本。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • min_sentences (int, default: 3 ) –

    最少句子数,默认 3

  • max_sentences (int, default: 1000 ) –

    最多句子数,默认 1000

  • language (str, default: 'zh' ) –

    语言,'zh' 或 'en',默认 'zh'

Examples:

from lazyllm.tools.data import filter

func = filter.sentence_count_filter(input_key='content', min_sentences=2, max_sentences=10, language='zh')
inputs = [{'content': '单句。'}, {'content': '第一句。第二句。'}]
res = func(inputs)
print(res)
# [{'content': '第一句。第二句。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def sentence_count_filter(data, input_key='content', min_sentences=3, max_sentences=1000, language='zh'):
    """按句子数量过滤,保留在 [min_sentences, max_sentences] 范围内的文本。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'
    min_sentences (int): 最少句子数,默认 3
    max_sentences (int): 最多句子数,默认 1000
    language (str): 语言,'zh' 或 'en',默认 'zh'


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.sentence_count_filter(input_key='content', min_sentences=2, max_sentences=10, language='zh')
    inputs = [{'content': '单句。'}, {'content': '第一句。第二句。'}]
    res = func(inputs)
    print(res)
    # [{'content': '第一句。第二句。'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    language = language.lower()
    if language in ['zh', 'cn', 'chinese']:
        sentences = re.split(r'[。!?]+', text)
        sentences = [s.strip() for s in sentences if s.strip()]
        num_sentences = len(sentences)
    elif language in ['en', 'english']:
        sentences = nltk.sent_tokenize(text)
        num_sentences = len(sentences)
    else:
        LOG.warning(f'Unsupported language: {language}, using Chinese punctuation')
        sentences = re.split(r'[。!?]+', text)
        sentences = [s.strip() for s in sentences if s.strip()]
        num_sentences = len(sentences)
    if min_sentences <= num_sentences <= max_sentences:
        return data
    else:
        return []

special_char_filter(data, input_key='content')

过滤包含特殊不可见字符的文本(零宽字符、替换字符等)。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

Examples:

from lazyllm.tools.data import filter

func = filter.special_char_filter(input_key='content')
inputs = [{'content': 'Normal text 正常文本'}, {'content': 'Text with ​ zero width'}]
res = func(inputs)
print(res)
# [{'content': 'Normal text 正常文本'}]
Source code in lazyllm/tools/data/operators/filter_op.py
@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def special_char_filter(data, input_key='content'):
    """过滤包含特殊不可见字符的文本(零宽字符、替换字符等)。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.special_char_filter(input_key='content')
    inputs = [{'content': 'Normal text 正常文本'}, {'content': 'Text with ​ zero width'}]
    res = func(inputs)
    print(res)
    # [{'content': 'Normal text 正常文本'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    has_special_char = any(re.search(pattern, text) for pattern in SPECIAL_CHAR_PATTERNS)
    if not has_special_char:
        return data
    else:
        return []

unique_word_filter(data, input_key='content', min_ratio=0.1, use_tokenizer=True, language='zh')

过滤去重后词数占比过低的文本(重复词过多的无效内容)。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • min_ratio (float, default: 0.1 ) –

    去重词数最小占比,默认 0.1

  • use_tokenizer (bool, default: True ) –

    是否使用分词,默认 True

  • language (str, default: 'zh' ) –

    语言,'zh' 或 'en',默认 'zh'

Examples:

from lazyllm.tools.data import filter

func = filter.unique_word_filter(input_key='content', min_ratio=0.4, language='zh')
inputs = [{'content': '这是一段包含多个不同词汇的文本。'}, {'content': '重复重复重复'}]
res = func(inputs)
print(res)
# [{'content': '这是一段包含多个不同词汇的文本。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
@data_register('data.filter', rewrite_func='forward', _concurrency_mode='thread')
def unique_word_filter(data, input_key='content', min_ratio=0.1, use_tokenizer=True, language='zh'):
    """过滤去重后词数占比过低的文本(重复词过多的无效内容)。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'
    min_ratio (float): 去重词数最小占比,默认 0.1
    use_tokenizer (bool): 是否使用分词,默认 True
    language (str): 语言,'zh' 或 'en',默认 'zh'


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.unique_word_filter(input_key='content', min_ratio=0.4, language='zh')
    inputs = [{'content': '这是一段包含多个不同词汇的文本。'}, {'content': '重复重复重复'}]
    res = func(inputs)
    print(res)
    # [{'content': '这是一段包含多个不同词汇的文本。'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    language = language.lower()
    if language in ['zh', 'cn', 'chinese']:
        if use_tokenizer:
            words = list(jieba.cut(text.lower()))
        else:
            words = list(text)
    elif language in ['en', 'english']:
        if use_tokenizer:
            nltk_data_dir = _setup_nltk_data_dir()
            try:
                nltk.data.find('tokenizers/punkt_tab')
            except LookupError:
                LOG.info('Downloading NLTK punkt_tab tokenizer...')
                nltk.download('punkt_tab', quiet=True, download_dir=nltk_data_dir)
            words = nltk.word_tokenize(text.lower())
        else:
            words = text.lower().split()
    else:
        LOG.warning(f'Unsupported language: {language}, using simple split')
        words = text.lower().split()
    num_words = len(words)
    if num_words == 0:
        return []
    num_unique_words = len(set(words))
    ratio = num_unique_words / num_words
    if ratio > min_ratio:
        return data
    else:
        return []

watermark_filter(data, input_key='content', watermarks=None)

过滤包含版权/水印相关词汇的文本。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • watermarks (list | None, default: None ) –

    自定义水印词列表,默认使用内置列表

Examples:

from lazyllm.tools.data import filter

func = filter.watermark_filter(input_key='content')
inputs = [{'content': 'Normal content'}, {'content': 'This document contains Copyright notice'}]
res = func(inputs)
print(res)
# [{'content': 'Normal content'}]
Source code in lazyllm/tools/data/operators/filter_op.py
@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def watermark_filter(data, input_key='content', watermarks=None):
    """过滤包含版权/水印相关词汇的文本。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'
    watermarks (list|None): 自定义水印词列表,默认使用内置列表


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.watermark_filter(input_key='content')
    inputs = [{'content': 'Normal content'}, {'content': 'This document contains Copyright notice'}]
    res = func(inputs)
    print(res)
    # [{'content': 'Normal content'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    watermarks = watermarks or DEFAULT_WATERMARKS
    matches = re.search('|'.join(watermarks), text)
    if matches is None:
        return data
    else:
        return []

word_count_filter(data, input_key='content', min_words=10, max_words=10000, language='zh')

按词/字符数量过滤:中文按字符数,英文按单词数,保留在 [min_words, max_words) 范围内的文本。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • min_words (int, default: 10 ) –

    最小词数,默认 10

  • max_words (int, default: 10000 ) –

    最大词数,默认 10000

  • language (str, default: 'zh' ) –

    语言,'zh' 或 'en',默认 'zh'

Examples:

from lazyllm.tools.data import filter

func = filter.word_count_filter(input_key='content', min_words=5, max_words=20, language='zh')
inputs = [{'content': '短文本'}, {'content': '这是一段适中长度的中文文本内容。'}]
res = func(inputs)
print(res)
# [{'content': '这是一段适中长度的中文文本内容。'}]
Source code in lazyllm/tools/data/operators/filter_op.py
@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def word_count_filter(data, input_key='content', min_words=10, max_words=10000, language='zh'):
    """按词/字符数量过滤:中文按字符数,英文按单词数,保留在 [min_words, max_words) 范围内的文本。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'
    min_words (int): 最小词数,默认 10
    max_words (int): 最大词数,默认 10000
    language (str): 语言,'zh' 或 'en',默认 'zh'


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.word_count_filter(input_key='content', min_words=5, max_words=20, language='zh')
    inputs = [{'content': '短文本'}, {'content': '这是一段适中长度的中文文本内容。'}]
    res = func(inputs)
    print(res)
    # [{'content': '这是一段适中长度的中文文本内容。'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    language = language.lower()
    if language in ['zh', 'cn', 'chinese']:
        count = len(text.replace(' ', '').replace('\n', '').replace('\t', ''))
    elif language in ['en', 'english']:
        count = len(text.split())
    else:
        LOG.warning(f'Unsupported language: {language}, using character count')
        count = len(text.replace(' ', '').replace('\n', '').replace('\t', ''))
    if min_words <= count < max_words:
        return data
    else:
        return []

word_length_filter(data, input_key='content', min_length=3, max_length=20)

按单词平均长度过滤,保留在 [min_length, max_length) 范围内的文本。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • min_length (float, default: 3 ) –

    单词平均最小长度,默认 3

  • max_length (float, default: 20 ) –

    单词平均最大长度,默认 20

Examples:

from lazyllm.tools.data import filter

func = filter.word_length_filter(input_key='content', min_length=3, max_length=10)
inputs = [{'content': 'I am ok'}, {'content': 'This is a normal sentence'}]
res = func(inputs)
print(res)
# [{'content': 'This is a normal sentence'}]
Source code in lazyllm/tools/data/operators/filter_op.py
@data_register('data.filter', rewrite_func='forward', _concurrency_mode='process')
def word_length_filter(data, input_key='content', min_length=3, max_length=20):
    """按单词平均长度过滤,保留在 [min_length, max_length) 范围内的文本。

Args:
    data (dict): 单条数据字典
    input_key (str): 文本字段名,默认 'content'
    min_length (float): 单词平均最小长度,默认 3
    max_length (float): 单词平均最大长度,默认 20


Examples:
    ```python
    from lazyllm.tools.data import filter

    func = filter.word_length_filter(input_key='content', min_length=3, max_length=10)
    inputs = [{'content': 'I am ok'}, {'content': 'This is a normal sentence'}]
    res = func(inputs)
    print(res)
    # [{'content': 'This is a normal sentence'}]
    ```
    """
    assert isinstance(data, dict)
    text = data.get(input_key)
    if not isinstance(text, str) or not text.strip():
        return []
    words = text.split()
    num_words = len(words)
    if num_words == 0:
        return []
    num_chars = sum(len(word) for word in words)
    mean_length = num_chars / num_words
    if min_length <= mean_length < max_length:
        return data
    else:
        return []

切块算子

lazyllm.tools.data.operators.token_chunker

TokenChunker

Bases: Chunker

按 token 数量将长文本切分为多个块。先按段落分隔,再按句子细切,保证每块不超过 max_tokens,过短块可丢弃。

Parameters:

  • input_key (str, default: 'content' ) –

    文本字段名,默认 'content'

  • model_path (str | None, default: None ) –

    tokenizer 模型路径,默认使用 Qwen2.5-0.5B-Instruct

  • max_tokens (int, default: 1024 ) –

    每块最大 token 数,默认 1024

  • min_tokens (int, default: 200 ) –

    每块最小 token 数,低于此值的块可能被丢弃,默认 200

  • _concurrency_mode (str, default: 'process' ) –

    可选,并发模式

  • _max_workers (int | None) –

    可选,最大并发数

Examples:

from lazyllm.tools.data import chunker

func = chunker.TokenChunker(input_key='content', max_tokens=50, min_tokens=10)
inputs = [{'content': '人工智能是计算机科学的一个分支。' * 20, 'meta_data': {'source': 'doc_1'}}]
res = func(inputs)
print(res)
# [{'uid': '...', 'content': '...', 'meta_data': {'source': 'doc_1', 'index': 0, 'total': N, 'length': ...}}, ...]
Source code in lazyllm/tools/data/operators/token_chunker.py
class TokenChunker(Chunker):
    """按 token 数量将长文本切分为多个块。先按段落分隔,再按句子细切,保证每块不超过 max_tokens,过短块可丢弃。

Args:
    input_key (str): 文本字段名,默认 'content'
    model_path (str|None): tokenizer 模型路径,默认使用 Qwen2.5-0.5B-Instruct
    max_tokens (int): 每块最大 token 数,默认 1024
    min_tokens (int): 每块最小 token 数,低于此值的块可能被丢弃,默认 200
    _concurrency_mode (str): 可选,并发模式
    _max_workers (int|None): 可选,最大并发数


Examples:
    ```python
    from lazyllm.tools.data import chunker

    func = chunker.TokenChunker(input_key='content', max_tokens=50, min_tokens=10)
    inputs = [{'content': '人工智能是计算机科学的一个分支。' * 20, 'meta_data': {'source': 'doc_1'}}]
    res = func(inputs)
    print(res)
    # [{'uid': '...', 'content': '...', 'meta_data': {'source': 'doc_1', 'index': 0, 'total': N, 'length': ...}}, ...]
    ```
    """
    def __init__(self, input_key='content', model_path=None,
                 max_tokens=1024, min_tokens=200, _concurrency_mode='process', **kwargs):
        super().__init__(_concurrency_mode=_concurrency_mode, **kwargs)
        self.input_key = input_key
        self.max_tokens = max_tokens
        self.min_tokens = min_tokens
        self.model_path = model_path
        self.tokenizer = self._load_tokenizer()

    def _try_load_tokenizer(self, path, is_local, default_cache_dir):
        if is_local or os.path.isdir(path) or os.path.isfile(path):
            return transformers.AutoTokenizer.from_pretrained(
                path, trust_remote_code=True
            )
        else:
            return transformers.AutoTokenizer.from_pretrained(
                path, cache_dir=default_cache_dir, trust_remote_code=True
            )

    def _try_load_from_config_path(self, default_model_name, default_cache_dir):
        try:
            config_model_path = config['model_path']
            if not config_model_path:
                return None
            if os.path.isdir(config_model_path):
                joined_path = os.path.join(config_model_path, default_model_name)
                if os.path.exists(joined_path):
                    LOG.info(f'Loading tokenizer from config model_path: {joined_path}')
                    try:
                        return self._try_load_tokenizer(joined_path, True, default_cache_dir)
                    except Exception as e:
                        LOG.warning(f'Failed to load from {joined_path}: {e}, trying cache directory')
            elif os.path.exists(config_model_path):
                LOG.info(f'Loading tokenizer from config model_path: {config_model_path}')
                try:
                    return self._try_load_tokenizer(config_model_path, True, default_cache_dir)
                except Exception as e:
                    LOG.warning(f'Failed to load from {config_model_path}: {e}, trying cache directory')
        except (KeyError, TypeError):
            pass
        return None

    def _try_load_from_cache(self, default_model_name, default_cache_dir):
        try:
            cache_model_path = os.path.join(default_cache_dir, default_model_name)
            if os.path.exists(cache_model_path):
                LOG.info(f'Loading tokenizer from cache directory: {cache_model_path}')
                return self._try_load_tokenizer(cache_model_path, True, default_cache_dir)
        except Exception:
            pass
        return None

    def _load_tokenizer(self):
        default_model = 'Qwen/Qwen2.5-0.5B-Instruct'
        default_model_name = 'qwen2.5-0.5b-instruct'
        model_or_path = self.model_path

        try:
            default_cache_dir = config['model_cache_dir']
        except (KeyError, TypeError):
            default_cache_dir = os.path.join(os.path.expanduser('~'), '.lazyllm', 'models')

        if model_or_path:
            try:
                is_local = os.path.isdir(model_or_path) or os.path.isfile(model_or_path)
                if is_local:
                    log_msg = f'Loading tokenizer from local path: {model_or_path}'
                else:
                    log_msg = f'Loading tokenizer from model: {model_or_path}'
                LOG.info(log_msg)
                return self._try_load_tokenizer(model_or_path, is_local, default_cache_dir)
            except Exception as e:
                LOG.warning(f'Failed to load from {model_or_path}: {e}, trying config model_path')

        if model_or_path is None:
            result = self._try_load_from_config_path(default_model_name, default_cache_dir)
            if result:
                return result

        result = self._try_load_from_cache(default_model_name, default_cache_dir)
        if result:
            return result

        LOG.info(f'Loading default tokenizer: {default_model} (will download to cache)')
        try:
            return self._try_load_tokenizer(default_model, False, default_cache_dir)
        except Exception as e:
            LOG.error(f'Failed to load default tokenizer: {e}')
            raise

    def _split_paragraphs(self, text):
        paragraphs = re.split(r'(\n{2,})', text)
        processed_paragraphs = []
        for i in range(0, len(paragraphs), 2):
            unit = paragraphs[i]
            if i + 1 < len(paragraphs):
                unit += paragraphs[i + 1]
            if unit:
                processed_paragraphs.append(unit)
        return processed_paragraphs

    def _split_sentences(self, text):
        sentences = re.split(r'([。!?\.!\?])', text)
        return [s for s in (''.join(filter(None, t)) for t in zip_longest(sentences[0::2], sentences[1::2])) if s]

    def _process_chunks(self, processed_paragraphs):
        chunks = []
        current_chunk_text_parts = []
        current_chunk_tokens = 0

        for p_text in processed_paragraphs:
            p_tokens = self.tokenizer.encode(p_text)

            if current_chunk_tokens + len(p_tokens) <= self.max_tokens:
                current_chunk_text_parts.append(p_text)
                current_chunk_tokens += len(p_tokens)
            else:
                if current_chunk_text_parts:
                    chunks.append(''.join(current_chunk_text_parts))

                if len(p_tokens) > self.max_tokens:
                    sentences = self._split_sentences(p_text)

                    sub_chunk_parts = []
                    sub_chunk_tokens = 0
                    for sent in sentences:
                        sent_tokens_count = len(self.tokenizer.encode(sent))
                        if sub_chunk_tokens + sent_tokens_count <= self.max_tokens:
                            sub_chunk_parts.append(sent)
                            sub_chunk_tokens += sent_tokens_count
                        else:
                            if sub_chunk_parts:
                                chunks.append(''.join(sub_chunk_parts))
                            sub_chunk_parts = [sent]
                            sub_chunk_tokens = sent_tokens_count

                    current_chunk_text_parts = sub_chunk_parts
                    current_chunk_tokens = sub_chunk_tokens
                else:
                    current_chunk_text_parts = [p_text]
                    current_chunk_tokens = len(p_tokens)

        if current_chunk_text_parts:
            final_chunk_text = ''.join(current_chunk_text_parts)
            final_chunk_tokens = len(self.tokenizer.encode(final_chunk_text))
            if len(chunks) > 0 and final_chunk_tokens < self.min_tokens:
                LOG.warning(f'Discarding small chunk (tokens: {final_chunk_tokens}, threshold: {self.min_tokens})')
            else:
                chunks.append(final_chunk_text)

        return chunks

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        text = data.get(self.input_key, '')
        orig_meta = data.get('meta_data', {})

        if not text:
            return []

        paragraphs = self._split_paragraphs(text)
        chunks = self._process_chunks(paragraphs)

        if not chunks and text:
            chunks = [text]

        ts = datetime.now().strftime('%Y%m%d%H%M%S')
        total = len(chunks)

        return [
            {
                'uid': f'{ts}_{uuid.uuid4().hex}',
                'content': chunk,
                'meta_data': {
                    **orig_meta,
                    'index': idx,
                    'total': total,
                    'length': len(chunk),
                },
            }
            for idx, chunk in enumerate(chunks)
        ]

代码生成算子

lazyllm.tools.data.operators.codegen_ops

CodeInstructionGenerator

Bases: CodeGenOps

代码生成流水线算子:指令标准化生成器。

从原始对话消息(messages)中抽取用户指令,并将其重写为统一的“代码增强指令”,输出为一条英文描述 + 一个包含完整函数骨架的 Python 代码块。

输出示例结构(默认 input_key='messages', output_key='generated_instruction'):

  • messages: 原始多轮对话(保持不变)
  • generated_instruction (str): 标准化后的英文指令 + Python 代码块

Parameters:

  • model

    LazyLLM 模型对象(必需),会被 share() 后复用。

  • prompt_template (str | None, default: None ) –

    可选,自定义系统提示词(若提供则替换默认 sys_prompt)。

  • input_key (str, default: 'messages' ) –

    输入对话字段名,默认 'messages'。

  • output_key (str, default: 'generated_instruction' ) –

    输出标准化指令字段名,默认 'generated_instruction'。

  • **kwargs

    传递给基类算子的其它参数(如 _max_workers、_save_data 等)。

Examples:

from lazyllm.tools.data.operators.codegen_ops import CodeInstructionGenerator

op = CodeInstructionGenerator(model=model,
                                         input_key='messages',
                                         output_key='generated_instruction')
item = {
    'messages': [
        {'role': 'user', 'content': '写一个 Python 函数,打印 hello'}
    ]
}
res = op(item)
print(res)

# Output Example:
# {
#    'messages': [...],
#    'generated_instruction': "Write a Python function that prints 'hello'.\n"
#                             "```python\n"
#                             "def solution():\n"
#                             "    print('hello')\n"
#                             "```"
# }
Source code in lazyllm/tools/data/operators/codegen_ops.py
class CodeInstructionGenerator(CodeGenOps):
    """代码生成流水线算子:指令标准化生成器。

从原始对话消息(messages)中抽取用户指令,并将其重写为统一的“代码增强指令”,输出为一条英文描述 + 一个包含完整函数骨架的 Python 代码块。

输出示例结构(默认 input_key='messages', output_key='generated_instruction'):

- messages: 原始多轮对话(保持不变)
- generated_instruction (str): 标准化后的英文指令 + Python 代码块

Args:
    model: LazyLLM 模型对象(必需),会被 share() 后复用。
    prompt_template (str|None): 可选,自定义系统提示词(若提供则替换默认 sys_prompt)。
    input_key (str): 输入对话字段名,默认 'messages'。
    output_key (str): 输出标准化指令字段名,默认 'generated_instruction'。
    **kwargs: 传递给基类算子的其它参数(如 _max_workers、_save_data 等)。


Examples:

    from lazyllm.tools.data.operators.codegen_ops import CodeInstructionGenerator

    op = CodeInstructionGenerator(model=model,
                                             input_key='messages',
                                             output_key='generated_instruction')
    item = {
        'messages': [
            {'role': 'user', 'content': '写一个 Python 函数,打印 hello'}
        ]
    }
    res = op(item)
    print(res)

    # Output Example:
    # {
    #    'messages': [...],
    #    'generated_instruction': "Write a Python function that prints 'hello'.\\n"
    #                             "```python\\n"
    #                             "def solution():\\n"
    #                             "    print('hello')\\n"
    #                             "```"
    # }
    """
    def __init__(self, model=None, prompt_template=None, input_key='messages', output_key='generated_instruction',
                 **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        sys_prompt = prompt_template or (
            'You are a code instruction standardization assistant.\n'
            'Rewrite the given instruction into a consistent format for Python code generation tasks.\n'
            'Output must be English and contain exactly two parts:\n'
            '1) A single concise instruction sentence in English.\n'
            '2) A Python code block in Markdown with a complete function skeleton.\n'
            'Do not add explanations, do not add extra sections.\n'
            'Example output format:\n'
            'Write a Python function that ...\n'
            '```python\n'
            'def solution(...):\n'
            '    \"\"\"...\"\"\"\n'
            '    ...\n'
            '```\n'
        )
        self.model = model.share().prompt(sys_prompt)

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        if self.model is None:
            raise ValueError('model is required')
        if self.input_key not in data:
            raise ValueError(f'Missing required key: {self.input_key}')
        if self.output_key in data:
            raise ValueError(f'The following key already exists and would be overwritten: {self.output_key}')
        raw_instruction = _extract_human_instruction(data.get(self.input_key))
        response = self.model(raw_instruction)
        data[self.output_key] = response.strip() if isinstance(response, str) else response
        return data

LogicIntegrityAuditor

Bases: CodeGenOps

代码生成流水线算子:代码质量评估器。

对单条 (generated_instruction, generated_code) 样本进行自动代码评审,输出一个质量分数(0–10)与一段文字反馈,默认使用 JSON 格式进行解析。

输出示例结构(默认 input_instruction_key='instruction', input_code_key='new_code'):

  • instruction: 标准化指令
  • new_code: 生成的代码
  • quality_score: 质量得分(int/float,取决于 JsonFormatter 解析)
  • feedback: 文字反馈

Parameters:

  • model

    LazyLLM 模型对象(必需),会被 JsonFormatter 包装为 JSON 输出。

  • prompt_template (str | None, default: None ) –

    可选,自定义系统提示词。

  • input_instruction_key (str, default: 'instruction' ) –

    输入指令字段名,默认 'instruction'。

  • input_code_key (str, default: 'new_code' ) –

    输入代码字段名,默认 'new_code'。

  • output_score_key (str, default: 'quality_score' ) –

    输出分数字段名,默认 'quality_score'。

  • output_feedback_key (str, default: 'feedback' ) –

    输出反馈字段名,默认 'feedback'。

  • **kwargs

    传递给基类算子的其它参数。

Examples:

from lazyllm.tools.data.operators.codegen_ops import LogicIntegrityAuditor

op = LogicIntegrityAuditor(model=model)
item = {
    'instruction': "Write a Python function that prints 'hello'.",
    'new_code': "def solution():
print('hello')"
}
res = op(item)
print(res)
# {
#   'instruction': "Write a Python function that prints 'hello'.",
#   'new_code': "def solution():
print('hello')",
#   'quality_score': 8,
#   'feedback': 'Good code. The logic is clear and follows PEP8.'
# }
Source code in lazyllm/tools/data/operators/codegen_ops.py
class LogicIntegrityAuditor(CodeGenOps):
    """代码生成流水线算子:代码质量评估器。

对单条 (generated_instruction, generated_code) 样本进行自动代码评审,输出一个质量分数(0–10)与一段文字反馈,默认使用 JSON 格式进行解析。

输出示例结构(默认 input_instruction_key='instruction', input_code_key='new_code'):

- instruction: 标准化指令
- new_code: 生成的代码
- quality_score: 质量得分(int/float,取决于 JsonFormatter 解析)
- feedback: 文字反馈

Args:
    model: LazyLLM 模型对象(必需),会被 JsonFormatter 包装为 JSON 输出。
    prompt_template (str|None): 可选,自定义系统提示词。
    input_instruction_key (str): 输入指令字段名,默认 'instruction'。
    input_code_key (str): 输入代码字段名,默认 'new_code'。
    output_score_key (str): 输出分数字段名,默认 'quality_score'。
    output_feedback_key (str): 输出反馈字段名,默认 'feedback'。
    **kwargs: 传递给基类算子的其它参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.codegen_ops import LogicIntegrityAuditor

    op = LogicIntegrityAuditor(model=model)
    item = {
        'instruction': "Write a Python function that prints 'hello'.",
        'new_code': "def solution():\n    print('hello')"
    }
    res = op(item)
    print(res)
    # {
    #   'instruction': "Write a Python function that prints 'hello'.",
    #   'new_code': "def solution():\n    print('hello')",
    #   'quality_score': 8,
    #   'feedback': 'Good code. The logic is clear and follows PEP8.'
    # }
    ```
    """
    def __init__(
        self,
        model=None,
        prompt_template=None,
        input_instruction_key='instruction',
        input_code_key='new_code',
        output_score_key='quality_score',
        output_feedback_key='feedback',
        **kwargs,
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_instruction_key = input_instruction_key
        self.input_code_key = input_code_key
        self.output_score_key = output_score_key
        self.output_feedback_key = output_feedback_key
        sys_prompt = prompt_template or (
            'You are an automated code reviewer.\n'
            'Evaluate the generated Python code against the given instruction.\n'
            'Please provide a score (0-10) and feedback.\n'
            'Output must be in JSON format:\n'
            '{\n'
            '  "score": <0-10>,\n'
            '  "feedback": "..."\n'
            '}'
        )
        self.model = model.share().prompt(sys_prompt).formatter(JsonFormatter())

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        if self.model is None:
            raise ValueError('model is required')
        if self.input_instruction_key not in data:
            raise ValueError(f'Missing required key: {self.input_instruction_key}')
        if self.input_code_key not in data:
            raise ValueError(f'Missing required key: {self.input_code_key}')
        if self.output_score_key in data:
            raise ValueError(f'The following key already exists and would be overwritten: {self.output_score_key}')
        if self.output_feedback_key in data:
            raise ValueError(f'The following key already exists and would be overwritten: {self.output_feedback_key}')
        instruction = data.get(self.input_instruction_key, '')
        code = data.get(self.input_code_key, '')
        user_input = f'Instruction:\n{instruction}\n\nCode:\n```python\n{code}\n```'
        res = self.model(user_input)

        if isinstance(res, dict):
            score = res.get('score', 0)
            feedback = res.get('feedback', 'No feedback provided.')
        else:
            from lazyllm import LOG
            LOG.warning(f'Failed to extract JSON from response: {res}')
            score, feedback = 0, 'Failed to parse LLM evaluation output.'

        data[self.output_score_key] = score
        data[self.output_feedback_key] = feedback
        return data

ScriptSynthesizer

Bases: CodeGenOps

代码生成流水线算子:指令到代码生成器。

给定自然语言代码指令(通常是上一阶段生成的 generated_instruction 或精简后的 instruction),生成对应的 Python 源代码文本,并尝试自动去掉 Markdown 代码块外壳,只保留代码本身。

输出示例结构(默认 input_key='instruction', output_key='new_code'):

  • instruction: 自然语言代码指令
  • new_code (str): 生成的 Python 代码字符串

Parameters:

  • model

    LazyLLM 模型对象(必需)。

  • prompt_template (str | None, default: None ) –

    可选,自定义系统提示词。

  • input_key (str, default: 'instruction' ) –

    输入指令字段名,默认 'instruction'。

  • output_key (str, default: 'new_code' ) –

    输出代码字段名,默认 'new_code'。

  • **kwargs

    传递给基类算子的其它参数。

Examples:

from lazyllm.tools.data.operators.codegen_ops import ScriptSynthesizer

op = ScriptSynthesizer(model=model,
                                    input_key='instruction',
                                    output_key='new_code')
item = {
    'instruction': 'Write a Python function that prints "hello".'
}
res = op(item)
print(res)
# {
#   'instruction': 'Write a Python function that prints "hello".',
#   'new_code': "def solution():
print('hello')"
# }
Source code in lazyllm/tools/data/operators/codegen_ops.py
class ScriptSynthesizer(CodeGenOps):
    """代码生成流水线算子:指令到代码生成器。

给定自然语言代码指令(通常是上一阶段生成的 generated_instruction 或精简后的 instruction),生成对应的 Python 源代码文本,并尝试自动去掉 Markdown 代码块外壳,只保留代码本身。

输出示例结构(默认 input_key='instruction', output_key='new_code'):

- instruction: 自然语言代码指令
- new_code (str): 生成的 Python 代码字符串

Args:
    model: LazyLLM 模型对象(必需)。
    prompt_template (str|None): 可选,自定义系统提示词。
    input_key (str): 输入指令字段名,默认 'instruction'。
    output_key (str): 输出代码字段名,默认 'new_code'。
    **kwargs: 传递给基类算子的其它参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.codegen_ops import ScriptSynthesizer

    op = ScriptSynthesizer(model=model,
                                        input_key='instruction',
                                        output_key='new_code')
    item = {
        'instruction': 'Write a Python function that prints "hello".'
    }
    res = op(item)
    print(res)
    # {
    #   'instruction': 'Write a Python function that prints "hello".',
    #   'new_code': "def solution():\n    print('hello')"
    # }
    ```
    """
    def __init__(self, model=None, prompt_template=None, input_key='instruction', output_key='new_code', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        sys_prompt = prompt_template or (
            'You are a senior Python engineer.\n'
            'Given a natural language instruction, generate the corresponding Python code.\n'
            'Return only the code. If you include a Markdown code block, use ```python ... ```.\n'
        )
        self.model = model.share().prompt(sys_prompt)

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        if self.model is None:
            raise ValueError('model is required')
        if self.input_key not in data:
            raise ValueError(f'Missing required key: {self.input_key}')
        if self.output_key in data:
            raise ValueError(f'The following key already exists and would be overwritten: {self.output_key}')
        instruction = data.get(self.input_key, '')
        response = self.model(instruction)
        data[self.output_key] = _parse_code(response)
        return data

ThresholdSieve

Bases: CodeGenOps

代码生成流水线算子:代码质量分数过滤器。

基于 LogicIntegrityAuditor 的打分结果,对样本进行区间过滤:

  • 若样本尚未包含 quality_score/feedback,会先自动调用内部 scorer 进行评估;
  • 若得分在 [min_score, max_score] 区间内,则为样本打上标签并保留;
  • 否则返回空列表 [],表示此样本在流水线中被过滤掉。

输出示例结构(默认 output_key='quality_score_filter_label'):

  • instruction: ...
  • new_code: ...
  • quality_score: 8
  • feedback: 'Good code. ...'
  • quality_score_filter_label: 1 (通过过滤为 1,未通过则样本被丢弃)

Parameters:

  • model

    LazyLLM 模型对象(必需),用于内部评估。

  • min_score (int, default: 7 ) –

    通过过滤的最小分数(含),默认 7。

  • max_score (int, default: 10 ) –

    通过过滤的最大分数(含),默认 10。

  • input_instruction_key (str, default: 'instruction' ) –

    输入指令字段名,默认 'instruction'。

  • input_code_key (str, default: 'new_code' ) –

    输入代码字段名,默认 'new_code'。

  • output_score_key (str, default: 'quality_score' ) –

    分数字段名,默认 'quality_score'。

  • output_feedback_key (str, default: 'feedback' ) –

    反馈字段名,默认 'feedback'。

  • output_key (str, default: 'quality_score_filter_label' ) –

    过滤标签字段名,默认 'quality_score_filter_label'。

  • **kwargs

    传递给基类算子的其它参数。

Examples:

from lazyllm.tools.data.operators.codegen_ops import ThresholdSieve

op = ThresholdSieve(model=model, min_score=7, max_score=10)
item = {
    'instruction': "Write a Python function that prints 'hello'.",
    'new_code': "def solution():
print('hello')"
}
res = op(item)
print(res)
# {
#   'instruction': '...',
#   'new_code': '...',
#   'quality_score': 8,
#   'feedback': 'Good code. The logic is clear and follows PEP8.',
#   'quality_score_filter_label': 1
# }
Source code in lazyllm/tools/data/operators/codegen_ops.py
class ThresholdSieve(CodeGenOps):
    """代码生成流水线算子:代码质量分数过滤器。

基于 LogicIntegrityAuditor 的打分结果,对样本进行区间过滤:

- 若样本尚未包含 quality_score/feedback,会先自动调用内部 scorer 进行评估;
- 若得分在 [min_score, max_score] 区间内,则为样本打上标签并保留;
- 否则返回空列表 [],表示此样本在流水线中被过滤掉。

输出示例结构(默认 output_key='quality_score_filter_label'):

- instruction: ...
- new_code: ...
- quality_score: 8
- feedback: 'Good code. ...'
- quality_score_filter_label: 1  (通过过滤为 1,未通过则样本被丢弃)

Args:
    model: LazyLLM 模型对象(必需),用于内部评估。
    min_score (int): 通过过滤的最小分数(含),默认 7。
    max_score (int): 通过过滤的最大分数(含),默认 10。
    input_instruction_key (str): 输入指令字段名,默认 'instruction'。
    input_code_key (str): 输入代码字段名,默认 'new_code'。
    output_score_key (str): 分数字段名,默认 'quality_score'。
    output_feedback_key (str): 反馈字段名,默认 'feedback'。
    output_key (str): 过滤标签字段名,默认 'quality_score_filter_label'。
    **kwargs: 传递给基类算子的其它参数。


Examples:
    ```python
    from lazyllm.tools.data.operators.codegen_ops import ThresholdSieve

    op = ThresholdSieve(model=model, min_score=7, max_score=10)
    item = {
        'instruction': "Write a Python function that prints 'hello'.",
        'new_code': "def solution():\n    print('hello')"
    }
    res = op(item)
    print(res)
    # {
    #   'instruction': '...',
    #   'new_code': '...',
    #   'quality_score': 8,
    #   'feedback': 'Good code. The logic is clear and follows PEP8.',
    #   'quality_score_filter_label': 1
    # }
    ```
    """
    def __init__(
        self,
        model=None,
        min_score: int = 7,
        max_score: int = 10,
        input_instruction_key: str = 'instruction',
        input_code_key: str = 'new_code',
        output_score_key: str = 'quality_score',
        output_feedback_key: str = 'feedback',
        output_key: str = 'quality_score_filter_label',
        **kwargs,
    ):
        super().__init__(**kwargs)
        self.model = model
        self.min_score = min_score
        self.max_score = max_score
        self.input_instruction_key = input_instruction_key
        self.input_code_key = input_code_key
        self.output_score_key = output_score_key
        self.output_feedback_key = output_feedback_key
        self.output_key = output_key
        self.scorer = LogicIntegrityAuditor(
            model=model,
            input_instruction_key=input_instruction_key,
            input_code_key=input_code_key,
            output_score_key=output_score_key,
            output_feedback_key=output_feedback_key,
        )

    def forward(self, data, **kwargs):
        assert isinstance(data, dict)
        if self.model is None:
            raise ValueError('model is required')
        if self.output_key in data:
            raise ValueError(f'The following key already exists and would be overwritten: {self.output_key}')

        if self.output_score_key not in data:
            data = self.scorer.forward(data)

        score = data.get(self.output_score_key, 0)
        try:
            score_int = int(score)
        except (ValueError, TypeError):
            score_int = 0
        pass_filter = (self.min_score <= score_int <= self.max_score)
        data[self.output_key] = 1 if pass_filter else 0
        if pass_filter:
            return data
        return []

Agentic rag

lazyllm.tools.data.operators.agentic_rag.agenticrag_atomic_task_generator

AgenticRAGCleanQA

Bases: agenticrag

对生成的问答对进行清洗与答案规范化。调用 LLM 生成 refined_answer,用于后续验证与评分。

Parameters:

  • llm

    语言模型服务实例

  • **kwargs (dict, default: {} ) –

    其它可选的参数。

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGCleanQA(llm=my_llm)
result = op({'question': 'What is...', 'answer': 'Raw answer'})
print(result['refined_answer'])
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
class AgenticRAGCleanQA(agenticrag):
    """对生成的问答对进行清洗与答案规范化。调用 LLM 生成 refined_answer,用于后续验证与评分。

Args:
    llm: 语言模型服务实例
    **kwargs (dict): 其它可选的参数。


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGCleanQA(llm=my_llm)
    result = op({'question': 'What is...', 'answer': 'Raw answer'})
    print(result['refined_answer'])
    ```
    """

    def __init__(self, llm=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.prompt_template = RAGQARefinementPrompt()
        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')
        question = data.get('question', '')
        answer = data.get('answer', '')

        user_prompt = self.prompt_template.build_prompt(
            {'question': question, 'original_answer': answer}
        )

        try:
            result = self._llm_serve(user_prompt)
            if isinstance(result, dict):
                data['refined_answer'] = str(result.get('refined_answer', ''))
            else:
                data['refined_answer'] = ''
        except Exception as e:
            LOG.warning(f'Failed to clean QA: {e}')
            data['refined_answer'] = ''

        return data

AgenticRAGExpandConclusions

Bases: agenticrag

解析 raw_conclusion 字段中的 JSON 结论列表, 并将其展开为多条候选任务数据。

仅保留包含 'conclusion' 和 'R' 字段的条目, 为每个条目生成独立数据行,并写入 candidate_tasks_str。

Parameters:

  • max_per_task (int, default: 10 ) –

    每个样本最多展开的候选任务数量

  • **kwargs (dict, default: {} ) –

    其它可选的参数。

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGExpandConclusions(max_per_task=5)
rows = op({
    'raw_conclusion': '[{"conclusion":"A","R":"rel"}]',
    'identifier': 'doc1'
})
print(rows)
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
class AgenticRAGExpandConclusions(agenticrag):
    """解析 raw_conclusion 字段中的 JSON 结论列表,
并将其展开为多条候选任务数据。

仅保留包含 'conclusion' 和 'R' 字段的条目,
为每个条目生成独立数据行,并写入 candidate_tasks_str。

Args:
    max_per_task (int): 每个样本最多展开的候选任务数量
    **kwargs (dict): 其它可选的参数。


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGExpandConclusions(max_per_task=5)
    rows = op({
        'raw_conclusion': '[{"conclusion":"A","R":"rel"}]',
        'identifier': 'doc1'
    })
    print(rows)
    ```
    """

    def __init__(self, max_per_task: int = 10, **kwargs):
        super().__init__(**kwargs)
        self.max_per_task = max_per_task

    def forward(self, data: dict) -> List[dict]:
        conclusion_str = data.get('raw_conclusion', '')
        identifier = data.get('identifier', '')

        if not conclusion_str:
            return []

        try:
            parsed = json.loads(_extract_json_content(conclusion_str))
            if isinstance(parsed, list):
                parsed = parsed[:self.max_per_task]
            else:
                return []
        except Exception as e:
            LOG.warning(f'Failed to parse conclusion JSON: {e}')
            return []

        expanded_rows = []
        for item in parsed:
            if isinstance(item, dict) and 'conclusion' in item and 'R' in item:
                new_row = data.copy()
                new_row['candidate_tasks_str'] = json.dumps(item, ensure_ascii=False)
                new_row['identifier'] = str(identifier)
                expanded_rows.append(new_row)

        return expanded_rows

AgenticRAGGenerateQuestion

Bases: agenticrag

根据主要内容标识符(ID), 关系(R), 答案(A) 生成问题(question)与标准答案(answer)的算子。

Parameters:

  • llm

    语言模型服务实例

  • **kwargs (dict, default: {} ) –

    其它可选的参数。

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGGenerateQuestion(llm=my_llm)
result = op({
    'candidate_tasks_str': '{"conclusion":"Paris","R":"capital_of"}',
    'identifier': 'France'
})
print(result['question'], result['answer'])
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
class AgenticRAGGenerateQuestion(agenticrag):
    """根据主要内容标识符(ID), 关系(R), 答案(A) 生成问题(question)与标准答案(answer)的算子。

Args:
    llm: 语言模型服务实例
    **kwargs (dict): 其它可选的参数。


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGGenerateQuestion(llm=my_llm)
    result = op({
        'candidate_tasks_str': '{"conclusion":"Paris","R":"capital_of"}',
        'identifier': 'France'
    })
    print(result['question'], result['answer'])
    ```
    """

    def __init__(self, llm=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.prompt_template = RAGTaskToQuestionPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict):
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')
        candidate_str = data.get('candidate_tasks_str', '')
        identifier = data.get('identifier', '')
        try:
            task_item = json.loads(_extract_json_content(candidate_str))
            conclusion = task_item.get('conclusion', '')
            relation = task_item.get('R', '')
            user_prompt = self.prompt_template.build_prompt(
                identifier, conclusion, relation
            )

            result = self._llm_serve(user_prompt)
            if isinstance(result, dict) and 'Q' in result:
                data['question'] = str(result['Q'])
                data['answer'] = str(conclusion)
                return data
        except Exception as e:
            LOG.warning(f'Failed to generate question: {e}')

        return []

AgenticRAGGetConclusion

Bases: agenticrag

调用 LLM 进行结论提取和关系生成的算子。

该算子根据输入文本构造提示词,并将模型的原始输出 保存至 data['raw_conclusion'],供后续 JSON 解析与任务展开使用。 若生成失败,则写入空字符串。

Parameters:

  • llm

    语言模型服务实例

  • input_key (str, default: 'prompts' ) –

    输入文本字段名,默认 'prompts'

  • **kwargs (dict, default: {} ) –

    其它可选的参数。

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGGetConclusion(llm=my_llm)
result = op({'prompts': 'Some document content'})
print(result['raw_conclusion'])
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
class AgenticRAGGetConclusion(agenticrag):
    """调用 LLM 进行结论提取和关系生成的算子。

该算子根据输入文本构造提示词,并将模型的原始输出
保存至 data['raw_conclusion'],供后续 JSON 解析与任务展开使用。
若生成失败,则写入空字符串。

Args:
    llm: 语言模型服务实例
    input_key (str): 输入文本字段名,默认 'prompts'
    **kwargs (dict): 其它可选的参数。


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGGetConclusion(llm=my_llm)
    result = op({'prompts': 'Some document content'})
    print(result['raw_conclusion'])
    ```
    """

    def __init__(self, llm=None, input_key: str = 'prompts', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.prompt_template = RAGFactsConclusionPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt)
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')
        content = data.get(self.input_key, '')
        user_prompt = self.prompt_template.build_prompt(content)

        try:
            result = self._llm_serve(user_prompt)
            data['raw_conclusion'] = result
        except Exception as e:
            LOG.warning(f'Failed to extract conclusion: {e}')
            data['raw_conclusion'] = ''

        return data

AgenticRAGGetIdentifier

Bases: agenticrag

调用 LLM 从输入文本中抽取内容标识符(identifier)的算子。

Parameters:

  • llm

    语言模型服务实例

  • input_key (str, default: 'prompts' ) –

    输入文本字段名,默认 'prompts'

  • **kwargs (dict, default: {} ) –

    其它可选的参数。

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGGetIdentifier(llm=my_llm, input_key='prompts')
result = op({'prompts': 'What is the third movie in the Avatar series?'})
print('identifier:', result['identifier'])
# {'identifier': 'Avatar series'}
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
class AgenticRAGGetIdentifier(agenticrag):
    """调用 LLM 从输入文本中抽取内容标识符(identifier)的算子。

Args:
    llm: 语言模型服务实例
    input_key (str): 输入文本字段名,默认 'prompts'
    **kwargs (dict): 其它可选的参数。


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGGetIdentifier(llm=my_llm, input_key='prompts')
    result = op({'prompts': 'What is the third movie in the Avatar series?'})
    print('identifier:', result['identifier'])
    # {'identifier': 'Avatar series'}
    ```
    """

    def __init__(self, llm=None, input_key: str = 'prompts', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.prompt_template = RAGContentIdExtractorPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        content = data.get(self.input_key, '')
        user_prompt = self.prompt_template.build_prompt(content)

        try:
            result = self._llm_serve(user_prompt)
            if isinstance(result, dict):
                data['identifier'] = result.get('content_identifier', '')
            else:
                data['identifier'] = ''
        except Exception as e:
            LOG.warning(f'Failed to extract identifier: {e}')
            data['identifier'] = ''

        return data

AgenticRAGGoldenDocAnswer

Bases: agenticrag

基于黄金文档生成答案并进行评分验证。

使用 golden_doc 与 question 生成答案, 再与 refined_answer 进行评分。 若评分不足则过滤样本。

Parameters:

  • llm

    语言模型服务实例

  • input_key (str, default: 'prompts' ) –

    黄金文档字段名

  • **kwargs (dict, default: {} ) –

    其它可选的参数。

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGGoldenDocAnswer(llm=my_llm)
result = op({
    'prompts': 'Golden document text',
    'question': 'Q?',
    'refined_answer': 'Expected A'
})
print(result)
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
class AgenticRAGGoldenDocAnswer(agenticrag):
    """基于黄金文档生成答案并进行评分验证。

使用 golden_doc 与 question 生成答案,
再与 refined_answer 进行评分。
若评分不足则过滤样本。

Args:
    llm: 语言模型服务实例
    input_key (str): 黄金文档字段名
    **kwargs (dict): 其它可选的参数。


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGGoldenDocAnswer(llm=my_llm)
    result = op({
        'prompts': 'Golden document text',
        'question': 'Q?',
        'refined_answer': 'Expected A'
    })
    print(result)
    ```
    """

    def __init__(self, llm=None, input_key: str = 'prompts', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.prompt_template = RAGDocGroundedAnswerPrompt()
        self.score_template = RAGConsistencyScoringPrompt()
        if llm is not None:
            self._llm_answer_serve = llm.share()
            self._llm_answer_serve.start()
            score_system_prompt = self.score_template.build_system_prompt()
            self._llm_score_serve = llm.share().prompt(score_system_prompt).formatter(JsonFormatter())
            self._llm_score_serve.start()
        else:
            self._llm_answer_serve = None
            self._llm_score_serve = None

    def forward(self, data: dict):
        if self._llm_answer_serve is None or self._llm_score_serve is None:
            raise ValueError('LLM is not configured')
        golden_doc = data.get(self.input_key, '')
        question = data.get('question', '')
        refined_answer = data.get('refined_answer', '')

        user_prompt = self.prompt_template.build_prompt(
            golden_doc, question
        )
        try:
            golden_doc_answer = self._llm_answer_serve(user_prompt)
            data['golden_doc_answer'] = golden_doc_answer
        except Exception as e:
            LOG.warning(f'Failed to get golden doc answer: {e}')
            return []

        score_prompt = self.score_template.build_prompt(
            refined_answer, golden_doc_answer
        )

        try:
            score_result = self._llm_score_serve(score_prompt)
            if isinstance(score_result, dict):
                score = score_result.get('answer_score', 0)
                data['golden_doc_score'] = score

                if score < 1:
                    return []
            else:
                return []
        except Exception as e:
            LOG.warning(f'Failed to calculate golden doc score: {e}')
            return []

        return data

AgenticRAGGroupAndLimit

Bases: agenticrag

按指定字段分组并限制每组最大问答数量。

对批量数据按 input_key 分组, 每组最多保留 max_question 条, 用于控制同源样本数量。

Parameters:

  • input_key (str, default: 'prompts' ) –

    分组字段名

  • max_question (int, default: 10 ) –

    每组最大问答数量

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGGroupAndLimit(input_key='prompts', max_question=2)
result = op([
    {'prompts': 'doc1', 'question': 'Q1'},
    {'prompts': 'doc1', 'question': 'Q2'},
    {'prompts': 'doc1', 'question': 'Q3'}
])
print(result)  # only 2 kept for doc1
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
class AgenticRAGGroupAndLimit(agenticrag):
    """按指定字段分组并限制每组最大问答数量。

对批量数据按 input_key 分组,
每组最多保留 max_question 条,
用于控制同源样本数量。

Args:
    input_key (str): 分组字段名
    max_question (int): 每组最大问答数量


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGGroupAndLimit(input_key='prompts', max_question=2)
    result = op([
        {'prompts': 'doc1', 'question': 'Q1'},
        {'prompts': 'doc1', 'question': 'Q2'},
        {'prompts': 'doc1', 'question': 'Q3'}
    ])
    print(result)  # only 2 kept for doc1
    ```
    """

    def __init__(
        self,
        input_key: str = 'prompts',
        max_question: int = 10,
        **kwargs,
    ):
        super().__init__(rewrite_func='forward_batch_input', **kwargs)
        self.input_key = input_key
        self.max_question = max_question

    def forward_batch_input(self, data: List[dict]) -> List[dict]:
        grouped_data = {}

        for item in data:
            key_value = item.get(self.input_key, '')
            grouped_data.setdefault(key_value, [])

            if len(grouped_data[key_value]) < self.max_question:
                grouped_data[key_value].append(item)

        result_list = []
        for items in grouped_data.values():
            result_list.extend(items)

        LOG.info(f'Grouped and limited to {len(result_list)} QA pairs')
        return result_list

AgenticRAGLLMVerify

Bases: agenticrag

使用 LLM 对问答进行回答与召回评分验证。

先让模型根据 question 生成 llm_answer, 再对 refined_answer 与 llm_answer 进行评分。 若评分 >= 1,则过滤该样本;否则保留。

Parameters:

  • llm

    语言模型服务实例

  • **kwargs (dict, default: {} ) –

    其它可选的参数。

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGLLMVerify(llm=my_llm)
result = op({'question': 'Q?', 'refined_answer': 'A'})
print(result)
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
class AgenticRAGLLMVerify(agenticrag):
    """使用 LLM 对问答进行回答与召回评分验证。

先让模型根据 question 生成 llm_answer,
再对 refined_answer 与 llm_answer 进行评分。
若评分 >= 1,则过滤该样本;否则保留。

Args:
    llm: 语言模型服务实例
    **kwargs (dict): 其它可选的参数。


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGLLMVerify(llm=my_llm)
    result = op({'question': 'Q?', 'refined_answer': 'A'})
    print(result)
    ```
    """

    def __init__(self, llm=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.prompt_template = RAGTaskSolverPrompt()
        self.score_template = RAGConsistencyScoringPrompt()
        if llm is not None:
            self._llm_answer_serve = llm.share()
            self._llm_answer_serve.start()
            score_system_prompt = self.score_template.build_system_prompt()
            self._llm_score_serve = llm.share().prompt(score_system_prompt).formatter(JsonFormatter())
            self._llm_score_serve.start()
        else:
            self._llm_answer_serve = None
            self._llm_score_serve = None

    def forward(self, data: dict):
        if self._llm_answer_serve is None or self._llm_score_serve is None:
            raise ValueError('LLM is not configured')
        question = data.get('question', '')
        refined_answer = data.get('refined_answer', '')

        user_prompt = self.prompt_template.build_prompt(question)
        try:
            llm_answer = self._llm_answer_serve(user_prompt)
            data['llm_answer'] = llm_answer
        except Exception as e:
            LOG.warning(f'Failed to get LLM answer: {e}')
            return []

        score_prompt = self.score_template.build_prompt(
            refined_answer, llm_answer
        )

        try:
            score_result = self._llm_score_serve(score_prompt)
            if isinstance(score_result, dict):
                score = score_result.get('answer_score', 0)
                data['llm_score'] = score

                if score >= 1:
                    return []
            else:
                data['llm_score'] = 0
        except Exception as e:
            LOG.warning(f'Failed to calculate recall score: {e}')
            data['llm_score'] = 0

        return data

AgenticRAGOptionalAnswers

Bases: agenticrag

为标准答案生成多个可选答案。

基于 refined_answer 调用 LLM, 生成语义等价或近似表达的答案列表, 写入 optional_answer 字段。

Parameters:

  • llm

    语言模型服务实例

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.AgenticRAGOptionalAnswers(llm=my_llm)
result = op({'refined_answer': 'Paris'})
print(result['optional_answer'])
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_atomic_task_generator.py
class AgenticRAGOptionalAnswers(agenticrag):
    """为标准答案生成多个可选答案。

基于 refined_answer 调用 LLM,
生成语义等价或近似表达的答案列表,
写入 optional_answer 字段。

Args:
    llm: 语言模型服务实例




Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.AgenticRAGOptionalAnswers(llm=my_llm)
    result = op({'refined_answer': 'Paris'})
    print(result['optional_answer'])
    ```
    """

    def __init__(self, llm=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.prompt_template = RAGAnswerVariantsPrompt()
        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')
        refined_answer = data.get('refined_answer', '')

        user_prompt = self.prompt_template.build_prompt(refined_answer)

        try:
            result = self._llm_serve(user_prompt)
            if isinstance(result, list):
                data['optional_answer'] = result
            else:
                data['optional_answer'] = [refined_answer]
        except Exception as e:
            LOG.warning(f'Failed to generate optional answers: {e}')
            data['optional_answer'] = [refined_answer]

        return data

lazyllm.tools.data.operators.agentic_rag.agenticrag_depth_qa_generator

DepthQAGBackwardTask

Bases: agenticrag

根据现有标识符生成反向任务,产生新的标识符和关系。

该算子用于从给定的 identifier 反向推理,生成新的 identifier 和对应的 relation, 用于构建深度问答任务。

Parameters:

  • llm

    语言模型服务实例

  • identifier_key (str, default: 'identifier' ) –

    原始标识符字段名,默认 'identifier'

  • new_identifier_key (str, default: 'new_identifier' ) –

    新生成的标识符字段名,默认 'new_identifier'

  • relation_key (str, default: 'relation' ) –

    关系字段名,默认 'relation'

  • **kwargs (dict, default: {} ) –

    其它可选的参数。

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.DepthQAGBackwardTask(llm=my_llm)
result = op({'identifier': 'machine learning'})
print(result)
# {'identifier': 'machine learning', 'new_identifier': '...', 'relation': '...'}
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_depth_qa_generator.py
class DepthQAGBackwardTask(agenticrag):
    """根据现有标识符生成反向任务,产生新的标识符和关系。

该算子用于从给定的 identifier 反向推理,生成新的 identifier 和对应的 relation,
用于构建深度问答任务。

Args:
    llm: 语言模型服务实例
    identifier_key (str): 原始标识符字段名,默认 'identifier'
    new_identifier_key (str): 新生成的标识符字段名,默认 'new_identifier'
    relation_key (str): 关系字段名,默认 'relation'
    **kwargs (dict): 其它可选的参数。


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.DepthQAGBackwardTask(llm=my_llm)
    result = op({'identifier': 'machine learning'})
    print(result)
    # {'identifier': 'machine learning', 'new_identifier': '...', 'relation': '...'}
    ```
    """

    def __init__(self, llm=None, identifier_key: str = 'identifier',
                 new_identifier_key: str = 'new_identifier', relation_key: str = 'relation', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.identifier_key = identifier_key
        self.new_identifier_key = new_identifier_key
        self.relation_key = relation_key
        self.prompt_template = RAGDepthBackwardSupersetPrompt()

        if llm is not None:
            self._llm_serve = llm.share().formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        identifier = data.get(self.identifier_key, '')
        user_prompt = self.prompt_template.build_prompt(identifier)

        try:
            result = self._llm_serve(user_prompt)
            parsed = self._parse_backward_result(result)
            if parsed is not None:
                data[self.new_identifier_key] = parsed['identifier']
                data[self.relation_key] = parsed['relation']
                return data
        except Exception as e:
            LOG.warning(f'Failed to generate backward task: {e}')

        return []

    def _parse_backward_result(self, result) -> Optional[dict]:
        try:
            if isinstance(result, dict) and 'identifier' in result and 'relation' in result:
                return result
            LOG.warning('[Skipped]: Invalid backward result')
        except Exception as e:
            LOG.warning(f'[Error]: Failed to parse backward result: {e}')
        return None

DepthQAGCheckSuperset

Bases: agenticrag

检查新生成的查询是否为原始标识符的超集。

验证 new_identifier 和 relation 组合后是否构成对原始 identifier 的有效超集查询, 若验证通过则保留数据,否则返回空列表过滤掉该样本。

Parameters:

  • llm

    语言模型服务实例

  • new_identifier_key (str, default: 'new_identifier' ) –

    新标识符字段名,默认 'new_identifier'

  • relation_key (str, default: 'relation' ) –

    关系字段名,默认 'relation'

  • identifier_key (str, default: 'identifier' ) –

    原始标识符字段名,默认 'identifier'

  • **kwargs (dict, default: {} ) –

    其它可选的参数。

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.DepthQAGCheckSuperset(llm=my_llm)
result = op({
    'identifier': 'Paris',
    'new_identifier': 'France',
    'relation': 'capital_of'
})
print(result)  # returns data if valid, empty list if invalid
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_depth_qa_generator.py
class DepthQAGCheckSuperset(agenticrag):
    """检查新生成的查询是否为原始标识符的超集。

验证 new_identifier 和 relation 组合后是否构成对原始 identifier 的有效超集查询,
若验证通过则保留数据,否则返回空列表过滤掉该样本。

Args:
    llm: 语言模型服务实例
    new_identifier_key (str): 新标识符字段名,默认 'new_identifier'
    relation_key (str): 关系字段名,默认 'relation'
    identifier_key (str): 原始标识符字段名,默认 'identifier'
    **kwargs (dict): 其它可选的参数。


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.DepthQAGCheckSuperset(llm=my_llm)
    result = op({
        'identifier': 'Paris',
        'new_identifier': 'France',
        'relation': 'capital_of'
    })
    print(result)  # returns data if valid, empty list if invalid
    ```
    """

    def __init__(self, llm=None, new_identifier_key: str = 'new_identifier',
                 relation_key: str = 'relation', identifier_key: str = 'identifier', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.new_identifier_key = new_identifier_key
        self.relation_key = relation_key
        self.identifier_key = identifier_key
        self.prompt_template = RAGDepthSupersetValidationPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        new_identifier = data.get(self.new_identifier_key, '')
        relation = data.get(self.relation_key, '')
        identifier = data.get(self.identifier_key, '')

        user_prompt = self.prompt_template.build_prompt(new_identifier, relation, identifier)

        try:
            result = self._llm_serve(user_prompt)
            if self._is_valid_superset(result):
                return data
        except Exception as e:
            LOG.warning(f'Failed to check superset: {e}')

        return []

    def _is_valid_superset(self, result) -> bool:
        try:
            if isinstance(result, dict):
                return result.get('new_query') == 'valid'
        except Exception as e:
            LOG.warning(f'[Error]: Failed to check superset: {e}')
        return False

DepthQAGGenerateQuestion

Bases: agenticrag

根据新标识符、关系和原始标识符生成深度问题。

使用 LLM 基于 new_identifier、relation 和 identifier 生成深度问答任务中的问题, 存储在指定的 question_key 字段中。

Parameters:

  • llm

    语言模型服务实例

  • new_identifier_key (str, default: 'new_identifier' ) –

    新标识符字段名,默认 'new_identifier'

  • relation_key (str, default: 'relation' ) –

    关系字段名,默认 'relation'

  • identifier_key (str, default: 'identifier' ) –

    原始标识符字段名,默认 'identifier'

  • question_key (str, default: 'depth_question' ) –

    生成问题存储的字段名,默认 'depth_question'

  • **kwargs (dict, default: {} ) –

    其它可选的参数。

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.DepthQAGGenerateQuestion(llm=my_llm)
result = op({
    'identifier': 'Paris',
    'new_identifier': 'France',
    'relation': 'capital_of'
})
print(result['depth_question'])
# 'What is the capital of France?'
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_depth_qa_generator.py
class DepthQAGGenerateQuestion(agenticrag):
    """根据新标识符、关系和原始标识符生成深度问题。

使用 LLM 基于 new_identifier、relation 和 identifier 生成深度问答任务中的问题,
存储在指定的 question_key 字段中。

Args:
    llm: 语言模型服务实例
    new_identifier_key (str): 新标识符字段名,默认 'new_identifier'
    relation_key (str): 关系字段名,默认 'relation'
    identifier_key (str): 原始标识符字段名,默认 'identifier'
    question_key (str): 生成问题存储的字段名,默认 'depth_question'
    **kwargs (dict): 其它可选的参数。


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.DepthQAGGenerateQuestion(llm=my_llm)
    result = op({
        'identifier': 'Paris',
        'new_identifier': 'France',
        'relation': 'capital_of'
    })
    print(result['depth_question'])
    # 'What is the capital of France?'
    ```
    """

    def __init__(self, llm=None, new_identifier_key: str = 'new_identifier',
                 relation_key: str = 'relation', identifier_key: str = 'identifier',
                 question_key: str = 'depth_question', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.new_identifier_key = new_identifier_key
        self.relation_key = relation_key
        self.identifier_key = identifier_key
        self.question_key = question_key
        self.prompt_template = RAGDepthQuestionFromContextPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        new_identifier = data.get(self.new_identifier_key, '')
        relation = data.get(self.relation_key, '')
        identifier = data.get(self.identifier_key, '')

        user_prompt = self.prompt_template.build_prompt(new_identifier, relation, identifier)

        try:
            result = self._llm_serve(user_prompt)
            parsed = self._parse_question_result(result)
            if parsed is not None:
                data[self.question_key] = parsed
                return data
        except Exception as e:
            LOG.warning(f'Failed to generate question: {e}')

        return []

    def _parse_question_result(self, result) -> Optional[str]:
        try:
            if isinstance(result, dict) and 'new_query' in result:
                return result['new_query']
        except Exception as e:
            LOG.warning(f'[Error]: Failed to parse question: {e}')
        return None

DepthQAGGetIdentifier

Bases: agenticrag

调用 LLM 从输入文本中抽取内容标识符(identifier)的算子。

如果数据中已存在 identifier 字段,则跳过处理。

Parameters:

  • llm

    语言模型服务实例

  • input_key (str, default: 'question' ) –

    输入文本字段名,默认 'question'

  • **kwargs (dict, default: {} ) –

    其它可选的参数。

Examples:

from lazyllm.tools.data import agenticrag
op = agenticrag.DepthQAGGetIdentifier(llm=my_llm, input_key='question')
result = op({'question': 'What is the capital of France?'})
print('identifier:', result['identifier'])
# {'identifier': 'capital of France'}
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_depth_qa_generator.py
class DepthQAGGetIdentifier(agenticrag):
    """调用 LLM 从输入文本中抽取内容标识符(identifier)的算子。

如果数据中已存在 identifier 字段,则跳过处理。

Args:
    llm: 语言模型服务实例
    input_key (str): 输入文本字段名,默认 'question'
    **kwargs (dict): 其它可选的参数。


Examples:
    ```python
    from lazyllm.tools.data import agenticrag
    op = agenticrag.DepthQAGGetIdentifier(llm=my_llm, input_key='question')
    result = op({'question': 'What is the capital of France?'})
    print('identifier:', result['identifier'])
    # {'identifier': 'capital of France'}
    ```
    """

    def __init__(self, llm=None, input_key: str = 'question', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.prompt_template = RAGDepthQueryIdPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt)
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        # Skip if identifier already exists
        if 'identifier' in data:
            return data

        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        content = data.get(self.input_key, '')
        user_prompt = self.prompt_template.build_prompt(content)

        try:
            result = self._llm_serve(user_prompt)
            data['identifier'] = result
        except Exception as e:
            LOG.warning(f'Failed to get identifier: {e}')
            data['identifier'] = ''

        return data

DepthQAGVerifyQuestion

Bases: agenticrag

验证生成问题的质量,过滤过于简单的问题。

先让 LLM 回答问题生成 llm_answer,然后与 refined_answer 进行召回评分。 若评分 >= 1(表示问题太简单),则过滤该样本;否则保留数据。

Parameters:

  • llm

    语言模型服务实例

  • question_key (str, default: 'depth_question' ) –

    问题字段名,默认 'depth_question'

  • **kwargs (dict, default: {} ) –

    其它可选的参数。

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.DepthQAGVerifyQuestion(llm=my_llm)
result = op({
    'depth_question': 'What is the capital of France?',
    'refined_answer': 'Paris'
})
# Returns data if question is challenging, empty list if too easy
print(result)
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_depth_qa_generator.py
class DepthQAGVerifyQuestion(agenticrag):
    """验证生成问题的质量,过滤过于简单的问题。

先让 LLM 回答问题生成 llm_answer,然后与 refined_answer 进行召回评分。
若评分 >= 1(表示问题太简单),则过滤该样本;否则保留数据。

Args:
    llm: 语言模型服务实例
    question_key (str): 问题字段名,默认 'depth_question'
    **kwargs (dict): 其它可选的参数。


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.DepthQAGVerifyQuestion(llm=my_llm)
    result = op({
        'depth_question': 'What is the capital of France?',
        'refined_answer': 'Paris'
    })
    # Returns data if question is challenging, empty list if too easy
    print(result)
    ```
    """

    def __init__(self, llm=None, question_key: str = 'depth_question', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.question_key = question_key
        self.answer_template = RAGDepthSolverPrompt()
        self.score_template = RAGDepthConsistencyScoringPrompt()

        if llm is not None:
            self._llm_answer_serve = llm.share()
            self._llm_answer_serve.start()

            score_system_prompt = self.score_template.build_system_prompt()
            self._llm_score_serve = llm.share().prompt(score_system_prompt).formatter(JsonFormatter())
            self._llm_score_serve.start()
        else:
            self._llm_answer_serve = None
            self._llm_score_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_answer_serve is None or self._llm_score_serve is None:
            raise ValueError('LLM is not configured')

        question = data.get(self.question_key, '')

        if 'refined_answer' not in data and 'answer' in data:
            data['refined_answer'] = data['answer']

        refined_answer = data.get('refined_answer', '')

        user_prompt = self.answer_template.build_prompt(question)
        try:
            llm_answer = self._llm_answer_serve(user_prompt)
            data['llm_answer'] = llm_answer
        except Exception as e:
            LOG.warning(f'Failed to get LLM answer: {e}')
            return []

        score_prompt = self.score_template.build_prompt(refined_answer, llm_answer)

        try:
            score_result = self._llm_score_serve(score_prompt)
            if isinstance(score_result, dict):
                score = score_result.get('answer_score', 0)
            else:
                score = 0
            data['llm_score'] = score

            # Filter out easy questions (score >= 1)
            if score >= 1:
                data.pop('llm_answer', None)
                data.pop('llm_score', None)
                return []

            # Clean up temporary fields
            data.pop('llm_answer', None)
            data.pop('llm_score', None)
        except Exception as e:
            LOG.warning(f'Failed to calculate recall score: {e}')
            return []

        return data

lazyllm.tools.data.operators.agentic_rag.agenticrag_qaf1_sample_evaluator

qaf1_calculate_score(data, result_key='F1Score')

计算问答对的 F1 分数的函数。

基于规范化后的预测答案和参考答案计算 F1 分数(综合考虑精确率和召回率)。 支持多个参考答案,取最高 F1 分数作为最终结果。计算完成后清理临时字段。

Parameters:

  • data (dict) –

    单条数据字典

  • output_key (str) –

    输出 F1 分数的字段名,默认 'F1Score'

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.qaf1_calculate_score(output_key='F1Score')
result = op({
    '_normalized_prediction': 'paris is capital',
    '_normalized_ground_truths': ['capital is paris', 'paris capital france']
})
print(result['F1Score'])  # F1 score value between 0.0 and 1.0
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_qaf1_sample_evaluator.py
@data_register('data.agenticrag', rewrite_func='forward', _concurrency_mode='process')
def qaf1_calculate_score(data: dict, result_key: str = 'F1Score') -> dict:
    """计算问答对的 F1 分数的函数。

基于规范化后的预测答案和参考答案计算 F1 分数(综合考虑精确率和召回率)。
支持多个参考答案,取最高 F1 分数作为最终结果。计算完成后清理临时字段。

Args:
    data (dict): 单条数据字典
    output_key (str): 输出 F1 分数的字段名,默认 'F1Score'


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.qaf1_calculate_score(output_key='F1Score')
    result = op({
        '_normalized_prediction': 'paris is capital',
        '_normalized_ground_truths': ['capital is paris', 'paris capital france']
    })
    print(result['F1Score'])  # F1 score value between 0.0 and 1.0
    ```
    """
    normalized_prediction = data.get('_normalized_prediction', None)
    normalized_ground_truths = data.get('_normalized_ground_truths', None)

    if normalized_prediction is None or not normalized_ground_truths:
        data[result_key] = 0.0
    else:
        max_f1 = 0.0
        for normalized_ground_truth in normalized_ground_truths:
            f1 = _compute_f1_score(normalized_prediction, normalized_ground_truth)
            max_f1 = max(max_f1, f1)
        data[result_key] = max_f1

    # Clean up temporary fields
    data.pop('_normalized_prediction', None)
    data.pop('_normalized_ground_truths', None)

    return data

qaf1_normalize_texts(data, predicted_key='refined_answer', reference_key='golden_doc_answer')

规范化预测答案和参考答案文本的函数。

对预测答案和参考答案进行标准化处理,包括:转换为小写、移除标点符号、 移除冠词(a/an/the)、规范化空白字符。规范化后的结果存储在临时字段中, 供后续 F1 分数计算使用。

Parameters:

  • data (dict) –

    单条数据字典

  • prediction_key (str) –

    预测答案字段名,默认 'refined_answer'

  • ground_truth_key (str) –

    参考答案字段名,默认 'golden_doc_answer'

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.qaf1_normalize_texts(prediction_key='refined_answer', ground_truth_key='golden_doc_answer')
result = op({
    'refined_answer': 'Paris is the capital.',
    'golden_doc_answer': 'The capital is Paris!'
})
print(result['_normalized_prediction'])  # 'paris is capital'
print(result['_normalized_ground_truths'])  # ['capital is paris']
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_qaf1_sample_evaluator.py
@data_register('data.agenticrag', rewrite_func='forward', _concurrency_mode='process')
def qaf1_normalize_texts(data: dict,
                         predicted_key: str = 'refined_answer',
                         reference_key: str = 'golden_doc_answer') -> dict:
    """规范化预测答案和参考答案文本的函数。

对预测答案和参考答案进行标准化处理,包括:转换为小写、移除标点符号、
移除冠词(a/an/the)、规范化空白字符。规范化后的结果存储在临时字段中,
供后续 F1 分数计算使用。

Args:
    data (dict): 单条数据字典
    prediction_key (str): 预测答案字段名,默认 'refined_answer'
    ground_truth_key (str): 参考答案字段名,默认 'golden_doc_answer'


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.qaf1_normalize_texts(prediction_key='refined_answer', ground_truth_key='golden_doc_answer')
    result = op({
        'refined_answer': 'Paris is the capital.',
        'golden_doc_answer': 'The capital is Paris!'
    })
    print(result['_normalized_prediction'])  # 'paris is capital'
    print(result['_normalized_ground_truths'])  # ['capital is paris']
    ```
    """
    prediction = data.get(predicted_key, None)
    ground_truths = data.get(reference_key, None)

    if prediction is None or ground_truths is None:
        data['_normalized_prediction'] = None
        data['_normalized_ground_truths'] = None
        return data

    # Normalize prediction
    data['_normalized_prediction'] = _normalize_response(str(prediction))

    # Normalize ground truths (handle both string and list)
    if isinstance(ground_truths, str):
        data['_normalized_ground_truths'] = [_normalize_response(str(ground_truths))]
    else:
        data['_normalized_ground_truths'] = [
            _normalize_response(str(gt)) for gt in ground_truths if gt is not None
        ]

    return data

lazyllm.tools.data.operators.agentic_rag.agenticrag_width_qa_generator

WidthQAGCheckDecomposition

Bases: agenticrag

验证合并后的问题是否有效分解了原始问题的算子。

该算子检查 LLM 生成的复杂问题是否正确地分解和包含了原始问题, 如果验证通过则保留数据,否则返回空列表过滤掉该样本。

Parameters:

  • llm

    语言模型服务实例

  • output_question_key (str, default: 'generated_width_task' ) –

    输出生成问题的字段名,默认 'generated_width_task'

  • **kwargs (dict, default: {} ) –

    其它可选的参数。

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.WidthQAGCheckDecomposition(llm=my_llm)
result = op({
    'question': 'What are the capitals of France and UK?',
    'original_question': ['What is Paris?', 'What is London?'],
    'index': 0
})
print(result)  # Returns data if valid, empty list if invalid
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_width_qa_generator.py
class WidthQAGCheckDecomposition(agenticrag):
    """验证合并后的问题是否有效分解了原始问题的算子。

该算子检查 LLM 生成的复杂问题是否正确地分解和包含了原始问题,
如果验证通过则保留数据,否则返回空列表过滤掉该样本。

Args:
    llm: 语言模型服务实例
    output_question_key (str): 输出生成问题的字段名,默认 'generated_width_task'
    **kwargs (dict): 其它可选的参数。


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.WidthQAGCheckDecomposition(llm=my_llm)
    result = op({
        'question': 'What are the capitals of France and UK?',
        'original_question': ['What is Paris?', 'What is London?'],
        'index': 0
    })
    print(result)  # Returns data if valid, empty list if invalid
    ```
    """

    def __init__(self, llm=None, output_question_key: str = 'generated_width_task', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.output_question_key = output_question_key
        self.prompt_template = RAGWidthDecompositionCheckPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def _build_check_input(self, item: dict) -> dict:
        ori_q = item.get('original_question', [])
        return {
            'index': item.get('index', 0),
            'complex_question': item.get('question', ''),
            'original_questions': ori_q if isinstance(ori_q, list) else [ori_q]
        }

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        check_input = self._build_check_input(data)
        user_prompt = self.prompt_template.build_prompt(check_input)

        try:
            result = self._llm_serve(user_prompt)

            if isinstance(result, dict):
                state = result.get('state', None)
                complex_question = result.get('complex_question', data.get('question'))

                if state == 1:
                    data['state'] = state
                    data[self.output_question_key] = complex_question
                    return data
                else:
                    return []
            else:
                LOG.warning('[Skipped]: Invalid check result')
                return []
        except Exception as e:
            LOG.warning(f'[Error]: Failed to parse check result: {e}')
            return []

WidthQAGFilterByScore

Bases: agenticrag

根据召回评分过滤广度问题的算子。

该算子对比 golden_answer 和 llm_answer 计算召回评分, 若评分 >= 1 则过滤该样本(表示问题太简单或 LLM 回答太好); 否则保留数据并清理临时字段。

Parameters:

  • llm

    语言模型服务实例

  • **kwargs (dict, default: {} ) –

    其它可选的参数。

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.WidthQAGFilterByScore(llm=my_llm)
result = op({
    'original_answer': ['Paris', 'London'],
    'llm_answer': 'Paris is the capital of France and London is the capital of UK',
    'state': 1
})
# Returns data if score < 1, empty list if score >= 1
print(result)
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_width_qa_generator.py
class WidthQAGFilterByScore(agenticrag):
    """根据召回评分过滤广度问题的算子。

该算子对比 golden_answer 和 llm_answer 计算召回评分,
若评分 >= 1 则过滤该样本(表示问题太简单或 LLM 回答太好);
否则保留数据并清理临时字段。

Args:
    llm: 语言模型服务实例
    **kwargs (dict): 其它可选的参数。


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.WidthQAGFilterByScore(llm=my_llm)
    result = op({
        'original_answer': ['Paris', 'London'],
        'llm_answer': 'Paris is the capital of France and London is the capital of UK',
        'state': 1
    })
    # Returns data if score < 1, empty list if score >= 1
    print(result)
    ```
    """

    def __init__(self, llm=None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.score_template = RAGWidthConsistencyScoringPrompt()

        if llm is not None:
            system_prompt = self.score_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        golden_answer = data.get('original_answer', [])
        llm_answer = data.get('llm_answer', '')

        if not golden_answer or not llm_answer:
            return []

        user_prompt = self.score_template.build_prompt(golden_answer, llm_answer)

        try:
            score_result = self._llm_serve(user_prompt)

            if isinstance(score_result, dict):
                score = score_result.get('answer_score', 0)
            else:
                score = 0

            data['llm_score'] = score

            if score >= 1:
                data.pop('llm_answer', None)
                data.pop('llm_score', None)
                data.pop('state', None)
                return []

            data.pop('llm_answer', None)
            data.pop('llm_score', None)
            data.pop('state', None)
            return data
        except Exception as e:
            LOG.warning(f'Failed to calculate recall score: {e}')
            return []

WidthQAGMergePairs

Bases: agenticrag

将相邻的问答对合并生成广度问题的算子。

该算子接收批量问答数据,通过 LLM 将相邻的两个问答对合并为一个更复杂的广度问题。 需要至少2条数据才能进行合并操作。

Parameters:

  • llm

    语言模型服务实例

  • **kwargs (dict, default: {} ) –

    其它可选的参数。

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.WidthQAGMergePairs(llm=my_llm)
result = op([
    {'question': 'What is Paris?', 'golden_answer': 'Capital of France'},
    {'question': 'What is London?', 'golden_answer': 'Capital of UK'}
])
print(result[0]['question'])  # Merged complex question
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_width_qa_generator.py
class WidthQAGMergePairs(agenticrag):
    """将相邻的问答对合并生成广度问题的算子。

该算子接收批量问答数据,通过 LLM 将相邻的两个问答对合并为一个更复杂的广度问题。
需要至少2条数据才能进行合并操作。

Args:
    llm: 语言模型服务实例
    **kwargs (dict): 其它可选的参数。


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.WidthQAGMergePairs(llm=my_llm)
    result = op([
        {'question': 'What is Paris?', 'golden_answer': 'Capital of France'},
        {'question': 'What is London?', 'golden_answer': 'Capital of UK'}
    ])
    print(result[0]['question'])  # Merged complex question
    ```
    """

    def __init__(self, llm=None, **kwargs):
        super().__init__(rewrite_func='forward_batch_input', **kwargs)
        self.prompt_template = RAGWidthQuestionSynthesisPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def _build_prompts(self, data: List[dict]) -> list:
        user_prompts = []
        for i in range(len(data) - 1):
            pair = [data[i], data[i + 1]]
            user_prompts.append(self.prompt_template.build_prompt(pair))
        return user_prompts

    def _parse_merge_result(self, result, idx: int, input_batch: List[dict]) -> Optional[dict]:
        try:
            if isinstance(result, dict):
                if isinstance(result, list) and len(result) > 0:
                    result = result[0]

            if not isinstance(result, dict) or 'question' not in result or 'index' not in result:
                LOG.warning(f'[Skipped]: Invalid merge result at index {idx}')
                return None

            indices = result['index'] if isinstance(result['index'], list) else [result['index']]
            group_items = [input_batch[i] for i in indices if i < len(input_batch)]

            if not group_items:
                return None

            return {
                'question': result['question'],
                'content_identifier': result.get('content_identifier', ''),
                'qa_index': indices,
                'index': idx,
                'original_answer': [item['golden_answer'] for item in group_items],
                'original_question': [item['question'] for item in group_items],
            }
        except Exception as e:
            LOG.warning(f'[Error]: Failed to parse merge result at index {idx}: {e}')
            return None

    def forward_batch_input(self, data: List[dict]) -> List[dict]:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        if len(data) < 2:
            LOG.warning('Need at least 2 items to merge.')
            return []

        LOG.info(f'Merging {len(data)} items into width questions...')
        user_prompts = self._build_prompts(data)

        if not user_prompts:
            return []

        merge_results = []
        for prompt in user_prompts:
            merge_results.append(self._llm_serve(prompt))

        merged_data_list = []
        for idx, result in enumerate(merge_results):
            parsed = self._parse_merge_result(result, idx, data)
            if parsed is not None:
                merged_data_list.append(parsed)

        LOG.info(f'Generated {len(merged_data_list)} merged questions.')
        return merged_data_list

WidthQAGVerifyQuestion

Bases: agenticrag

验证生成的问题能否被正确回答的算子。

该算子使用 LLM 尝试回答生成的问题,并将答案存储在 llm_answer 字段中, 供后续评分使用。

Parameters:

  • llm

    语言模型服务实例

  • output_question_key (str, default: 'generated_width_task' ) –

    问题字段名,默认 'generated_width_task'

  • **kwargs (dict, default: {} ) –

    其它可选的参数。

Examples:

from lazyllm.tools.data import agenticrag

op = agenticrag.WidthQAGVerifyQuestion(llm=my_llm)
result = op({
    'generated_width_task': 'What are the capitals of France and UK?',
    'index': 0
})
print(result['llm_answer'])  # LLM's answer to the question
Source code in lazyllm/tools/data/operators/agentic_rag/agenticrag_width_qa_generator.py
class WidthQAGVerifyQuestion(agenticrag):
    """验证生成的问题能否被正确回答的算子。

该算子使用 LLM 尝试回答生成的问题,并将答案存储在 llm_answer 字段中,
供后续评分使用。

Args:
    llm: 语言模型服务实例
    output_question_key (str): 问题字段名,默认 'generated_width_task'
    **kwargs (dict): 其它可选的参数。


Examples:
    ```python
    from lazyllm.tools.data import agenticrag

    op = agenticrag.WidthQAGVerifyQuestion(llm=my_llm)
    result = op({
        'generated_width_task': 'What are the capitals of France and UK?',
        'index': 0
    })
    print(result['llm_answer'])  # LLM's answer to the question
    ```
    """

    def __init__(self, llm=None, output_question_key: str = 'generated_width_task', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.output_question_key = output_question_key
        self.prompt_template = RAGWidthVerificationPrompt()

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def _parse_verify_result(self, result) -> Optional[str]:
        try:
            if isinstance(result, dict):
                return result.get('llm_answer', None)
        except Exception as e:
            LOG.warning(f'[Error]: Failed to parse verification result: {e}')
        return None

    def forward(self, data: dict) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        question = data.get(self.output_question_key, '')

        verify_input = {
            'index': data.get('index', 0),
            'complex_question': question
        }

        user_prompt = self.prompt_template.build_prompt(verify_input)

        try:
            result = self._llm_serve(user_prompt)
            llm_answer = self._parse_verify_result(result)
            data['llm_answer'] = llm_answer
            return data
        except Exception as e:
            LOG.warning(f'Failed to verify question: {e}')
            return []

纯文本生成QA对算子

lazyllm.tools.data.operators.text2qa_ops

ChunkToQA

Bases: Text2qa

基于大模型将每个文本块生成一个 QA 对(问题 + 答案)。使用 JsonFormatter 约束输出格式,可自定义 user_prompt 或使用默认「根据下面文本生成一个 QA 对」。

Parameters:

  • input_key (str, default: 'chunk' ) –

    输入块字段名,默认 'chunk'

  • query_key (str, default: 'query' ) –

    生成的问题写入的字段名,默认 'query'

  • answer_key (str, default: 'answer' ) –

    生成的答案写入的字段名,默认 'answer'

  • model

    可选,TrainableModule 或兼容接口;None 时使用默认 Qwen 模型

  • user_prompt (str | None, default: None ) –

    可选,用户提示前缀;None 时使用默认

  • **kwargs

    其它基类参数

Examples:

from lazyllm.tools.data import Text2qa
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = Text2qa.ChunkToQA(input_key='chunk', query_key='query', answer_key='answer', model=llm)
data = [{'chunk': '今天是晴天!'}]
res = op(data)
print(res)
# [{'chunk': '今天是晴天!', 'query': '今天的天气怎么样?', 'answer': '今天是晴天!'}]
Source code in lazyllm/tools/data/operators/text2qa_ops.py
class ChunkToQA(Text2qa):
    """基于大模型将每个文本块生成一个 QA 对(问题 + 答案)。使用 JsonFormatter 约束输出格式,可自定义 user_prompt 或使用默认「根据下面文本生成一个 QA 对」。

Args:
    input_key (str): 输入块字段名,默认 'chunk'
    query_key (str): 生成的问题写入的字段名,默认 'query'
    answer_key (str): 生成的答案写入的字段名,默认 'answer'
    model: 可选,TrainableModule 或兼容接口;None 时使用默认 Qwen 模型
    user_prompt (str|None): 可选,用户提示前缀;None 时使用默认
    **kwargs: 其它基类参数


Examples:
    ```python
    from lazyllm.tools.data import Text2qa
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = Text2qa.ChunkToQA(input_key='chunk', query_key='query', answer_key='answer', model=llm)
    data = [{'chunk': '今天是晴天!'}]
    res = op(data)
    print(res)
    # [{'chunk': '今天是晴天!', 'query': '今天的天气怎么样?', 'answer': '今天是晴天!'}]
    ```
    """
    def __init__(self,
                 input_key='chunk',
                 query_key='query',
                 answer_key='answer',
                 model=None,
                 user_prompt=None,
                 **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.query_key = query_key
        self.answer_key = answer_key
        self.user_prompt = user_prompt

        output_structure = f'''
        输出格式要求:
        {{
            "{self.query_key}": "生成的问题",
            "{self.answer_key}": "答案"
        }}
        '''

        if model is None:
            self.model = TrainableModule(DEFAULT_MODEL)
        else:
            self.model = model.share()
        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data: dict):
        assert self.input_key in data
        chunk = data.get(self.input_key, '')

        if not chunk:
            data[self.query_key] = ''
            data[self.answer_key] = ''
            return data

        if self.user_prompt is None:
            user_prompt = '根据下面文本生成一个 QA 对:\n'
        else:
            user_prompt = self.user_prompt

        inp = f'{user_prompt}\n{chunk}'

        qa = self.model(inp)

        data[self.query_key] = qa.get(self.query_key, '')
        data[self.answer_key] = qa.get(self.answer_key, '')
        return data

QAScorer

Bases: Text2qa

基于大模型对 QA 对进行打分:判断答案是否严格基于原文,输出 1(基于原文)或 0(否则)。使用 JsonFormatter 约束输出 score 字段。

Parameters:

  • input_key (str, default: 'chunk' ) –

    原文块字段名,默认 'chunk'

  • output_key (str, default: 'score' ) –

    分数写入的字段名,默认 'score'

  • query_key (str, default: 'query' ) –

    问题字段名,默认 'query'

  • answer_key (str, default: 'answer' ) –

    答案字段名,默认 'answer'

  • model

    可选,TrainableModule 或兼容接口;None 时使用默认 Qwen 模型

  • user_prompt (str | None, default: None ) –

    可选,用户提示;None 时使用默认规则(严格基于原文→1,否则→0)

  • **kwargs

    其它基类参数

Examples:

from lazyllm.tools.data import Text2qa
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = Text2qa.QAScorer(input_key='chunk', output_key='score', query_key='query', answer_key='answer', model=llm)
data = [
{'chunk': '今天是晴天!', 'query': '今天的天气怎么样?', 'answer': '今天是晴天!'},
{'chunk': '1+1=2', 'query': '1+1=?', 'answer': '3'}
]
res = op(data)
print(res)
# [{'chunk': '今天是晴天!', 'query': '今天的天气怎么样?', 'answer': '今天是晴天!', 'score': 1}, {'chunk': '1+1=2', 'query': '1+1=?', 'answer': '3', 'score': 0}]
Source code in lazyllm/tools/data/operators/text2qa_ops.py
class QAScorer(Text2qa):
    """基于大模型对 QA 对进行打分:判断答案是否严格基于原文,输出 1(基于原文)或 0(否则)。使用 JsonFormatter 约束输出 score 字段。

Args:
    input_key (str): 原文块字段名,默认 'chunk'
    output_key (str): 分数写入的字段名,默认 'score'
    query_key (str): 问题字段名,默认 'query'
    answer_key (str): 答案字段名,默认 'answer'
    model: 可选,TrainableModule 或兼容接口;None 时使用默认 Qwen 模型
    user_prompt (str|None): 可选,用户提示;None 时使用默认规则(严格基于原文→1,否则→0)
    **kwargs: 其它基类参数


Examples:
    ```python
    from lazyllm.tools.data import Text2qa
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = Text2qa.QAScorer(input_key='chunk', output_key='score', query_key='query', answer_key='answer', model=llm)
    data = [
    {'chunk': '今天是晴天!', 'query': '今天的天气怎么样?', 'answer': '今天是晴天!'},
    {'chunk': '1+1=2', 'query': '1+1=?', 'answer': '3'}
    ]
    res = op(data)
    print(res)
    # [{'chunk': '今天是晴天!', 'query': '今天的天气怎么样?', 'answer': '今天是晴天!', 'score': 1}, {'chunk': '1+1=2', 'query': '1+1=?', 'answer': '3', 'score': 0}]
    ```
    """
    def __init__(self,
                 input_key='chunk',
                 output_key='score',
                 query_key='query',
                 answer_key='answer',
                 model=None,
                 user_prompt=None,
                 **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.output_key = output_key
        self.query_key = query_key
        self.answer_key = answer_key
        self.user_prompt = user_prompt

        output_structure = f'''
        输出格式要求:
        {{
            "{self.output_key}": 0 or 1
        }}
        '''

        if model is None:
            self.model = TrainableModule(DEFAULT_MODEL)
        else:
            self.model = model.share()
        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data: dict):
        assert self.input_key in data
        assert self.query_key in data
        assert self.answer_key in data

        chunk = data.get(self.input_key, '')
        query = data.get(self.query_key, '')
        answer = data.get(self.answer_key, '')

        if not (chunk and query and answer):
            data[self.output_key] = 0
            return data

        qa = f'问题{query}; 答案{answer}'
        if self.user_prompt is None:
            user_prompt = f'''
        请根据下面内容对 QA 打分:

        原文:
        {chunk}

        {qa}

        规则:
        - 严格基于原文 → 1
        - 否则 → 0
        '''
        else:
            user_prompt = self.user_prompt + qa
        res = self.model(user_prompt)

        data[self.output_key] = res.get(self.output_key, 0)
        return data

TextToChunks

Bases: Text2qa

将输入文本按行切分为多个块(chunk),每条输入可展开为多条输出。支持按 token 数或字符数控制块大小,可选用 tokenizer 或按字符计数。

Parameters:

  • input_key (str, default: 'content' ) –

    输入文本字段名,默认 'content'

  • output_key (str, default: 'chunk' ) –

    输出块内容写入的字段名,默认 'chunk'

  • chunk_size (int, default: 10 ) –

    每块的最大长度(token 数或字符数),默认 10

  • tokenize (bool, default: True ) –

    是否按 token 计数;为 True 且未提供 tokenizer 时使用默认 Qwen tokenizer

  • tokenizer

    可选,用于计数的 tokenizer;None 时若 tokenize=True 则自动加载默认

  • **kwargs

    其它基类参数(如 _concurrency_mode、_max_workers 等)

Examples:

from lazyllm.tools.data import Text2qa

op = Text2qa.TextToChunks(input_key='content', output_key='chunk', chunk_size=10, tokenize=False)
data = [{'content': 'line1
line2
line3
line4'}]
res = op(data)
print(res)
# [{'content': 'line1
line2
line3
line4', 'chunk': 'line1
line2'}, {'content': 'line1
line2
line3
line4', 'chunk': 'line3
line4'}]
Source code in lazyllm/tools/data/operators/text2qa_ops.py
class TextToChunks(Text2qa):
    """将输入文本按行切分为多个块(chunk),每条输入可展开为多条输出。支持按 token 数或字符数控制块大小,可选用 tokenizer 或按字符计数。

Args:
    input_key (str): 输入文本字段名,默认 'content'
    output_key (str): 输出块内容写入的字段名,默认 'chunk'
    chunk_size (int): 每块的最大长度(token 数或字符数),默认 10
    tokenize (bool): 是否按 token 计数;为 True 且未提供 tokenizer 时使用默认 Qwen tokenizer
    tokenizer: 可选,用于计数的 tokenizer;None 时若 tokenize=True 则自动加载默认
    **kwargs: 其它基类参数(如 _concurrency_mode、_max_workers 等)


Examples:
    ```python
    from lazyllm.tools.data import Text2qa

    op = Text2qa.TextToChunks(input_key='content', output_key='chunk', chunk_size=10, tokenize=False)
    data = [{'content': 'line1
    line2
    line3
    line4'}]
    res = op(data)
    print(res)
    # [{'content': 'line1
    line2
    line3
    line4', 'chunk': 'line1
    line2'}, {'content': 'line1
    line2
    line3
    line4', 'chunk': 'line3
    line4'}]
    ```
    """
    def __init__(self,
                 input_key='content',
                 output_key='chunk',
                 chunk_size=10,
                 tokenize=True,
                 tokenizer=None,
                 **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.input_key = input_key
        self.output_key = output_key
        self.chunk_size = chunk_size
        self.tokenizer = tokenizer
        if tokenize and tokenizer is None:
            LOG.warning(
                f'tokenize=True but tokenizer is None, '
                f'loading tokenizer from default model: {DEFAULT_TOKENIZER}'
            )
            try:
                self.tokenizer = transformers.AutoTokenizer.from_pretrained(
                    DEFAULT_TOKENIZER,
                    trust_remote_code=True
                )
                self.tokenize = True
            except Exception as e:
                LOG.warning(
                    f'failed to load tokenizer from {DEFAULT_TOKENIZER}, '
                    f'falling back to char count, error: {e}'
                )
                self.tokenize = False
                self.tokenizer = None
        else:
            self.tokenizer = tokenizer
            self.tokenize = tokenize

    def _get_len(self, text: str):
        if self.tokenize:
            return len(
                self.tokenizer.encode(text, add_special_tokens=False)
            )
        return len(text)

    def forward(self, data: dict):
        text = data.get(self.input_key, '')
        if not text:
            return []

        lines = [line.strip() for line in text.split('\n') if line.strip()]

        chunks = []
        cur_parts = []
        cur_len = 0

        for line in lines:
            l_len = self._get_len(line)
            if cur_len + l_len <= self.chunk_size:
                cur_parts.append(line)
                cur_len += l_len
            else:
                if cur_parts:
                    chunks.append('\n'.join(cur_parts))
                cur_parts = [line]
                cur_len = l_len

        if cur_parts:
            chunks.append('\n'.join(cur_parts))

        results = []
        for c in chunks:
            item = data.copy()
            item[self.output_key] = c
            results.append(item)

        return results

empty_or_noise_filter(data, input_key='chunk')

过滤空内容或纯噪声数据。若指定字段为空或仅包含非字母/非 CJK 字符则丢弃该条(返回空列表),否则保留原数据。以 forward 单条方式注册。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'chunk' ) –

    要检查的字段名,默认 'chunk'

Examples:

from lazyllm.tools.data import Text2qa

op = Text2qa.empty_or_noise_filter(input_key='chunk')
data = [{'chunk': 'hello'}, {'chunk': ''}, {'chunk': '
'}]
res = op(data)
print(res)
# [{'chunk': 'hello'}]
Source code in lazyllm/tools/data/operators/text2qa_ops.py
@data_register('data.Text2qa', rewrite_func='forward', _concurrency_mode='process')
def empty_or_noise_filter(data: dict, input_key='chunk'):
    """过滤空内容或纯噪声数据。若指定字段为空或仅包含非字母/非 CJK 字符则丢弃该条(返回空列表),否则保留原数据。以 forward 单条方式注册。

Args:
    data (dict): 单条数据字典
    input_key (str): 要检查的字段名,默认 'chunk'


Examples:
    ```python
    from lazyllm.tools.data import Text2qa

    op = Text2qa.empty_or_noise_filter(input_key='chunk')
    data = [{'chunk': 'hello'}, {'chunk': ''}, {'chunk': '
    '}]
    res = op(data)
    print(res)
    # [{'chunk': 'hello'}]
    ```
    """
    text = data.get(input_key, '')
    if not text:
        return []

    if not re.search(r'[\w\u4e00-\u9fff]', text):
        return []

    return data

invalid_unicode_cleaner(data, input_key='chunk')

清除指定文本字段中的无效 Unicode 码位(如 FDD0–FDEF、FFFE/FFFF 及若干 Supplementary Special Purpose 区段),原地修改并返回数据。以 forward 单条方式注册。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'chunk' ) –

    要清洗的文本字段名,默认 'chunk'

Examples:

from lazyllm.tools.data import Text2qa

op = Text2qa.invalid_unicode_cleaner(input_key='chunk')
data = {'chunk': 'valid text￾ tail'}
res = op(data)  # 剔除乱码￾
print(res)
[{'chunk': 'valid text tail'}]
Source code in lazyllm/tools/data/operators/text2qa_ops.py
@data_register('data.Text2qa', rewrite_func='forward', _concurrency_mode='process')
def invalid_unicode_cleaner(data: dict, input_key='chunk'):
    """清除指定文本字段中的无效 Unicode 码位(如 FDD0–FDEF、FFFE/FFFF 及若干 Supplementary Special Purpose 区段),原地修改并返回数据。以 forward 单条方式注册。

Args:
    data (dict): 单条数据字典
    input_key (str): 要清洗的文本字段名,默认 'chunk'


Examples:
    ```python
    from lazyllm.tools.data import Text2qa

    op = Text2qa.invalid_unicode_cleaner(input_key='chunk')
    data = {'chunk': 'valid text￾ tail'}
    res = op(data)  # 剔除乱码￾
    print(res)
    [{'chunk': 'valid text tail'}]
    ```
    """
    text = data.get(input_key, '')
    if not text:
        return data

    text = re.sub(
        r'[\uFDD0-\uFDEF\uFFFE\uFFFF'
        r'\U0001FFFE\U0001FFFF'
        r'\U0002FFFE\U0002FFFF'
        r'\U0003FFFE\U0003FFFF'
        r'\U0004FFFE\U0004FFFF'
        r'\U0005FFFE\U0005FFFF'
        r'\U0006FFFE\U0006FFFF'
        r'\U0007FFFE\U0007FFFF'
        r'\U0008FFFE\U0008FFFF'
        r'\U0009FFFE\U0009FFFF'
        r'\U000AFFFE\U000AFFFF'
        r'\U000BFFFE\U000BFFFF'
        r'\U000CFFFE\U000CFFFF'
        r'\U000DFFFE\U000DFFFF'
        r'\U000EFFFE\U000EFFFF'
        r'\U000FFFFE\U000FFFFF'
        r'\U0010FFFE\U0010FFFF]',
        '',
        text
    )

    data[input_key] = text
    return data

Embedding 数据合成

lazyllm.tools.data.operators.embedding_synthesis

EmbeddingFormatFlagEmbedding

Bases: embedding

将数据格式化为 FlagEmbedding 训练格式的算子。

该算子将输入的 query、pos(正样本)、neg(负样本)格式化为 FlagEmbedding 框架所需的训练数据格式。 支持添加指令(instruction)字段用于有监督的 Embedding 训练。

Parameters:

  • instruction (str, default: None ) –

    指令文本,用于有监督训练场景。默认为 None。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    包含 query、pos、neg 和可选 prompt 字段的字典。

Examples:

from lazyllm.tools.data import embedding

op = embedding.EmbeddingFormatFlagEmbedding(instruction='Represent this sentence for searching relevant passages:')
result = op({'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking recipe']})
# Returns: {'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking recipe'], 'prompt': 'Represent this sentence for searching relevant passages:'}
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_data_formatter.py
class EmbeddingFormatFlagEmbedding(embedding):
    """将数据格式化为 FlagEmbedding 训练格式的算子。

该算子将输入的 query、pos(正样本)、neg(负样本)格式化为 FlagEmbedding 框架所需的训练数据格式。
支持添加指令(instruction)字段用于有监督的 Embedding 训练。

Args:
    instruction (str, optional): 指令文本,用于有监督训练场景。默认为 None。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 包含 query、pos、neg 和可选 prompt 字段的字典。


Examples:
    ```python
    from lazyllm.tools.data import embedding

    op = embedding.EmbeddingFormatFlagEmbedding(instruction='Represent this sentence for searching relevant passages:')
    result = op({'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking recipe']})
    # Returns: {'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking recipe'], 'prompt': 'Represent this sentence for searching relevant passages:'}
    ```
    """
    def __init__(self, instruction: Optional[str] = None, **kwargs):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.instruction = instruction

    def forward(self, data: dict) -> dict:
        query = data.get('query', '')
        pos = data.get('pos', [])
        neg = data.get('neg', [])

        if not query or not pos:
            return []

        # Ensure pos and neg are lists
        if not isinstance(pos, list):
            pos = [pos]
        if not isinstance(neg, list):
            neg = [neg] if neg else []

        result = {
            'query': query,
            'pos': pos,
            'neg': neg,
        }
        if self.instruction:
            result['prompt'] = self.instruction

        return result

EmbeddingFormatSentenceTransformers

Bases: embedding

将数据格式化为 SentenceTransformers 三元组训练格式的算子。

该算子将输入的 query、pos(正样本)、neg(负样本)转换为 SentenceTransformers 框架所需的 anchor-positive-negative 三元组格式。 适用于 MultipleNegativesRankingLoss 等损失函数的训练。

Parameters:

  • **kwargs (dict, default: {} ) –

    可选的参数,传递给父类。

Returns:

  • List[dict]: 包含 anchor、positive、negative 字段的字典列表,每对正负样本生成一个三元组。

Examples:

from lazyllm.tools.data import embedding

op = embedding.EmbeddingFormatSentenceTransformers()
result = op({'query': 'machine learning', 'pos': ['ML basics'], 'neg': ['cooking tips']})
# Returns: [{'anchor': 'machine learning', 'positive': 'ML basics', 'negative': 'cooking tips'}]
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_data_formatter.py
class EmbeddingFormatSentenceTransformers(embedding):
    """将数据格式化为 SentenceTransformers 三元组训练格式的算子。

该算子将输入的 query、pos(正样本)、neg(负样本)转换为 SentenceTransformers 框架所需的 anchor-positive-negative 三元组格式。
适用于 MultipleNegativesRankingLoss 等损失函数的训练。

Args:
    **kwargs (dict): 可选的参数,传递给父类。

Returns:
    List[dict]: 包含 anchor、positive、negative 字段的字典列表,每对正负样本生成一个三元组。


Examples:
    ```python
    from lazyllm.tools.data import embedding

    op = embedding.EmbeddingFormatSentenceTransformers()
    result = op({'query': 'machine learning', 'pos': ['ML basics'], 'neg': ['cooking tips']})
    # Returns: [{'anchor': 'machine learning', 'positive': 'ML basics', 'negative': 'cooking tips'}]
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='process', **kwargs)

    def forward(self, data: dict) -> List[dict]:
        query = data.get('query', '')
        pos = data.get('pos', [])
        neg = data.get('neg', [])

        if not query or not pos:
            return []

        # Ensure pos and neg are lists
        pos_list = pos if isinstance(pos, list) else [pos]
        neg_list = neg if isinstance(neg, list) else [neg] if neg else []

        # Create anchor-positive-negative triplets
        results = []
        for p in pos_list:
            for n in neg_list:
                results.append(
                    {
                        'anchor': query,
                        'positive': p,
                        'negative': n,
                    }
                )

        return results

EmbeddingFormatTriplet

Bases: embedding

将数据格式化为通用三元组格式的算子。

该算子将输入的 query、pos(正样本)、neg(负样本)转换为标准的三元组格式, 字段名为 query、positive、negative。适用于多种 Embedding 训练框架。

Parameters:

  • **kwargs (dict, default: {} ) –

    可选的参数,传递给父类。

Returns:

  • List[dict]: 包含 query、positive、negative 字段的字典列表,每对正负样本生成一个三元组。

Examples:

from lazyllm.tools.data import embedding

op = embedding.EmbeddingFormatTriplet()
result = op({'query': 'deep learning', 'pos': ['neural networks', 'AI'], 'neg': ['history', 'geography']})
# Returns list of triplets combining each positive with each negative
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_data_formatter.py
class EmbeddingFormatTriplet(embedding):
    """将数据格式化为通用三元组格式的算子。

该算子将输入的 query、pos(正样本)、neg(负样本)转换为标准的三元组格式,
字段名为 query、positive、negative。适用于多种 Embedding 训练框架。

Args:
    **kwargs (dict): 可选的参数,传递给父类。

Returns:
    List[dict]: 包含 query、positive、negative 字段的字典列表,每对正负样本生成一个三元组。


Examples:
    ```python
    from lazyllm.tools.data import embedding

    op = embedding.EmbeddingFormatTriplet()
    result = op({'query': 'deep learning', 'pos': ['neural networks', 'AI'], 'neg': ['history', 'geography']})
    # Returns list of triplets combining each positive with each negative
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='process', **kwargs)

    def forward(self, data: dict) -> List[dict]:
        query = data.get('query', '')
        pos = data.get('pos', [])
        neg = data.get('neg', [])

        if not query or not pos:
            return []

        # Ensure pos and neg are lists
        pos_list = pos if isinstance(pos, list) else [pos]
        neg_list = neg if isinstance(neg, list) else [neg] if neg else []

        # Create query-positive-negative triplets
        results = []
        for p in pos_list:
            for n in neg_list:
                results.append(
                    {
                        'query': query,
                        'positive': p,
                        'negative': n,
                    }
                )

        return results

EmbeddingGenerateQueries

Bases: embedding

使用 LLM 生成查询的算子。

该算子调用语言模型服务,基于构建的提示生成查询。返回 JSON 格式的查询响应。

Parameters:

  • llm

    LLM 服务实例,用于生成查询。

  • num_queries (int, default: 3 ) –

    要生成的查询数量,默认为 3。

  • lang (str, default: 'zh' ) –

    语言,'zh' 表示中文,'en' 表示英文,默认为 'zh'。

  • query_types (List[str], default: None ) –

    查询类型列表,默认为 ['factual', 'semantic', 'inferential']。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    输入数据,添加了 '_query_response' 字段包含生成的查询响应。

Examples:

from lazyllm.tools.data import embedding

# Assuming llm is an LLM service instance
generator = embedding.EmbeddingGenerateQueries(llm=llm, lang='zh')
data = {'_query_prompt': 'Generate queries for: machine learning tutorial'}
result = generator(data)
# Returns data with '_query_response' field containing JSON queries
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_query_generator.py
class EmbeddingGenerateQueries(embedding):
    """使用 LLM 生成查询的算子。

该算子调用语言模型服务,基于构建的提示生成查询。返回 JSON 格式的查询响应。

Args:
    llm: LLM 服务实例,用于生成查询。
    num_queries (int): 要生成的查询数量,默认为 3。
    lang (str): 语言,'zh' 表示中文,'en' 表示英文,默认为 'zh'。
    query_types (List[str], optional): 查询类型列表,默认为 ['factual', 'semantic', 'inferential']。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 输入数据,添加了 '_query_response' 字段包含生成的查询响应。


Examples:
    ```python
    from lazyllm.tools.data import embedding

    # Assuming llm is an LLM service instance
    generator = embedding.EmbeddingGenerateQueries(llm=llm, lang='zh')
    data = {'_query_prompt': 'Generate queries for: machine learning tutorial'}
    result = generator(data)
    # Returns data with '_query_response' field containing JSON queries
    ```
    """
    def __init__(
        self,
        llm=None,
        num_queries: int = 3,
        lang: str = 'zh',
        query_types: Optional[List[str]] = None,
        **kwargs,
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.prompt_template = EmbeddingQueryGeneratorPrompt(lang=lang)
        self.num_queries = num_queries
        self.query_types = query_types or ['factual', 'semantic', 'inferential']
        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = (
                llm.share()
                .prompt(system_prompt)
                .formatter(JsonFormatter())
            )
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(
        self,
        data: dict,
        input_key: str = 'passage',
        **kwargs,
    ) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        passage = data.get(input_key, '')
        if not passage:
            return {**data, '_query_response': ''}

        user_prompt = self.prompt_template.build_prompt(
            passage=passage,
            num_queries=self.num_queries,
            query_types=self.query_types,
        )
        if not user_prompt:
            return {**data, '_query_response': ''}

        try:
            result = self._llm_serve(user_prompt)

            if isinstance(result, str):
                response = result
            else:
                response = json.dumps(result, ensure_ascii=False)

            return {**data, '_query_response': response}

        except Exception as e:
            LOG.warning(f'Failed to generate queries: {e}')
            return {**data, '_query_response': ''}

EmbeddingInitBM25

Bases: embedding

初始化 BM25 索引的算子。

该算子基于语料库构建 BM25 索引,用于后续的关键词检索和困难负样本挖掘。 支持中英文分词,使用 jieba 进行中文分词,Stemmer 进行英文词干提取。

Parameters:

  • language (str, default: 'zh' ) –

    语言类型,'zh' 表示中文,'en' 表示英文,默认为 'zh'。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • List[dict]: 输入数据,每条数据添加了 BM25 索引和相关配置信息。

Examples:

from lazyllm.tools.data import embedding

# First build corpus, then initialize BM25
corpus_op = embedding.build_embedding_corpus(input_pos_key='pos')
bm25_op = embedding.EmbeddingInitBM25(language='zh')
# Returns data with '_bm25' index and tokenizer configuration
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_hard_negative_miner.py
class EmbeddingInitBM25(embedding):
    """初始化 BM25 索引的算子。

该算子基于语料库构建 BM25 索引,用于后续的关键词检索和困难负样本挖掘。
支持中英文分词,使用 jieba 进行中文分词,Stemmer 进行英文词干提取。

Args:
    language (str): 语言类型,'zh' 表示中文,'en' 表示英文,默认为 'zh'。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    List[dict]: 输入数据,每条数据添加了 BM25 索引和相关配置信息。


Examples:
    ```python
    from lazyllm.tools.data import embedding

    # First build corpus, then initialize BM25
    corpus_op = embedding.build_embedding_corpus(input_pos_key='pos')
    bm25_op = embedding.EmbeddingInitBM25(language='zh')
    # Returns data with '_bm25' index and tokenizer configuration
    ```
    """

    def __init__(self, language: str = 'zh', **kwargs):
        super().__init__(rewrite_func='forward_batch_input', **kwargs)
        self.language = language
        self._setup_tokenizer(language)

    def _setup_tokenizer(self, language: str):
        if language == 'en':
            self._stemmer = Stemmer.Stemmer('english')
            self._stopwords = language
            self._tokenizer = lambda t: t
        elif language == 'zh':
            self._stemmer = None
            self._stopwords = STOPWORDS_CHINESE
            self._tokenizer = lambda t: ' '.join(jieba.lcut(t))
        else:
            self._stemmer = None
            self._stopwords = None
            self._tokenizer = lambda t: t

    def forward_batch_input(self, inputs: List[dict], **kwargs) -> List[dict]:
        if not inputs:
            return inputs

        # Load corpus from file path instead of memory
        corpus_path = inputs[0].get('_corpus', '')
        if not corpus_path:
            LOG.warning('No corpus path found for BM25 initialization.')
            return [
                {**item, '_bm25': None, '_bm25_corpus': []}
                for item in inputs
            ]

        corpus = _load_corpus_from_path(corpus_path)
        if not corpus:
            LOG.warning(f'Failed to load corpus from {corpus_path}')
            return [
                {**item, '_bm25': None, '_bm25_corpus': []}
                for item in inputs
            ]

        LOG.info(f'Initializing BM25 index for {len(corpus)} documents...')

        corpus_tokens = bm25s.tokenize(
            [self._tokenizer(doc) for doc in corpus],
            stopwords=self._stopwords,
            stemmer=self._stemmer,
        )

        bm25_index = bm25s.BM25()
        bm25_index.index(corpus_tokens)

        LOG.info('BM25 index initialized.')

        return [
            {
                **item,
                '_bm25': bm25_index,
                '_bm25_corpus': corpus,
                '_bm25_tokenizer': self._tokenizer,
                '_bm25_stopwords': self._stopwords,
                '_bm25_stemmer': self._stemmer,
            }
            for item in inputs
        ]

EmbeddingInitSemantic

Bases: embedding

初始化语义嵌入向量的算子。

该算子使用 Embedding 服务计算语料库中所有文档的向量表示,并保存到文件中。 用于后续的语义相似度计算和困难负样本挖掘。

Parameters:

  • embedding_serving (Callable, default: None ) –

    Embedding 服务调用函数,用于计算文本向量。

  • embeddings_dir (str, default: None ) –

    向量文件保存目录,默认为语料库所在目录。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • List[dict]: 输入数据,每条数据添加了语义向量文件路径和语料库信息。

Examples:

from lazyllm.tools.data import embedding

# Assuming my_embedding_fn is an embedding service
semantic_op = embedding.EmbeddingInitSemantic(embedding_serving=my_embedding_fn)
# Returns data with '_semantic_embeddings_path' pointing to saved embeddings
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_hard_negative_miner.py
class EmbeddingInitSemantic(embedding):
    """初始化语义嵌入向量的算子。

该算子使用 Embedding 服务计算语料库中所有文档的向量表示,并保存到文件中。
用于后续的语义相似度计算和困难负样本挖掘。

Args:
    embedding_serving (Callable): Embedding 服务调用函数,用于计算文本向量。
    embeddings_dir (str, optional): 向量文件保存目录,默认为语料库所在目录。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    List[dict]: 输入数据,每条数据添加了语义向量文件路径和语料库信息。


Examples:
    ```python
    from lazyllm.tools.data import embedding

    # Assuming my_embedding_fn is an embedding service
    semantic_op = embedding.EmbeddingInitSemantic(embedding_serving=my_embedding_fn)
    # Returns data with '_semantic_embeddings_path' pointing to saved embeddings
    ```
    """

    def __init__(
        self,
        embedding_serving: Optional[Callable] = None,
        embeddings_dir: Optional[str] = None,
        **kwargs,
    ):
        super().__init__(rewrite_func='forward_batch_input', **kwargs)
        self.embedding_serving = embedding_serving
        self.embeddings_dir = embeddings_dir

    def forward_batch_input(self, inputs: List[dict], **kwargs) -> List[dict]:
        if not inputs:
            return inputs

        # Load corpus from file path instead of memory
        corpus_path = inputs[0].get('_corpus', '')
        if not corpus_path:
            LOG.warning('No corpus path found for semantic initialization.')
            return [
                {
                    **item,
                    '_semantic_embeddings_path': '',
                    '_semantic_corpus': [],
                }
                for item in inputs
            ]

        # Verify all inputs share the same corpus path for consistency
        if not all(item.get('_corpus') == corpus_path for item in inputs):
            LOG.warning('Not all inputs share the same corpus path. Using corpus from first item.')

        corpus = _load_corpus_from_path(corpus_path)
        if not corpus or self.embedding_serving is None:
            LOG.warning(
                'No corpus or embedding_serving for semantic initialization.'
            )
            return [
                {
                    **item,
                    '_semantic_embeddings_path': '',
                    '_semantic_corpus': corpus or [],
                }
                for item in inputs
            ]

        LOG.info(f'Computing embeddings for {len(corpus)} documents...')
        embeddings = np.array(self.embedding_serving(corpus))
        LOG.info('Embeddings computed.')

        # Save embeddings to file instead of storing in memory for each item
        if self.embeddings_dir is None:
            embeddings_dir = os.path.dirname(corpus_path)
        else:
            embeddings_dir = self.embeddings_dir
        os.makedirs(embeddings_dir, exist_ok=True)

        embeddings_path = os.path.join(
            embeddings_dir, f'embeddings_{id(inputs)}.npy'
        )
        np.save(embeddings_path, embeddings)
        LOG.info(f'Saved embeddings to {embeddings_path}')

        return [
            {
                **item,
                '_semantic_embeddings_path': embeddings_path,
                '_semantic_corpus': corpus,
            }
            for item in inputs
        ]

EmbeddingMineSemanticNegatives

Bases: embedding

使用语义相似度挖掘困难负样本的算子。

该算子基于语义向量相似度,找出与查询最相似但不属于正样本的文档作为负样本。 适用于挖掘语义相近但实际不相关的困难负样本,通常比 BM25 方法效果更好。

Parameters:

  • num_negatives (int, default: 7 ) –

    需要挖掘的负样本数量,默认为 7。

  • embedding_serving (Callable, default: None ) –

    Embedding 服务调用函数,用于计算查询向量。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    输入数据,添加了基于语义相似度挖掘的负样本列表。

Examples:

from lazyllm.tools.data import embedding

# Assuming embeddings are initialized
semantic_miner = embedding.EmbeddingMineSemanticNegatives(num_negatives=5, embedding_serving=my_embedding_fn)
data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_semantic_embeddings_path': emb_path, '_semantic_corpus': corpus}
result = semantic_miner(data)
# Returns data with 'neg' field containing semantically similar negative samples
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_hard_negative_miner.py
class EmbeddingMineSemanticNegatives(embedding):
    """使用语义相似度挖掘困难负样本的算子。

该算子基于语义向量相似度,找出与查询最相似但不属于正样本的文档作为负样本。
适用于挖掘语义相近但实际不相关的困难负样本,通常比 BM25 方法效果更好。

Args:
    num_negatives (int): 需要挖掘的负样本数量,默认为 7。
    embedding_serving (Callable): Embedding 服务调用函数,用于计算查询向量。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 输入数据,添加了基于语义相似度挖掘的负样本列表。


Examples:
    ```python
    from lazyllm.tools.data import embedding

    # Assuming embeddings are initialized
    semantic_miner = embedding.EmbeddingMineSemanticNegatives(num_negatives=5, embedding_serving=my_embedding_fn)
    data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_semantic_embeddings_path': emb_path, '_semantic_corpus': corpus}
    result = semantic_miner(data)
    # Returns data with 'neg' field containing semantically similar negative samples
    ```
    """

    def __init__(
        self,
        num_negatives: int = 7,
        embedding_serving: Optional[Callable] = None,
        **kwargs,
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.num_negatives = num_negatives
        self.embedding_serving = embedding_serving

    @staticmethod
    def _cosine_similarity(
        query_emb: np.ndarray,
        corpus_embs: np.ndarray,
    ) -> np.ndarray:
        query_norm = np.linalg.norm(query_emb)
        if query_norm > 0:
            query_emb = query_emb / query_norm

        corpus_norms = np.linalg.norm(
            corpus_embs,
            axis=1,
            keepdims=True,
        )
        corpus_norms = np.where(corpus_norms > 0, corpus_norms, 1)
        corpus_normalized = corpus_embs / corpus_norms

        return np.dot(corpus_normalized, query_emb)

    def forward(
        self,
        data: dict,
        input_query_key: str = 'query',
        input_pos_key: str = 'pos',
        output_neg_key: str = 'neg',
        **kwargs,
    ) -> dict:
        # Load embeddings from file path
        embeddings_path = data.get('_semantic_embeddings_path', '')
        corpus_embeddings = _load_embeddings_from_path(embeddings_path)
        corpus = data.get('_semantic_corpus') or []

        if corpus_embeddings is None:
            LOG.warning('Semantic embeddings not initialized.')
            return {**data, output_neg_key: []}

        query = data.get(input_query_key, '')
        pos_samples = data.get(input_pos_key, [])

        if not query:
            return {**data, output_neg_key: []}

        if self.embedding_serving is None:
            return {**data, output_neg_key: []}

        pos_set = _normalize_pos_samples(pos_samples)

        query_embedding = np.array(
            self.embedding_serving([query])[0]
        )

        similarities = self._cosine_similarity(
            query_embedding,
            corpus_embeddings,
        )

        scored_docs = [
            (sim, doc)
            for sim, doc in zip(similarities, corpus)
            if doc not in pos_set
        ]

        scored_docs.sort(key=lambda x: x[0], reverse=True)

        negatives = [
            doc for _, doc in scored_docs[: self.num_negatives]
        ]

        return {**data, output_neg_key: negatives}

EmbeddingParseQueries

Bases: embedding

解析生成的查询的算子。

该算子解析 LLM 生成的查询响应,将每条查询展开为独立的数据记录。

Parameters:

  • input_key (str, default: 'passage' ) –

    输入字段名,默认为 'passage'。

  • output_query_key (str, default: 'query' ) –

    输出查询字段名,默认为 'query'。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • List[dict]: 解析后的查询列表,每个查询为一个独立的数据记录。

Examples:

from lazyllm.tools.data import embedding

parser = embedding.EmbeddingParseQueries(input_key='passage', output_query_key='query')
data = {'_query_response': '[{"query": "what is ML?", "type": "factual"}]', 'passage': 'Machine learning is...'}
result = parser(data)
# Returns list of expanded query records with 'query' and 'pos' fields
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_query_generator.py
class EmbeddingParseQueries(embedding):
    """解析生成的查询的算子。

该算子解析 LLM 生成的查询响应,将每条查询展开为独立的数据记录。

Args:
    input_key (str): 输入字段名,默认为 'passage'。
    output_query_key (str): 输出查询字段名,默认为 'query'。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    List[dict]: 解析后的查询列表,每个查询为一个独立的数据记录。


Examples:
    ```python
    from lazyllm.tools.data import embedding

    parser = embedding.EmbeddingParseQueries(input_key='passage', output_query_key='query')
    data = {'_query_response': '[{"query": "what is ML?", "type": "factual"}]', 'passage': 'Machine learning is...'}
    result = parser(data)
    # Returns list of expanded query records with 'query' and 'pos' fields
    ```
    """
    def __init__(
        self,
        input_key: str = 'passage',
        output_query_key: str = 'query',
        **kwargs,
    ):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.input_key = input_key
        self.output_query_key = output_query_key

    def forward(
        self,
        data: dict,
        **kwargs,
    ) -> List[dict]:
        response = data.get('_query_response', '')
        if not response:
            return []

        passage = data.get(self.input_key, '')
        expanded_rows = []

        try:
            parsed = json.loads(_clean_json_block(response))
            queries = (
                parsed if isinstance(parsed, list)
                else parsed.get('queries', [])
            )

            for query_item in queries:
                if isinstance(query_item, dict):
                    query = query_item.get('query', '')
                    query_type = query_item.get('type', 'unknown')
                else:
                    query = str(query_item)
                    query_type = 'unknown'

                if query.strip():
                    new_row = data.copy()
                    new_row[self.output_query_key] = query.strip()
                    new_row['query_type'] = query_type
                    new_row['pos'] = [passage]

                    new_row.pop('_query_prompt', None)
                    new_row.pop('_query_response', None)

                    expanded_rows.append(new_row)

        except Exception as e:
            LOG.warning(f'Failed to parse query response: {e}')
            return []

        return expanded_rows

EmbeddingTrainTestSplitter

Bases: embedding

将数据集分割为训练集和测试集的算子。

该算子对输入数据进行随机打乱,并按指定比例分割为训练集和测试集。 支持保存分割后的数据到 JSONL 文件,并可按指定键进行分层抽样。

Parameters:

  • test_size (float, default: 0.1 ) –

    测试集比例,默认为 0.1(即 10%)。

  • seed (int, default: 42 ) –

    随机种子,用于可复现的分割结果,默认为 42。

  • stratify_key (str, default: None ) –

    分层抽样的键名,默认为 None。

  • train_output_file (str, default: None ) –

    训练集输出文件路径,默认为 None。

  • test_output_file (str, default: None ) –

    测试集输出文件路径,默认为 None。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • List[dict]: 包含训练集和测试集的所有样本,每个样本添加了 'split' 字段标记所属集合。

Examples:

from lazyllm.tools.data import embedding

op = embedding.EmbeddingTrainTestSplitter(test_size=0.2, seed=123, train_output_file='train.jsonl', test_output_file='test.jsonl')
data = [{'query': 'q1', 'pos': 'p1'}, {'query': 'q2', 'pos': 'p2'}, {'query': 'q3', 'pos': 'p3'}]
result = op(data)
# Returns all samples with 'split' field ('train' or 'test')
# Saves train data to train.jsonl and test data to test.jsonl
Source code in lazyllm/tools/data/operators/embedding_synthesis/embedding_data_formatter.py
class EmbeddingTrainTestSplitter(embedding):
    """将数据集分割为训练集和测试集的算子。

该算子对输入数据进行随机打乱,并按指定比例分割为训练集和测试集。
支持保存分割后的数据到 JSONL 文件,并可按指定键进行分层抽样。

Args:
    test_size (float): 测试集比例,默认为 0.1(即 10%)。
    seed (int): 随机种子,用于可复现的分割结果,默认为 42。
    stratify_key (str, optional): 分层抽样的键名,默认为 None。
    train_output_file (str, optional): 训练集输出文件路径,默认为 None。
    test_output_file (str, optional): 测试集输出文件路径,默认为 None。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    List[dict]: 包含训练集和测试集的所有样本,每个样本添加了 'split' 字段标记所属集合。


Examples:
    ```python
    from lazyllm.tools.data import embedding

    op = embedding.EmbeddingTrainTestSplitter(test_size=0.2, seed=123, train_output_file='train.jsonl', test_output_file='test.jsonl')
    data = [{'query': 'q1', 'pos': 'p1'}, {'query': 'q2', 'pos': 'p2'}, {'query': 'q3', 'pos': 'p3'}]
    result = op(data)
    # Returns all samples with 'split' field ('train' or 'test')
    # Saves train data to train.jsonl and test data to test.jsonl
    ```
    """
    def __init__(
        self,
        test_size: float = 0.1,
        seed: int = 42,
        stratify_key: Optional[str] = None,
        train_output_file: Optional[str] = None,
        test_output_file: Optional[str] = None,
        **kwargs,
    ):
        super().__init__(rewrite_func='forward_batch_input', **kwargs)
        self.test_size = test_size
        self.seed = seed
        self.stratify_key = stratify_key
        self.train_output_file = train_output_file
        self.test_output_file = test_output_file
        LOG.info(
            f'Initializing {self.__class__.__name__} with test_size: {test_size}'
        )

    def forward_batch_input(
        self,
        inputs: List[dict],
        **kwargs,
    ) -> List[dict]:
        assert isinstance(inputs, list), 'inputs must be a list of dict'

        LOG.info(
            f'Splitting {len(inputs)} samples with test_size={self.test_size}'
        )

        # Shuffle and split
        random.seed(self.seed)
        shuffled = inputs.copy()
        random.shuffle(shuffled)

        split_idx = int(len(shuffled) * (1 - self.test_size))
        train_data = shuffled[:split_idx]
        test_data = shuffled[split_idx:]

        # Add split labels
        for item in train_data:
            item['split'] = 'train'
        for item in test_data:
            item['split'] = 'test'

        LOG.info(
            f'Split completed: {len(train_data)} train, {len(test_data)} test'
        )

        if self.train_output_file:
            output_path = Path(self.train_output_file)
            output_path.parent.mkdir(parents=True, exist_ok=True)
            with open(output_path, 'w', encoding='utf-8') as f:
                for item in train_data:
                    item_copy = {
                        k: v for k, v in item.items() if k != 'split'
                    }
                    f.write(
                        json.dumps(item_copy, ensure_ascii=False) + '\n'
                    )
            LOG.info(f'Saved train data to {output_path}')

        if self.test_output_file:
            output_path = Path(self.test_output_file)
            output_path.parent.mkdir(parents=True, exist_ok=True)
            with open(output_path, 'w', encoding='utf-8') as f:
                for item in test_data:
                    item_copy = {
                        k: v for k, v in item.items() if k != 'split'
                    }
                    f.write(
                        json.dumps(item_copy, ensure_ascii=False) + '\n'
                    )
            LOG.info(f'Saved test data to {output_path}')

        return train_data + test_data

知识库清洗

lazyllm.tools.data.operators.knowledge_cleaning.file_or_url_to_markdown_converter_api

FileOrURLNormalizer

Bases: kbc

文件或URL标准化算子。

该算子根据输入类型(文件或URL)自动识别文件格式,进行标准化处理。 支持PDF、HTML/XML、TXT/MD等文件格式,以及网页URL。对于网络PDF,会先下载到本地。

Parameters:

  • intermediate_dir (str, default: 'intermediate' ) –

    中间文件保存目录,默认为 'intermediate'。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    标准化后的数据,包含以下字段:

  • _type

    文件类型 ('pdf', 'html', 'text', 'invalid', 'unsupported')

  • _raw_path

    本地文件路径(如果有)

  • _url

    URL地址(如果是网页)

  • _output_path

    预期的Markdown输出路径

  • _error

    错误信息(如果有)

Examples:

from lazyllm.tools.data import kbc

normalizer = kbc.FileOrURLNormalizer(intermediate_dir='./temp')

# For file input
data = {'source': '/path/to/document.pdf'}
result = normalizer(data)
# Returns: {'source': '/path/to/document.pdf', '_type': 'pdf', '_raw_path': '/path/to/document.pdf', '_output_path': './temp/document.md'}

# For URL input
data = {'source': 'https://example.com/page.html'}
result = normalizer(data)
# Returns: {'source': 'https://example.com/page.html', '_type': 'html', '_url': 'https://example.com/page.html', '_output_path': './temp/url_xxx.md'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/file_or_url_to_markdown_converter_api.py
class FileOrURLNormalizer(kbc):
    """文件或URL标准化算子。

该算子根据输入类型(文件或URL)自动识别文件格式,进行标准化处理。
支持PDF、HTML/XML、TXT/MD等文件格式,以及网页URL。对于网络PDF,会先下载到本地。

Args:
    intermediate_dir (str): 中间文件保存目录,默认为 'intermediate'。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 标准化后的数据,包含以下字段:
    _type: 文件类型 ('pdf', 'html', 'text', 'invalid', 'unsupported')
    _raw_path: 本地文件路径(如果有)
    _url: URL地址(如果是网页)
    _output_path: 预期的Markdown输出路径
    _error: 错误信息(如果有)


Examples:
    ```python
    from lazyllm.tools.data import kbc

    normalizer = kbc.FileOrURLNormalizer(intermediate_dir='./temp')

    # For file input
    data = {'source': '/path/to/document.pdf'}
    result = normalizer(data)
    # Returns: {'source': '/path/to/document.pdf', '_type': 'pdf', '_raw_path': '/path/to/document.pdf', '_output_path': './temp/document.md'}

    # For URL input
    data = {'source': 'https://example.com/page.html'}
    result = normalizer(data)
    # Returns: {'source': 'https://example.com/page.html', '_type': 'html', '_url': 'https://example.com/page.html', '_output_path': './temp/url_xxx.md'}
    ```
    """
    def __init__(self, intermediate_dir: str = 'intermediate', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.intermediate_dir = intermediate_dir
        os.makedirs(self.intermediate_dir, exist_ok=True)

    def forward(
        self,
        data: dict,
        input_key: str = 'source',
        **kwargs,
    ) -> dict:
        src = data.get(input_key, '')
        if not src:
            return {**data, '_type': 'invalid', '_error': 'Empty source'}

        result = data.copy()

        if _is_url(src):
            if _is_pdf_url(src):
                pdf_path = os.path.join(
                    self.intermediate_dir,
                    f'crawled_{id(data)}.pdf',
                )
                downloaded_path = _download_pdf(src, pdf_path)

                if downloaded_path:
                    result['_type'] = 'pdf'
                    result['_raw_path'] = downloaded_path
                else:
                    result['_type'] = 'invalid'
                    result['_error'] = 'Failed to download PDF from URL'
            else:
                result['_type'] = 'html'
                result['_url'] = src

        else:
            if not os.path.exists(src):
                result['_type'] = 'invalid'
                result['_error'] = f'File not found: {src}'
            else:
                ext = Path(src).suffix.lower()

                if ext in [
                    '.pdf',
                    '.png',
                    '.jpg',
                    '.jpeg',
                    '.webp',
                    '.gif',
                ]:
                    result['_type'] = 'pdf'
                    result['_raw_path'] = src

                elif ext in ['.html', '.xml']:
                    result['_type'] = 'html'
                    result['_raw_path'] = src

                elif ext in ['.txt', '.md']:
                    result['_type'] = 'text'
                    result['_raw_path'] = src

                else:
                    result['_type'] = 'unsupported'
                    result['_error'] = f'Unsupported file type: {ext}'

        if '_raw_path' in result:
            name = Path(result['_raw_path']).stem
            result['_output_path'] = os.path.join(
                self.intermediate_dir,
                f'{name}.md',
            )

        elif '_url' in result:
            result['_output_path'] = os.path.join(
                self.intermediate_dir,
                f'url_{id(data)}.md',
            )

        return result

HTMLToMarkdownConverter

Bases: kbc

HTML转Markdown转换器算子。

该算子使用trafilatura库从HTML或XML文件中提取内容并转换为Markdown格式。 支持本地HTML文件和网络URL,会自动处理页面元数据。

Parameters:

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    转换后的数据,包含以下字段:

  • _markdown_path

    生成的Markdown文件路径

Examples:

from lazyllm.tools.data import kbc

converter = kbc.HTMLToMarkdownConverter()

# After normalization
data = {'_type': 'html', '_url': 'https://example.com/article', '_output_path': './temp/output.md'}
result = converter(data)
# Returns: {'_type': 'html', '_url': 'https://example.com/article', '_output_path': './temp/output.md', '_markdown_path': './temp/output.md'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/file_or_url_to_markdown_converter_api.py
class HTMLToMarkdownConverter(kbc):
    """HTML转Markdown转换器算子。

该算子使用trafilatura库从HTML或XML文件中提取内容并转换为Markdown格式。
支持本地HTML文件和网络URL,会自动处理页面元数据。

Args:
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 转换后的数据,包含以下字段:
    _markdown_path: 生成的Markdown文件路径


Examples:
    ```python
    from lazyllm.tools.data import kbc

    converter = kbc.HTMLToMarkdownConverter()

    # After normalization
    data = {'_type': 'html', '_url': 'https://example.com/article', '_output_path': './temp/output.md'}
    result = converter(data)
    # Returns: {'_type': 'html', '_url': 'https://example.com/article', '_output_path': './temp/output.md', '_markdown_path': './temp/output.md'}
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

    def forward(self, data: dict, **kwargs) -> dict:
        if data.get('_type', '') != 'html':
            return data

        url = data.get('_url')
        raw_path = data.get('_raw_path')
        output_path = data.get('_output_path', '')

        try:
            if url:
                downloaded = trafilatura.fetch_url(url)
                if not downloaded:
                    error_msg = (
                        'fail to fetch this url. '
                        'Please check your Internet Connection or URL correctness'
                    )
                    with open(output_path, 'w', encoding='utf-8') as f:
                        f.write(error_msg)
                    return {**data, '_markdown_path': output_path}

            elif raw_path:
                with open(raw_path, 'r', encoding='utf-8') as f:
                    downloaded = f.read()
            else:
                return {**data, '_markdown_path': ''}

            result = trafilatura.extract(
                downloaded,
                output_format='markdown',
                with_metadata=True,
            )

            if result:
                with open(output_path, 'w', encoding='utf-8') as f:
                    f.write(result)

                LOG.info(f'Extracted content written to {output_path}')
                return {**data, '_markdown_path': output_path}

            return {**data, '_markdown_path': ''}

        except Exception as e:
            LOG.error(f'Error extracting HTML/XML: {e}')
            return {**data, '_markdown_path': ''}

PDFToMarkdownConverterAPI

Bases: kbc

PDF转Markdown转换器API算子。

该算子使用MinerU服务将PDF文件(包括扫描件和图片)转换为Markdown格式。 支持通过API调用MinerU进行PDF解析,可配置后端引擎和上传模式。

Parameters:

  • mineru_url (str, default: None ) –

    MinerU服务URL地址。

  • mineru_backend (str, default: 'vlm-vllm-async-engine' ) –

    MinerU后端引擎类型,默认为 'vlm-vllm-async-engine'。

  • upload_mode (bool, default: True ) –

    是否使用上传模式,默认为 True。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    转换后的数据,包含以下字段:

  • _markdown_path

    生成的Markdown文件路径

Examples:

from lazyllm.tools.data import kbc

converter = kbc.PDFToMarkdownConverterAPI(
    mineru_url='your_mineru_url',
    mineru_backend='vlm-vllm-async-engine',
    upload_mode=True
)

# After normalization
data = {'_type': 'pdf', '_raw_path': '/path/to/doc.pdf', '_output_path': './temp/output.md'}
result = converter(data)
# Returns: {'_type': 'pdf', '_raw_path': '/path/to/doc.pdf', '_output_path': './temp/output.md', '_markdown_path': './temp/output.md'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/file_or_url_to_markdown_converter_api.py
class PDFToMarkdownConverterAPI(kbc):
    """PDF转Markdown转换器API算子。

该算子使用MinerU服务将PDF文件(包括扫描件和图片)转换为Markdown格式。
支持通过API调用MinerU进行PDF解析,可配置后端引擎和上传模式。

Args:
    mineru_url (str): MinerU服务URL地址。
    mineru_backend (str): MinerU后端引擎类型,默认为 'vlm-vllm-async-engine'。
    upload_mode (bool): 是否使用上传模式,默认为 True。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 转换后的数据,包含以下字段:
    _markdown_path: 生成的Markdown文件路径


Examples:
    ```python
    from lazyllm.tools.data import kbc

    converter = kbc.PDFToMarkdownConverterAPI(
        mineru_url='your_mineru_url',
        mineru_backend='vlm-vllm-async-engine',
        upload_mode=True
    )

    # After normalization
    data = {'_type': 'pdf', '_raw_path': '/path/to/doc.pdf', '_output_path': './temp/output.md'}
    result = converter(data)
    # Returns: {'_type': 'pdf', '_raw_path': '/path/to/doc.pdf', '_output_path': './temp/output.md', '_markdown_path': './temp/output.md'}
    ```
    """
    def __init__(
        self,
        mineru_url: str = None,
        mineru_backend: str = 'vlm-vllm-async-engine',
        upload_mode: bool = True,
        **kwargs,
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.mineru_url = mineru_url
        self.mineru_backend = mineru_backend
        self.upload_mode = upload_mode

    def forward(self, data: dict, **kwargs) -> dict:
        if data.get('_type', '') != 'pdf':
            return data

        if self.mineru_url is None:
            LOG.error('mineru_url is required for PDF processing')
            return {**data, '_markdown_path': ''}

        try:
            from lazyllm.tools.rag import MineruPDFReader
        except ImportError:
            LOG.error('MineruPDFReader not available')
            return {**data, '_markdown_path': ''}

        raw_path = data.get('_raw_path')
        output_path = data.get('_output_path', '')

        if not raw_path:
            return {**data, '_markdown_path': ''}

        try:
            reader = MineruPDFReader(
                url=self.mineru_url,
                backend=self.mineru_backend,
                upload_mode=self.upload_mode,
                split_doc=False,
            )

            docs = reader(file=raw_path, use_cache=False)

            if not docs:
                LOG.warning(f'MinerU returned no documents for: {raw_path}')
                return {**data, '_markdown_path': ''}

            md_content = '\n'.join(
                doc.text for doc in docs if doc.text
            )

            if not md_content.strip():
                LOG.warning(
                    f'MinerU returned empty content for: {raw_path}',
                )
                return {**data, '_markdown_path': ''}

            os.makedirs(os.path.dirname(output_path), exist_ok=True)

            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(md_content)

            LOG.info(f'MinerU parsed: {raw_path} -> {output_path}')
            return {**data, '_markdown_path': output_path}

        except Exception as e:
            LOG.error(f'MinerU API failed for {raw_path}: {e}')
            return {**data, '_markdown_path': ''}

lazyllm.tools.data.operators.knowledge_cleaning.kbc_chunk_generator_batch

KBCChunkText

Bases: kbc

文本分块算子。

该算子将长文本分割成小块(chunks),支持多种分块策略: - token: 基于Token数量分块 - sentence: 基于句子边界分块 - semantic: 基于语义相似度分块 - recursive: 递归分块

Parameters:

  • chunk_size (int, default: 512 ) –

    每个块的最大大小,默认为 512。

  • chunk_overlap (int, default: 50 ) –

    块之间的重叠大小,默认为 50。

  • split_method (str, default: 'token' ) –

    分块方法,可选 'token', 'sentence', 'semantic', 'recursive',默认为 'token'。

  • tokenizer_name (str, default: 'bert-base-uncased' ) –

    使用的tokenizer名称,默认为 'bert-base-uncased'。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    包含分块结果的数据:

  • _chunks

    分块后的文本列表

  • _chunk_error

    分块错误信息(如果有)

Examples:

from lazyllm.tools.data import kbc

chunker = kbc.KBCChunkText(chunk_size=512, chunk_overlap=50, split_method='token')

data = {'_text_content': 'Long text content that needs to be chunked...'}
result = chunker(data)
# Returns: {'_text_content': 'Long text content...', '_chunks': ['chunk1', 'chunk2', ...]}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_chunk_generator_batch.py
class KBCChunkText(kbc):
    """文本分块算子。

该算子将长文本分割成小块(chunks),支持多种分块策略:
- token: 基于Token数量分块
- sentence: 基于句子边界分块
- semantic: 基于语义相似度分块
- recursive: 递归分块

Args:
    chunk_size (int): 每个块的最大大小,默认为 512。
    chunk_overlap (int): 块之间的重叠大小,默认为 50。
    split_method (str): 分块方法,可选 'token', 'sentence', 'semantic', 'recursive',默认为 'token'。
    tokenizer_name (str): 使用的tokenizer名称,默认为 'bert-base-uncased'。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 包含分块结果的数据:
    _chunks: 分块后的文本列表
    _chunk_error: 分块错误信息(如果有)


Examples:
    ```python
    from lazyllm.tools.data import kbc

    chunker = kbc.KBCChunkText(chunk_size=512, chunk_overlap=50, split_method='token')

    data = {'_text_content': 'Long text content that needs to be chunked...'}
    result = chunker(data)
    # Returns: {'_text_content': 'Long text content...', '_chunks': ['chunk1', 'chunk2', ...]}
    ```
    """
    def __init__(
        self,
        chunk_size: int = 512,
        chunk_overlap: int = 50,
        split_method: str = 'token',
        tokenizer_name: str = 'bert-base-uncased',
        **kwargs,
    ):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.split_method = split_method
        self.tokenizer_name = tokenizer_name
        self._chunker = None
        self._tokenizer = None

    def _ensure_initialized(self):
        if self._tokenizer is None:
            self._tokenizer = transformers.AutoTokenizer.from_pretrained(self.tokenizer_name)
            self._chunker = self._initialize_chunker()

    def _initialize_chunker(self):
        if self.split_method == 'token':
            return chonkie.TokenChunker(
                tokenizer=self._tokenizer,
                chunk_size=self.chunk_size,
                chunk_overlap=self.chunk_overlap,
            )

        if self.split_method == 'sentence':
            return chonkie.SentenceChunker(
                chunk_size=self.chunk_size,
                chunk_overlap=self.chunk_overlap,
            )

        if self.split_method == 'semantic':
            return chonkie.SemanticChunker(
                chunk_size=self.chunk_size,
            )

        if self.split_method == 'recursive':
            return chonkie.RecursiveChunker(
                chunk_size=self.chunk_size,
                chunk_overlap=self.chunk_overlap,
            )

        raise ValueError(f'Unsupported split method: {self.split_method}')

    def forward(
        self,
        data: dict,
        **kwargs,
    ) -> dict:
        text = data.get('_text_content', '')
        if not text:
            return {**data, '_chunks': []}

        self._ensure_initialized()

        try:
            tokens = self._tokenizer.encode(text)
            total_tokens = len(tokens)
            max_tokens = self._tokenizer.model_max_length

            if total_tokens <= max_tokens:
                chunks = self._chunker(text)
            else:
                x = (total_tokens + max_tokens - 1) // max_tokens
                words = text.split()
                words_per_chunk = (len(words) + x - 1) // x

                chunks = []
                for j in range(0, len(words), words_per_chunk):
                    chunk_text = ' '.join(words[j:j + words_per_chunk])
                    chunks.extend(self._chunker(chunk_text))

            chunk_texts = [chunk.text for chunk in chunks]
            LOG.info(f'Split text into {len(chunks)} chunks.')
            return {**data, '_chunks': chunk_texts}

        except Exception as e:
            LOG.error(f'Error chunking text: {e}')
            return {**data, '_chunks': [], '_chunk_error': str(e)}

KBCLoadText

Bases: kbc

加载文本文件内容的算子。

该算子从指定路径加载文本文件内容,支持多种文件格式: - .txt, .md, .xml: 直接读取文本内容 - .json, .jsonl: 从指定的文本字段中提取内容并合并

Parameters:

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    包含加载结果的数据:

  • _text_content

    加载的文本内容

  • _load_error

    加载错误信息(如果有)

Examples:

from lazyllm.tools.data import kbc

loader = kbc.KBCLoadText()

# Load text file
data = {'text_path': '/path/to/document.txt'}
result = loader(data)
# Returns: {'text_path': '/path/to/document.txt', '_text_content': 'file content...'}

# Load JSON file
data = {'text_path': '/path/to/data.json'}
result = loader(data)
# Returns: {'text_path': '/path/to/data.json', '_text_content': 'extracted text...'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_chunk_generator_batch.py
class KBCLoadText(kbc):
    """加载文本文件内容的算子。

该算子从指定路径加载文本文件内容,支持多种文件格式:
- .txt, .md, .xml: 直接读取文本内容
- .json, .jsonl: 从指定的文本字段中提取内容并合并

Args:
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 包含加载结果的数据:
    _text_content: 加载的文本内容
    _load_error: 加载错误信息(如果有)


Examples:
    ```python
    from lazyllm.tools.data import kbc

    loader = kbc.KBCLoadText()

    # Load text file
    data = {'text_path': '/path/to/document.txt'}
    result = loader(data)
    # Returns: {'text_path': '/path/to/document.txt', '_text_content': 'file content...'}

    # Load JSON file
    data = {'text_path': '/path/to/data.json'}
    result = loader(data)
    # Returns: {'text_path': '/path/to/data.json', '_text_content': 'extracted text...'}
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

    def forward(
        self,
        data: dict,
        input_key: str = 'text_path',
        **kwargs,
    ) -> dict:
        text_path = data.get(input_key, '')
        if not text_path:
            return {**data, '_text_content': '', '_load_error': 'Empty text path'}

        if not os.path.exists(text_path):
            LOG.error(f'Input file not found: {text_path}')
            return {
                **data,
                '_text_content': '',
                '_load_error': f'File not found: {text_path}',
            }

        try:
            if text_path.endswith(('.txt', '.md', '.xml')):
                with open(text_path, 'r', encoding='utf-8') as f:
                    text = f.read()
                return {**data, '_text_content': text}

            if text_path.endswith(('.json', '.jsonl')):
                with open(text_path, 'r', encoding='utf-8') as f:
                    if text_path.endswith('.json'):
                        file_data = json.load(f)
                    else:
                        file_data = [json.loads(line) for line in f]

                text_fields = ['text', 'content', 'body']
                for field in text_fields:
                    if isinstance(file_data, list) and file_data and field in file_data[0]:
                        text = '\n'.join(item[field] for item in file_data)
                        return {**data, '_text_content': text}
                    if isinstance(file_data, dict) and field in file_data:
                        text = file_data[field]
                        return {**data, '_text_content': text}

                LOG.error(f'No text field found in {text_path}')
                return {
                    **data,
                    '_text_content': '',
                    '_load_error': 'No text field found',
                }

            LOG.error(f'Unsupported file format for {text_path}')
            return {
                **data,
                '_text_content': '',
                '_load_error': 'Unsupported format',
            }

        except Exception as e:
            LOG.error(f'Error loading {text_path}: {e}')
            return {
                **data,
                '_text_content': '',
                '_load_error': str(e),
            }

KBCSaveChunks

Bases: kbc

保存文本分块结果的算子。

该算子将分块后的文本保存为JSON文件,每个分块作为一个JSON对象。 支持指定输出目录,会保留原始文件的相对路径结构。

Parameters:

  • output_dir (str, default: None ) –

    输出目录路径,默认为 None(保存到原文件所在目录的 'extract' 子目录)。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    包含保存结果的数据:

  • chunk_path

    保存的JSON文件路径

Examples:

from lazyllm.tools.data import kbc

saver = kbc.KBCSaveChunks(output_dir='./output')

data = {'text_path': '/path/to/doc.txt', '_chunks': ['chunk1', 'chunk2']}
result = saver(data)
# Returns: {'text_path': '/path/to/doc.txt', 'chunk_path': './output/path/to/doc_chunk.json'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_chunk_generator_batch.py
class KBCSaveChunks(kbc):
    """保存文本分块结果的算子。

该算子将分块后的文本保存为JSON文件,每个分块作为一个JSON对象。
支持指定输出目录,会保留原始文件的相对路径结构。

Args:
    output_dir (str, optional): 输出目录路径,默认为 None(保存到原文件所在目录的 'extract' 子目录)。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 包含保存结果的数据:
    chunk_path: 保存的JSON文件路径


Examples:
    ```python
    from lazyllm.tools.data import kbc

    saver = kbc.KBCSaveChunks(output_dir='./output')

    data = {'text_path': '/path/to/doc.txt', '_chunks': ['chunk1', 'chunk2']}
    result = saver(data)
    # Returns: {'text_path': '/path/to/doc.txt', 'chunk_path': './output/path/to/doc_chunk.json'}
    ```
    """
    def __init__(self, output_dir: Optional[str] = None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.output_dir = output_dir

    def forward(
        self,
        data: dict,
        input_key: str = 'text_path',
        output_key: str = 'chunk_path',
        **kwargs,
    ) -> dict:
        chunks = data.get('_chunks', [])
        text_path = data.get(input_key, '')

        result = data.copy()

        if not chunks:
            LOG.warning(f'No chunks to save for {text_path}')
            result[output_key] = ''
            for key in ['_text_content', '_load_error', '_chunks', '_chunk_error']:
                result.pop(key, None)
            return result

        try:
            # Determine output directory
            if self.output_dir:
                # Use specified output directory, preserving relative structure
                abs_text_path = os.path.abspath(text_path)
                abs_cwd = os.path.abspath(os.getcwd())

                if abs_text_path.startswith(abs_cwd):
                    rel_path = os.path.relpath(os.path.dirname(abs_text_path), abs_cwd)
                else:
                    rel_path = os.path.dirname(abs_text_path).lstrip('/')

                output_dir = os.path.join(self.output_dir, rel_path)
            else:
                # Default: save to 'extract' subdirectory
                output_dir = os.path.join(os.path.dirname(text_path), 'extract')

            os.makedirs(output_dir, exist_ok=True)

            file_name = os.path.splitext(os.path.basename(text_path))[0] + '_chunk.json'
            output_path = os.path.join(output_dir, file_name)

            json_chunks = [{'raw_chunk': chunk} for chunk in chunks]

            with open(output_path, 'w', encoding='utf-8') as f:
                json.dump(json_chunks, f, ensure_ascii=False, indent=4)

            LOG.info(f'Saved {len(chunks)} chunks to {output_path}')

            result[output_key] = output_path
            for key in ['_text_content', '_load_error', '_chunks', '_chunk_error']:
                result.pop(key, None)

            return result

        except Exception as e:
            LOG.error(f'Error saving chunks: {e}')
            result[output_key] = ''
            for key in ['_text_content', '_load_error', '_chunks', '_chunk_error']:
                result.pop(key, None)
            return result

lazyllm.tools.data.operators.knowledge_cleaning.kbc_chunk_generator

KBCExpandChunks

Bases: kbc

将分块文本展开为独立记录的算子。

该算子将包含多个文本分块的数据记录展开为多个独立的数据记录,每个记录包含一个分块。 适用于需要将分块后的文本作为独立样本进行后续处理的场景。

Parameters:

  • output_key (str, default: 'raw_chunk' ) –

    输出字段名,用于存储分块文本,默认为 'raw_chunk'。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • List[dict]: 展开后的独立数据记录列表,每个记录包含一个分块。

Examples:

from lazyllm.tools.data import kbc

expander = kbc.KBCExpandChunks(output_key='raw_chunk')

data = {'text_path': '/path/to/doc.txt', '_chunks': ['chunk1 content', 'chunk2 content', 'chunk3 content']}
result = expander(data)
# Returns: [
#   {'text_path': '/path/to/doc.txt', 'raw_chunk': 'chunk1 content'},
#   {'text_path': '/path/to/doc.txt', 'raw_chunk': 'chunk2 content'},
#   {'text_path': '/path/to/doc.txt', 'raw_chunk': 'chunk3 content'}
# ]
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_chunk_generator.py
class KBCExpandChunks(kbc):
    """将分块文本展开为独立记录的算子。

该算子将包含多个文本分块的数据记录展开为多个独立的数据记录,每个记录包含一个分块。
适用于需要将分块后的文本作为独立样本进行后续处理的场景。

Args:
    output_key (str): 输出字段名,用于存储分块文本,默认为 'raw_chunk'。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    List[dict]: 展开后的独立数据记录列表,每个记录包含一个分块。


Examples:
    ```python
    from lazyllm.tools.data import kbc

    expander = kbc.KBCExpandChunks(output_key='raw_chunk')

    data = {'text_path': '/path/to/doc.txt', '_chunks': ['chunk1 content', 'chunk2 content', 'chunk3 content']}
    result = expander(data)
    # Returns: [
    #   {'text_path': '/path/to/doc.txt', 'raw_chunk': 'chunk1 content'},
    #   {'text_path': '/path/to/doc.txt', 'raw_chunk': 'chunk2 content'},
    #   {'text_path': '/path/to/doc.txt', 'raw_chunk': 'chunk3 content'}
    # ]
    ```
    """
    def __init__(self, output_key: str = 'raw_chunk', **kwargs):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.output_key = output_key

    def forward(
        self,
        data: dict,
        **kwargs,
    ) -> List[dict]:
        chunks = data.get('_chunks', [])

        if not chunks:
            return []

        new_records = []
        for chunk_text in chunks:
            new_row = data.copy()
            new_row[self.output_key] = chunk_text
            new_row.pop('_text_content', None)
            new_row.pop('_chunks', None)
            new_records.append(new_row)

        return new_records

lazyllm.tools.data.operators.knowledge_cleaning.kbc_multihop_qa_generator_batch

KBCExtractInfoPairs

Bases: kbc

信息对提取算子。

该算子从预处理后的文本中提取信息对,用于生成多跳问答。 根据语言类型(中文或英文)使用不同的句子分割符, 提取前提-中间-结论三元组和相关上下文。

Parameters:

  • lang (str, default: 'en' ) –

    语言类型,'en' 表示英文,'zh' 表示中文,默认为 'en'。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    包含信息对的数据:

  • _info_pairs

    信息对列表,每个包含 premise、intermediate、conclusion 和 related_contexts

Examples:

from lazyllm.tools.data import kbc

extractor = kbc.KBCExtractInfoPairs(lang='en')

data = {'_processed_chunks': [{'text': 'First sentence. Second sentence. Third sentence.', 'original_data': {}}]}
result = extractor(data)
# Returns: {'_processed_chunks': [...], '_info_pairs': [{'premise': 'First sentence', 'intermediate': 'Second sentence', 'conclusion': 'Third sentence', 'related_contexts': [], 'original_data': {}}]}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py
class KBCExtractInfoPairs(kbc):
    """信息对提取算子。

该算子从预处理后的文本中提取信息对,用于生成多跳问答。
根据语言类型(中文或英文)使用不同的句子分割符,
提取前提-中间-结论三元组和相关上下文。

Args:
    lang (str): 语言类型,'en' 表示英文,'zh' 表示中文,默认为 'en'。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 包含信息对的数据:
    _info_pairs: 信息对列表,每个包含 premise、intermediate、conclusion 和 related_contexts


Examples:
    ```python
    from lazyllm.tools.data import kbc

    extractor = kbc.KBCExtractInfoPairs(lang='en')

    data = {'_processed_chunks': [{'text': 'First sentence. Second sentence. Third sentence.', 'original_data': {}}]}
    result = extractor(data)
    # Returns: {'_processed_chunks': [...], '_info_pairs': [{'premise': 'First sentence', 'intermediate': 'Second sentence', 'conclusion': 'Third sentence', 'related_contexts': [], 'original_data': {}}]}
    ```
    """
    def __init__(self, lang: str = 'en', **kwargs):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.lang = lang

    def forward(self, data: dict, **kwargs) -> dict:
        processed_chunks = data.get('_processed_chunks', [])
        if not processed_chunks:
            return {**data, '_info_pairs': []}

        all_info_pairs = []

        for chunk in processed_chunks:
            text = chunk.get('text', '')
            original_data = chunk.get('original_data', {})

            if self.lang == 'en':
                sentences = [s.strip() for s in text.split('.') if s.strip()]
            else:
                sentences = [s.strip() for s in text.split('。') if s.strip()]

            for i in range(len(sentences) - 2):
                if len(sentences[i]) > 10 and len(sentences[i + 1]) > 10:
                    info_pair = {
                        'premise': sentences[i],
                        'intermediate': sentences[i + 1],
                        'conclusion': (
                            sentences[i + 2]
                            if i + 2 < len(sentences)
                            else ''
                        ),
                        'related_contexts': [
                            s
                            for j, s in enumerate(sentences)
                            if j not in (i, i + 1) and len(s) > 10
                        ][:2],
                        'original_data': original_data,
                    }
                    all_info_pairs.append(info_pair)

        return {**data, '_info_pairs': all_info_pairs}

KBCGenerateMultiHopQA

Bases: kbc

多跳问答生成算子。

该算子使用LLM根据提取的信息对生成多跳问答对。 多跳问答需要多个推理步骤才能回答,适用于训练复杂的问答模型。

Parameters:

  • llm

    LLM服务实例,用于生成问答对。

  • lang (str, default: 'en' ) –

    语言类型,'en' 表示英文,'zh' 表示中文,默认为 'en'。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    包含生成的问答结果的数据:

  • _qa_results

    问答结果列表,每个包含 response 和 info_pair

Examples:

from lazyllm.tools.data import kbc

# Assuming llm is an LLM service instance
generator = kbc.KBCGenerateMultiHopQA(llm=llm, lang='en')

data = {'_info_pairs': [{'premise': 'A', 'intermediate': 'B', 'conclusion': 'C', 'original_data': {}}]}
result = generator(data)
# Returns: {'_info_pairs': [...], '_qa_results': [{'response': {...}, 'info_pair': {...}}]}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py
class KBCGenerateMultiHopQA(kbc):
    """多跳问答生成算子。

该算子使用LLM根据提取的信息对生成多跳问答对。
多跳问答需要多个推理步骤才能回答,适用于训练复杂的问答模型。

Args:
    llm: LLM服务实例,用于生成问答对。
    lang (str): 语言类型,'en' 表示英文,'zh' 表示中文,默认为 'en'。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 包含生成的问答结果的数据:
    _qa_results: 问答结果列表,每个包含 response 和 info_pair


Examples:
    ```python
    from lazyllm.tools.data import kbc

    # Assuming llm is an LLM service instance
    generator = kbc.KBCGenerateMultiHopQA(llm=llm, lang='en')

    data = {'_info_pairs': [{'premise': 'A', 'intermediate': 'B', 'conclusion': 'C', 'original_data': {}}]}
    result = generator(data)
    # Returns: {'_info_pairs': [...], '_qa_results': [{'response': {...}, 'info_pair': {...}}]}
    ```
    """
    def __init__(self, llm=None, lang: str = 'en', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

        self.prompt_template = MultiHopQABuilderPrompt(lang=lang)

        if llm is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = (
                llm.share()
                .prompt(system_prompt)
                .formatter(JsonFormatter())
            )
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(self, data: dict, **kwargs) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        info_pairs = data.get('_info_pairs', [])
        if not info_pairs:
            return {**data, '_qa_results': []}

        qa_results = []

        for pair in info_pairs:
            # Build context from info pair
            context = (
                f"{pair['premise']}. "
                f"{pair['intermediate']}. "
                f"{pair['conclusion']}"
            )

            # Build prompt for this info pair
            user_prompt = self.prompt_template.build_prompt(context)

            try:
                response = self._llm_serve(user_prompt)

                qa_results.append(
                    {
                        'response': response,
                        'info_pair': pair,
                    }
                )

            except Exception as e:
                LOG.warning(f'Failed to generate QA: {e}')

        return {**data, '_qa_results': qa_results}

KBCLoadChunkFile

Bases: kbc

加载分块文件算子。

该算子从指定路径加载JSON或JSONL格式的分块文件。 支持从知识库清洗流程中生成的分块结果文件。

Parameters:

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    包含分块数据的数据:

  • _chunks_data

    分块数据列表

  • _chunk_path

    分块文件路径

Examples:

from lazyllm.tools.data import kbc

loader = kbc.KBCLoadChunkFile()

data = {'chunk_path': '/path/to/chunks.json'}
result = loader(data)
# Returns: {'chunk_path': '/path/to/chunks.json', '_chunks_data': [...], '_chunk_path': '/path/to/chunks.json'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py
class KBCLoadChunkFile(kbc):
    """加载分块文件算子。

该算子从指定路径加载JSON或JSONL格式的分块文件。
支持从知识库清洗流程中生成的分块结果文件。

Args:
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 包含分块数据的数据:
    _chunks_data: 分块数据列表
    _chunk_path: 分块文件路径


Examples:
    ```python
    from lazyllm.tools.data import kbc

    loader = kbc.KBCLoadChunkFile()

    data = {'chunk_path': '/path/to/chunks.json'}
    result = loader(data)
    # Returns: {'chunk_path': '/path/to/chunks.json', '_chunks_data': [...], '_chunk_path': '/path/to/chunks.json'}
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

    def forward(
        self,
        data: dict,
        input_key: str = 'chunk_path',
        **kwargs,
    ) -> dict:
        import os

        chunk_path = data.get(input_key, '')

        if not chunk_path or not os.path.exists(chunk_path):
            LOG.warning(f'Invalid chunk path: {chunk_path}')
            return {**data, '_chunks_data': []}

        try:
            if str(chunk_path).endswith('.json'):
                with open(chunk_path, 'r', encoding='utf-8') as f:
                    file_data = json.load(f)
            elif str(chunk_path).endswith('.jsonl'):
                with open(chunk_path, 'r', encoding='utf-8') as f:
                    file_data = [json.loads(line) for line in f]
            else:
                LOG.warning(f'Unsupported file format: {chunk_path}')
                return {**data, '_chunks_data': []}

            return {
                **data,
                '_chunks_data': file_data,
                '_chunk_path': chunk_path,
            }

        except Exception as e:
            LOG.error(f'Error loading chunk file {chunk_path}: {e}')
            return {**data, '_chunks_data': []}

KBCPreprocessText

Bases: kbc

文本预处理算子。

该算子对加载的分块文本进行预处理,根据长度过滤分块。 只保留长度在指定范围内的分块,避免处理过短或过长的文本。

Parameters:

  • min_length (int, default: 100 ) –

    最小文本长度,默认为 100。

  • max_length (int, default: 200000 ) –

    最大文本长度,默认为 200000。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    包含预处理结果的数据:

  • _processed_chunks

    预处理后的分块列表

Examples:

from lazyllm.tools.data import kbc

processor = kbc.KBCPreprocessText(min_length=50, max_length=10000)

data = {'_chunks_data': [{'cleaned_chunk': 'Short text.'}, {'cleaned_chunk': 'A much longer text that meets the length requirements and will be processed.'}]}
result = processor(data, text_field='cleaned_chunk')
# Returns: {'_chunks_data': [...], '_processed_chunks': [{'text': 'A much longer text...', 'original_data': {...}}]}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py
class KBCPreprocessText(kbc):
    """文本预处理算子。

该算子对加载的分块文本进行预处理,根据长度过滤分块。
只保留长度在指定范围内的分块,避免处理过短或过长的文本。

Args:
    min_length (int): 最小文本长度,默认为 100。
    max_length (int): 最大文本长度,默认为 200000。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 包含预处理结果的数据:
    _processed_chunks: 预处理后的分块列表


Examples:
    ```python
    from lazyllm.tools.data import kbc

    processor = kbc.KBCPreprocessText(min_length=50, max_length=10000)

    data = {'_chunks_data': [{'cleaned_chunk': 'Short text.'}, {'cleaned_chunk': 'A much longer text that meets the length requirements and will be processed.'}]}
    result = processor(data, text_field='cleaned_chunk')
    # Returns: {'_chunks_data': [...], '_processed_chunks': [{'text': 'A much longer text...', 'original_data': {...}}]}
    ```
    """
    def __init__(self, min_length: int = 100, max_length: int = 200000, **kwargs):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.min_length = min_length
        self.max_length = max_length

    def forward(
        self,
        data: dict,
        text_field: str = 'cleaned_chunk',
        **kwargs,
    ) -> dict:
        chunks_data = data.get('_chunks_data', [])
        if not chunks_data:
            return {**data, '_processed_chunks': []}

        processed = []
        for item in chunks_data:
            text = item.get(text_field, '')
            if not isinstance(text, str):
                continue

            text = text.strip()
            if self.min_length <= len(text) <= self.max_length:
                processed.append(
                    {
                        'text': text,
                        'original_data': item,
                    }
                )

        return {**data, '_processed_chunks': processed}

KBCSaveEnhanced

Bases: kbc

保存增强数据算子。

该算子将生成的问答对与原始分块数据合并,保存为增强后的分块文件。 支持指定输出目录,会保留原始文件的相对路径结构。

Parameters:

  • output_dir (str, default: None ) –

    输出目录路径,默认为 None(保存到原文件所在目录)。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    包含保存结果的数据:

  • enhanced_chunk_path

    增强后的分块文件路径

Examples:

from lazyllm.tools.data import kbc

saver = kbc.KBCSaveEnhanced(output_dir='./enhanced_output')

data = {'_chunk_path': '/path/to/chunks.json', '_chunks_data': [{'id': 1, 'text': 'chunk1'}], '_qa_pairs': [{'id': 1, 'qa_pairs': {'question': 'Q1', 'answer': 'A1'}}]}
result = saver(data, output_key='enhanced_chunk_path')
# Returns: {'enhanced_chunk_path': './enhanced_output/path/to/chunks_enhanced.json'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py
class KBCSaveEnhanced(kbc):
    """保存增强数据算子。

该算子将生成的问答对与原始分块数据合并,保存为增强后的分块文件。
支持指定输出目录,会保留原始文件的相对路径结构。

Args:
    output_dir (str, optional): 输出目录路径,默认为 None(保存到原文件所在目录)。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 包含保存结果的数据:
    enhanced_chunk_path: 增强后的分块文件路径


Examples:
    ```python
    from lazyllm.tools.data import kbc

    saver = kbc.KBCSaveEnhanced(output_dir='./enhanced_output')

    data = {'_chunk_path': '/path/to/chunks.json', '_chunks_data': [{'id': 1, 'text': 'chunk1'}], '_qa_pairs': [{'id': 1, 'qa_pairs': {'question': 'Q1', 'answer': 'A1'}}]}
    result = saver(data, output_key='enhanced_chunk_path')
    # Returns: {'enhanced_chunk_path': './enhanced_output/path/to/chunks_enhanced.json'}
    ```
    """
    def __init__(self, output_dir: Optional[str] = None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.output_dir = output_dir

    def forward(self, data: dict, output_key: str = 'enhanced_chunk_path', **kwargs) -> dict:
        chunk_path = data.get('_chunk_path', '')
        result = data.copy()

        if not chunk_path:
            return _clean_enhanced_result(result, output_key)

        try:
            enhanced_data = _build_enhanced_data(
                data.get('_chunks_data', []),
                data.get('_qa_pairs', [])
            )
            output_path = _get_output_path(chunk_path, self.output_dir)
            _save_enhanced_data(enhanced_data, output_path)
            LOG.info(f'Saved enhanced chunks to {output_path}')
            return _clean_enhanced_result(result, output_key, output_path)
        except Exception as e:
            LOG.error(f'Error saving enhanced chunks: {e}')
            return _clean_enhanced_result(result, output_key)

parse_qa_pairs(data)

解析问答对函数。

该函数解析LLM生成的问答响应,提取有效的问答对。 支持多种响应格式(字典、列表、字符串),并将解析结果与原始数据合并。

Parameters:

  • data (dict) –

    包含问答结果的数据。

Returns:

  • dict ( dict ) –

    包含解析后的问答对的数据:

  • _qa_pairs ( dict ) –

    解析后的问答对列表

Examples:

from lazyllm.tools.data.operators.knowledge_cleaning.kbc_multihop_qa_generator_batch import parse_qa_pairs

data = {'_qa_results': [{'response': {'question': 'What is AI?', 'answer': 'Artificial Intelligence'}, 'info_pair': {'original_data': {'id': 1}}}]}
result = parse_qa_pairs(data)
# Returns: {'_qa_results': [...], '_qa_pairs': [{'id': 1, 'qa_pairs': {'question': 'What is AI?', 'answer': 'Artificial Intelligence'}}]}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_multihop_qa_generator_batch.py
@data_register('data.kbc', rewrite_func='forward', _concurrency_mode='process')
def parse_qa_pairs(data: dict) -> dict:
    """解析问答对函数。

该函数解析LLM生成的问答响应,提取有效的问答对。
支持多种响应格式(字典、列表、字符串),并将解析结果与原始数据合并。

Args:
    data (dict): 包含问答结果的数据。

Returns:
    dict: 包含解析后的问答对的数据:
    _qa_pairs: 解析后的问答对列表


Examples:
    ```python
    from lazyllm.tools.data.operators.knowledge_cleaning.kbc_multihop_qa_generator_batch import parse_qa_pairs

    data = {'_qa_results': [{'response': {'question': 'What is AI?', 'answer': 'Artificial Intelligence'}, 'info_pair': {'original_data': {'id': 1}}}]}
    result = parse_qa_pairs(data)
    # Returns: {'_qa_results': [...], '_qa_pairs': [{'id': 1, 'qa_pairs': {'question': 'What is AI?', 'answer': 'Artificial Intelligence'}}]}
    ```
    """
    qa_results = data.get('_qa_results', [])
    if not qa_results:
        return {**data, '_qa_pairs': []}

    all_qa_pairs = []

    for qa_result in qa_results:
        response = qa_result.get('response', '')
        info_pair = qa_result.get('info_pair', {})
        original_data = info_pair.get('original_data', {})

        if isinstance(response, dict):
            if 'question' in response:
                all_qa_pairs.append(
                    {**original_data, 'qa_pairs': response}
                )

        elif isinstance(response, list):
            for item in response:
                if isinstance(item, dict) and 'question' in item:
                    all_qa_pairs.append(
                        {**original_data, 'qa_pairs': item}
                    )

        elif isinstance(response, str):
            LOG.warning(
                f'JsonFormatter failed to parse response, '
                f'skipping: {response[:100]}...'
            )

    return {**data, '_qa_pairs': all_qa_pairs}

lazyllm.tools.data.operators.knowledge_cleaning.kbc_text_cleaner_batch

KBCGenerateCleanedText

Bases: kbc

生成清洗后文本的算子。

该算子使用LLM对原始分块文本进行清洗,去除噪声、格式化内容。 支持多语言,当LLM调用失败时会使用原始文本作为回退。

Parameters:

  • llm

    LLM服务实例,用于清洗文本。

  • lang (str, default: 'en' ) –

    语言类型,'en' 表示英文,'zh' 表示中文,默认为 'en'。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    包含清洗结果的数据:

  • _cleaned_results

    清洗结果列表,每个包含 response、raw_chunk 和 original_item

Examples:

from lazyllm.tools.data import kbc

# Assuming llm is an LLM service instance
cleaner = kbc.KBCGenerateCleanedText(llm=llm, lang='en')

data = {'_chunks_data': [{'raw_chunk': 'Noisy text with errors...'}]}
result = cleaner(data)
# Returns: {'_chunks_data': [...], '_cleaned_results': [{'response': 'Cleaned text', 'raw_chunk': '...', 'original_item': {...}}]}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner_batch.py
class KBCGenerateCleanedText(kbc):
    """生成清洗后文本的算子。

该算子使用LLM对原始分块文本进行清洗,去除噪声、格式化内容。
支持多语言,当LLM调用失败时会使用原始文本作为回退。

Args:
    llm: LLM服务实例,用于清洗文本。
    lang (str): 语言类型,'en' 表示英文,'zh' 表示中文,默认为 'en'。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 包含清洗结果的数据:
    _cleaned_results: 清洗结果列表,每个包含 response、raw_chunk 和 original_item


Examples:
    ```python
    from lazyllm.tools.data import kbc

    # Assuming llm is an LLM service instance
    cleaner = kbc.KBCGenerateCleanedText(llm=llm, lang='en')

    data = {'_chunks_data': [{'raw_chunk': 'Noisy text with errors...'}]}
    result = cleaner(data)
    # Returns: {'_chunks_data': [...], '_cleaned_results': [{'response': 'Cleaned text', 'raw_chunk': '...', 'original_item': {...}}]}
    ```
    """
    def __init__(self, llm=None, lang: str = 'en', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.prompts = DocRefinementPrompt(lang=lang)
        if llm is not None:
            # Note: DocRefinementPrompt may not have system prompt, use empty string
            system_prompt = getattr(self.prompts, 'build_system_prompt', lambda: '')()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(
        self,
        data: dict,
        **kwargs
    ) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        chunks_data = data.get('_chunks_data', [])
        if not chunks_data:
            return {**data, '_cleaned_results': []}

        cleaned_results = []
        for item in chunks_data:
            raw_chunk = item.get('raw_chunk', '')
            if not raw_chunk:
                continue

            # Build prompt for this chunk
            user_prompt = self.prompts.build_prompt(raw_chunk)

            try:
                # Call LLM (system prompt and formatter already set in __init__)
                response = self._llm_serve(user_prompt)

                cleaned_results.append({
                    'response': response,
                    'raw_chunk': raw_chunk,
                    'original_item': item
                })
            except Exception as e:
                LOG.warning(f'Failed to clean text: {e}')
                # Use raw chunk as fallback
                cleaned_results.append({
                    'response': raw_chunk,
                    'raw_chunk': raw_chunk,
                    'original_item': item
                })

        return {**data, '_cleaned_results': cleaned_results}

KBCLoadRAWChunkFile

Bases: kbc

加载原始分块文件算子。

该算子从指定路径加载包含原始分块(raw_chunk)的JSON或JSONL文件。 用于知识库清洗流程中加载需要清洗的原始分块数据。

Parameters:

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    包含原始分块数据的数据:

  • _chunks_data

    原始分块数据列表

  • _chunk_path

    分块文件路径

Examples:

from lazyllm.tools.data import kbc

loader = kbc.KBCLoadRAWChunkFile()

data = {'chunk_path': '/path/to/raw_chunks.json'}
result = loader(data)
# Returns: {'chunk_path': '/path/to/raw_chunks.json', '_chunks_data': [{'raw_chunk': '...'}], '_chunk_path': '/path/to/raw_chunks.json'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner_batch.py
class KBCLoadRAWChunkFile(kbc):
    """加载原始分块文件算子。

该算子从指定路径加载包含原始分块(raw_chunk)的JSON或JSONL文件。
用于知识库清洗流程中加载需要清洗的原始分块数据。

Args:
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 包含原始分块数据的数据:
    _chunks_data: 原始分块数据列表
    _chunk_path: 分块文件路径


Examples:
    ```python
    from lazyllm.tools.data import kbc

    loader = kbc.KBCLoadRAWChunkFile()

    data = {'chunk_path': '/path/to/raw_chunks.json'}
    result = loader(data)
    # Returns: {'chunk_path': '/path/to/raw_chunks.json', '_chunks_data': [{'raw_chunk': '...'}], '_chunk_path': '/path/to/raw_chunks.json'}
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

    def forward(
        self,
        data: dict,
        input_key: str = 'chunk_path',
        **kwargs
    ) -> dict:
        chunk_path = data.get(input_key, '')
        if not chunk_path or not os.path.exists(chunk_path):
            LOG.warning(f'Invalid chunk path: {chunk_path}')
            return {**data, '_chunks_data': [], '_chunk_path': chunk_path}

        try:
            if chunk_path.endswith('.json'):
                with open(chunk_path, 'r', encoding='utf-8') as f:
                    file_data = json.load(f)
            elif chunk_path.endswith('.jsonl'):
                with open(chunk_path, 'r', encoding='utf-8') as f:
                    file_data = [json.loads(line) for line in f]
            else:
                LOG.warning(f'Unsupported file format: {chunk_path}')
                return {**data, '_chunks_data': [], '_chunk_path': chunk_path}

            if not file_data or 'raw_chunk' not in file_data[0]:
                LOG.warning(f"'raw_chunk' field not found in: {chunk_path}")
                return {**data, '_chunks_data': [], '_chunk_path': chunk_path}

            return {**data, '_chunks_data': file_data, '_chunk_path': chunk_path}

        except Exception as e:
            LOG.error(f'Error loading chunk file {chunk_path}: {e}')
            return {**data, '_chunks_data': [], '_chunk_path': chunk_path}

KBCSaveCleaned

Bases: kbc

保存清洗后数据算子。

该算子将清洗后的分块数据保存为JSON文件,保留原始分块和清洗后分块的对应关系。 支持指定输出目录,会保留原始文件的相对路径结构。

Parameters:

  • output_dir (str, default: None ) –

    输出目录路径,默认为 None(保存到原文件所在目录)。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    包含保存结果的数据:

  • cleaned_chunk_path

    清洗后的分块文件路径

Examples:

from lazyllm.tools.data import kbc

saver = kbc.KBCSaveCleaned(output_dir='./cleaned_output')

data = {'_chunk_path': '/path/to/raw_chunks.json', '_cleaned_chunks': [{'raw_chunk': 'raw', 'cleaned_chunk': 'cleaned'}]}
result = saver(data, output_key='cleaned_chunk_path')
# Returns: {'cleaned_chunk_path': './cleaned_output/path/to/raw_chunks_cleaned.json'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner_batch.py
class KBCSaveCleaned(kbc):
    """保存清洗后数据算子。

该算子将清洗后的分块数据保存为JSON文件,保留原始分块和清洗后分块的对应关系。
支持指定输出目录,会保留原始文件的相对路径结构。

Args:
    output_dir (str, optional): 输出目录路径,默认为 None(保存到原文件所在目录)。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 包含保存结果的数据:
    cleaned_chunk_path: 清洗后的分块文件路径


Examples:
    ```python
    from lazyllm.tools.data import kbc

    saver = kbc.KBCSaveCleaned(output_dir='./cleaned_output')

    data = {'_chunk_path': '/path/to/raw_chunks.json', '_cleaned_chunks': [{'raw_chunk': 'raw', 'cleaned_chunk': 'cleaned'}]}
    result = saver(data, output_key='cleaned_chunk_path')
    # Returns: {'cleaned_chunk_path': './cleaned_output/path/to/raw_chunks_cleaned.json'}
    ```
    """
    def __init__(self, output_dir: Optional[str] = None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.output_dir = output_dir

    def forward(self, data: dict, output_key: str = 'cleaned_chunk_path', **kwargs) -> dict:
        cleaned_chunks = data.get('_cleaned_chunks', [])
        chunk_path = data.get('_chunk_path', '')
        result = data.copy()

        if not chunk_path:
            return _clean_save_result(result, output_key)

        if not cleaned_chunks:
            LOG.warning(f'No cleaned chunks to save for {chunk_path}')
            return _clean_save_result(result, output_key, chunk_path)

        try:
            json_items = _build_json_items(cleaned_chunks)
            output_path = _get_save_output_path(chunk_path, self.output_dir)

            with open(output_path, 'w', encoding='utf-8') as f:
                json.dump(json_items, f, ensure_ascii=False, indent=4)

            LOG.info(f'Successfully saved cleaned chunks to {output_path}')
            return _clean_save_result(result, output_key, output_path)

        except Exception as e:
            LOG.error(f'Error saving cleaned chunks: {e}')
            return _clean_save_result(result, output_key)

extract_cleaned_content(data)

提取清洗内容函数。

该函数从LLM清洗结果中提取清洗后的文本内容,处理不同的响应格式。 支持从标签 之间提取内容。

Parameters:

  • data (dict) –

    包含清洗结果的数据。

Returns:

  • dict ( dict ) –

    包含提取后清洗内容的数据:

  • _cleaned_chunks ( dict ) –

    清洗后的分块列表,每个包含 raw_chunk、cleaned_chunk 和 original_item

Examples:

from lazyllm.tools.data.operators.knowledge_cleaning.kbc_text_cleaner_batch import extract_cleaned_content

data = {'_cleaned_results': [{'response': '<cleaned_start>Clean text<cleaned_end>', 'raw_chunk': 'raw', 'original_item': {}}]}
result = extract_cleaned_content(data)
# Returns: {'_cleaned_results': [...], '_cleaned_chunks': [{'raw_chunk': 'raw', 'cleaned_chunk': 'Clean text', 'original_item': {}}]}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner_batch.py
@data_register('data.kbc', rewrite_func='forward', _concurrency_mode='process')
def extract_cleaned_content(data: dict) -> dict:
    """提取清洗内容函数。

该函数从LLM清洗结果中提取清洗后的文本内容,处理不同的响应格式。
支持从标签 <cleaned_start> 和 <cleaned_end> 之间提取内容。

Args:
    data (dict): 包含清洗结果的数据。

Returns:
    dict: 包含提取后清洗内容的数据:
    _cleaned_chunks: 清洗后的分块列表,每个包含 raw_chunk、cleaned_chunk 和 original_item


Examples:
    ```python
    from lazyllm.tools.data.operators.knowledge_cleaning.kbc_text_cleaner_batch import extract_cleaned_content

    data = {'_cleaned_results': [{'response': '<cleaned_start>Clean text<cleaned_end>', 'raw_chunk': 'raw', 'original_item': {}}]}
    result = extract_cleaned_content(data)
    # Returns: {'_cleaned_results': [...], '_cleaned_chunks': [{'raw_chunk': 'raw', 'cleaned_chunk': 'Clean text', 'original_item': {}}]}
    ```
    """
    cleaned_results = data.get('_cleaned_results', [])
    if not cleaned_results:
        return {**data, '_cleaned_chunks': []}

    cleaned_chunks = []
    for result in cleaned_results:
        response = result.get('response', '')
        raw_chunk = result.get('raw_chunk', '')
        original_item = result.get('original_item', {})

        # Handle different response types from JsonFormatter
        if isinstance(response, dict):
            # JsonFormatter returned a dict, extract text field or convert to string
            text = response.get('text', '') or response.get('content', '') or str(response)
        elif isinstance(response, list):
            # JsonFormatter returned a list, join or take first item
            text = response[0] if response else ''
            if isinstance(text, dict):
                text = text.get('text', '') or text.get('content', '') or str(text)
        elif isinstance(response, str):
            # JsonFormatter failed to parse, use as-is
            text = response
        else:
            text = str(response)

        # Extract content between tags
        if '<cleaned_start>' in text and '<cleaned_end>' in text:
            try:
                cleaned_text = text.split('<cleaned_start>')[1].split('<cleaned_end>')[0].strip()
            except IndexError:
                cleaned_text = text.strip()
        else:
            cleaned_text = text.strip()

        cleaned_chunks.append({
            'raw_chunk': raw_chunk,
            'cleaned_chunk': cleaned_text,
            'original_item': original_item
        })

    return {**data, '_cleaned_chunks': cleaned_chunks}

lazyllm.tools.data.operators.knowledge_cleaning.kbc_text_cleaner

KBCGenerateCleanedTextSingle

Bases: kbc

单条文本清洗生成算子。

该算子使用LLM对单条原始文本进行清洗,去除噪声、格式化内容。 适用于单条数据的实时清洗场景,当LLM调用失败时会使用原始文本作为回退。

Parameters:

  • llm

    LLM服务实例,用于清洗文本。

  • lang (str, default: 'en' ) –

    语言类型,'en' 表示英文,'zh' 表示中文,默认为 'en'。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    包含清洗响应的数据:

  • _cleaned_response

    LLM的清洗响应

Examples:

from lazyllm.tools.data import kbc

# Assuming llm is an LLM service instance
cleaner = kbc.KBCGenerateCleanedTextSingle(llm=llm, lang='en')

data = {'raw_chunk': 'Noisy text with errors...'}
result = cleaner(data, input_key='raw_chunk')
# Returns: {'raw_chunk': '...', '_cleaned_response': 'Cleaned text result'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner.py
class KBCGenerateCleanedTextSingle(kbc):
    """单条文本清洗生成算子。

该算子使用LLM对单条原始文本进行清洗,去除噪声、格式化内容。
适用于单条数据的实时清洗场景,当LLM调用失败时会使用原始文本作为回退。

Args:
    llm: LLM服务实例,用于清洗文本。
    lang (str): 语言类型,'en' 表示英文,'zh' 表示中文,默认为 'en'。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 包含清洗响应的数据:
    _cleaned_response: LLM的清洗响应


Examples:
    ```python
    from lazyllm.tools.data import kbc

    # Assuming llm is an LLM service instance
    cleaner = kbc.KBCGenerateCleanedTextSingle(llm=llm, lang='en')

    data = {'raw_chunk': 'Noisy text with errors...'}
    result = cleaner(data, input_key='raw_chunk')
    # Returns: {'raw_chunk': '...', '_cleaned_response': 'Cleaned text result'}
    ```
    """

    def __init__(self, llm=None, lang: str = 'en', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

        # Initialize prompt template
        self.prompts = DocRefinementPrompt(lang=lang)

        # Initialize LLM serve with system prompt and formatter
        if llm is not None:
            # Note: DocRefinementPrompt may not have system prompt, use empty string
            system_prompt = getattr(self.prompts, 'build_system_prompt', lambda: '')()
            self._llm_serve = llm.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(
        self,
        data: dict,
        input_key: str = 'raw_chunk',
        **kwargs
    ) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM is not configured')

        raw_content = data.get(input_key, '')
        if not raw_content:
            return {**data, '_cleaned_response': raw_content}

        # Build prompt for the raw content
        user_prompt = self.prompts.build_prompt(raw_content)

        try:
            # Call LLM (system prompt and formatter already set in __init__)
            response = self._llm_serve(user_prompt)
            return {**data, '_cleaned_response': response}
        except Exception as e:
            LOG.warning(f'Failed to clean text: {e}')
            # Use raw content as fallback
            return {**data, '_cleaned_response': raw_content}

extract_cleaned_content_single(data, output_key='cleaned_chunk')

单条清洗内容提取函数。

该函数从单条LLM清洗响应中提取清洗后的文本内容,处理不同的响应格式。 支持从标签 之间提取内容,并清理中间字段。

Parameters:

  • data (dict) –

    包含清洗响应的数据。

  • output_key (str, default: 'cleaned_chunk' ) –

    输出字段名,默认为 'cleaned_chunk'。

Returns:

  • dict ( dict ) –

    包含提取后清洗内容的数据,添加了 output_key 指定的字段。

Examples:

from lazyllm.tools.data.operators.knowledge_cleaning.kbc_text_cleaner import extract_cleaned_content_single

data = {'_cleaned_response': '<cleaned_start>Clean text<cleaned_end>'}
result = extract_cleaned_content_single(data, output_key='cleaned_chunk')
# Returns: {'cleaned_chunk': 'Clean text'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/kbc_text_cleaner.py
@data_register('data.kbc', rewrite_func='forward', _concurrency_mode='process')
def extract_cleaned_content_single(
    data: dict,
    output_key: str = 'cleaned_chunk',
) -> dict:
    """单条清洗内容提取函数。

该函数从单条LLM清洗响应中提取清洗后的文本内容,处理不同的响应格式。
支持从标签 <cleaned_start> 和 <cleaned_end> 之间提取内容,并清理中间字段。

Args:
    data (dict): 包含清洗响应的数据。
    output_key (str): 输出字段名,默认为 'cleaned_chunk'。

Returns:
    dict: 包含提取后清洗内容的数据,添加了 output_key 指定的字段。


Examples:
    ```python
    from lazyllm.tools.data.operators.knowledge_cleaning.kbc_text_cleaner import extract_cleaned_content_single

    data = {'_cleaned_response': '<cleaned_start>Clean text<cleaned_end>'}
    result = extract_cleaned_content_single(data, output_key='cleaned_chunk')
    # Returns: {'cleaned_chunk': 'Clean text'}
    ```
    """
    response = data.get('_cleaned_response', '')

    # Handle different response types from JsonFormatter
    if isinstance(response, dict):
        # JsonFormatter returned a dict, extract text field or convert to string
        text = response.get('text', '') or response.get('content', '') or str(response)
    elif isinstance(response, list):
        # JsonFormatter returned a list, join or take first item
        text = response[0] if response else ''
        if isinstance(text, dict):
            text = text.get('text', '') or text.get('content', '') or str(text)
    elif isinstance(response, str):
        # JsonFormatter failed to parse, use as-is
        text = response
    else:
        text = str(response)

    # Extract content between tags
    if '<cleaned_start>' in text and '<cleaned_end>' in text:
        try:
            cleaned_text = text.split('<cleaned_start>')[1].split('<cleaned_end>')[0].strip()
        except IndexError:
            cleaned_text = text.strip()
    else:
        cleaned_text = text.strip()

    result = data.copy()
    result[output_key] = cleaned_text
    # Clean intermediate fields
    for key in ['_cleaned_response']:
        result.pop(key, None)
    return result

lazyllm.tools.data.operators.knowledge_cleaning.qa_extract

KBCExtractQAPairs

Bases: kbc

提取问答对的算子。

该算子从加载的问答数据中提取问答对,并将其转换为标准格式。 支持自定义指令、问题和答案的输出字段名。

Parameters:

  • qa_key (str, default: 'QA_pairs' ) –

    问答数据字段名,默认为 'QA_pairs'。

  • instruction (str, default: 'Please answer the following question based on the provided information.' ) –

    指令文本,默认为 'Please answer the following question based on the provided information.'。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • List[dict]: 提取的问答对列表,每个包含 instruction、input 和 output 字段。

Examples:

from lazyllm.tools.data import kbc

extractor = kbc.KBCExtractQAPairs(
    qa_key='QA_pairs',
    instruction='Please answer based on the context.'
)

data = {'_qa_data': {'qa_pairs': [{'question': 'What is AI?', 'answer': 'Artificial Intelligence'}]}}
result = extractor(
    data,
    output_instruction_key='instruction',
    output_question_key='input',
    output_answer_key='output'
)
# Returns: [{'instruction': 'Please answer based on the context.', 'input': 'What is AI?', 'output': 'Artificial Intelligence'}]
Source code in lazyllm/tools/data/operators/knowledge_cleaning/qa_extract.py
class KBCExtractQAPairs(kbc):
    """提取问答对的算子。

该算子从加载的问答数据中提取问答对,并将其转换为标准格式。
支持自定义指令、问题和答案的输出字段名。

Args:
    qa_key (str): 问答数据字段名,默认为 'QA_pairs'。
    instruction (str): 指令文本,默认为 'Please answer the following question based on the provided information.'。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    List[dict]: 提取的问答对列表,每个包含 instruction、input 和 output 字段。


Examples:
    ```python
    from lazyllm.tools.data import kbc

    extractor = kbc.KBCExtractQAPairs(
        qa_key='QA_pairs',
        instruction='Please answer based on the context.'
    )

    data = {'_qa_data': {'qa_pairs': [{'question': 'What is AI?', 'answer': 'Artificial Intelligence'}]}}
    result = extractor(
        data,
        output_instruction_key='instruction',
        output_question_key='input',
        output_answer_key='output'
    )
    # Returns: [{'instruction': 'Please answer based on the context.', 'input': 'What is AI?', 'output': 'Artificial Intelligence'}]
    ```
    """
    def __init__(
        self,
        qa_key: str = 'QA_pairs',
        instruction: str = 'Please answer the following question based on the provided information.',
        **kwargs
    ):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.qa_key = qa_key
        self.instruction = instruction

    def forward(
        self,
        data: dict,
        output_instruction_key: str = 'instruction',
        output_question_key: str = 'input',
        output_answer_key: str = 'output',
        **kwargs
    ) -> List[dict]:
        qa_data = data.get('_qa_data')
        if not qa_data:
            return []

        # Extract qa_pairs - handle both dict with 'qa_pairs' key and direct list
        qa_list = qa_data.get('qa_pairs', []) if isinstance(qa_data, dict) else qa_data
        if not isinstance(qa_list, list):
            qa_list = [qa_list] if isinstance(qa_list, dict) else []

        results = []
        for qa in qa_list:
            if not isinstance(qa, dict):
                continue

            question = qa.get('question', '').strip()
            answer = qa.get('answer', '').strip()

            if not question or not answer:
                continue

            item = {
                output_instruction_key: self.instruction,
                output_question_key: question,
                output_answer_key: answer
            }
            results.append(item)

        return results

KBCLoadQAData

Bases: kbc

加载问答数据的算子。

该算子从输入数据或分块文件中加载问答数据。首先检查输入数据中是否已包含问答数据, 如果没有则尝试从增强分块文件、清洗后分块文件或普通分块文件中加载。

Parameters:

  • qa_key (str, default: 'QA_pairs' ) –

    问答数据字段名,默认为 'QA_pairs'。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    包含问答数据的数据:

  • _qa_data

    加载的问答数据

  • _source_file

    数据来源文件路径(如果从文件加载)

Examples:

from lazyllm.tools.data import kbc

loader = kbc.KBCLoadQAData(qa_key='QA_pairs')

# From existing data
data = {'QA_pairs': [{'question': 'Q1', 'answer': 'A1'}]}
result = loader(data)
# Returns: {'QA_pairs': [...], '_qa_data': [...]}

# From file
data = {'enhanced_chunk_path': '/path/to/enhanced.json'}
result = loader(data)
# Returns: {'enhanced_chunk_path': '...', '_qa_data': [...], '_source_file': '/path/to/enhanced.json'}
Source code in lazyllm/tools/data/operators/knowledge_cleaning/qa_extract.py
class KBCLoadQAData(kbc):
    """加载问答数据的算子。

该算子从输入数据或分块文件中加载问答数据。首先检查输入数据中是否已包含问答数据,
如果没有则尝试从增强分块文件、清洗后分块文件或普通分块文件中加载。

Args:
    qa_key (str): 问答数据字段名,默认为 'QA_pairs'。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 包含问答数据的数据:
    _qa_data: 加载的问答数据
    _source_file: 数据来源文件路径(如果从文件加载)


Examples:
    ```python
    from lazyllm.tools.data import kbc

    loader = kbc.KBCLoadQAData(qa_key='QA_pairs')

    # From existing data
    data = {'QA_pairs': [{'question': 'Q1', 'answer': 'A1'}]}
    result = loader(data)
    # Returns: {'QA_pairs': [...], '_qa_data': [...]}

    # From file
    data = {'enhanced_chunk_path': '/path/to/enhanced.json'}
    result = loader(data)
    # Returns: {'enhanced_chunk_path': '...', '_qa_data': [...], '_source_file': '/path/to/enhanced.json'}
    ```
    """
    def __init__(self, qa_key: str = 'QA_pairs', **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.qa_key = qa_key

    def forward(
        self,
        data: dict,
        **kwargs
    ) -> dict:
        # Check if QA data already exists in the data
        if self.qa_key in data:
            return {**data, '_qa_data': data.get(self.qa_key)}

        # Try to load from chunk files
        path_keys = ['enhanced_chunk_path', 'cleaned_chunk_path', 'chunk_path']

        for path_key in path_keys:
            file_path = data.get(path_key)
            if not file_path or not Path(file_path).exists():
                continue

            try:
                with open(file_path, 'r', encoding='utf-8') as f:
                    chunks = json.load(f)
                    chunks = chunks if isinstance(chunks, list) else [chunks]

                    for chunk in chunks:
                        if self.qa_key in chunk:
                            return {
                                **data,
                                '_qa_data': chunk[self.qa_key],
                                '_source_file': file_path
                            }
            except Exception as e:
                LOG.error(f'Failed to load {file_path}: {e}')
                continue

        # No QA data found
        return {**data, '_qa_data': None}

思维链生成算子

lazyllm.tools.data.operators.cot_ops

CoTGenerator

Bases: GenCot

使用大模型为问题生成带思维链(CoT)的推理过程,要求最终答案用 \boxed{{ANSWER}} 包裹。输出写入指定字段。

Parameters:

  • input_key (str, default: 'query' ) –

    输入问题字段名,默认 'query'

  • output_key (str, default: 'cot_answer' ) –

    输出 CoT 答案字段名,默认 'cot_answer'

  • model

    可选,TrainableModule 或兼容接口;None 时使用默认 Qwen 模型

  • user_prompt (str | None, default: None ) –

    可选,用户提示前缀;None 时使用默认

  • **kwargs

    其它基类参数

Examples:

from lazyllm.tools.data import genCot
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = genCot.CoTGenerator(input_key='query', output_key='cot_answer', model=llm)
data = {'query': 'What is 2+2?'}
res = op(data)  # each item gets 'cot_answer' with CoT and \boxed{{4}}
print(res)
# {'query': 'What is 2+2?', 'cot_answer': '首先,我们需要理解加法的基本概念,即两个或多个数值的总和。在这个问题中,我们需要计算 2 和另一个 2 的和。

第一步我们识别出第一个数值是 2

第二步我们识别出第二个数值也是 2

第三步我们将这两个数值相加2 + 2

第四步我们进行计算2 + 2 = 4

因此最终答案是 4使用规定的格式包裹答案

最终答案:oxed{4}'}
Source code in lazyllm/tools/data/operators/cot_ops.py
class CoTGenerator(GenCot):
    """使用大模型为问题生成带思维链(CoT)的推理过程,要求最终答案用 \\boxed{{ANSWER}} 包裹。输出写入指定字段。

Args:
    input_key (str): 输入问题字段名,默认 'query'
    output_key (str): 输出 CoT 答案字段名,默认 'cot_answer'
    model: 可选,TrainableModule 或兼容接口;None 时使用默认 Qwen 模型
    user_prompt (str|None): 可选,用户提示前缀;None 时使用默认
    **kwargs: 其它基类参数


Examples:
    ```python
    from lazyllm.tools.data import genCot
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = genCot.CoTGenerator(input_key='query', output_key='cot_answer', model=llm)
    data = {'query': 'What is 2+2?'}
    res = op(data)  # each item gets 'cot_answer' with CoT and \\boxed{{4}}
    print(res)
    # {'query': 'What is 2+2?', 'cot_answer': '首先,我们需要理解加法的基本概念,即两个或多个数值的总和。在这个问题中,我们需要计算 2 和另一个 2 的和。

    第一步,我们识别出第一个数值是 2。

    第二步,我们识别出第二个数值也是 2。

    第三步,我们将这两个数值相加:2 + 2。

    第四步,我们进行计算:2 + 2 = 4。

    因此,最终答案是 4,使用规定的格式包裹答案。

    最终答案:\boxed{4}'}
    ```
    """
    def __init__(self,
                 input_key='query',
                 output_key='cot_answer',
                 model=None,
                 user_prompt=None,
                 **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.output_key = output_key
        self.user_prompt = user_prompt

        output_structure = f'''
        输出格式要求:
        {{
            "{self.output_key}": "包含CoT推理过程和最终boxed答案"
        }}
        '''

        if model is None:
            self.model = TrainableModule(DEFAULT_MODEL)
        else:
            self.model = model.share()

        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data):
        question = data.get(self.input_key, '')
        if not question:
            data[self.output_key] = None
            return data

        base_prompt = f'''
        问题:
        {question}

        规则:
        - 输出详细CoT
        - 最终答案必须使用 \boxed{{ANSWER}} 包裹
        '''

        if self.user_prompt is None:
            user_prompt = '请为这个问题生成带有思维链(Chain-of-Thought, CoT)的输出结果:\n' + base_prompt
        else:
            user_prompt = self.user_prompt + '\n' + f'问题:{question}'

        res = self.model(user_prompt)
        data[self.output_key] = res.get(self.output_key, None)
        return data

SelfConsistencyCoTGenerator

Bases: GenCot

对同一问题采样多次 CoT,从 \boxed{{}} 中提取答案并做多数投票,最终保留与多数答案一致的一条 CoT 输出。

Parameters:

  • input_key (str, default: 'query' ) –

    输入问题字段名,默认 'query'

  • output_key (str, default: 'cot_answer' ) –

    输出 CoT 答案字段名,默认 'cot_answer'

  • num_samples (int, default: 5 ) –

    采样次数,默认 5

  • model

    可选;None 时使用默认 Qwen 模型

  • user_prompt (str | None, default: None ) –

    可选用户提示

  • **kwargs

    其它基类参数

Examples:

from lazyllm.tools.data import genCot
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = genCot.SelfConsistencyCoTGenerator(
    input_key='query',
    output_key='cot_answer',
    num_samples=3,
    model=llm
)

data = {'query': 'What is 3*4?'}
res = op(data)
print(res)
# {'query': 'What is 3*4?', 'candidates': ['12', '12', '12'], 'cot_answer': '首先,我们需要理解问题的核心,即计算3乘以4的结果。

1. 确定操作这是一个乘法问题我们需要将两个数相乘
2. 识别数字问题中给出的两个数字是3和4
3. 执行乘法将3乘以4计算过程如下
   - 3 * 4 = 12

因此3乘以4的结果是12

最终答案为:oxed{12}'}
Source code in lazyllm/tools/data/operators/cot_ops.py
class SelfConsistencyCoTGenerator(GenCot):
    """对同一问题采样多次 CoT,从 \\boxed{{}} 中提取答案并做多数投票,最终保留与多数答案一致的一条 CoT 输出。

Args:
    input_key (str): 输入问题字段名,默认 'query'
    output_key (str): 输出 CoT 答案字段名,默认 'cot_answer'
    num_samples (int): 采样次数,默认 5
    model: 可选;None 时使用默认 Qwen 模型
    user_prompt (str|None): 可选用户提示
    **kwargs: 其它基类参数


Examples:
    ```python
    from lazyllm.tools.data import genCot
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = genCot.SelfConsistencyCoTGenerator(
        input_key='query',
        output_key='cot_answer',
        num_samples=3,
        model=llm
    )

    data = {'query': 'What is 3*4?'}
    res = op(data)
    print(res)
    # {'query': 'What is 3*4?', 'candidates': ['12', '12', '12'], 'cot_answer': '首先,我们需要理解问题的核心,即计算3乘以4的结果。

    1. 确定操作:这是一个乘法问题,我们需要将两个数相乘。
    2. 识别数字:问题中给出的两个数字是3和4。
    3. 执行乘法:将3乘以4,计算过程如下:
       - 3 * 4 = 12

    因此,3乘以4的结果是12。

    最终答案为:\boxed{12}'}
    ```
    """
    def __init__(self,
                 input_key='query',
                 output_key='cot_answer',
                 num_samples=5,
                 model=None,
                 user_prompt=None,
                 **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.output_key = output_key
        self.num_samples = num_samples
        self.user_prompt = user_prompt

        if model is None:
            self.model = TrainableModule(DEFAULT_MODEL)
        else:
            self.model = model.share()
        self.model.start()

    def _build_prompt(self, question):
        base_prompt = f'''
        问题:
        {question}

        规则:
        - 输出详细CoT
        - 最终答案必须使用 \boxed{{ANSWER}} 包裹
        '''
        if self.user_prompt is None:
            return '请为这个问题生成带有思维链(Chain-of-Thought, CoT)的输出结果:\n' + base_prompt
        return self.user_prompt + '\n' + f'问题:{question};'

    def forward(self, data):
        question = data.get(self.input_key, '')
        if not question:
            data[self.output_key] = None
            return data

        cot_list = []
        boxed_answers = []

        prompt = self._build_prompt(question)
        candidates = []
        for _ in range(self.num_samples):
            response = self.model(prompt)
            cot = response
            boxed = boxed_res_extractor(response)
            candidates.append(boxed)
            if boxed is not None:
                cot_list.append(cot)
                boxed_answers.append(boxed)

        if not boxed_answers:
            data[self.output_key] = None
            return data

        counter = Counter(boxed_answers)
        majority_answer = counter.most_common(1)[0][0]
        data['candidates'] = candidates
        for cot, ans in zip(cot_list, boxed_answers):
            if ans == majority_answer:
                data[self.output_key] = cot
                return data

        data[self.output_key] = None
        return data

answer_verify(data, answer_key='reference', infer_key='llm_extracted', output_key='is_equal')

比较参考答案与模型提取答案是否(数学意义下)相等。使用 math_verify 解析并验证,结果写入指定字段。以 forward 单条方式注册。

Parameters:

  • data (dict) –

    单条数据字典

  • answer_key (str, default: 'reference' ) –

    参考答案字段名,默认 'reference'

  • infer_key (str, default: 'llm_extracted' ) –

    模型提取答案字段名,默认 'llm_extracted'

  • output_key (str, default: 'is_equal' ) –

    是否相等写入的字段名,默认 'is_equal'

Examples:

from lazyllm.tools.data import genCot

data = {'reference': '1/2', 'llm_extracted': '0.5'}
op = genCot.answer_verify(answer_key='reference', infer_key='llm_extracted', output_key='is_equal')
print(op(data))  # Add key/value: 'is_equal': True
# {'reference': '1/2', 'llm_extracted': '0.5', 'is_equal': True}
Source code in lazyllm/tools/data/operators/cot_ops.py
@data_register('data.genCot', rewrite_func='forward')
def answer_verify(data, answer_key='reference', infer_key='llm_extracted', output_key='is_equal'):
    """比较参考答案与模型提取答案是否(数学意义下)相等。使用 math_verify 解析并验证,结果写入指定字段。以 forward 单条方式注册。

Args:
    data (dict): 单条数据字典
    answer_key (str): 参考答案字段名,默认 'reference'
    infer_key (str): 模型提取答案字段名,默认 'llm_extracted'
    output_key (str): 是否相等写入的字段名,默认 'is_equal'


Examples:
    ```python
    from lazyllm.tools.data import genCot

    data = {'reference': '1/2', 'llm_extracted': '0.5'}
    op = genCot.answer_verify(answer_key='reference', infer_key='llm_extracted', output_key='is_equal')
    print(op(data))  # Add key/value: 'is_equal': True
    # {'reference': '1/2', 'llm_extracted': '0.5', 'is_equal': True}
    ```
    """
    real_answer = data.get(answer_key, None)
    llm_answer = data.get(infer_key, None)

    if real_answer is None or llm_answer is None:
        data[output_key] = False
        return data

    try:
        parsed_real = math_verify.parse(str(real_answer))
        parsed_llm = math_verify.parse(str(llm_answer))
        data[output_key] = math_verify.verify(parsed_real, parsed_llm)

    except Exception as e:
        LOG.error(f'Error verifying answers: {e}')
        data[output_key] = False

    return data

多样性强化算子

lazyllm.tools.data.operators.enQa_ops

DiversityScorer

Bases: EnQA

对问题列表进行多样性打分,输出与输入顺序一致的列表,每项含 rewritten_query 与 diversity_score(0 相似/1 差异明显)。

Parameters:

  • input_key (str, default: 'rewrite_querys' ) –

    问题列表字段名,默认 'rewrite_querys'

  • output_key (str, default: 'diversity_querys' ) –

    带多样性分数的列表写入的字段名,默认 'diversity_querys'

  • model

    可选;None 时使用默认 Qwen 模型

  • user_prompt (str | None, default: None ) –

    可选用户提示

  • **kwargs

    其它基类参数

Examples:

from lazyllm.tools.data import EnQA
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = EnQA.DiversityScorer(input_key='rewrite_querys', output_key='diversity_querys', model=llm)
data = {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!']}
res = op(data)
print(data)
# {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'diversity_querys': [{'rewritten_query': '今天是个好天气', 'diversity_score': 1}, {'rewritten_query': '今天天气不错', 'diversity_score': 1}, {'rewritten_query': 'It is a nice day!', 'diversity_score': 1}]}
Source code in lazyllm/tools/data/operators/enQa_ops.py
class DiversityScorer(EnQA):
    """对问题列表进行多样性打分,输出与输入顺序一致的列表,每项含 rewritten_query 与 diversity_score(0 相似/1 差异明显)。

Args:
    input_key (str): 问题列表字段名,默认 'rewrite_querys'
    output_key (str): 带多样性分数的列表写入的字段名,默认 'diversity_querys'
    model: 可选;None 时使用默认 Qwen 模型
    user_prompt (str|None): 可选用户提示
    **kwargs: 其它基类参数


Examples:
    ```python
    from lazyllm.tools.data import EnQA
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = EnQA.DiversityScorer(input_key='rewrite_querys', output_key='diversity_querys', model=llm)
    data = {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!']}
    res = op(data)
    print(data)
    # {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'diversity_querys': [{'rewritten_query': '今天是个好天气', 'diversity_score': 1}, {'rewritten_query': '今天天气不错', 'diversity_score': 1}, {'rewritten_query': 'It is a nice day!', 'diversity_score': 1}]}
    ```
    """

    def __init__(self,
                 input_key='rewrite_querys',
                 output_key='diversity_querys',
                 model=None,
                 user_prompt=None,
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.output_key = output_key
        self.user_prompt = user_prompt

        output_structure = '''
        输出格式要求:
        {
            "diversity_scores": [0,1]
        }
        '''

        if model is None:
            self.model = TrainableModule(DEFAULT_MODEL)
        else:
            self.model = model.share()

        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data):
        querys = data.get(self.input_key)
        if not querys:
            return None

        if data.get(self.output_key) is not None:
            return None

        base_prompt = f'''
        问题列表:
        {querys}

        规则:
        - 表达重复或相似度高:score = 0
        - 表达差异明显:score = 1
        - 输出与输入顺序一致
        '''

        if self.user_prompt is None:
            prompt = '判断下面问题列表的表达多样性。\n' + base_prompt
        else:
            prompt = self.user_prompt + '\n' + f'问题列表:{querys};'

        res = self.model(prompt)

        scores = res.get('diversity_scores', [])

        new_list = []
        for i, q in enumerate(querys):
            score = scores[i] if i < len(scores) else 0
            new_list.append({
                'rewritten_query': q,
                'diversity_score': score
            })

        data[self.output_key] = new_list
        return data

QueryRewriter

Bases: EnQA

使用大模型将原问题重写为多个语义一致、表达不同的问法,输出列表写入指定字段。

Parameters:

  • input_key (str, default: 'query' ) –

    输入问题字段名,默认 'query'

  • output_key (str, default: 'rewrite_querys' ) –

    重写问题列表写入的字段名,默认 'rewrite_querys'

  • rewrite_num (int, default: 3 ) –

    生成的重写数量,默认 3

  • model

    可选;None 时使用默认 Qwen 模型

  • user_prompt (str | None, default: None ) –

    可选用户提示

  • **kwargs

    其它基类参数

Examples:

from lazyllm.tools.data import EnQA
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = EnQA.QueryRewriter(input_key='query', output_key='rewrite_querys', rewrite_num=2, model=llm)
data = {'query': 'What is machine learning?'}
res = op(data)  # data gets 'rewrite_querys': [str, str, ...]
print(res)
# [{'query': 'What is machine learning?', 'rewrite_querys': ['Could you explain what machine learning is?', 'What does the term machine learning refer to?']}]
Source code in lazyllm/tools/data/operators/enQa_ops.py
class QueryRewriter(EnQA):
    """使用大模型将原问题重写为多个语义一致、表达不同的问法,输出列表写入指定字段。

Args:
    input_key (str): 输入问题字段名,默认 'query'
    output_key (str): 重写问题列表写入的字段名,默认 'rewrite_querys'
    rewrite_num (int): 生成的重写数量,默认 3
    model: 可选;None 时使用默认 Qwen 模型
    user_prompt (str|None): 可选用户提示
    **kwargs: 其它基类参数


Examples:
    ```python
    from lazyllm.tools.data import EnQA
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = EnQA.QueryRewriter(input_key='query', output_key='rewrite_querys', rewrite_num=2, model=llm)
    data = {'query': 'What is machine learning?'}
    res = op(data)  # data gets 'rewrite_querys': [str, str, ...]
    print(res)
    # [{'query': 'What is machine learning?', 'rewrite_querys': ['Could you explain what machine learning is?', 'What does the term machine learning refer to?']}]
    ```
    """

    def __init__(self,
                 input_key='query',
                 output_key='rewrite_querys',
                 rewrite_num=3,
                 model=None,
                 user_prompt=None,
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.output_key = output_key
        self.rewrite_num = rewrite_num
        self.user_prompt = user_prompt

        output_structure = f'''
        输出格式要求:
        {{
            "{self.output_key}": ["rewrite1","rewrite2"]
        }}
        '''

        if model is None:
            self.model = TrainableModule(DEFAULT_MODEL)
        else:
            self.model = model.share()

        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data):

        query = data.get(self.input_key)
        if not query:
            return None

        if data.get(self.output_key) is not None:
            return None

        base_prompt = f'''
        原问题:
        {query}

        规则:
        - 生成 {self.rewrite_num} 个不同表达
        - 保持语义一致
        - 不要解释
        '''

        if self.user_prompt is None:
            prompt = '请重写下面的问题,使其语义一致但表达不同。\n' + base_prompt
        else:
            prompt = self.user_prompt + \
                '\n' + f'原问题:{query} \n 生成 {self.rewrite_num} 个不同表达'

        res = self.model(prompt)

        data[self.output_key] = res.get(self.output_key, [])
        return data

diversity_filter(data, input_key, min_score)

按多样性分数过滤:若 data 中指定字段(分数)小于 min_score 则丢弃该条(返回 []),否则保留(返回 None 表示保留原 data)。以 forward 单条方式注册。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str) –

    分数所在字段名

  • min_score

    最小分数阈值

Examples:

from lazyllm.tools.data import EnQA

data = {'query': 'a and b', 'rewritten_query': 'b', 'diversity_score': 0}
op = EnQA.diversity_filter(input_key='diversity_score', min_score=1)
print(op(data))  # [None] (drop) 
# []
Source code in lazyllm/tools/data/operators/enQa_ops.py
@data_register('data.enQA', rewrite_func='forward')
def diversity_filter(data, input_key, min_score):
    """按多样性分数过滤:若 data 中指定字段(分数)小于 min_score 则丢弃该条(返回 []),否则保留(返回 None 表示保留原 data)。以 forward 单条方式注册。

Args:
    data (dict): 单条数据字典
    input_key (str): 分数所在字段名
    min_score: 最小分数阈值


Examples:
    ```python
    from lazyllm.tools.data import EnQA

    data = {'query': 'a and b', 'rewritten_query': 'b', 'diversity_score': 0}
    op = EnQA.diversity_filter(input_key='diversity_score', min_score=1)
    print(op(data))  # [None] (drop) 
    # []
    ```
    """
    score = data.get(input_key, 0)
    if score >= min_score:
        return None
    return []

post_processor(data, input_key)

将指定字段(列表 of dict)展开为多行:每项 dict 与原始 data 合并为一行,原列表字段移除。返回多行时以 list 形式;无数据返回 None。以 forward 单条方式注册。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str) –

    要展开的列表字段名(列表中每项为 dict)

Examples:

from lazyllm.tools.data import EnQA

data = {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'diversity_querys': [{'rewritten_query': '今天是个好天气', 'diversity_score': 1}, {'rewritten_query': '今天天气不错', 'diversity_score': 1}, {'rewritten_query': 'It is a nice day!', 'diversity_score': 1}]}
op = EnQA.post_processor(input_key='diversity_querys')
print(op(data))  
# [{'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'rewritten_query': '今天是个好天气', 'diversity_score': 1}, {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'rewritten_query': '今天天气不错', 'diversity_score': 1}, {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'rewritten_query': 'It is a nice day!', 'diversity_score': 1}]
Source code in lazyllm/tools/data/operators/enQa_ops.py
@data_register('data.enQA', rewrite_func='forward')
def post_processor(data, input_key):
    """将指定字段(列表 of dict)展开为多行:每项 dict 与原始 data 合并为一行,原列表字段移除。返回多行时以 list 形式;无数据返回 None。以 forward 单条方式注册。

Args:
    data (dict): 单条数据字典
    input_key (str): 要展开的列表字段名(列表中每项为 dict)


Examples:
    ```python
    from lazyllm.tools.data import EnQA

    data = {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'diversity_querys': [{'rewritten_query': '今天是个好天气', 'diversity_score': 1}, {'rewritten_query': '今天天气不错', 'diversity_score': 1}, {'rewritten_query': 'It is a nice day!', 'diversity_score': 1}]}
    op = EnQA.post_processor(input_key='diversity_querys')
    print(op(data))  
    # [{'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'rewritten_query': '今天是个好天气', 'diversity_score': 1}, {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'rewritten_query': '今天天气不错', 'diversity_score': 1}, {'rewrite_querys': ['今天是个好天气', '今天天气不错', 'It is a nice day!'], 'rewritten_query': 'It is a nice day!', 'diversity_score': 1}]
    ```
    """
    items = data.get(input_key)
    if not items:
        return None

    result = []
    for obj in items:

        if not isinstance(obj, dict):
            continue

        new_row = data.copy()
        new_row.pop(input_key, None)
        for k, v in obj.items():
            new_row[k] = v

        result.append(new_row)

    return result

数学问题算子

lazyllm.tools.data.operators.math_ops

DifficultyEvaluator

Bases: MathQA

使用大模型判断数学问题难度,输出 Easy | Medium | Hard(小学/初中高中/大学及以上)。若已有 difficulty 则跳过。

Parameters:

  • input_key (str, default: 'question' ) –

    问题字段名,默认 'question'

  • output_key (str, default: 'difficulty' ) –

    难度写入的字段名,默认 'difficulty'

  • model

    可选;None 时使用默认 Qwen 模型

  • user_prompt (str | None, default: None ) –

    可选用户提示

  • **kwargs

    其它基类参数

Examples:

from lazyllm.tools.data.operators.math_ops import DifficultyEvaluator

from lazyllm.tools.data import MathQA
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = MathQA.DifficultyEvaluator(input_key='question', output_key='difficulty', model=llm)
data = {'question': '1+1=?'}
res = op(data)  # each item gets 'difficulty': 'Easy'|'Medium'|'Hard'
print(res)
# [{'question': '1+1=?', 'difficulty': 'Easy'}]
Source code in lazyllm/tools/data/operators/math_ops.py
class DifficultyEvaluator(MathQA):
    """使用大模型判断数学问题难度,输出 Easy | Medium | Hard(小学/初中高中/大学及以上)。若已有 difficulty 则跳过。

Args:
    input_key (str): 问题字段名,默认 'question'
    output_key (str): 难度写入的字段名,默认 'difficulty'
    model: 可选;None 时使用默认 Qwen 模型
    user_prompt (str|None): 可选用户提示
    **kwargs: 其它基类参数


Examples:
    ```python
    from lazyllm.tools.data.operators.math_ops import DifficultyEvaluator

    from lazyllm.tools.data import MathQA
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = MathQA.DifficultyEvaluator(input_key='question', output_key='difficulty', model=llm)
    data = {'question': '1+1=?'}
    res = op(data)  # each item gets 'difficulty': 'Easy'|'Medium'|'Hard'
    print(res)
    # [{'question': '1+1=?', 'difficulty': 'Easy'}]
    ```
    """
    def __init__(self,
                 input_key='question',
                 output_key='difficulty',
                 model=None,
                 user_prompt=None,
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.output_key = output_key
        self.user_prompt = user_prompt

        output_structure = f'''
        输出格式要求:
        {{
            "{self.output_key}": "难度"
        }}
        '''

        self.model = model.share() or TrainableModule(DEFAULT_MODEL)

        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data):

        if data.get(self.output_key) is not None:
            return None

        question = data.get(self.input_key)

        base_prompt = f'''
        问题:
        {question}

        难度级别:
        - Easy : 小学
        - Medium : 初中/高中
        - Hard : 大学及以上

        '''

        if self.user_prompt is None:
            prompt = '判断下面数学问题的难度。\n' + base_prompt
        else:
            prompt = self.user_prompt + '\n' + f'问题:{question}'

        res = self.model(prompt)

        data[self.output_key] = res.get(self.output_key)
        return data

DuplicateAnswerDetector

Bases: MathQA

检测答案是否存在重复/周期/长片段重复:周期重复、句子级重复、或合并问题+答案后的长子串重复则标记为 True。不调用模型。

Parameters:

  • question_key (str, default: 'question' ) –

    问题字段名,默认 'question'

  • answer_key (str, default: 'answer' ) –

    答案字段名,默认 'answer'

  • output_key (str, default: 'duplicate' ) –

    是否重复写入的字段名,默认 'duplicate'

  • min_repeat_len (int, default: 15 ) –

    判定长重复的最小子串长度,默认 15

  • repeat_threshold (int, default: 2 ) –

    子串出现次数阈值,默认 2

  • periodic_min_repeat (int, default: 3 ) –

    周期重复的最小周期重复次数,默认 3

  • **kwargs

    其它基类参数

Examples:

from lazyllm.tools.data import MathQA

op = MathQA.DuplicateAnswerDetector(question_key='question', answer_key='answer', output_key='duplicate')
data = {'question': 'Q', 'answer': 'A' * 50}
res = op(data)  # data['duplicate'] True
print(res)
# [{'question': 'Q', 'answer': 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA', 'duplicate': True}]
Source code in lazyllm/tools/data/operators/math_ops.py
class DuplicateAnswerDetector(MathQA):
    """检测答案是否存在重复/周期/长片段重复:周期重复、句子级重复、或合并问题+答案后的长子串重复则标记为 True。不调用模型。

Args:
    question_key (str): 问题字段名,默认 'question'
    answer_key (str): 答案字段名,默认 'answer'
    output_key (str): 是否重复写入的字段名,默认 'duplicate'
    min_repeat_len (int): 判定长重复的最小子串长度,默认 15
    repeat_threshold (int): 子串出现次数阈值,默认 2
    periodic_min_repeat (int): 周期重复的最小周期重复次数,默认 3
    **kwargs: 其它基类参数


Examples:
    ```python
    from lazyllm.tools.data import MathQA

    op = MathQA.DuplicateAnswerDetector(question_key='question', answer_key='answer', output_key='duplicate')
    data = {'question': 'Q', 'answer': 'A' * 50}
    res = op(data)  # data['duplicate'] True
    print(res)
    # [{'question': 'Q', 'answer': 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA', 'duplicate': True}]
    ```
    """
    def __init__(self,
                 question_key='question',
                 answer_key='answer',
                 output_key='duplicate',
                 min_repeat_len=15,
                 repeat_threshold=2,
                 periodic_min_repeat=3,
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)

        self.question_key = question_key
        self.answer_key = answer_key
        self.output_key = output_key

        self.min_repeat_len = min_repeat_len
        self.repeat_threshold = repeat_threshold
        self.periodic_min_repeat = periodic_min_repeat

    def _is_periodic(self, text):
        n = len(text)
        if n < 6:
            return False
        for size in range(1, n // 2 + 1):
            if n % size != 0:
                continue

            unit = text[:size]
            if unit * (n // size) == text:
                if (n // size) >= self.periodic_min_repeat:
                    return True

        return False

    def _has_long_repeat(self, merged_text):
        seen = {}
        text_len = len(merged_text)

        for i in range(text_len - self.min_repeat_len + 1):

            substr = merged_text[i:i + self.min_repeat_len]

            if not substr.strip():
                continue

            seen[substr] = seen.get(substr, 0) + 1

            if seen[substr] >= self.repeat_threshold:
                return True

        return False

    def _sentence_repeat(self, answer):
        sentences = re.split(r'[。!?.!?\n]', answer)
        counter = {}
        for s in sentences:
            s = s.strip()
            if len(s) < 10:
                continue
            counter[s] = counter.get(s, 0) + 1
            if counter[s] >= 3:
                return True
        return False

    def forward(self, data):
        assert isinstance(data, dict)
        question = str(data.get(self.question_key, ''))
        answer = str(data.get(self.answer_key, ''))
        data[self.output_key] = False
        if not answer:
            return data

        merged = question + '\n' + answer
        if self._is_periodic(answer):
            data[self.output_key] = True
            return data

        if self._sentence_repeat(answer):
            data[self.output_key] = True
            return data

        if self._has_long_repeat(merged):
            data[self.output_key] = True
            return data

        return data

MathAnswerGenerator

Bases: MathQA

使用大模型为数学问题生成推理与答案,要求最终结果用 \boxed{{ANSWER}} 包裹。若已有 answer 且未设置 regenerate 则跳过。

Parameters:

  • input_key (str, default: 'question' ) –

    问题字段名,默认 'question'

  • output_key (str, default: 'answer' ) –

    答案写入的字段名,默认 'answer'

  • regenerate_key (str, default: 'regenerate' ) –

    是否强制重新生成的标志字段,默认 'regenerate'

  • model

    可选;None 时使用默认 Qwen 模型

  • user_prompt (str | None, default: None ) –

    可选用户提示

  • **kwargs

    其它基类参数

Examples:

from lazyllm.tools.data.operators.math_ops import MathAnswerGenerator

from lazyllm.tools.data import MathQA
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = MathQA.MathAnswerGenerator(input_key='question', output_key='answer', model=llm)
data = [{'question': 'Solve 10 * 10'}]
res = op(data) 
print(res)
# [{'question': 'Solve 10 * 10', 'answer': '首先,我们需要计算 \(10  imes 10\)。这是一个简单的乘法运算,其中两个乘数都是10。

步骤1写下乘数10和另一个乘数10
步骤2将两个10相乘

计算过程如下
\[ 10       imes 10 = 100 \]

因此最终结果是 \(oxed{100}\)', 'regenerate': False}]
Source code in lazyllm/tools/data/operators/math_ops.py
class MathAnswerGenerator(MathQA):
    """使用大模型为数学问题生成推理与答案,要求最终结果用 \\boxed{{ANSWER}} 包裹。若已有 answer 且未设置 regenerate 则跳过。

Args:
    input_key (str): 问题字段名,默认 'question'
    output_key (str): 答案写入的字段名,默认 'answer'
    regenerate_key (str): 是否强制重新生成的标志字段,默认 'regenerate'
    model: 可选;None 时使用默认 Qwen 模型
    user_prompt (str|None): 可选用户提示
    **kwargs: 其它基类参数


Examples:
    ```python
    from lazyllm.tools.data.operators.math_ops import MathAnswerGenerator

    from lazyllm.tools.data import MathQA
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = MathQA.MathAnswerGenerator(input_key='question', output_key='answer', model=llm)
    data = [{'question': 'Solve 10 * 10'}]
    res = op(data) 
    print(res)
    # [{'question': 'Solve 10 * 10', 'answer': '首先,我们需要计算 \(10 	imes 10\)。这是一个简单的乘法运算,其中两个乘数都是10。

    步骤1:写下乘数10和另一个乘数10。
    步骤2:将两个10相乘。

    计算过程如下:
    \[ 10 	imes 10 = 100 \]

    因此,最终结果是 \(\boxed{100}\)。', 'regenerate': False}]
    ```
    """
    def __init__(self,
                 input_key='question',
                 output_key='answer',
                 regenerate_key='regenerate',
                 model=None,
                 user_prompt=None,
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.output_key = output_key
        self.regenerate_key = regenerate_key
        self.user_prompt = user_prompt

        output_structure = f'''
        输出格式要求:
        {{
            "{self.output_key}": "推理结果boxed"
        }}
        '''

        self.model = model.share() or TrainableModule(DEFAULT_MODEL)

        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data):

        answer = data.get(self.output_key)
        regenerate = data.get(self.regenerate_key, False)

        if answer is not None and regenerate is False:
            return None

        question = data.get(self.input_key)

        base_prompt = f'''
        问题:
        {question}

        规则:
        - 输出详细的过程
        - 最终结果使用 \\boxed{{ANSWER}} 包裹
        '''

        if self.user_prompt is None:
            prompt = '请为这个数学问题生成推理结果。\n' + base_prompt
        else:
            prompt = self.user_prompt + '\n' + f'问题:{question}'

        res = self.model(prompt)

        data[self.output_key] = res.get(self.output_key)
        data[self.regenerate_key] = False

        return data

QualityEvaluator

Bases: MathQA

使用大模型对问题-答案对做质量打分:0 表示需重新生成,1 表示合格。若已有 output_key 则跳过。

Parameters:

  • question_key (str, default: 'question' ) –

    问题字段名,默认 'question'

  • answer_key (str, default: 'answer' ) –

    答案字段名,默认 'answer'

  • output_key (str, default: 'score' ) –

    分数写入的字段名,默认 'score'

  • model

    可选;None 时使用默认 Qwen 模型

  • user_prompt (str | None, default: None ) –

    可选用户提示

  • **kwargs

    其它基类参数

Examples:

from lazyllm.tools.data import MathQA
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = MathQA.QualityEvaluator(question_key='question', answer_key='answer', output_key='score', model=llm)
data = {'question': '今天天气如何', 'answer': '大家好~'}
res = op(data) # 质量低的会被打 0 分
print(res)
# [{'question': '今天天气如何', 'answer': '大家好~', 'score': 0}]
Source code in lazyllm/tools/data/operators/math_ops.py
class QualityEvaluator(MathQA):
    """使用大模型对问题-答案对做质量打分:0 表示需重新生成,1 表示合格。若已有 output_key 则跳过。

Args:
    question_key (str): 问题字段名,默认 'question'
    answer_key (str): 答案字段名,默认 'answer'
    output_key (str): 分数写入的字段名,默认 'score'
    model: 可选;None 时使用默认 Qwen 模型
    user_prompt (str|None): 可选用户提示
    **kwargs: 其它基类参数


Examples:
    ```python
    from lazyllm.tools.data import MathQA
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = MathQA.QualityEvaluator(question_key='question', answer_key='answer', output_key='score', model=llm)
    data = {'question': '今天天气如何', 'answer': '大家好~'}
    res = op(data) # 质量低的会被打 0 分
    print(res)
    # [{'question': '今天天气如何', 'answer': '大家好~', 'score': 0}]
    ```
    """
    def __init__(self,
                 question_key='question',
                 answer_key='answer',
                 output_key='score',
                 model=None,
                 user_prompt=None,
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)

        self.question_key = question_key
        self.answer_key = answer_key
        self.output_key = output_key
        self.user_prompt = user_prompt

        output_structure = f'''
        输出格式要求:
        {{
            "{self.output_key}": 0
        }}
        '''

        self.model = model.share() or TrainableModule(DEFAULT_MODEL)

        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data):

        if data.get(self.output_key) is not None:
            return None

        question = data.get(self.question_key)
        answer = data.get(self.answer_key)

        base_prompt = f'''
        问题:
        {question}

        答案:
        {answer}

        规则:
        - 输出 0 表示需要重新生成
        - 输出 1 表示质量合格
        '''

        if self.user_prompt is None:
            prompt = '请检查问题和答案的质量。\n' + base_prompt
        else:
            prompt = self.user_prompt + '\n' + f'问题:{question}; 答案: {answer}'

        res = self.model(prompt)

        data[self.output_key] = res.get(self.output_key)
        return data

QuestionFusionGenerator

Bases: MathQA

使用大模型将多条问题融合为一个新问题并生成推理与 \boxed{{}} 答案。需要 list_key 下至少 2 个问题。

Parameters:

  • input_key (str, default: 'question' ) –

    融合后问题字段名,默认 'question'

  • output_key (str, default: 'answer' ) –

    推理结果/答案写入的字段名,默认 'answer'

  • list_key (str, default: 'question_list' ) –

    问题列表字段名,默认 'question_list'

  • model

    可选;None 时使用默认 Qwen 模型

  • user_prompt (str | None, default: None ) –

    可选用户提示

  • **kwargs

    其它基类参数

Examples:

from lazyllm.tools.data import MathQA
from lazyllm import OnlineChatModule

llm = OnlineChatModule()
op = MathQA.QuestionFusionGenerator(input_key='new_question', list_key='question_list', output_key='new_answer', model=llm)
data = {'question_list': [
    {'question': '1加1等于几?', 'answer': '1+1 = 2'}, 
    {'question': '2的平方等于几?', 'answer': '2*2 = 4'}]}
res = op(data) 
print(res)
# [{'question_list': [{'question': '1加1等于几?', 'answer': '1+1 = 2'}, {'question': '2的平方等于几?', 'answer': '2*2 = 4'}], 
# 'new_question': '如果1加1的结果与2的平方相比较,哪个更大?', 
# 'new_answer': '首先,我们解决第一个问题:1加1等于几?计算得到 1+1 = 2。然后,解决第二个问题:2的平方等于几?计算得到 2*2 = 4。最后,我们比较这两个结果,2和4。显然,4大于2。所以,2的平方更大。'}]
Source code in lazyllm/tools/data/operators/math_ops.py
class QuestionFusionGenerator(MathQA):
    """使用大模型将多条问题融合为一个新问题并生成推理与 \\boxed{{}} 答案。需要 list_key 下至少 2 个问题。

Args:
    input_key (str): 融合后问题字段名,默认 'question'
    output_key (str): 推理结果/答案写入的字段名,默认 'answer'
    list_key (str): 问题列表字段名,默认 'question_list'
    model: 可选;None 时使用默认 Qwen 模型
    user_prompt (str|None): 可选用户提示
    **kwargs: 其它基类参数


Examples:
    ```python
    from lazyllm.tools.data import MathQA
    from lazyllm import OnlineChatModule

    llm = OnlineChatModule()
    op = MathQA.QuestionFusionGenerator(input_key='new_question', list_key='question_list', output_key='new_answer', model=llm)
    data = {'question_list': [
        {'question': '1加1等于几?', 'answer': '1+1 = 2'}, 
        {'question': '2的平方等于几?', 'answer': '2*2 = 4'}]}
    res = op(data) 
    print(res)
    # [{'question_list': [{'question': '1加1等于几?', 'answer': '1+1 = 2'}, {'question': '2的平方等于几?', 'answer': '2*2 = 4'}], 
    # 'new_question': '如果1加1的结果与2的平方相比较,哪个更大?', 
    # 'new_answer': '首先,我们解决第一个问题:1加1等于几?计算得到 1+1 = 2。然后,解决第二个问题:2的平方等于几?计算得到 2*2 = 4。最后,我们比较这两个结果,2和4。显然,4大于2。所以,2的平方更大。'}]
    ```
    """
    def __init__(self,
                 input_key='question',
                 output_key='answer',
                 model=None,
                 user_prompt=None,
                 list_key='question_list',
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.output_key = output_key
        self.user_prompt = user_prompt
        self.list_key = list_key

        output_structure = f'''
        输出格式要求:
        {{
            "{self.input_key}": "融合后的问题",
            "{self.output_key}": "推理结果"
        }}
        '''

        self.model = model.share() or TrainableModule(DEFAULT_MODEL)

        self.model.prompt(output_structure)\
            .formatter(JsonFormatter())\
            .start()

    def forward(self, data):
        questions = data.get(self.list_key, [])
        if len(questions) <= 1:
            LOG.warning(f'QuestionFusionGenerator requires more than one question, but got {len(questions)}. Skipping.')
            return data
        base_prompt = f'''
        问题列表:
        {questions}

        规则:
        - 融合列表中的问题,生成一个更复杂的新问题
        - 输出详细的过程
        '''

        if self.user_prompt is None:
            prompt = base_prompt
        else:
            prompt = self.user_prompt + '\n' + f'融合列表中的问题,生成一个更复杂的新问题:{questions}'

        res = self.model(prompt)
        data[self.input_key] = res.get(self.input_key)
        data[self.output_key] = res.get(self.output_key)

        return data

ReasoningAnswerTokenLengthFilter

Bases: MathQA

按 token 或字符长度过滤答案:超过 max_answer_token_length 时清空该字段并返回修改后的 data;未超过时返回 None 保留原样;无内容时返回 []。支持 tokenizer 或字符计数。

Parameters:

  • input_key (str, default: 'answer' ) –

    答案字段名,默认 'answer'

  • max_answer_token_length (int, default: 300 ) –

    最大允许长度,默认 300

  • tokenize (bool, default: True ) –

    是否按 token 计数;True 且未提供 tokenizer 时使用默认 Qwen tokenizer

  • tokenizer

    可选

  • **kwargs

    其它基类参数

Examples:

from lazyllm.tools.data import MathQA

op = MathQA.ReasoningAnswerTokenLengthFilter(input_key='answer', max_answer_token_length=100, tokenize=False)
data = [{'answer': 'short'}]
print(op(data))  # less than the max_length, keep the original input
# [{'answer': 'short'}]
Source code in lazyllm/tools/data/operators/math_ops.py
class ReasoningAnswerTokenLengthFilter(MathQA):
    """按 token 或字符长度过滤答案:超过 max_answer_token_length 时清空该字段并返回修改后的 data;未超过时返回 None 保留原样;无内容时返回 []。支持 tokenizer 或字符计数。

Args:
    input_key (str): 答案字段名,默认 'answer'
    max_answer_token_length (int): 最大允许长度,默认 300
    tokenize (bool): 是否按 token 计数;True 且未提供 tokenizer 时使用默认 Qwen tokenizer
    tokenizer: 可选
    **kwargs: 其它基类参数


Examples:
    ```python
    from lazyllm.tools.data import MathQA

    op = MathQA.ReasoningAnswerTokenLengthFilter(input_key='answer', max_answer_token_length=100, tokenize=False)
    data = [{'answer': 'short'}]
    print(op(data))  # less than the max_length, keep the original input
    # [{'answer': 'short'}]
    ```
    """
    def __init__(self,
                 input_key='answer',
                 max_answer_token_length=300,
                 tokenize=True,
                 tokenizer=None,
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)

        self.input_key = input_key
        self.max_answer_token_length = max_answer_token_length
        self.tokenizer = tokenizer

        if tokenize and tokenizer is None:
            LOG.warning(
                f'tokenize=True but tokenizer is None, '
                f'loading tokenizer from default model: {DEFAULT_TOKENIZER}'
            )
            try:
                self.tokenizer = transformers.AutoTokenizer.from_pretrained(
                    DEFAULT_TOKENIZER,
                    trust_remote_code=True
                )
                self.tokenize = True
            except Exception as e:
                LOG.warning(
                    f'failed to load tokenizer from {DEFAULT_TOKENIZER}, '
                    f'falling back to char count, error: {e}'
                )
                self.tokenize = False
                self.tokenizer = None
        else:
            self.tokenizer = tokenizer
            self.tokenize = tokenize

        self.empty_count = 0

    def _get_len(self, text: str):
        if text is None or (isinstance(text, str) and text.strip() == ''):
            self.empty_count += 1
            return self.max_answer_token_length + 1

        try:
            if self.tokenize:
                return len(
                    self.tokenizer.encode(
                        text,
                        add_special_tokens=False
                    )
                )
            return len(text)

        except Exception as e:
            LOG.warning(f'token encode failed: {e}')
            self.empty_count += 1
            return self.max_answer_token_length + 1

    def forward(self, data: dict):
        text = data.get(self.input_key, '')
        if not text:
            self.empty_count += 1
            return []

        token_len = self._get_len(text)

        if token_len <= self.max_answer_token_length:
            return None

        # clear eligible answer
        data[self.input_key] = ''
        return data

DifficultyEvaluatorBatch(data, input_key='difficulty')

批处理:统计输入列表中指定字段(难度)的分布,返回包含各难度计数的单元素列表 [{{难度: 数量}}]。以 forward_batch_input 注册。

Parameters:

  • data (list[dict]) –

    输入数据列表

  • input_key (str, default: 'difficulty' ) –

    难度字段名,默认 'difficulty'

Examples:

from lazyllm.tools.data import MathQA

op = MathQA.DifficultyEvaluatorBatch(input_key='difficulty')
data = [{'difficulty': 'Easy'}, {'difficulty': 'Hard'}, {'difficulty': 'Easy'}]
print(op(data))  
# [{'Easy': 2, 'Hard': 1}]
Source code in lazyllm/tools/data/operators/math_ops.py
@data_register(
    'data.mathQA',
    rewrite_func='forward_batch_input'
)
def DifficultyEvaluatorBatch(data, input_key='difficulty'):
    """批处理:统计输入列表中指定字段(难度)的分布,返回包含各难度计数的单元素列表 [{{难度: 数量}}]。以 forward_batch_input 注册。

Args:
    data (list[dict]): 输入数据列表
    input_key (str): 难度字段名,默认 'difficulty'


Examples:
    ```python
    from lazyllm.tools.data import MathQA

    op = MathQA.DifficultyEvaluatorBatch(input_key='difficulty')
    data = [{'difficulty': 'Easy'}, {'difficulty': 'Hard'}, {'difficulty': 'Easy'}]
    print(op(data))  
    # [{'Easy': 2, 'Hard': 1}]
    ```
    """
    result = {}
    for entry in data:
        key = entry.get(input_key)
        if key in result:
            result[key] += 1
        else:
            result[key] = 1
    return [result]

math_answer_extractor(data, input_key='answer', output_key='math_answer')

从文本中提取 \boxed{{}} 内的数学答案,写入指定输出字段。以 forward 单条方式注册。

Parameters:

  • data (dict) –

    单条数据字典

  • input_key (str, default: 'answer' ) –

    含答案文本的字段名,默认 'answer'

  • output_key (str, default: 'math_answer' ) –

    提取结果写入的字段名,默认 'math_answer'

Examples:

from lazyllm.tools.data import MathQA

data = {'answer': 'So the answer is \boxed{{42}}.'}
op = MathQA.math_answer_extractor(input_key='answer', output_key='math_answer')
print(op(data))  # data['math_answer'] == '42'
# [{'answer': 'So the answer is \boxed{{42}}.', 'math_answer': '{42}'}]
Source code in lazyllm/tools/data/operators/math_ops.py
@data_register('data.mathQA', rewrite_func='forward')
def math_answer_extractor(data, input_key='answer', output_key='math_answer'):
    """从文本中提取 \\boxed{{}} 内的数学答案,写入指定输出字段。以 forward 单条方式注册。

Args:
    data (dict): 单条数据字典
    input_key (str): 含答案文本的字段名,默认 'answer'
    output_key (str): 提取结果写入的字段名,默认 'math_answer'


Examples:
    ```python
    from lazyllm.tools.data import MathQA

    data = {'answer': 'So the answer is \\boxed{{42}}.'}
    op = MathQA.math_answer_extractor(input_key='answer', output_key='math_answer')
    print(op(data))  # data['math_answer'] == '42'
    # [{'answer': 'So the answer is \\boxed{{42}}.', 'math_answer': '{42}'}]
    ```
    """
    assert isinstance(data, dict)
    answer = data[input_key]
    math_answer = boxed_extractor(answer)
    data[output_key] = math_answer
    return data

Pdf处理算子

lazyllm.tools.data.operators.pdf_ops

Pdf2Md

Bases: Pdf2Qa

将 PDF 转为 Markdown 文档列表。通过 MineruPDFReader(需配置 reader_url)调用后端服务,支持缓存。

Parameters:

  • input_key (str, default: 'pdf_path' ) –

    PDF 路径字段名,默认 'pdf_path'

  • output_key (str, default: 'docs' ) –

    转换得到的文档列表写入的字段名,默认 'docs'

  • reader_url

    必填,Mineru 阅读器服务 URL

  • backend (str, default: 'vlm-vllm-async-engine' ) –

    后端类型,默认 'vlm-vllm-async-engine'

  • upload_mode (bool, default: True ) –

    是否上传模式,默认 True

  • use_cache (bool, default: False ) –

    是否使用缓存,默认 False

  • **kwargs

    其它基类参数

Examples:

from lazyllm.tools.data import Pdf2Qa
from lazyllm.tools.data.operators.pdf_ops import Pdf2Md

op = Pdf2Qa.Pdf2Md(input_key='pdf_path', output_key='docs', reader_url='http://...')
data = [{'pdf_path': '/path/to/file.pdf'}]
res = op(data)  # each item gets 'docs' (list of doc content)
Source code in lazyllm/tools/data/operators/pdf_ops.py
class Pdf2Md(Pdf2Qa):
    """将 PDF 转为 Markdown 文档列表。通过 MineruPDFReader(需配置 reader_url)调用后端服务,支持缓存。

Args:
    input_key (str): PDF 路径字段名,默认 'pdf_path'
    output_key (str): 转换得到的文档列表写入的字段名,默认 'docs'
    reader_url: 必填,Mineru 阅读器服务 URL
    backend (str): 后端类型,默认 'vlm-vllm-async-engine'
    upload_mode (bool): 是否上传模式,默认 True
    use_cache (bool): 是否使用缓存,默认 False
    **kwargs: 其它基类参数


Examples:
    ```python
    from lazyllm.tools.data import Pdf2Qa
    from lazyllm.tools.data.operators.pdf_ops import Pdf2Md

    op = Pdf2Qa.Pdf2Md(input_key='pdf_path', output_key='docs', reader_url='http://...')
    data = [{'pdf_path': '/path/to/file.pdf'}]
    res = op(data)  # each item gets 'docs' (list of doc content)
    ```"""
    def __init__(self,
                 input_key='pdf_path',
                 output_key='docs',
                 reader_url=None,
                 backend='vlm-vllm-async-engine',
                 upload_mode=True,
                 use_cache=False,
                 **kwargs):

        super().__init__(_concurrency_mode='thread', **kwargs)
        if not reader_url:
            raise ValueError('You must pass in a reader_url.')

        self.input_key = input_key
        self.output_key = output_key
        self.use_cache = use_cache

        self.reader = MineruPDFReader(
            url=reader_url,
            backend=backend,
            upload_mode=upload_mode
        )

    def forward(self, data):
        pdf_path = data.get(self.input_key)
        if not pdf_path:
            return None

        try:
            docs = self.reader(
                file=pdf_path,
                use_cache=self.use_cache
            )
            data[self.output_key] = docs

        except Exception as e:
            LOG.warning(f'PDF read failed: {e}')
            data[self.output_key] = None
        return data

Reranker 数据合成

lazyllm.tools.data.operators.reranker_synthesis

RerankerAdjustNegatives

Bases: reranker

调整重排序负样本数量的算子。

该算子调整负样本数量以匹配目标数量。如果负样本过多则截断,如果不足则通过随机采样进行填充。 使用基于查询内容的确定性随机种子以保证可复现性。

Parameters:

  • adjust_neg_count (int, default: 7 ) –

    目标负样本数量,默认为 7。

  • seed (int, default: 42 ) –

    随机种子,用于填充时的随机选择,默认为 42。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    调整后的数据,包含更新后的 _neg 字段。

Examples:

from lazyllm.tools.data import reranker

adjuster = reranker.RerankerAdjustNegatives(adjust_neg_count=5, seed=123)

# Too many negatives
data = {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8']}
result = adjuster(data)
# Returns: {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2', 'n3', 'n4', 'n5']}

# Too few negatives
data = {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2']}
result = adjuster(data)
# Returns: {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2', 'n1', 'n2', 'n1']}
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_from_embedding_converter.py
class RerankerAdjustNegatives(reranker):
    """调整重排序负样本数量的算子。

该算子调整负样本数量以匹配目标数量。如果负样本过多则截断,如果不足则通过随机采样进行填充。
使用基于查询内容的确定性随机种子以保证可复现性。

Args:
    adjust_neg_count (int): 目标负样本数量,默认为 7。
    seed (int): 随机种子,用于填充时的随机选择,默认为 42。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 调整后的数据,包含更新后的 _neg 字段。


Examples:
    ```python
    from lazyllm.tools.data import reranker

    adjuster = reranker.RerankerAdjustNegatives(adjust_neg_count=5, seed=123)

    # Too many negatives
    data = {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8']}
    result = adjuster(data)
    # Returns: {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2', 'n3', 'n4', 'n5']}

    # Too few negatives
    data = {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2']}
    result = adjuster(data)
    # Returns: {'_is_valid': True, '_query': 'ML', '_neg': ['n1', 'n2', 'n1', 'n2', 'n1']}
    ```
    """
    def __init__(self, adjust_neg_count: int = 7, seed: int = 42, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.adjust_neg_count = adjust_neg_count
        self.seed = seed

    def forward(self, data: dict, **kwargs) -> dict:
        if not data.get('_is_valid'):
            return data

        neg = data.get('_neg', [])

        if len(neg) > self.adjust_neg_count:
            # Truncate to target count
            neg = neg[:self.adjust_neg_count]
        elif len(neg) < self.adjust_neg_count and neg:
            # Pad with duplicates if needed (when we have some negatives)
            local_random = random.Random(f'{self.seed}_{data["_query"]}')
            while len(neg) < self.adjust_neg_count:
                neg.append(local_random.choice(neg))

        return {**data, '_neg': neg}

RerankerBuildFormat

Bases: reranker

构建重排序格式的算子。

该算子将验证后的数据转换为标准的重排序训练格式。输出包含 query、pos 和 neg 字段的字典, 不包含提示或指令字段。

Parameters:

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • dict

    重排序格式的数据,包含 query、pos 和 neg 字段。如果数据无效则返回空字典。

Examples:

from lazyllm.tools.data import reranker

builder = reranker.RerankerBuildFormat()

data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking']}
result = builder(data)
# Returns: {'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking']}
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_from_embedding_converter.py
class RerankerBuildFormat(reranker):
    """构建重排序格式的算子。

该算子将验证后的数据转换为标准的重排序训练格式。输出包含 query、pos 和 neg 字段的字典,
不包含提示或指令字段。

Args:
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    dict: 重排序格式的数据,包含 query、pos 和 neg 字段。如果数据无效则返回空字典。


Examples:
    ```python
    from lazyllm.tools.data import reranker

    builder = reranker.RerankerBuildFormat()

    data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking']}
    result = builder(data)
    # Returns: {'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking']}
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

    def forward(self, data: dict, **kwargs) -> dict:
        if not data.get('_is_valid'):
            return {}

        # Build reranker format (no prompt/instruction)
        reranker_item = {
            'query': data['_query'],
            'pos': data['_pos'],
            'neg': data['_neg'],
        }

        return reranker_item

RerankerFormatCrossEncoder

Bases: reranker

CrossEncoder格式转换算子。

该算子将验证后的数据转换为CrossEncoder训练格式。每个查询-文档对作为一个独立样本, 正样本标记为1,负样本标记为0。

Parameters:

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • List[dict]: 转换后的数据列表,每个包含 query、document 和 label 字段。

Examples:

from lazyllm.tools.data import reranker

formatter = reranker.RerankerFormatCrossEncoder()

data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking']}
result = formatter(data)
# Returns: [{'query': 'machine learning', 'document': 'ML tutorial', 'label': 1}, {'query': 'machine learning', 'document': 'cooking', 'label': 0}]
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_data_formatter.py
class RerankerFormatCrossEncoder(reranker):
    """CrossEncoder格式转换算子。

该算子将验证后的数据转换为CrossEncoder训练格式。每个查询-文档对作为一个独立样本,
正样本标记为1,负样本标记为0。

Args:
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    List[dict]: 转换后的数据列表,每个包含 query、document 和 label 字段。


Examples:
    ```python
    from lazyllm.tools.data import reranker

    formatter = reranker.RerankerFormatCrossEncoder()

    data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking']}
    result = formatter(data)
    # Returns: [{'query': 'machine learning', 'document': 'ML tutorial', 'label': 1}, {'query': 'machine learning', 'document': 'cooking', 'label': 0}]
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

    def forward(self, data: dict, **kwargs) -> List[dict]:
        if not data.get('_is_valid'):
            return []

        query = data['_query']
        pos = data['_pos']
        neg = data['_neg']

        results = []

        # Positive samples with label 1
        for p in pos:
            results.append({'query': query, 'document': p, 'label': 1})

        # Negative samples with label 0
        for n in neg:
            results.append({'query': query, 'document': n, 'label': 0})

        return results

RerankerFormatFlagReranker

Bases: reranker

FlagReranker格式转换算子。

该算子将验证后的数据转换为FlagReranker训练格式。确保负样本数量符合训练组大小要求, 如果负样本不足会复制填充,如果过多会截断。

Parameters:

  • train_group_size (int, default: 8 ) –

    训练组大小(包含1个正样本),默认为 8。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • List[dict]: 转换后的数据列表,每个包含 query、pos 和 neg 字段。

Examples:

from lazyllm.tools.data import reranker

formatter = reranker.RerankerFormatFlagReranker(train_group_size=8)

data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking', 'history']}
result = formatter(data)
# Returns: [{'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking', 'history', ...]}]
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_data_formatter.py
class RerankerFormatFlagReranker(reranker):
    """FlagReranker格式转换算子。

该算子将验证后的数据转换为FlagReranker训练格式。确保负样本数量符合训练组大小要求,
如果负样本不足会复制填充,如果过多会截断。

Args:
    train_group_size (int): 训练组大小(包含1个正样本),默认为 8。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    List[dict]: 转换后的数据列表,每个包含 query、pos 和 neg 字段。


Examples:
    ```python
    from lazyllm.tools.data import reranker

    formatter = reranker.RerankerFormatFlagReranker(train_group_size=8)

    data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking', 'history']}
    result = formatter(data)
    # Returns: [{'query': 'machine learning', 'pos': ['ML tutorial'], 'neg': ['cooking', 'history', ...]}]
    ```
    """
    def __init__(self, train_group_size: int = 8, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.train_group_size = train_group_size

    def forward(self, data: dict, **kwargs) -> List[dict]:
        if not data.get('_is_valid'):
            return []

        query = data['_query']
        pos = data['_pos']
        neg = data['_neg']

        # Ensure neg has exactly train_group_size - 1 samples
        num_neg_needed = self.train_group_size - 1
        if len(neg) < num_neg_needed:
            # Pad with duplicates if needed
            neg = (neg * (num_neg_needed // len(neg) + 1))[:num_neg_needed] if neg else []
        else:
            neg = neg[:num_neg_needed]

        return [{
            'query': query,
            'pos': pos,
            'neg': neg,
        }]

RerankerFormatPairwise

Bases: reranker

Pairwise格式转换算子。

该算子将验证后的数据转换为Pairwise训练格式。创建正样本和负样本的成对组合, 用于训练排序模型区分相关和不相关文档。

Parameters:

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • List[dict]: 转换后的数据列表,每个包含 query、doc_pos 和 doc_neg 字段。

Examples:

from lazyllm.tools.data import reranker

formatter = reranker.RerankerFormatPairwise()

data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking']}
result = formatter(data)
# Returns: [{'query': 'machine learning', 'doc_pos': 'ML tutorial', 'doc_neg': 'cooking'}]
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_data_formatter.py
class RerankerFormatPairwise(reranker):
    """Pairwise格式转换算子。

该算子将验证后的数据转换为Pairwise训练格式。创建正样本和负样本的成对组合,
用于训练排序模型区分相关和不相关文档。

Args:
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    List[dict]: 转换后的数据列表,每个包含 query、doc_pos 和 doc_neg 字段。


Examples:
    ```python
    from lazyllm.tools.data import reranker

    formatter = reranker.RerankerFormatPairwise()

    data = {'_is_valid': True, '_query': 'machine learning', '_pos': ['ML tutorial'], '_neg': ['cooking']}
    result = formatter(data)
    # Returns: [{'query': 'machine learning', 'doc_pos': 'ML tutorial', 'doc_neg': 'cooking'}]
    ```
    """
    def __init__(self, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)

    def forward(self, data: dict, **kwargs) -> List[dict]:
        if not data.get('_is_valid'):
            return []

        query = data['_query']
        pos = data['_pos']
        neg = data['_neg']

        results = []

        # Create pairwise comparisons
        for p in pos:
            for n in neg:
                results.append({'query': query, 'doc_pos': p, 'doc_neg': n})

        return results

RerankerGenerateQueries

Bases: reranker

基于给定文本生成多条检索查询(query)的算子。

该算子使用 RerankerQueryGeneratorPrompt 构造提示词, 调用 LLM 生成不同难度等级的查询语句。 生成结果通过 JsonFormatter 解析后, 以 JSON 字符串形式保存在 '_query_response' 字段中。

若输入 passage 为空或生成失败,则返回空响应字段。

Parameters:

  • llm_serving

    语言模型服务实例

  • lang (str, default: 'zh' ) –

    查询生成语言,默认 'zh'

  • num_queries (int, default: 3 ) –

    生成查询数量,默认 3

  • difficulty_levels (List[str], default: None ) –

    查询难度等级列表,默认 ['easy', 'medium', 'hard']

  • **kwargs (dict, default: {} ) –

    其他可选参数,传递给父类。

Examples:

op = RerankerGenerateQueries(
    llm_serving=my_llm,
    lang='en',
    num_queries=5,
    difficulty_levels=['easy', 'hard']
)

result = op({'passage': 'Large language models are widely used in NLP.'})
print(result['_query_response'])
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_query_generator.py
class RerankerGenerateQueries(reranker):
    """基于给定文本生成多条检索查询(query)的算子。

该算子使用 RerankerQueryGeneratorPrompt 构造提示词,
调用 LLM 生成不同难度等级的查询语句。
生成结果通过 JsonFormatter 解析后,
以 JSON 字符串形式保存在 '_query_response' 字段中。

若输入 passage 为空或生成失败,则返回空响应字段。

Args:
    llm_serving: 语言模型服务实例
    lang (str): 查询生成语言,默认 'zh'
    num_queries (int): 生成查询数量,默认 3
    difficulty_levels (List[str]): 查询难度等级列表,默认 ['easy', 'medium', 'hard']
    **kwargs (dict): 其他可选参数,传递给父类。


Examples:
    ```python
    op = RerankerGenerateQueries(
        llm_serving=my_llm,
        lang='en',
        num_queries=5,
        difficulty_levels=['easy', 'hard']
    )

    result = op({'passage': 'Large language models are widely used in NLP.'})
    print(result['_query_response'])
    ```
    """
    def __init__(
        self,
        llm_serving=None,
        lang: str = 'zh',
        num_queries: int = 3,
        difficulty_levels: Optional[List[str]] = None,
        **kwargs
    ):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.num_queries = num_queries
        self.difficulty_levels = difficulty_levels or ['easy', 'medium', 'hard']
        self.prompt_template = RerankerQueryGeneratorPrompt(lang=lang)

        # Initialize LLM serve with system prompt and formatter
        if llm_serving is not None:
            system_prompt = self.prompt_template.build_system_prompt()
            self._llm_serve = llm_serving.share().prompt(system_prompt).formatter(JsonFormatter())
            self._llm_serve.start()
        else:
            self._llm_serve = None

    def forward(
        self,
        data: dict,
        input_key: str = 'passage',
        **kwargs
    ) -> dict:
        if self._llm_serve is None:
            raise ValueError('LLM serving is not configured')

        passage = data.get(input_key, '')
        if not passage:
            return {**data, '_query_response': ''}

        # Build user prompt from passage
        user_prompt = self.prompt_template.build_prompt(
            passage=passage,
            num_queries=self.num_queries,
            difficulty_levels=self.difficulty_levels
        )

        try:
            result = self._llm_serve(user_prompt)
            # JsonFormatter already parses JSON, handle both str and parsed result
            if isinstance(result, str):
                response = result
            else:
                response = json.dumps(result, ensure_ascii=False)
            return {**data, '_query_response': response}
        except Exception as e:
            LOG.warning(f'Failed to generate queries: {e}')
            return {**data, '_query_response': ''}

RerankerInitBM25

Bases: reranker

初始化BM25索引的算子。

该算子基于语料库构建BM25索引,用于基于关键词的负样本挖掘。 支持中英文分词,中文使用jieba,英文使用Stemmer词干提取。

Parameters:

  • language (str, default: 'zh' ) –

    语言类型,'zh'表示中文,'en'表示英文,默认为'zh'。

  • **kwargs (dict, default: {} ) –

    其他可选参数,传递给父类。

Returns:

  • List[dict]: 输入数据列表,每个数据添加了BM25索引和分词器配置。

Examples:

from lazyllm.tools.data import reranker

init_bm25 = reranker.RerankerInitBM25(language='zh')

# 先构建语料库
data_with_corpus = reranker.build_reranker_corpus(inputs)
# 然后初始化BM25
result = init_bm25(data_with_corpus)
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py
class RerankerInitBM25(reranker):
    """初始化BM25索引的算子。

该算子基于语料库构建BM25索引,用于基于关键词的负样本挖掘。
支持中英文分词,中文使用jieba,英文使用Stemmer词干提取。

Args:
    language (str): 语言类型,'zh'表示中文,'en'表示英文,默认为'zh'。
    **kwargs (dict): 其他可选参数,传递给父类。

Returns:
    List[dict]: 输入数据列表,每个数据添加了BM25索引和分词器配置。


Examples:
    ```python
    from lazyllm.tools.data import reranker

    init_bm25 = reranker.RerankerInitBM25(language='zh')

    # 先构建语料库
    data_with_corpus = reranker.build_reranker_corpus(inputs)
    # 然后初始化BM25
    result = init_bm25(data_with_corpus)
    ```
    """
    def __init__(self, language: str = 'zh', **kwargs):
        super().__init__(rewrite_func='forward_batch_input', **kwargs)
        self.language = language
        self._setup_tokenizer(language)

    def _setup_tokenizer(self, language: str):
        if language == 'en':
            self._stemmer = Stemmer.Stemmer('english')
            self._stopwords = language
            self._tokenizer = lambda t: t
        elif language == 'zh':
            self._stemmer = None
            self._stopwords = STOPWORDS_CHINESE
            self._tokenizer = lambda t: ' '.join(jieba.lcut(t))
        else:
            self._stemmer = None
            self._stopwords = None
            self._tokenizer = lambda t: t

    def forward_batch_input(self, inputs: List[dict], **kwargs) -> List[dict]:
        if not inputs:
            return inputs

        # Load corpus from file path instead of memory
        corpus_path = inputs[0].get('_corpus', '')
        if not corpus_path:
            LOG.warning('No corpus path found for BM25 initialization.')
            return [{**item, '_bm25': None, '_bm25_corpus': []} for item in inputs]

        corpus = _load_corpus_from_path(corpus_path)
        if not corpus:
            LOG.warning(f'Failed to load corpus from {corpus_path}')
            return [{**item, '_bm25': None, '_bm25_corpus': []} for item in inputs]

        LOG.info(f'Initializing BM25 index for {len(corpus)} documents...')
        corpus_tokens = bm25s.tokenize(
            [self._tokenizer(doc) for doc in corpus],
            stopwords=self._stopwords,
            stemmer=self._stemmer,
        )
        bm25_index = bm25s.BM25()
        bm25_index.index(corpus_tokens)
        LOG.info('BM25 index initialized.')

        return [{
            **item,
            '_bm25': bm25_index,
            '_bm25_corpus': corpus,
            '_bm25_tokenizer': self._tokenizer,
            '_bm25_stopwords': self._stopwords,
            '_bm25_stemmer': self._stemmer
        } for item in inputs]

RerankerInitSemantic

Bases: reranker

初始化语义向量的算子。

该算子使用embedding服务计算语料库中所有文档的向量表示,并保存到文件中。 用于后续的语义相似度计算和负样本挖掘。

Parameters:

  • embedding_serving (Callable, default: None ) –

    embedding服务调用函数。

  • embeddings_dir (str, default: None ) –

    向量文件保存目录,默认为语料库所在目录。

  • **kwargs (dict, default: {} ) –

    其他可选参数,传递给父类。

Returns:

  • List[dict]: 输入数据列表,每个数据添加了向量文件路径和语料库信息。

Examples:

from lazyllm.tools.data import reranker

# 假设 embedding_fn 是embedding服务
init_semantic = reranker.RerankerInitSemantic(embedding_serving=embedding_fn)

# 先构建语料库
data_with_corpus = reranker.build_reranker_corpus(inputs)
# 然后计算语义向量
result = init_semantic(data_with_corpus)
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py
class RerankerInitSemantic(reranker):
    """初始化语义向量的算子。

该算子使用embedding服务计算语料库中所有文档的向量表示,并保存到文件中。
用于后续的语义相似度计算和负样本挖掘。

Args:
    embedding_serving (Callable): embedding服务调用函数。
    embeddings_dir (str, optional): 向量文件保存目录,默认为语料库所在目录。
    **kwargs (dict): 其他可选参数,传递给父类。

Returns:
    List[dict]: 输入数据列表,每个数据添加了向量文件路径和语料库信息。


Examples:
    ```python
    from lazyllm.tools.data import reranker

    # 假设 embedding_fn 是embedding服务
    init_semantic = reranker.RerankerInitSemantic(embedding_serving=embedding_fn)

    # 先构建语料库
    data_with_corpus = reranker.build_reranker_corpus(inputs)
    # 然后计算语义向量
    result = init_semantic(data_with_corpus)
    ```
    """
    def __init__(self, embedding_serving: Optional[Callable] = None, embeddings_dir: Optional[str] = None, **kwargs):
        super().__init__(rewrite_func='forward_batch_input', **kwargs)
        self.embedding_serving = embedding_serving
        self.embeddings_dir = embeddings_dir

    def forward_batch_input(self, inputs: List[dict], **kwargs) -> List[dict]:
        if not inputs:
            return inputs

        # Load corpus from file path instead of memory
        corpus_path = inputs[0].get('_corpus', '')
        if not corpus_path:
            LOG.warning('No corpus path found for semantic initialization.')
            return [{**item, '_semantic_embeddings_path': '', '_semantic_corpus': []}
                    for item in inputs]

        # Verify all inputs share the same corpus path for consistency
        if not all(item.get('_corpus') == corpus_path for item in inputs):
            LOG.warning('Not all inputs share the same corpus path. Using corpus from first item.')

        corpus = _load_corpus_from_path(corpus_path)
        if not corpus or self.embedding_serving is None:
            LOG.warning('No corpus or embedding_serving for semantic initialization.')
            return [{**item, '_semantic_embeddings_path': '', '_semantic_corpus': corpus or []}
                    for item in inputs]

        LOG.info(f'Computing embeddings for {len(corpus)} documents...')
        embeddings = np.array(self.embedding_serving(corpus))
        LOG.info('Embeddings computed.')

        # Save embeddings to file instead of storing in memory for each item
        if self.embeddings_dir is None:
            embeddings_dir = os.path.dirname(corpus_path)
        else:
            embeddings_dir = self.embeddings_dir
        os.makedirs(embeddings_dir, exist_ok=True)

        embeddings_path = os.path.join(embeddings_dir, f'reranker_embeddings_{id(inputs)}.npy')
        np.save(embeddings_path, embeddings)
        LOG.info(f'Saved embeddings to {embeddings_path}')

        return [{
            **item,
            '_semantic_embeddings_path': embeddings_path,
            '_semantic_corpus': corpus
        } for item in inputs]

RerankerMineBM25Negatives

Bases: reranker

BM25负样本挖掘算子。

该算子基于BM25索引,检索与查询最相关但不属于正样本的文档作为负样本。 适用于挖掘与查询有词汇重叠但语义不同的困难负样本。

Parameters:

  • num_negatives (int, default: 7 ) –

    需要挖掘的负样本数量,默认为 7。

  • **kwargs (dict, default: {} ) –

    其他可选参数,传递给父类。

Returns:

  • dict

    输入数据,添加了挖掘到的负样本列表。

Examples:

from lazyllm.tools.data import reranker

miner = reranker.RerankerMineBM25Negatives(num_negatives=5)

data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_bm25': bm25_index, '_bm25_corpus': corpus}
result = miner(data)
# Returns: {'query': '...', 'pos': [...], 'neg': ['bm25_neg1', 'bm25_neg2', ...]}
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py
class RerankerMineBM25Negatives(reranker):
    """BM25负样本挖掘算子。

该算子基于BM25索引,检索与查询最相关但不属于正样本的文档作为负样本。
适用于挖掘与查询有词汇重叠但语义不同的困难负样本。

Args:
    num_negatives (int): 需要挖掘的负样本数量,默认为 7。
    **kwargs (dict): 其他可选参数,传递给父类。

Returns:
    dict: 输入数据,添加了挖掘到的负样本列表。


Examples:
    ```python
    from lazyllm.tools.data import reranker

    miner = reranker.RerankerMineBM25Negatives(num_negatives=5)

    data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_bm25': bm25_index, '_bm25_corpus': corpus}
    result = miner(data)
    # Returns: {'query': '...', 'pos': [...], 'neg': ['bm25_neg1', 'bm25_neg2', ...]}
    ```
    """
    def __init__(self, num_negatives: int = 7, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.num_negatives = num_negatives

    def forward(
        self,
        data: dict,
        input_query_key: str = 'query',
        input_pos_key: str = 'pos',
        output_neg_key: str = 'neg',
        **kwargs
    ) -> dict:
        bm25_index = data.get('_bm25')
        corpus = data.get('_bm25_corpus') or []
        tokenizer = data.get('_bm25_tokenizer', lambda t: t)
        stopwords = data.get('_bm25_stopwords')
        stemmer = data.get('_bm25_stemmer')

        if bm25_index is None:
            LOG.warning('BM25 index not initialized.')
            return {**data, output_neg_key: []}

        query = data.get(input_query_key, '')
        pos_samples = data.get(input_pos_key, [])

        if not query:
            return {**data, output_neg_key: []}

        pos_set = _normalize_pos_samples(pos_samples)
        tokenized_query = bm25s.tokenize(
            tokenizer(query), stopwords=stopwords, stemmer=stemmer
        )

        k = min(len(corpus) if corpus else 0,
                self.num_negatives + len(pos_set) + 10)
        indices, scores = bm25_index.retrieve(tokenized_query, k=k)

        negatives = []
        if not corpus:
            return {**data, output_neg_key: []}

        for idx in indices[0]:
            doc = corpus[idx]
            if doc not in pos_set:
                negatives.append(doc)
                if len(negatives) >= self.num_negatives:
                    break

        result = {k: v for k, v in data.items() if k not in (
            '_bm25', '_bm25_corpus', '_bm25_tokenizer', '_bm25_stopwords', '_bm25_stemmer'
        )}
        result[output_neg_key] = negatives
        return result

RerankerMineMixedNegatives

Bases: reranker

混合策略负样本挖掘算子。

该算子结合BM25和语义相似度两种方法挖掘负样本。按指定比例分别使用两种方法, 可以获得更多样化的困难负样本。

Parameters:

  • embedding_serving (Callable, default: None ) –

    embedding服务调用函数。

  • num_negatives (int, default: 7 ) –

    需要挖掘的负样本数量,默认为 7。

  • bm25_ratio (float, default: 0.5 ) –

    BM25方法占比,剩余部分使用语义方法,默认为 0.5。

  • **kwargs (dict, default: {} ) –

    其他可选参数,传递给父类。

Returns:

  • dict

    输入数据,添加了混合策略挖掘的负样本列表。

Examples:

from lazyllm.tools.data import reranker

# 假设 embedding_fn 是embedding服务
miner = reranker.RerankerMineMixedNegatives(
    embedding_serving=embedding_fn,
    num_negatives=6,
    bm25_ratio=0.5  # 3个BM25负样本 + 3个语义负样本
)

data = {
    'query': 'machine learning',
    'pos': ['ML tutorial'],
    '_bm25': bm25_index,
    '_bm25_corpus': corpus,
    '_semantic_embeddings_path': emb_path,
    '_semantic_corpus': corpus
}
result = miner(data)
# Returns: {'query': '...', 'pos': [...], 'neg': [...]} 包含3个BM25负样本和3个语义负样本
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py
class RerankerMineMixedNegatives(reranker):
    """混合策略负样本挖掘算子。

该算子结合BM25和语义相似度两种方法挖掘负样本。按指定比例分别使用两种方法,
可以获得更多样化的困难负样本。

Args:
    embedding_serving (Callable): embedding服务调用函数。
    num_negatives (int): 需要挖掘的负样本数量,默认为 7。
    bm25_ratio (float): BM25方法占比,剩余部分使用语义方法,默认为 0.5。
    **kwargs (dict): 其他可选参数,传递给父类。

Returns:
    dict: 输入数据,添加了混合策略挖掘的负样本列表。


Examples:
    ```python
    from lazyllm.tools.data import reranker

    # 假设 embedding_fn 是embedding服务
    miner = reranker.RerankerMineMixedNegatives(
        embedding_serving=embedding_fn,
        num_negatives=6,
        bm25_ratio=0.5  # 3个BM25负样本 + 3个语义负样本
    )

    data = {
        'query': 'machine learning',
        'pos': ['ML tutorial'],
        '_bm25': bm25_index,
        '_bm25_corpus': corpus,
        '_semantic_embeddings_path': emb_path,
        '_semantic_corpus': corpus
    }
    result = miner(data)
    # Returns: {'query': '...', 'pos': [...], 'neg': [...]} 包含3个BM25负样本和3个语义负样本
    ```
    """
    def __init__(self, embedding_serving: Optional[Callable] = None,
                 num_negatives: int = 7, bm25_ratio: float = 0.5, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.num_negatives = num_negatives
        self.bm25_ratio = bm25_ratio
        self.embedding_serving = embedding_serving

    def forward(
        self,
        data: dict,
        input_query_key: str = 'query',
        input_pos_key: str = 'pos',
        output_neg_key: str = 'neg',
        **kwargs
    ) -> dict:
        query = data.get(input_query_key, '')
        pos_samples = data.get(input_pos_key, [])

        if not query:
            return {**data, output_neg_key: []}

        pos_set = _normalize_pos_samples(pos_samples)

        # Calculate number of negatives for each strategy
        num_bm25 = max(1, int(self.num_negatives * self.bm25_ratio))
        num_semantic = self.num_negatives - num_bm25

        # Mine BM25 negatives first
        bm25_negatives = []
        bm25_index = data.get('_bm25')
        corpus_bm25 = data.get('_bm25_corpus') or []

        if bm25_index and corpus_bm25:
            tokenizer = data.get('_bm25_tokenizer', lambda t: t)
            stopwords = data.get('_bm25_stopwords')
            stemmer = data.get('_bm25_stemmer')

            tokenized_query = bm25s.tokenize(
                tokenizer(query), stopwords=stopwords, stemmer=stemmer
            )
            k = min(len(corpus_bm25), num_bm25 + len(pos_set) + 5)
            indices, scores = bm25_index.retrieve(tokenized_query, k=k)

            for idx in indices[0]:
                doc = corpus_bm25[idx]
                if doc not in pos_set:
                    bm25_negatives.append(doc)
                    if len(bm25_negatives) >= num_bm25:
                        break

        # Mine semantic negatives
        semantic_negatives = []
        # Load embeddings from file path
        embeddings_path = data.get('_semantic_embeddings_path', '')
        corpus_embeddings = _load_embeddings_from_path(embeddings_path)
        corpus_semantic = data.get('_semantic_corpus') or []

        if corpus_embeddings is not None and corpus_semantic and self.embedding_serving is not None:
            # Update pos_set to exclude BM25 negatives
            pos_set_extended = pos_set | set(bm25_negatives)

            query_embedding = np.array(self.embedding_serving([query])[0])

            # Compute cosine similarity using shared function
            similarities = _compute_cosine_similarity(query_embedding, corpus_embeddings)

            scored_docs = [(sim, doc) for sim, doc in zip(similarities, corpus_semantic)
                           if doc not in pos_set_extended]
            scored_docs.sort(key=lambda x: x[0], reverse=True)

            semantic_negatives = [doc for _, doc in scored_docs[:num_semantic]]

        negatives = bm25_negatives + semantic_negatives
        result = {k: v for k, v in data.items() if k not in (
            '_bm25', '_bm25_corpus', '_bm25_tokenizer', '_bm25_stopwords', '_bm25_stemmer'
        )}
        result[output_neg_key] = negatives
        return result

RerankerMineRandomNegatives

Bases: reranker

随机负样本挖掘算子。

该算子从语料库中随机选择不属于正样本的文档作为负样本。 适用于基线对比或需要随机负样本的场景。

Parameters:

  • num_negatives (int, default: 7 ) –

    需要挖掘的负样本数量,默认为 7。

  • seed (int, default: 42 ) –

    随机种子,用于可复现的随机选择,默认为 42。

  • **kwargs (dict, default: {} ) –

    其他可选参数,传递给父类。

Returns:

  • dict

    输入数据,添加了挖掘到的负样本列表。

Examples:

from lazyllm.tools.data import reranker

miner = reranker.RerankerMineRandomNegatives(num_negatives=5, seed=123)

data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_corpus': corpus_path}
result = miner(data)
# Returns: {'query': '...', 'pos': [...], '_corpus': '...', 'neg': ['random_neg1', 'random_neg2', ...]}
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py
class RerankerMineRandomNegatives(reranker):
    """随机负样本挖掘算子。

该算子从语料库中随机选择不属于正样本的文档作为负样本。
适用于基线对比或需要随机负样本的场景。

Args:
    num_negatives (int): 需要挖掘的负样本数量,默认为 7。
    seed (int): 随机种子,用于可复现的随机选择,默认为 42。
    **kwargs (dict): 其他可选参数,传递给父类。

Returns:
    dict: 输入数据,添加了挖掘到的负样本列表。


Examples:
    ```python
    from lazyllm.tools.data import reranker

    miner = reranker.RerankerMineRandomNegatives(num_negatives=5, seed=123)

    data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_corpus': corpus_path}
    result = miner(data)
    # Returns: {'query': '...', 'pos': [...], '_corpus': '...', 'neg': ['random_neg1', 'random_neg2', ...]}
    ```
    """
    def __init__(self, num_negatives: int = 7, seed: int = 42, **kwargs):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.num_negatives = num_negatives
        self.seed = seed

    def forward(
        self,
        data: dict,
        input_query_key: str = 'query',
        input_pos_key: str = 'pos',
        output_neg_key: str = 'neg',
        **kwargs
    ) -> dict:
        # Load corpus from file path
        corpus_path = data.get('_corpus', '')
        if isinstance(corpus_path, str) and corpus_path:
            corpus = _load_corpus_from_path(corpus_path)
        elif isinstance(corpus_path, list):
            # Backward compatibility: corpus stored directly
            corpus = corpus_path
        else:
            corpus = []

        if not corpus:
            return {**data, output_neg_key: []}

        query = data.get(input_query_key, '')
        pos_samples = data.get(input_pos_key, [])

        if not query:
            return {**data, output_neg_key: []}

        pos_set = _normalize_pos_samples(pos_samples)
        candidates = [doc for doc in corpus if doc not in pos_set]

        if len(candidates) <= self.num_negatives:
            negatives = candidates
        else:
            # Use instance seed combined with query content for reproducibility
            local_random = random.Random(f'{self.seed}_{query}')
            negatives = local_random.sample(candidates, self.num_negatives)

        return {**data, output_neg_key: negatives}

RerankerMineSemanticNegatives

Bases: reranker

语义相似度负样本挖掘算子。

该算子基于语义向量相似度,找出与查询最相似但不属于正样本的文档作为负样本。 适用于挖掘语义相近但实际不相关的困难负样本,通常比BM25方法效果更好。

Parameters:

  • num_negatives (int, default: 7 ) –

    需要挖掘的负样本数量,默认为 7。

  • embedding_serving (Callable, default: None ) –

    embedding服务调用函数,用于计算查询向量。

  • **kwargs (dict, default: {} ) –

    其他可选参数,传递给父类。

Returns:

  • dict

    输入数据,添加了基于语义相似度挖掘的负样本列表。

Examples:

from lazyllm.tools.data import reranker

# 假设 embedding_fn 是embedding服务
miner = reranker.RerankerMineSemanticNegatives(num_negatives=5, embedding_serving=embedding_fn)

data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_semantic_embeddings_path': emb_path, '_semantic_corpus': corpus}
result = miner(data)
# Returns: {'query': '...', 'pos': [...], 'neg': ['semantic_neg1', 'semantic_neg2', ...]}
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_hard_negative_miner.py
class RerankerMineSemanticNegatives(reranker):
    """语义相似度负样本挖掘算子。

该算子基于语义向量相似度,找出与查询最相似但不属于正样本的文档作为负样本。
适用于挖掘语义相近但实际不相关的困难负样本,通常比BM25方法效果更好。

Args:
    num_negatives (int): 需要挖掘的负样本数量,默认为 7。
    embedding_serving (Callable): embedding服务调用函数,用于计算查询向量。
    **kwargs (dict): 其他可选参数,传递给父类。

Returns:
    dict: 输入数据,添加了基于语义相似度挖掘的负样本列表。


Examples:
    ```python
    from lazyllm.tools.data import reranker

    # 假设 embedding_fn 是embedding服务
    miner = reranker.RerankerMineSemanticNegatives(num_negatives=5, embedding_serving=embedding_fn)

    data = {'query': 'machine learning', 'pos': ['ML tutorial'], '_semantic_embeddings_path': emb_path, '_semantic_corpus': corpus}
    result = miner(data)
    # Returns: {'query': '...', 'pos': [...], 'neg': ['semantic_neg1', 'semantic_neg2', ...]}
    ```
    """
    def __init__(self, num_negatives: int = 7,
                 embedding_serving: Optional[Callable] = None, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        self.num_negatives = num_negatives
        self.embedding_serving = embedding_serving

    def forward(
        self,
        data: dict,
        input_query_key: str = 'query',
        input_pos_key: str = 'pos',
        output_neg_key: str = 'neg',
        **kwargs
    ) -> dict:
        # Load embeddings from file path
        embeddings_path = data.get('_semantic_embeddings_path', '')
        corpus_embeddings = _load_embeddings_from_path(embeddings_path)
        corpus = data.get('_semantic_corpus') or []

        if corpus_embeddings is None:
            LOG.warning('Semantic embeddings not initialized.')
            return {**data, output_neg_key: []}

        query = data.get(input_query_key, '')
        pos_samples = data.get(input_pos_key, [])

        if not query:
            return {**data, output_neg_key: []}

        pos_set = _normalize_pos_samples(pos_samples)

        if self.embedding_serving is None:
            return {**data, output_neg_key: []}

        query_embedding = np.array(self.embedding_serving([query])[0])
        similarities = _compute_cosine_similarity(query_embedding, corpus_embeddings)

        scored_docs = [(sim, doc) for sim, doc in zip(similarities, corpus)
                       if doc not in pos_set]
        scored_docs.sort(key=lambda x: x[0], reverse=True)

        negatives = [doc for _, doc in scored_docs[:self.num_negatives]]
        return {**data, output_neg_key: negatives}

RerankerParseQueries

Bases: reranker

解析 LLM 生成的查询结果,并展开为多条训练样本数据。

该算子读取 '_query_response' 字段中的 JSON 内容, 解析得到查询列表(支持 list 或 {'queries': [...]} 结构)。 每条查询会生成一条新的数据记录,包含:

  • query: 查询文本
  • difficulty: 难度等级(默认 'medium')
  • pos: 正样本文本列表(原始 passage)

同时会清理中间字段 '_query_response' 等。

Parameters:

  • input_key (str, default: 'passage' ) –

    原始文本字段名,默认 'passage'

  • output_query_key (str, default: 'query' ) –

    输出查询字段名,默认 'query'

  • **kwargs (dict, default: {} ) –

    其他可选参数,传递给父类。

Examples:

op = RerankerParseQueries(input_key='passage', output_query_key='query')

data = {
    'passage': 'Large language models are widely used in NLP.',
    '_query_response': '[{"query": "What are LLMs used for?", "difficulty": "easy"}]'
}

rows = op(data)
for row in rows:
    print(row['query'], row['difficulty'], row['pos'])
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_query_generator.py
class RerankerParseQueries(reranker):
    """解析 LLM 生成的查询结果,并展开为多条训练样本数据。

该算子读取 '_query_response' 字段中的 JSON 内容,
解析得到查询列表(支持 list 或 {'queries': [...]} 结构)。
每条查询会生成一条新的数据记录,包含:

- query: 查询文本
- difficulty: 难度等级(默认 'medium')
- pos: 正样本文本列表(原始 passage)

同时会清理中间字段 '_query_response' 等。

Args:
    input_key (str): 原始文本字段名,默认 'passage'
    output_query_key (str): 输出查询字段名,默认 'query'
    **kwargs (dict): 其他可选参数,传递给父类。


Examples:
    ```python
    op = RerankerParseQueries(input_key='passage', output_query_key='query')

    data = {
        'passage': 'Large language models are widely used in NLP.',
        '_query_response': '[{"query": "What are LLMs used for?", "difficulty": "easy"}]'
    }

    rows = op(data)
    for row in rows:
        print(row['query'], row['difficulty'], row['pos'])
    ```
    """
    def __init__(
        self,
        input_key: str = 'passage',
        output_query_key: str = 'query',
        **kwargs
    ):
        super().__init__(_concurrency_mode='process', **kwargs)
        self.input_key = input_key
        self.output_query_key = output_query_key

    def forward(
        self,
        data: dict,
        **kwargs
    ) -> List[dict]:
        response = data.get('_query_response', '')
        if not response:
            return []

        passage = data.get(self.input_key, '')
        expanded_rows = []

        try:
            parsed = json.loads(_clean_json_block(response))
            queries = parsed if isinstance(parsed, list) else parsed.get('queries', [])

            for query_item in queries:
                if isinstance(query_item, dict):
                    query = query_item.get('query', '')
                    difficulty = query_item.get('difficulty', 'medium')
                else:
                    query = str(query_item)
                    difficulty = 'medium'

                if query.strip():
                    new_row = data.copy()
                    new_row[self.output_query_key] = query.strip()
                    new_row['difficulty'] = difficulty
                    new_row['pos'] = [passage]  # Positive sample is the source passage
                    # Clean up intermediate fields
                    new_row.pop('_query_prompt', None)
                    new_row.pop('_query_response', None)
                    expanded_rows.append(new_row)

        except Exception as e:
            LOG.warning(f'Failed to parse LLM response: {e}')
            return []

        return expanded_rows

RerankerTrainTestSplitter

Bases: reranker

重排序训练集/测试集分割算子。

该算子将数据集随机分割为训练集和测试集,支持指定分割比例和随机种子。 可以保存训练集和测试集到指定文件,测试集会转换格式以兼容评估需求。

Parameters:

  • test_size (float, default: 0.1 ) –

    测试集比例,默认为 0.1(即10%)。

  • seed (int, default: 42 ) –

    随机种子,用于可复现的分割,默认为 42。

  • train_output_file (str, default: None ) –

    训练集输出文件路径,默认为 None。

  • test_output_file (str, default: None ) –

    测试集输出文件路径,默认为 None。

  • **kwargs (dict, default: {} ) –

    其它可选的参数,传递给父类。

Returns:

  • List[dict]: 分割后的数据列表,每个样本包含 split 字段标记所属集合('train' 或 'test')。

Examples:

from lazyllm.tools.data import reranker

splitter = reranker.RerankerTrainTestSplitter(
    test_size=0.2,
    seed=123,
    train_output_file='train.jsonl',
    test_output_file='test.jsonl'
)

data = [
    {'query': 'q1', 'pos': ['p1'], 'neg': ['n1']},
    {'query': 'q2', 'pos': ['p2'], 'neg': ['n2']}
]
result = splitter(data)
# Returns: [{'query': 'q1', 'pos': ['p1'], 'neg': ['n1'], 'split': 'train'}, {'query': 'q2', 'pos': ['p2'], 'neg': ['n2'], 'split': 'test'}]
Source code in lazyllm/tools/data/operators/reranker_synthesis/reranker_data_formatter.py
class RerankerTrainTestSplitter(reranker):
    """重排序训练集/测试集分割算子。

该算子将数据集随机分割为训练集和测试集,支持指定分割比例和随机种子。
可以保存训练集和测试集到指定文件,测试集会转换格式以兼容评估需求。

Args:
    test_size (float): 测试集比例,默认为 0.1(即10%)。
    seed (int): 随机种子,用于可复现的分割,默认为 42。
    train_output_file (str, optional): 训练集输出文件路径,默认为 None。
    test_output_file (str, optional): 测试集输出文件路径,默认为 None。
    **kwargs (dict): 其它可选的参数,传递给父类。

Returns:
    List[dict]: 分割后的数据列表,每个样本包含 split 字段标记所属集合('train' 或 'test')。


Examples:
    ```python
    from lazyllm.tools.data import reranker

    splitter = reranker.RerankerTrainTestSplitter(
        test_size=0.2,
        seed=123,
        train_output_file='train.jsonl',
        test_output_file='test.jsonl'
    )

    data = [
        {'query': 'q1', 'pos': ['p1'], 'neg': ['n1']},
        {'query': 'q2', 'pos': ['p2'], 'neg': ['n2']}
    ]
    result = splitter(data)
    # Returns: [{'query': 'q1', 'pos': ['p1'], 'neg': ['n1'], 'split': 'train'}, {'query': 'q2', 'pos': ['p2'], 'neg': ['n2'], 'split': 'test'}]
    ```
    """
    def __init__(
            self,
            test_size: float = 0.1,
            seed: int = 42,
            train_output_file: Optional[str] = None,
            test_output_file: Optional[str] = None,
            **kwargs
    ):
        super().__init__(rewrite_func='forward_batch_input', **kwargs)
        self.test_size = test_size
        self.seed = seed
        self.train_output_file = train_output_file
        self.test_output_file = test_output_file
        LOG.info(f'Initializing {self.__class__.__name__} with test_size: {test_size}')

    def forward_batch_input(self, data: List[dict]) -> List[dict]:
        assert isinstance(data, list), 'Input data must be a list'
        records = list(data)

        LOG.info(f'Splitting {len(records)} samples with test_size={self.test_size}')

        # Shuffle and split
        random.seed(self.seed)
        shuffled = records.copy()
        random.shuffle(shuffled)

        split_idx = int(len(shuffled) * (1 - self.test_size))
        train_data = shuffled[:split_idx]
        test_data = shuffled[split_idx:]

        # Add split labels
        for item in train_data:
            item['split'] = 'train'
        for item in test_data:
            item['split'] = 'test'

        LOG.info(f'Split completed: {len(train_data)} train, {len(test_data)} test')

        # Save to files if specified
        if self.train_output_file:
            output_path = Path(self.train_output_file)
            output_path.parent.mkdir(parents=True, exist_ok=True)
            with open(output_path, 'w', encoding='utf-8') as f:
                for item in train_data:
                    item_copy = {k: v for k, v in item.items() if k != 'split'}
                    f.write(json.dumps(item_copy, ensure_ascii=False) + '\n')
            LOG.info(f'Saved train data to {output_path}')

        if self.test_output_file:
            output_path = Path(self.test_output_file)
            output_path.parent.mkdir(parents=True, exist_ok=True)
            with open(output_path, 'w', encoding='utf-8') as f:
                for item in test_data:
                    # For eval data, rename pos to corpus for compatibility
                    item_copy = {
                        'query': item.get('query', ''),
                        'corpus': item.get('pos', []),
                        'neg': item.get('neg', [])
                    }
                    f.write(json.dumps(item_copy, ensure_ascii=False) + '\n')
            LOG.info(f'Saved test data to {output_path}')

        return train_data + test_data

LLM JSON 算子

lazyllm.tools.data.operators.llm_base_ops

LLMDataJson

基于 LLM 的 JSON 数据处理算子基类。提供结构化输出的基础逻辑,包括自动配置 JsonFormatter、重试机制以及预处理/验证/后处理生命周期。

构造函数参数:

  • model: LazyLLM 模型实例。
  • prompt: 可选,用于引导 LLM 的 Prompt(ChatPrompter 或字符串)。
  • max_retries: 最大重试次数,默认 3。
  • **kwargs: 其它传递给基类的并发或持久化参数。
Source code in lazyllm/tools/data/operators/llm_base_ops.py
class LLMDataJson:
    """基于 LLM 的 JSON 数据处理算子基类。提供结构化输出的基础逻辑,包括自动配置 JsonFormatter、重试机制以及预处理/验证/后处理生命周期。

构造函数参数:

- model: LazyLLM 模型实例。
- prompt: 可选,用于引导 LLM 的 Prompt(ChatPrompter 或字符串)。
- max_retries: 最大重试次数,默认 3。
- **kwargs: 其它传递给基类的并发或持久化参数。
"""
    _default_prompt: Optional[Union[ChatPrompter, str]] = None
    _default_inference_kwargs = {
        'max_new_tokens': 512,
        'temperature': 0.2,
    }

    def __init__(self, model, prompt=None, max_retries=3, **kwargs):
        super().__init__(_concurrency_mode='thread', **kwargs)
        assert prompt is not None or self._default_prompt is not None, 'Prompt must be provided'
        prompt = prompt if prompt is not None else self._default_prompt
        self.model = model.share().prompt(prompt).formatter(JsonFormatter())
        self._max_retries = max_retries

    def preprocess(self, data: dict, **kwargs) -> Tuple[dict, dict]:
        raise NotImplementedError()

    def verify_output(self, output: dict, data: dict) -> bool:
        raise NotImplementedError()

    def postprocess(self, output: dict, data: dict) -> dict:
        raise NotImplementedError()

    def forward(self, data: dict, **kwargs) -> dict:
        prepared_data, infer_kwargs = self.preprocess(data, **kwargs)
        for key, default_val in self._default_inference_kwargs.items():
            infer_kwargs[key] = infer_kwargs.get(key, default_val)
        error_log = []
        for i in range(self._max_retries):
            try:
                res = self.model(prepared_data, **infer_kwargs)
                if self.verify_output(res, data):
                    return self.postprocess(res, data)
            except Exception as e:
                LOG.warning(f'LLM inference failed, try {i+1}/{self._max_retries}, Error: {e}')
                error_log.append(str(e))
                continue
        else:
            raise RuntimeError(f'LLM inference failed after {self._max_retries} retries. Errors: {"; ".join(error_log)}')

lazyllm.tools.data.operators.llm_json_ops

FieldExtractor

Bases: LLMDataJson, LLMJsonBase

字段提取器。利用 LLM 根据提供的字段列表从输入文本中提取特定信息。

Parameters:

  • model

    LazyLLM 模型实例。

  • prompt

    可选,自定义提取 Prompt。

  • input_keys

    字段列表,默认为 ['persona', 'text', 'fields']。

  • output_key

    结果存储在数据字典中的键名,默认 'structured_data'。

Examples:

from lazyllm import OnlineChatModule
from lazyllm.tools.data.operators.llm_json_ops import FieldExtractor
model = OnlineChatModule(source='sensenova')
op = FieldExtractor(model=model)
inputs = [{
    'text': '张三,28岁,目前在上海',
    'fields': ['name', 'age', 'location']
}]
res = op(inputs)
print(res[0]['structured_data']) # {'name': '张三', 'age': '28', 'location': '上海'}
Source code in lazyllm/tools/data/operators/llm_json_ops.py
class FieldExtractor(LLMDataJson, LLMJsonBase):
    """字段提取器。利用 LLM 根据提供的字段列表从输入文本中提取特定信息。

Args:
    model: LazyLLM 模型实例。
    prompt: 可选,自定义提取 Prompt。
    input_keys: 字段列表,默认为 ['persona', 'text', 'fields']。
    output_key: 结果存储在数据字典中的键名,默认 'structured_data'。


Examples:
    ```python
    from lazyllm import OnlineChatModule
    from lazyllm.tools.data.operators.llm_json_ops import FieldExtractor
    model = OnlineChatModule(source='sensenova')
    op = FieldExtractor(model=model)
    inputs = [{
        'text': '张三,28岁,目前在上海',
        'fields': ['name', 'age', 'location']
    }]
    res = op(inputs)
    print(res[0]['structured_data']) # {'name': '张三', 'age': '28', 'location': '上海'}
    ```
    """
    _default_prompt = DataPrompt('zh')('field_extractor')
    _default_inference_kwargs = {
        'max_new_tokens': 1024,
        'temperature': 0.2,
    }

    def __init__(self, model, prompt=None, input_keys=None, output_key=None, **kwargs):
        super().__init__(model, prompt, **kwargs)
        self.input_keys = input_keys or ['persona', 'text', 'fields']
        assert len(self.input_keys) == 3, 'input_keys must contain exactly three keys.'
        self.output_key = output_key or 'structured_data'

    def preprocess(self, data: dict, **kwargs) -> Tuple[dict, dict]:
        raw_values = [data.get(k) for k in self.input_keys]
        persona, text, fields = ['' if v is None else str(v) for v in raw_values]
        if not text or not fields:
            raise ValueError(
                f'Missing required input keys. Received persona: "{persona}", '
                f'text: "{text}", fields: "{fields}"')
        return {'persona': persona or 'Extractor', 'text': text, 'fields': fields}, kwargs

    def verify_output(self, output: dict, data: dict) -> bool:
        if not isinstance(output, dict):
            return False
        for key in data.get(self.input_keys[2], []):
            if key not in output:
                return False
        return True

    def postprocess(self, output: dict, data: dict) -> dict:
        processed_output = {k: v.strip() if isinstance(v, str) else v for k, v in output.items()}
        data[self.output_key] = processed_output
        return data

SchemaExtractor

Bases: LLMDataJson, LLMJsonBase

架构提取器。利用 LLM 根据指定的 Schema(字典或 Pydantic 模型)从文本中提取结构化数据。

Parameters:

  • model

    LazyLLM 模型实例。

  • prompt

    可选,自定义提取 Prompt。

  • input_key

    输入文本的键名,默认 'text'。

  • output_key

    结果存储在数据字典中的键名,默认 'structured_data'。

Examples:

from lazyllm import OnlineChatModule
from lazyllm.tools.data.operators.llm_json_ops import SchemaExtractor
model = OnlineChatModule(source='sensenova')
op = SchemaExtractor(model=model)
inputs = [{'text': 'Math score is 95', 'schema': {'subject': 'str', 'score': 'int'}}]
res = op(inputs)
print(res[0]['structured_data']) # {'subject': 'Math', 'score': 95}
Source code in lazyllm/tools/data/operators/llm_json_ops.py
class SchemaExtractor(LLMDataJson, LLMJsonBase):
    """架构提取器。利用 LLM 根据指定的 Schema(字典或 Pydantic 模型)从文本中提取结构化数据。

Args:
    model: LazyLLM 模型实例。
    prompt: 可选,自定义提取 Prompt。
    input_key: 输入文本的键名,默认 'text'。
    output_key: 结果存储在数据字典中的键名,默认 'structured_data'。


Examples:
    ```python
    from lazyllm import OnlineChatModule
    from lazyllm.tools.data.operators.llm_json_ops import SchemaExtractor
    model = OnlineChatModule(source='sensenova')
    op = SchemaExtractor(model=model)
    inputs = [{'text': 'Math score is 95', 'schema': {'subject': 'str', 'score': 'int'}}]
    res = op(inputs)
    print(res[0]['structured_data']) # {'subject': 'Math', 'score': 95}
    ```
    """
    _default_prompt = DataPrompt('zh')('schema_extractor')
    _default_inference_kwargs = {
        'max_new_tokens': 1024,
        'temperature': 0.2,
    }
    _default_schema = {'subject': 'subject of the event', 'description': 'detailed description of the event'}

    def __init__(self, model, prompt=None, input_key=None, output_key=None, **kwargs):
        super().__init__(model, prompt, **kwargs)
        self.input_key = input_key or 'text'
        self.output_key = output_key or 'structured_data'

    def _get_schema_dict(self, schema: Union[dict, type]) -> dict:
        if isinstance(schema, dict):
            return schema
        elif isinstance(schema, type) and issubclass(schema, BaseModel):
            return schema.model_json_schema()
        else:
            raise ValueError(
                f'Invalid schema format. Expected dict or BaseModel, got {type(schema)}. '
                f'Received schema: "{schema}"'
            )

    def preprocess(self, data: dict, **kwargs) -> Tuple[dict, dict]:
        text = data.get(self.input_key)
        schema = data.get('schema', self._default_schema)
        if not text:
            raise ValueError(f'Missing required input key "{self.input_key}". Received text: "{text}"')
        schema_dict = self._get_schema_dict(schema)
        return {'text': text, 'schema': str(schema_dict)}, kwargs

    def verify_output(self, output: dict, data: dict) -> bool:
        if not isinstance(output, dict):
            return False
        schema = data.get('schema', self._default_schema)
        if isinstance(schema, type) and issubclass(schema, BaseModel):
            try:
                schema(**output)
                return True
            except ValidationError:
                return False
        for key in schema:
            if key not in output:
                return False
        return True

    def postprocess(self, output: dict, data: dict) -> dict:
        processed_output = {k: v.strip() if isinstance(v, str) else v for k, v in output.items()}
        data[self.output_key] = processed_output
        return data

数据处理 Pipeline

演示Pipeline

lazyllm.tools.data.pipelines.demo_pipelines

build_demo_pipeline(input_key='text')

构建演示用数据处理流水线(Pipeline),包含若干示例算子并展示如何在 pipeline 上组合使用这些算子。

Parameters:

  • input_key (str, default: 'text' ) –

    要处理的文本字段名,默认 'text'

Returns:

一个可调用的 pipeline 对象,调用时会按顺序执行其中注册的算子。

Examples:

from lazyllm.tools.data.pipelines.demo_pipelines import build_demo_pipeline

ppl = build_demo_pipeline(input_key='text')
data = [{'text': 'lazyLLM'}]
res = ppl(data)
print(res)  # demonstrates how operators are combined and applied
Source code in lazyllm/tools/data/pipelines/demo_pipelines.py
def build_demo_pipeline(input_key='text'):
    """构建演示用数据处理流水线(Pipeline),包含若干示例算子并展示如何在 pipeline 上组合使用这些算子。

Args:
    input_key (str): 要处理的文本字段名,默认 'text'

**Returns:**

    一个可调用的 pipeline 对象,调用时会按顺序执行其中注册的算子。


Examples:
    ```python
    from lazyllm.tools.data.pipelines.demo_pipelines import build_demo_pipeline

    ppl = build_demo_pipeline(input_key='text')
    data = [{'text': 'lazyLLM'}]
    res = ppl(data)
    print(res)  # demonstrates how operators are combined and applied
    ```
    """
    with pipeline() as ppl:
        ppl.build_pre_suffix = demo1.build_pre_suffix(input_key=input_key, prefix='Hello, ', suffix='!')
        ppl.process_uppercase = demo1.process_uppercase(input_key=input_key)
        ppl.add_suffix = demo2.AddSuffix(input_key=input_key, suffix='!!!', _max_workers=4)
        ppl.rich_content = demo2.rich_content(input_key=input_key, _concurrency_mode='single')
    return ppl