Skip to content

数据增强是不是只能用于单字段的json,不能用于多字段的 #813

@GGGsk

Description

@GGGsk

比如
{"text":"包含同音字替换测试:今天天气很好,我们去公园玩."}
这个可以
{"info":"文本", "text":"包含同音字替换测试:今天天气很好,我们去公园玩."}
就不可以

使用中报错了

2025-11-11 05:33:10.943 | ERROR    | data_juicer.core.data.dj_dataset:317 - An error occurred during Op [nlpcda_zh_mapper].
Traceback (most recent call last):
  File "/data-juicer/data_juicer/core/data/dj_dataset.py", line 297, in process
    dataset, resource_util_per_op = Monitor.monitor_func(op.run, args=run_args)
  File "/data-juicer/data_juicer/core/monitor.py", line 225, in monitor_func
    ret = func()
  File "/data-juicer/data_juicer/ops/base_op.py", line 377, in run
    new_dataset = dataset.map(
  File "/data-juicer/data_juicer/core/data/dj_dataset.py", line 401, in map
    new_ds = NestedDataset(super().map(*args, **kargs))
  File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 560, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3318, in map
    for rank, done, content in Dataset._map_single(**unprocessed_kwargs):
  File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3689, in _map_single
    writer.write_batch(batch, try_original_type=try_original_type)
  File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py", line 630, in write_batch
    pa_table = pa.Table.from_arrays(arrays, schema=schema)
  File "pyarrow/table.pxi", line 4868, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 4214, in pyarrow.lib.Table.validate
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 1 named text expected length 648 but got length 36

字段行数不一致

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions