Skip to content

使用 Ray 处理数据集时疑似卡住 #810

@cnlinxi

Description

@cnlinxi

Before Asking 在提问之前

  • I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

使用 Ray 处理数据集时,共创建 11783 个任务,前 1w+ 任务很快处理完成,但后面任务处理特别慢,特别是后面几个 task,处理10小时都尚未处理完成,但 task 确实在缓慢 finish。

配置文件:

project_name: 'ray-demo'
dataset_path: 'my-dataset/'
export_path: 'ray.jsonl'
export_shard_size: 1073741824
temp_dir: '/tmp'
text_keys: 'content'

use_cache: true
cache_compress: 'gzip'

open_tracer: false
trace_num: 0

executor_type: 'ray'
ray_address: 'auto'

op_fusion: true
fusion_strategy: 'probe'

# process schedule
# a list of several process operators with their arguments
process:
  - clean_email_mapper:
  - clean_links_mapper:
  - fix_unicode_mapper:
  - whitespace_normalization_mapper:
  - clean_copyright_mapper:
  - maximum_line_length_filter:
      max_len: 1000
  - average_line_length_filter:
      max_len: 100
  - alphanumeric_filter:
      tokenization: False
      min_ratio: 0.25
  - text_length_filter:
      max_len: 96714
  - words_num_filter:
      min_num: 20
      max_num: 6640
  - word_repetition_filter:
      rep_len: 10
      max_ratio: 0.357

数据集存储占用约 2TB,parquet 格式,512节点。这种情况是否正常,可否定位到卡点位置?

谢谢

Additional 额外信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions