使用 Ray 处理数据集时疑似卡住

### Before Asking 在提问之前

- [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。

- [x] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。


### Search before asking 先搜索，再提问

- [x] I have searched the Data-Juicer [issues](https://github.com/alibaba/data-juicer/issues) and found no similar questions. 我已经在 [issue列表](https://github.com/alibaba/data-juicer/issues) 中搜索但是没有发现类似的问题。


### Question

使用 Ray 处理数据集时，共创建 11783 个任务，前 1w+ 任务很快处理完成，但后面任务处理特别慢，特别是后面几个 task，处理10小时都尚未处理完成，但 task 确实在缓慢 finish。

配置文件：

```xml
project_name: 'ray-demo'
dataset_path: 'my-dataset/'
export_path: 'ray.jsonl'
export_shard_size: 1073741824
temp_dir: '/tmp'
text_keys: 'content'

use_cache: true
cache_compress: 'gzip'

open_tracer: false
trace_num: 0

executor_type: 'ray'
ray_address: 'auto'

op_fusion: true
fusion_strategy: 'probe'

# process schedule
# a list of several process operators with their arguments
process:
  - clean_email_mapper:
  - clean_links_mapper:
  - fix_unicode_mapper:
  - whitespace_normalization_mapper:
  - clean_copyright_mapper:
  - maximum_line_length_filter:
      max_len: 1000
  - average_line_length_filter:
      max_len: 100
  - alphanumeric_filter:
      tokenization: False
      min_ratio: 0.25
  - text_length_filter:
      max_len: 96714
  - words_num_filter:
      min_num: 20
      max_num: 6640
  - word_repetition_filter:
      rep_len: 10
      max_ratio: 0.357
```

数据集存储占用约 2TB，parquet 格式，512节点。这种情况是否正常，可否定位到卡点位置？

谢谢

### Additional 额外信息

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

使用 Ray 处理数据集时疑似卡住 #810

Before Asking 在提问之前

Search before asking 先搜索，再提问

Question

Additional 额外信息

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

使用 Ray 处理数据集时疑似卡住 #810

Description

Before Asking 在提问之前

Search before asking 先搜索，再提问

Question

Additional 额外信息

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions