Skip to content

支持HDFS或者iceberg数据源 #848

@gao-xiao-long

Description

@gao-xiao-long

Before Asking 在提问之前

  • I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

  1. 在对数据处理时,可否将HDFS作为数据源输出和输出路径:类似

dataset_path: hdfs://mnt/dst/the-pile-philpaper-refine-result.jsonl
export_path: hdfs:/mnt/dst/processed_demo/

  1. 如果数据存在iceberg,如何能够使用data-juicer进行清洗

Additional 额外信息

No response

Metadata

Metadata

Assignees

Labels

dj:coreissues/PRs about the core functions of Data-Juicerdj:datasetissues/PRs about the dj-datasetquestionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions