Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I use JuiceFS for storing sparse files? #5675

Open
liyimeng opened this issue Feb 18, 2025 · 6 comments
Open

Can I use JuiceFS for storing sparse files? #5675

liyimeng opened this issue Feb 18, 2025 · 6 comments
Labels
kind/question Further information is requested

Comments

@liyimeng
Copy link

liyimeng commented Feb 18, 2025

I am wondering if Juicefs is a good fit for storing sparse file.
According to #2637, it seems JuiceFS have limited support sparse features. What if I put a sparse file into JuiceFS, will juicefs upload the file with logic size into backend storage, or it only uploads the physical size of the file. I mean, if juicefs fill the storage with a lot of zero, or just skip them for efficiency? If it is the later case, what happens to the usage? Can Juicefs correctly calculate the real usage?

@liyimeng liyimeng added the kind/question Further information is requested label Feb 18, 2025
@liyimeng
Copy link
Author

Sorry, I was a little rush to push out the question. In the issue, it mentioned similar problem in glusterfs, which is exactly dealing with sparse files.

So my question become:

if #3898 is merged, will juicefs can efficiently upload into backend storage, like s3? I get a impression from this comment that gnu tools can be impacted if we use cp on a file inside a mounted juicefs. But what I really care is between juicefs and backend storage, where is bandwidth and storage efficiency matter.

Any insight suggestions?

Thanks a lot in advanced!

@liyimeng
Copy link
Author

liyimeng commented Feb 19, 2025

I also observed some difference between local fs and s3 backend:

  • When usingg du command, local fs backend, juicefs report sparse files physical size properly while s3 backend always report virtual size.
  • first time to copy sparse into juicefs always result in a corrupted file. It must be deleted and re-copy again, while s3 seem always failed to verify the copied files and which treat as corrupted in my case.

why different backend make so much big difference?

@liyimeng
Copy link
Author

Another finding, juicefs seem handle sparse file dramatically different depending on if file is in qcow2 or raw format? what could make such difference?

@jiefenghuang
Copy link
Contributor

Sorry, I was a little rush to push out the question. In the issue, it mentioned similar problem in glusterfs, which is exactly dealing with sparse files.抱歉,我有点急切地发布了这个问题。在问题中,它提到了 glusterfs 中类似的问题,这正是处理稀疏文件的情况。

So my question become:所以我的问题变成了:

if #3898 is merged, will juicefs can efficiently upload into backend storage, like s3? I get a impression from this comment that gnu tools can be impacted if we use cp on a file inside a mounted juicefs. But what I really care is between juicefs and backend storage, where is bandwidth and storage efficiency matter.如果 #3898 被合并,juicefs 能否高效地上传到后端存储,如 s3?我从这条评论中得到的印象是,如果我们在一个挂载的 juicefs 文件上使用 cp,GNU 工具可能会受到影响。但我真正关心的是,在 juicefs 和后端存储之间,带宽和存储效率在哪里更重要。

Any insight suggestions?任何见解建议?

Thanks a lot in advanced!非常感谢提前!

for now, according to #2637, juicefs don't support seek_hole, seek_data for sparse files copy (like cp cmd);
Implementing a complete lseek (supporting SEEK_HOLE, SEEK_DATA) may not be suitable for general scenarios. However, optimizations could be considered during write operations. For example, even for PLAIN_SCANTYPE scan-based copies, enabling zero-block verification in specific directories could skip the copying of such data blocks, reducing unnecessary write overhead. This way, S3 would not need to store these zero blocks.

@liyimeng
Copy link
Author

@jiefenghuang So #3898 dose have a value, even tools like cp are not fully benefited, demanding on backend storage, communication between FUSE and backend storage are significantly reduced. Why don't we just get it merged?

@jiefenghuang
Copy link
Contributor

@jiefenghuang So #3898 dose have a value, even tools like cp are not fully benefited, demanding on backend storage, communication between FUSE and backend storage are significantly reduced. Why don't we just get it merged?所以 #3898 有值,即使像 cp 这样的工具也无法完全受益,对后端存储的需求增加,FUSE 与后端存储之间的通信显著减少。我们为什么不直接合并它呢?

it is too expensive for general scenarios, fyi #3924

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants