Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ext4: support bigger blkdev block size #241

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

mcimerman
Copy link
Contributor

@mcimerman mcimerman commented Jan 25, 2025

Allows ext4 to be used with block devices that have block size > EXT4_SUPERBLOCK_SIZE (1K).

Currently, ext4 "supports" only block dev bsize smaller or equal to the superblock size. (quotes on support because it gets EPERM when writing reading past the allocated buffer)

This currently fails (blkdev bsize is set to 4K with -b to file_bd):

# mkfile -s 100M /tmp/file1

# /srv/bd/file_bd -b 4096 /tmp/file1 disk1

# mkext4 disk1

But the block devices must be at most the block size of the filesystem, currently maximum is 4096 which is the default too.

Will be useful when for example NVMe support is implemented in the future, or something that uses bigger block sizes.

@le-jzr
Copy link
Contributor

le-jzr commented Jan 25, 2025

Nice. Out of curiosity, do you think there's any value in device blocks even being exposed in the interface like that?
IMO it would be simpler and more efficient to present block devices like normal random access files and let the driver take care of the gritty details.

@mcimerman
Copy link
Contributor Author

mcimerman commented Jan 25, 2025

do you think there's any value in device blocks even being exposed in the interface like that?

Do you mean /srv/bd/file_bd -b 4096 /tmp/file1 disk1? I think the file backed block device is great for debugging.

Edit: Sorry, you probably mean all of the block devices. Got to think about that :-D

@le-jzr
Copy link
Contributor

le-jzr commented Jan 25, 2025

I mean the API interface between a file system and a block device. Like, why should ext4 driver even care what the physical block size is?

Sorry I was being ambiguous. :)

@mcimerman
Copy link
Contributor Author

mcimerman commented Jan 25, 2025

Yes, I get you.

That would indeed make a lot of things a lot simpler. And it doesn't actually sound that hard to realize.

Just rewrite libblock (and convert everything) to use the classic file API. Or is there some pitfall I don't see?

Edit: Well, probably the block caching will have to be hidden inside the file API, which might not be very easy to do, and it's quite nice that filesystems can manipulate the blocks on a lower level, set some flags or mark them dirty, etc.

@le-jzr
Copy link
Contributor

le-jzr commented Jan 25, 2025

Or is there some pitfall I don't see?

I can't think of anything, but you seem to be more intimately familiar with the code right now, so thought I'd ask you. 😄

Anyway, right now I'm trying to finish implementing a completely new way to do IPC, and if that works out like I want, then we'll have the opportunity to rework some IPC protocols as they are being migrated to the new shiny thing. Good reason to think about how the seams between layers should work and where there are opportunities for improvement.

@le-jzr
Copy link
Contributor

le-jzr commented Jan 25, 2025

Edit: Well, probably the block caching will have to be hidden inside the file API, which might not be very easy to do, and it's quite nice that filesystems can manipulate the blocks on a lower level, set some flags or mark them dirty, etc.

The comparison to files was just illustrative. Obviously it would still be its own protocol with its own nuances, but ability to read/write with arbitrary granularity (and size) like files do would be nice.

Though you make a good point. I'm not actually familiar with the extent of operations that can be done on blocks. But now that I think about it, "trim" would be a good example? Since that can only work on whole blocks, and definitely sounds like something we want to support (eventually).

EDIT: So maybe the right interface would have flexible reads/writes, but the intrinsically blocky operation would still have block-based interface that the file system can exploit if it wants to.

@mcimerman
Copy link
Contributor Author

seem to be more intimately familiar with the code right now, so thought I'd ask you

I've been working on HelenRAID for a few months now, but in there I use the direct read/write of blocks, just forwarding the final writes or reads that come to the array. Just now I started to look at filesystems, because I would like to do some performance evaluation on "real" workload, with files and whatnot. The filesystems use libblock more extensively, with caching and everything, which I don't fully understand as well as the filesystems.

new way to do IPC

Sounds great, also the "non-redundant" copying would be great to have, at least for block devices and big IO. I am interested in the new IPC you are working on, do you have any notes or something I can read to try to understand your new proposal? I am not very experienced in that area, but there is a chance I can help :-D

@le-jzr
Copy link
Contributor

le-jzr commented Jan 26, 2025

No formal proposal as of yet, since I'm still figuring out how to implement what I want and details keep changing to accomodate. :)

But currently, the broad strokes look like this: Task creates an IPC queue, then it creates a uniquely identified IPC endpoint on that queue (possibly one for each unique resource handled by the server, e.g. each open file). The endpoint makes its way to a client via IPC (similar to what IPC_CONNECT__ calls do now, but with more control and granularity). The client makes calls on the endpoint, providing its own endpoint as a return address.

The main feature is that endpoints represent individual resources managed by servers and can be transferred between tasks arbitrarily. They basically act like references to OOP objects that live in another task.

Anyway, the messages sent to endpoint are fixed size, so to send a random bucket of data, you explicitly write the data to a kernel buffer (immutable once created), then send a reference to that buffer to the other party, who can then read chunks out of it. The IPC forward call function is replaced simply by being able to pass the received buffer reference to another task without touching it. Possibly more than once for different parts.

Allows ext4 to be used with block devices that have
block size > EXT4_SUPERBLOCK_SIZE (1K).
@mcimerman mcimerman force-pushed the ext4-big-blkdev-bsize branch from afe8ff5 to cb747b3 Compare January 27, 2025 15:11
@mcimerman
Copy link
Contributor Author

Obviously it would still be its own protocol with its own nuances, but ability to read/write with arbitrary granularity (and size) like files do would be nice.

Check block_read_bytes_direct()1. Writes could be done the same way, but then unaligned writes would have to be dealt with like the bug above, the block(s) would have to be read first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants