Skip to content

Wire O_DIRECT also to Uncached I/O #17218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

amotin
Copy link
Member

@amotin amotin commented Apr 4, 2025

Before Direct I/O was implemented, I've implemented lighter version I called Uncached I/O. It uses normal DMU/ARC data path with some optimizations, but evicts data from caches as soon as possible and reasonable. Originally I wired it only to a primarycache property, but now completing the integration all the way up to the VFS.

While Direct I/O has the lowest possible memory bandwidth usage, it also has a significant number of limitations. It require I/Os to be page aligned, does not allow speculative prefetch, etc. The Uncached I/O does not have those limitations, but instead require additional memory copy, though still one less than regular cached I/O. As such it should fill the gap in between. Considering this I've disabled annoying EINVAL errors on misaligned requests, adding a tunable for those who wants to test their applications.

To pass the information between the layers I had to change a number of APIs. But as side effect upper layers can now control not only the caching, but also speculative prefetch. I haven't wired it to VFS yet, since it require looking on some OS specifics. But while there I've implemented speculative prefetch of indirect blocks for Direct I/O, controllable via all the same mechanisms.

Fixes #17027

How Has This Been Tested?

Basic read/write tests with and without O_DIRECT, observing proper cache behavior.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin amotin requested a review from bwatkinson April 4, 2025 19:41
@amotin
Copy link
Member Author

amotin commented Apr 4, 2025

The change is quite invasive on DMU/DBUF APIs. In some places it was not obvious what is better or where to stop, so I am open to comments on whether we'd like to change more, while already there, or less, to keep some parts of compatibility if that is even possible at all with so big change.

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Apr 4, 2025
@tonyhutter
Copy link
Contributor

Would it makes sense to activate this when the user does a fnctl(fd, F_SET_RW_HINT, RWH_WRITE_LIFE_SHORT)? I believe we could key off the the file's inode->i_write_hint value on the kernel side.

RWF_UNCACHED also sounds similar to this concept, but I don't think it's merged into the kernel yet.

@amotin
Copy link
Member Author

amotin commented Apr 4, 2025

@tonyhutter I am not familiar with that fcntl, but sounds possible as a next step. I was thinking about posix_fadvise() or something like it.

@tonyhutter
Copy link
Contributor

Ah yes, posix_fadvise() makes sense too.

@amotin amotin force-pushed the direct branch 4 times, most recently from 7579173 to ef0c865 Compare April 7, 2025 19:20
@amotin
Copy link
Member Author

amotin commented Apr 7, 2025

F_SET_RW_HINT and inode->i_write_hint seem about on-disk life time, not saying anything about caching. May be they could be used for some allocation policies to reduce pool fragmentation, but not for this PR. RWF_UNCACHED indeed looks very similar.

@bwatkinson
Copy link
Contributor

So, I am not particularly in favor of a strict being disabled by default. I think we already murky the waters enough when a write I/O is PAGE_SIZE aligned, but not recordsize aligned and we just issue it through the ARC with direct=standard. The strict alignment should be observed in general. Users asked for direct I/O with O_DIRECT and they should get an EINVAL returned living by the only requirement we request, which is PAGE_SIZE alignment.

My concern is people are expecting one thing and get another transparently. This leads to confusion and makes ZFS seem like it is not doing what they strictly asked it to do (do not make any copies of my data).

I also don't like the idea that prefetching can happen if a user explicitly requested O_DIRECT for data blocks. When a user informs us of their intent with O_DIRECT they are seriously saying (or should be), they have no desire to have FS caching take place for data. If we disregard this, then prefetching becomes a nightmare for reads.

I think if we decided to add another dataset property setting for direct such as maybe relaxed, that might make sense with this? However, the default should still be standard in my mind. Truth be told, we only added direct=always for Lustre.

I think we don't need to complicate things more in my opinion. This should be an opt in feature if anything with maybe a new dataset value for direct.

@amotin
Copy link
Member Author

amotin commented Apr 11, 2025

I think we already murky the waters enough when a write I/O is PAGE_SIZE aligned, but not recordsize aligned and we just issue it through the ARC with direct=standard.

My patch actually improves this case, evicting the data from DBUF and ARC caches as soon as possible, reducing cache effects as user asked.

The strict alignment should be observed in general. Users asked for direct I/O with O_DIRECT and they should get an EINVAL returned living by the only requirement we request, which is PAGE_SIZE alignment.

I don't think it has to be strict other than for software testing. The man page on Linux says: "The handling of misaligned O_DIRECT I/Os also varies; they can either fail with EINVAL or fall back to buffered I/O.", so relaxed behavior is not a violation.

My concern is people are expecting one thing and get another transparently. This leads to confusion and makes ZFS seem like it is not doing what they strictly asked it to do (do not make any copies of my data).

We've already found several examples of software having no idea about alignment, but using O_DIRECT, including so general as systemd extensions. Considering they are "broken", I bet more typical Linux file systems to do not enforce alignment. And for us being strict just increase suffering. I am not interested to fix all software in the world, but happy to provide them a testing tool in a shape of a tunable.

I also don't like the idea that prefetching can happen if a user explicitly requested O_DIRECT for data blocks. When a user informs us of their intent with O_DIRECT they are seriously saying (or should be), they have no desire to have FS caching take place for data. If we disregard this, then prefetching becomes a nightmare for reads.

The data prefetch will only activate on misaligned I/Os. So obviously user already does not know what he is doing. It is trivial to actually disable it now, if you insist. But I considered that somebody might use O_DIRECT with Direct I/O disabled via module parameter, just to reduce cache trashing on some parts of workload, and prefetch really give performance improvement in many cases even with NVMe pools if application queue depth is insufficient.

I think if we decided to add another dataset property setting for direct such as maybe relaxed, that might make sense with this? However, the default should still be standard in my mind.

I was actually thinking about additional relaxed value for direct property, but present code asserts on any unknown property value, so addition of any new one would be a pain with a new pool features added, etc. I personally could live with zfs_dio_strict=1 by default, overriding it locally, but I can't agree that it is productive for the mass market.

Truth be told, we only added direct=always for Lustre.

It is likely the only way to use Direct I/O with SMB and NFS also, since they have no concepts of alignment in protocol. So it seems the world is somewhat less "perfect" than we'd like. ;)

@adamdmoss
Copy link
Contributor

adamdmoss commented Apr 21, 2025

Quick question:

Originally I wired it only to a primarycache property

Do you intend to keep this wired to primarycache!=all in addition to the directIO-is-misaligned paths?

(My reading of the code changes suggests not, but my reading of the discussion suggests yes, so I'm left uncertain!)

@amotin
Copy link
Member Author

amotin commented Apr 21, 2025

Originally I wired it only to a primarycache property

Do you intend to keep this wired to primarycache != all in addition to the directIO-is-misaligned paths?

Yes. Exactly the same effect is expected from either of primarycache=metadata, O_DIRECT + zfs_dio_enabled=0, direct=always + zfs_dio_enabled=0, O_DIRECT + misaligned buffers, or direct=always + misaligned buffers.

Before Direct I/O was implemented, I've implemented lighter version
I called Uncached I/O.  It uses normal DMU/ARC data path with some
optimizations, but evicts data from caches as soon as possible and
reasonable.  Originally I wired it only to a primarycache property,
but now completing the integration all the way up to the VFS.

While Direct I/O has the lowest possible memory bandwidth usage,
it also has a significant number of limitations.  It require I/Os
to be page aligned, does not allow speculative prefetch, etc.  The
Uncached I/O does not have those limitations, but instead require
additional memory copy, though still one less than regular cached
I/O.  As such it should fill the gap in between.  Considering this
I've disabled annoying EINVAL errors on misaligned requests, adding
a tunable for those who wants to test their applications.

To pass the information between the layers I had to change a number
of APIs.  But as side effect upper layers can now control not only
the caching, but also speculative prefetch.  I haven't wired it to
VFS yet, since it require looking on some OS specifics.  But while
there I've implemented speculative prefetch of indirect blocks for
Direct I/O, controllable via all the same mechanisms.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Fixes openzfs#17027
@amotin
Copy link
Member Author

amotin commented Apr 21, 2025

Just fixed the commend typo and rebased.

@robn
Copy link
Member

robn commented Apr 26, 2025

I was thinking about this today and then realised I’d misunderstood exactly what direct=standard did (and so is why #17027 surprised me). I thought direct=standard was "try direct, fall back otherwise", not caring about alignment, but I think this it what we have (and tell me if I’m wrong):

aligned unaligned O_DIRECT+aligned O_DIRECT+unaligned
disabled ARC ARC ARC ARC
standard ARC ARC direct EINVAL
always direct ARC direct ARC

So it seems there’s two variables here:

  • what to do if a non-direct request with correct alignment for direct arrives (column 1)
  • what to do if a direct request with incorrect alignment for direct arrives (column 4)

If they were both options, direct= might have the following options:

aligned unaligned O_DIRECT+aligned O_DIRECT+unaligned
disabled ARC ARC ARC ARC
relaxed ARC ARC direct ARC
strict ARC ARC direct EINVAL
always+relaxed direct ARC direct ARC
always+strict direct ARC direct EINVAL

So if I’m understanding this PR, the new zfs_dio_strict tunable effectively toggles the meaning of standard between relaxed and strict, right?


With my sysadmin hat on, I think I prefer relaxed by default, as a kind of principle of least surprise.

Users asked for direct I/O with O_DIRECT and they should get an EINVAL returned living by the only requirement we request, which is PAGE_SIZE alignment.

It feels like there's two things going on here.

One is that the user didn't ask for O_DIRECT, but the application did. If it fails, is it going to be clear to the user what to do next? Can they find out that the problem is that the call is misaligned? Do they even have an option, like, what if the software isn't configurable?

The other is about "the only requirement we request". Until we added support for STATX_DIOALIGN, there wasn't even a way for an application to discover this, and even then, it's still a relatively recent facility. Without that, we have no way of communicating alignment requirements to applications.

It seems unsurprising that there are O_DIRECT-using software doing it "wrong". By forcing strict, all we're doing is punishing the users of those applications, possibly with no recourse. For the ones doing it "right", none of this affects them.

So it seems like "work, but slower" is a kinder thing to do. Not to mention that that's historically what OpenZFS has always done with O_DIRECT.


I don’t know if we actually want all the property values as described above (if we even could change the property values we currently have without causing problems, though I have an idea for that). But at least, I think switching standard over to “relaxed” is probably the right thing to do.

If we do want more visibility, we could have a kstat counter noting the number of times a direct IO request redirected to the ARC for service due to misalignment. It’s not super obvious of course, but combined with a system call trace on the process it at least gives the operator the ability to see what’s going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

systemd-sysext fails to install extensions with Direct I/O
5 participants