Wire O_DIRECT also to Uncached I/O #17218

amotin · 2025-04-04T19:41:21Z

Before Direct I/O was implemented, I've implemented lighter version I called Uncached I/O. It uses normal DMU/ARC data path with some optimizations, but evicts data from caches as soon as possible and reasonable. Originally I wired it only to a primarycache property, but now completing the integration all the way up to the VFS.

While Direct I/O has the lowest possible memory bandwidth usage, it also has a significant number of limitations. It require I/Os to be page aligned, does not allow speculative prefetch, etc. The Uncached I/O does not have those limitations, but instead require additional memory copy, though still one less than regular cached I/O. As such it should fill the gap in between. Considering this I've disabled annoying EINVAL errors on misaligned requests, adding a tunable for those who wants to test their applications.

To pass the information between the layers I had to change a number of APIs. But as side effect upper layers can now control not only the caching, but also speculative prefetch. I haven't wired it to VFS yet, since it require looking on some OS specifics.

Fixes #17027

How Has This Been Tested?

Basic read/write tests with and without O_DIRECT, observing proper cache behavior.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2025-04-04T19:48:34Z

The change is quite invasive on DMU/DBUF APIs. In some places it was not obvious what is better or where to stop, so I am open to comments on whether we'd like to change more, while already there, or less, to keep some parts of compatibility if that is even possible at all with so big change.

tonyhutter · 2025-04-04T20:12:34Z

Would it makes sense to activate this when the user does a fnctl(fd, F_SET_RW_HINT, RWH_WRITE_LIFE_SHORT)? I believe we could key off the the file's inode->i_write_hint value on the kernel side.

RWF_UNCACHED also sounds similar to this concept, but I don't think it's merged into the kernel yet.

amotin · 2025-04-04T20:17:18Z

@tonyhutter I am not familiar with that fcntl, but sounds possible as a next step. I was thinking about posix_fadvise() or something like it.

tonyhutter · 2025-04-04T20:21:33Z

Ah yes, posix_fadvise() makes sense too.

amotin · 2025-04-07T19:30:02Z

F_SET_RW_HINT and inode->i_write_hint seem about on-disk life time, not saying anything about caching. May be they could be used for some allocation policies to reduce pool fragmentation, but not for this PR. RWF_UNCACHED indeed looks very similar.

module/zfs/zfs_vnops.c

bwatkinson · 2025-04-11T00:06:41Z

So, I am not particularly in favor of a strict being disabled by default. I think we already murky the waters enough when a write I/O is PAGE_SIZE aligned, but not recordsize aligned and we just issue it through the ARC with direct=standard. The strict alignment should be observed in general. Users asked for direct I/O with O_DIRECT and they should get an EINVAL returned living by the only requirement we request, which is PAGE_SIZE alignment.

My concern is people are expecting one thing and get another transparently. This leads to confusion and makes ZFS seem like it is not doing what they strictly asked it to do (do not make any copies of my data).

I also don't like the idea that prefetching can happen if a user explicitly requested O_DIRECT for data blocks. When a user informs us of their intent with O_DIRECT they are seriously saying (or should be), they have no desire to have FS caching take place for data. If we disregard this, then prefetching becomes a nightmare for reads.

I think if we decided to add another dataset property setting for direct such as maybe relaxed, that might make sense with this? However, the default should still be standard in my mind. Truth be told, we only added direct=always for Lustre.

I think we don't need to complicate things more in my opinion. This should be an opt in feature if anything with maybe a new dataset value for direct.

amotin · 2025-04-11T00:50:03Z

I think we already murky the waters enough when a write I/O is PAGE_SIZE aligned, but not recordsize aligned and we just issue it through the ARC with direct=standard.

My patch actually improves this case, evicting the data from DBUF and ARC caches as soon as possible, reducing cache effects as user asked.

The strict alignment should be observed in general. Users asked for direct I/O with O_DIRECT and they should get an EINVAL returned living by the only requirement we request, which is PAGE_SIZE alignment.

I don't think it has to be strict other than for software testing. The man page on Linux says: "The handling of misaligned O_DIRECT I/Os also varies; they can either fail with EINVAL or fall back to buffered I/O.", so relaxed behavior is not a violation.

My concern is people are expecting one thing and get another transparently. This leads to confusion and makes ZFS seem like it is not doing what they strictly asked it to do (do not make any copies of my data).

We've already found several examples of software having no idea about alignment, but using O_DIRECT, including so general as systemd extensions. Considering they are "broken", I bet more typical Linux file systems to do not enforce alignment. And for us being strict just increase suffering. I am not interested to fix all software in the world, but happy to provide them a testing tool in a shape of a tunable.

I also don't like the idea that prefetching can happen if a user explicitly requested O_DIRECT for data blocks. When a user informs us of their intent with O_DIRECT they are seriously saying (or should be), they have no desire to have FS caching take place for data. If we disregard this, then prefetching becomes a nightmare for reads.

The data prefetch will only activate on misaligned I/Os. So obviously user already does not know what he is doing. It is trivial to actually disable it now, if you insist. But I considered that somebody might use O_DIRECT with Direct I/O disabled via module parameter, just to reduce cache trashing on some parts of workload, and prefetch really give performance improvement in many cases even with NVMe pools if application queue depth is insufficient.

I think if we decided to add another dataset property setting for direct such as maybe relaxed, that might make sense with this? However, the default should still be standard in my mind.

I was actually thinking about additional relaxed value for direct property, but present code asserts on any unknown property value, so addition of any new one would be a pain with a new pool features added, etc. I personally could live with zfs_dio_strict=1 by default, overriding it locally, but I can't agree that it is productive for the mass market.

Truth be told, we only added direct=always for Lustre.

It is likely the only way to use Direct I/O with SMB and NFS also, since they have no concepts of alignment in protocol. So it seems the world is somewhat less "perfect" than we'd like. ;)

adamdmoss · 2025-04-21T21:40:39Z

Quick question:

Originally I wired it only to a primarycache property

Do you intend to keep this wired to primarycache!=all in addition to the directIO-is-misaligned paths?

(My reading of the code changes suggests not, but my reading of the discussion suggests yes, so I'm left uncertain!)

amotin · 2025-04-21T21:50:43Z

Originally I wired it only to a primarycache property

Do you intend to keep this wired to primarycache != all in addition to the directIO-is-misaligned paths?

Yes. Exactly the same effect is expected from either of primarycache=metadata, O_DIRECT + zfs_dio_enabled=0, direct=always + zfs_dio_enabled=0, O_DIRECT + misaligned buffers, or direct=always + misaligned buffers.

amotin · 2025-04-21T21:55:15Z

Just fixed the commend typo and rebased.

robn · 2025-04-26T12:22:21Z

I was thinking about this today and then realised I’d misunderstood exactly what direct=standard did (and so is why #17027 surprised me). I thought direct=standard was "try direct, fall back otherwise", not caring about alignment, but I think this it what we have (and tell me if I’m wrong):

	aligned	unaligned	`O_DIRECT`+aligned	`O_DIRECT`+unaligned
`disabled`	ARC	ARC	ARC	ARC
`standard`	ARC	ARC	direct	`EINVAL`
`always`	direct	ARC	direct	ARC

So it seems there’s two variables here:

what to do if a non-direct request with correct alignment for direct arrives (column 1)
what to do if a direct request with incorrect alignment for direct arrives (column 4)

If they were both options, direct= might have the following options:

	aligned	unaligned	`O_DIRECT`+aligned	`O_DIRECT`+unaligned
`disabled`	ARC	ARC	ARC	ARC
`relaxed`	ARC	ARC	direct	ARC
`strict`	ARC	ARC	direct	`EINVAL`
`always+relaxed`	direct	ARC	direct	ARC
`always+strict`	direct	ARC	direct	`EINVAL`

So if I’m understanding this PR, the new zfs_dio_strict tunable effectively toggles the meaning of standard between relaxed and strict, right?

With my sysadmin hat on, I think I prefer relaxed by default, as a kind of principle of least surprise.

Users asked for direct I/O with O_DIRECT and they should get an EINVAL returned living by the only requirement we request, which is PAGE_SIZE alignment.

It feels like there's two things going on here.

One is that the user didn't ask for O_DIRECT, but the application did. If it fails, is it going to be clear to the user what to do next? Can they find out that the problem is that the call is misaligned? Do they even have an option, like, what if the software isn't configurable?

The other is about "the only requirement we request". Until we added support for STATX_DIOALIGN, there wasn't even a way for an application to discover this, and even then, it's still a relatively recent facility. Without that, we have no way of communicating alignment requirements to applications.

It seems unsurprising that there are O_DIRECT-using software doing it "wrong". By forcing strict, all we're doing is punishing the users of those applications, possibly with no recourse. For the ones doing it "right", none of this affects them.

So it seems like "work, but slower" is a kinder thing to do. Not to mention that that's historically what OpenZFS has always done with O_DIRECT.

I don’t know if we actually want all the property values as described above (if we even could change the property values we currently have without causing problems, though I have an idea for that). But at least, I think switching standard over to “relaxed” is probably the right thing to do.

If we do want more visibility, we could have a kstat counter noting the number of times a direct IO request redirected to the ARC for service due to misalignment. It’s not super obvious of course, but combined with a system call trace on the process it at least gives the operator the ability to see what’s going on.

tonyhutter · 2025-04-30T18:29:38Z

I don't think it has to be strict other than for software testing. The man page on Linux says: "The handling of misaligned O_DIRECT I/Os also varies; they can either fail with EINVAL or fall back to buffered I/O.", so relaxed behavior is not a violation.

So it seems like "work, but slower" is a kinder thing to do. Not to mention that that's historically what OpenZFS has always done with O_DIRECT.

I agree with @amotin and @robn that relaxed should be the default. It works the best "out of the box" with the real world. Swaying my opinion is fact that users can now use statx(STATX_DIOALIGN) to get the alignment, so the EINVAL feedback on non-page alignment is less needed.

tonyhutter · 2025-04-30T18:57:16Z

If I could wave a magic wand, I would keep the existing direct options with the same behavior and add relaxed and always_relaxed:

direct=[disabled | standard | always | relaxed | always_relaxed]

I would also make relaxed the new default.

That said, I would be totally ok with the current behavior of this PR too (with zfs_dio_strict=1). The number of users who actually want strict behavior is going to be very small, so I think going the simpler path of zfs_dio_strict=1 is also appropriate.

amotin · 2025-05-01T17:41:37Z

Enabled Uncached I/O in edge cases of receive. Since normal receive bypasses ARC, its edge cases, such as unexpected large blocks and spill blocks, should not be cached either. Prefetch we should not need there too.

include/sys/dmu.h

module/zfs/dmu.c

robn · 2025-05-04T11:42:21Z

Initial read done, a few comments added. Nothing structurally weird jumps out of me. I don't have the brain this evening to think very hard about the details, but it looks like most of this is just threading flags through and setting some of the new dbuf fields when necessary, so at worst, it's going to be one missed or wrong (which might be bad, but y'know).

It would have been a lot easier if it were separate commits, first updating the API calls and threading them all through, then moving where some things are set around, and then the semi-unrelated stuff like the direct IO changes. But at this point I'm past the tricky bit, so nbd now :)

robn

Alright, this works for me. I think the feature is worthwhile, and the changes seem to make sense.

I'm not sure I have the confidence to say yes, it's definitely right and safe, but getting it onto the master branch now is worth doing. Once it lands, I'll pull it onto my daily drivers at least; not that I do a ton of O_DIRECT, but at least it's getting a workout.

With the dio_strict change (addition), I do think this would be better done with a property, but as we've discussed, we don't quite have that as an option right now. I will say that I'm quietly working on a "versioned properties" feature, that would make it possible to safely "upgrade" a property to a new version with different values. I'm not suggesting this wait for that, but it will definitely be a good test case and there's time to get that in before 2.4 if we want it.

amotin · 2025-05-07T15:21:17Z

Thinking about @bwatkinson request of separate accounting for Uncached I/O, I've decided to pass DMU_UNCACHEDIO into zfs_racct_read() when called from arc_read(). In case of primarycache=metadata it might not match what we pass to zfs_racct_write() from DMU, but it is the best we have here. Though I am not sure I want to expose it as a stats, creating possible confusions.

Before Direct I/O was implemented, I've implemented lighter version I called Uncached I/O. It uses normal DMU/ARC data path with some optimizations, but evicts data from caches as soon as possible and reasonable. Originally I wired it only to a primarycache property, but now completing the integration all the way up to the VFS. While Direct I/O has the lowest possible memory bandwidth usage, it also has a significant number of limitations. It require I/Os to be page aligned, does not allow speculative prefetch, etc. The Uncached I/O does not have those limitations, but instead require additional memory copy, though still one less than regular cached I/O. As such it should fill the gap in between. Considering this I've disabled annoying EINVAL errors on misaligned requests, adding a tunable for those who wants to test their applications. To pass the information between the layers I had to change a number of APIs. But as side effect upper layers can now control not only the caching, but also speculative prefetch. I haven't wired it to VFS yet, since it require looking on some OS specifics. But while there I've implemented speculative prefetch of indirect blocks for Direct I/O, controllable via all the same mechanisms. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Fixes openzfs#17027

behlendorf

This lighter weight version of O_DIRECT is a nice way to cover that middle ground between the full Direct IO path and the buffered path. As for changing the default behavior to be "relaxed", I agree, it's the practical thing to do given the amount of software we've seen have problems.

I was actually thinking about additional relaxed value for direct property, but present code asserts on any unknown property value

I think adding the additional values relaxed and always_relaxed for this property would be best. We don't need to include that in this PR, but we can start laying the ground work by including a patch in the 2.3.3 release to tolerate the new unknown values. We should be able to leverage the 'iuv' semantics ("ignore unknown value") added in 5405be0 to make this a bit easier.

A fairly carefully walked through all the mechanical changes to the APIs while reviewing this. I didn't spot anything wrong, but it'd be easy to overlook something. We'll want to keep an eye out for any unintentional changes in behavior when this merges.

The changes the dmu_* interfaces and going to badly break the Lustre builds against ZFS. I briefly looked at providing wrapper functions, but in this case I'm not sure it's worth it. We can detect the API changes on the Lustre side easily enough and make the needed changes. It would actually be a good time to more carefully re-review Lustres usage of the DMU API since it's evolved a fair bit since the Lustre code was originally integrated with it.

Lastly I went ahead and resubmitted the failing tests again. Since this change is so large I'd like to make sure we get a clean run before merging it. Particularly with ztest since it's always good at catching corner cases missed by the test suite.

man/man4/zfs.4

Before Direct I/O was implemented, I've implemented lighter version I called Uncached I/O. It uses normal DMU/ARC data path with some optimizations, but evicts data from caches as soon as possible and reasonable. Originally I wired it only to a primarycache property, but now completing the integration all the way up to the VFS. While Direct I/O has the lowest possible memory bandwidth usage, it also has a significant number of limitations. It require I/Os to be page aligned, does not allow speculative prefetch, etc. The Uncached I/O does not have those limitations, but instead require additional memory copy, though still one less than regular cached I/O. As such it should fill the gap in between. Considering this I've disabled annoying EINVAL errors on misaligned requests, adding a tunable for those who wants to test their applications. To pass the information between the layers I had to change a number of APIs. But as side effect upper layers can now control not only the caching, but also speculative prefetch. I haven't wired it to VFS yet, since it require looking on some OS specifics. But while there I've implemented speculative prefetch of indirect blocks for Direct I/O, controllable via all the same mechanisms. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Fixes openzfs#17027 Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]>

amotin requested a review from bwatkinson April 4, 2025 19:41

amotin added the Status: Code Review Needed Ready for review and testing label Apr 4, 2025

amotin force-pushed the direct branch 4 times, most recently from 7579173 to ef0c865 Compare April 7, 2025 19:20

tonyhutter reviewed Apr 8, 2025

View reviewed changes

module/zfs/zfs_vnops.c Outdated Show resolved Hide resolved

amotin force-pushed the direct branch from ef0c865 to 16209df Compare April 21, 2025 21:54

ixhamza mentioned this pull request Apr 28, 2025

Kernel update breaks loopbacks on ZFS volumes #17277

Closed

amotin mentioned this pull request Apr 29, 2025

txg: generalise txg_wait_synced_sig() to txg_wait_synced_flags() #17284

Merged

13 tasks

amotin force-pushed the direct branch from 16209df to 5922927 Compare May 1, 2025 17:34

amotin force-pushed the direct branch from 5922927 to a27e22a Compare May 1, 2025 17:44

robn reviewed May 4, 2025

View reviewed changes

include/sys/dmu.h Show resolved Hide resolved

robn reviewed May 4, 2025

View reviewed changes

include/sys/dmu.h Show resolved Hide resolved

robn reviewed May 4, 2025

View reviewed changes

module/zfs/dmu.c Show resolved Hide resolved

robn approved these changes May 5, 2025

View reviewed changes

amotin force-pushed the direct branch from a27e22a to 1f9ca25 Compare May 7, 2025 15:15

amotin force-pushed the direct branch from 1f9ca25 to f0c48e7 Compare May 7, 2025 20:13

behlendorf approved these changes May 12, 2025

View reviewed changes

man/man4/zfs.4 Show resolved Hide resolved

tonyhutter merged commit 734eba2 into openzfs:master May 13, 2025
23 of 24 checks passed

amotin deleted the direct branch May 13, 2025 23:05

amotin added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels May 13, 2025

satmandu mentioned this pull request May 26, 2025

2.3.3 staging prep #17371

Closed

14 tasks

Wire O_DIRECT also to Uncached I/O #17218

Wire O_DIRECT also to Uncached I/O #17218

Uh oh!

Conversation

amotin commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

amotin commented Apr 4, 2025

Uh oh!

tonyhutter commented Apr 4, 2025

Uh oh!

amotin commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tonyhutter commented Apr 4, 2025

Uh oh!

amotin commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bwatkinson commented Apr 11, 2025

Uh oh!

amotin commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamdmoss commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amotin commented Apr 21, 2025

Uh oh!

amotin commented Apr 21, 2025

Uh oh!

robn commented Apr 26, 2025

Uh oh!

tonyhutter commented Apr 30, 2025

Uh oh!

tonyhutter commented Apr 30, 2025

Uh oh!

amotin commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robn commented May 4, 2025

Uh oh!

robn left a comment

Choose a reason for hiding this comment

Uh oh!

amotin commented May 7, 2025

Uh oh!

behlendorf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

amotin commented Apr 4, 2025 •

edited

Loading

amotin commented Apr 4, 2025 •

edited

Loading

amotin commented Apr 7, 2025 •

edited

Loading

amotin commented Apr 11, 2025 •

edited

Loading

adamdmoss commented Apr 21, 2025 •

edited

Loading

amotin commented May 1, 2025 •

edited

Loading