-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve iteration for multiple documents in one file. #120
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #120 +/- ##
==========================================
+ Coverage 86.07% 87.23% +1.16%
==========================================
Files 15 15
Lines 1616 1669 +53
==========================================
+ Hits 1391 1456 +65
+ Misses 225 213 -12 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A YAML document can be a single node with a scalar or a sequence, so Base.eltype
should not return Dict{Any, Any}
.
Good point! I suppose the safest thing to do is change Is this package now maintained? I'd be happy to keep working on this if there are signs of life! |
@contradict it is very loosely (and poorly) maintained by me. It's kind of an accident - I didn't write the package, have never dug into the internals, and don't use it day-to-day. In general, I'm happy to review PRs, and if they provide tests and don't break anything, I'll usually merge and release, though I may need some reminders to pay attention. Sorry this languished so long. |
Ok, rebased and comment addressed. Unfortunately, this is still broken on Windows. I posted here looking for advice when I made this PR and I haven't made any progress on that part of the problem since. |
src/YAML.jl
Outdated
load_all(input, args...; kwargs...) | ||
end | ||
|
||
load_all(open(filename, "r"), args...; kwargs...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly sure this won't close the file, in contrast to the open(...) do
construction. Is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Since parsing is lazy, using do
here results in the file being closed before anything is read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but it's still problematic to leave a file unclosed. It's entirely possible that's the cause of the Windows failures and even if it seems to work on the other platforms, it will be a file descriptor leak and if you read too many YAML files this way you will eventually fail because you have run out of file descriptors.
There are some options to handle this, with varying advantages and disadvantages.
- Design the API with a manually called close operation.
- Set up a finalizer in
YAMLDocIterator
which closes the file when the iterator is garbage collected. - Read the raw file eagerly and do the lazy parsing from an internal
IOBuffer
. That way the file can be closed immediately after reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking closer at the code I think 3 is the better option. The main drawback is that it requires memory to store the raw file contents but if your file is large enough that that becomes a problem, you should use the load_all(::IO)
method directly instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll try again with that approach. Though a quick test shows that the finalizer attached during open does eventually close the file in the approach used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation that the IOStream
object would have a finalizer. Unfortunately there are no guarantees that it will run before you run out of file descriptors, so better avoid relying on it.
when parsing multi-document files. Since parsing is lazy, just using `do` results in the file being closed before parsing happens. Leaving the file open and wating for the finalizer to close it was deemed undesirable.
The first patch implements some missing functions from the iteration interface to describe the iterator, utilities like
collect
need use some of these.The second patch fixes
load_all_file
to not close the file before it is done reading it.I also added a test for each of these.