You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NOTE: This issue could be partially addressed by clarification in the documentation/examples. It could also be improved by refactoring so that a more useful/informative error message is raised in this situation. A full fix would add support for line-by-line reading of NTriples files, without reading the entire file in at once as a String
I tried to use this Gem to parse the NTriples statements in TGNOut_1Subjects.nt file (locally renamed to 1Subjects.nt) from the TGN explicit.zip I downloaded from http://vocab.getty.edu/
This file has 26,854,584 lines so I had no intention of reading the whole file into memory and do not need the entire thing as a graph. I thought this was a good way to handle the parsing of the NTriples data so I could selectively do stuff with the statements I'm interested in, one at a time.
I read through the documentation prior to trying this, looking for any warnings about problems with large files, and did not find any information about performance/in-memory requirements or limits aside from info about caching which I categorized as irrelevant to my local-only application.
Given that NTriples is a line-based format, and the examples showing use of Reader.open and each_statement, I assumed wrongly2 that the each_statement pattern of working with an NTriples file was further evidence I could iterate through the statements one at a time.
My initial code:
RDF::Reader.open("1Subjects.nt") do |reader|
reader.each_statement do |statement|
binding.pry
end
end
Running my script immediately gets:
/Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/util/file.rb:332:in `read': Invalid argument @ io_fread - 1Subjects.nt (Errno::EINVAL)
from /Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/util/file.rb:332:in `block in open_file'
from /Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/util/file.rb:322:in `open'
from /Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/util/file.rb:322:in `open_file'
from /Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/reader.rb:221:in `open'
from /Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/reader.rb:212:in `open'
from tgn.rb:29:in `<main>'
Why it happened
The file at lib/rdf/util/file.rb:332 is the Ruby File object yielded by Kernel.open, and calling .read on the file indeed tries to read the entire file into memory, to be passed as a gigantic string to RemoteDocument.new.
Proposed solutions
Full fix
The NTriples data I work with always has one statement per line of the file (which I thought was a critical feature of the format), so ideally this could be fixed by handling each_statement from an NTriples::Reader by reading the file line-by-line instead of all-at-once (or providing some option to force this -- I looked for one in the code and API docs and didn't find it).
Prevention of issue without full fix
The issue could have been prevented by being clear that the NTriples::Reader is going to (try to) read the whole file into memory as one String in the documentation examples.
If the documentation was clear about that, I wouldn't have tried this and run into this issue
Mitigation of issue by refactoring to throw a more informative error message
Errno::EINVAL means "Invalid argument. This is used to indicate various kinds of problems with passing the wrong argument to a library function." (src)
Neither my code nor the rdf Gem has passed a wrong argument, so this error is very unclear in this context. The io_fread failing because of a bad argument is somewhere in Ruby's C code and thus pretty obscure and uninformative to the average Ruby user.
Footnotes
Not providing system details because the issue isn't system specific (beyond the fact that my system (like most?) falls over trying to read a 26 million line file into memory as a String, as I expected it would) ↩
But not unreasonably, given the general Ruby pattern of open to create an IO-type object, and then an each... method to iterate part-by-part through the whole thing without having to hold the entire thing in memory ↩
The text was updated successfully, but these errors were encountered:
You're correct that N-Triples (and N-Quads) is line-based, with one statement per line; this is a fundamental feature of the format. It does make it ripe for a streaming line-based reader, although the library has moved away from support that over time. The reader does read into memory, which is not great for long dumps.
The Errno::EINVAL comes from Kernel.read. I'd welcome a PR to address the documentation. Changing the behavior of File.open_file could address the issue, but would have some big consequences. It could possibly be done by refactoring File::RemoteDocument to handle a streaming case with an open file handle.
PRs welcome, as I don't have time to address large refactors at present, and not likely for some time.
NOTE: This issue could be partially addressed by clarification in the documentation/examples. It could also be improved by refactoring so that a more useful/informative error message is raised in this situation. A full fix would add support for line-by-line reading of NTriples files, without reading the entire file in at once as a String
What happened1
I tried to use this Gem to parse the NTriples statements in
TGNOut_1Subjects.nt
file (locally renamed to1Subjects.nt
) from the TGN explicit.zip I downloaded from http://vocab.getty.edu/This file has 26,854,584 lines so I had no intention of reading the whole file into memory and do not need the entire thing as a graph. I thought this was a good way to handle the parsing of the NTriples data so I could selectively do stuff with the statements I'm interested in, one at a time.
I read through the documentation prior to trying this, looking for any warnings about problems with large files, and did not find any information about performance/in-memory requirements or limits aside from info about caching which I categorized as irrelevant to my local-only application.
Given that NTriples is a line-based format, and the examples showing use of
Reader.open
andeach_statement
, I assumed wrongly2 that theeach_statement
pattern of working with an NTriples file was further evidence I could iterate through the statements one at a time.My initial code:
Running my script immediately gets:
Why it happened
The
file
atlib/rdf/util/file.rb:332
is the RubyFile
object yielded byKernel.open
, and calling.read
on the file indeed tries to read the entire file into memory, to be passed as a gigantic string toRemoteDocument.new
.Proposed solutions
Full fix
The NTriples data I work with always has one statement per line of the file (which I thought was a critical feature of the format), so ideally this could be fixed by handling
each_statement
from anNTriples::Reader
by reading the file line-by-line instead of all-at-once (or providing some option to force this -- I looked for one in the code and API docs and didn't find it).Prevention of issue without full fix
The issue could have been prevented by being clear that the
NTriples::Reader
is going to (try to) read the whole file into memory as one String in the documentation examples.If the documentation was clear about that, I wouldn't have tried this and run into this issue
Mitigation of issue by refactoring to throw a more informative error message
Errno::EINVAL
means "Invalid argument. This is used to indicate various kinds of problems with passing the wrong argument to a library function." (src)Neither my code nor the rdf Gem has passed a wrong argument, so this error is very unclear in this context. The
io_fread
failing because of a bad argument is somewhere in Ruby's C code and thus pretty obscure and uninformative to the average Ruby user.Footnotes
Not providing system details because the issue isn't system specific (beyond the fact that my system (like most?) falls over trying to read a 26 million line file into memory as a String, as I expected it would) ↩
But not unreasonably, given the general Ruby pattern of
open
to create an IO-type object, and then aneach...
method to iterate part-by-part through the whole thing without having to hold the entire thing in memory ↩The text was updated successfully, but these errors were encountered: