Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errno::EINVAL using each_statement with a LARGE, local NTriples file #440

Open
kspurgin opened this issue Jul 3, 2023 · 1 comment
Open

Comments

@kspurgin
Copy link

kspurgin commented Jul 3, 2023

NOTE: This issue could be partially addressed by clarification in the documentation/examples. It could also be improved by refactoring so that a more useful/informative error message is raised in this situation. A full fix would add support for line-by-line reading of NTriples files, without reading the entire file in at once as a String

What happened1

I tried to use this Gem to parse the NTriples statements in TGNOut_1Subjects.nt file (locally renamed to 1Subjects.nt) from the TGN explicit.zip I downloaded from http://vocab.getty.edu/

This file has 26,854,584 lines so I had no intention of reading the whole file into memory and do not need the entire thing as a graph. I thought this was a good way to handle the parsing of the NTriples data so I could selectively do stuff with the statements I'm interested in, one at a time.

I read through the documentation prior to trying this, looking for any warnings about problems with large files, and did not find any information about performance/in-memory requirements or limits aside from info about caching which I categorized as irrelevant to my local-only application.

Given that NTriples is a line-based format, and the examples showing use of Reader.open and each_statement, I assumed wrongly2 that the each_statement pattern of working with an NTriples file was further evidence I could iterate through the statements one at a time.

My initial code:

RDF::Reader.open("1Subjects.nt") do |reader|
  reader.each_statement do |statement|
    binding.pry
  end
end

Running my script immediately gets:

/Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/util/file.rb:332:in `read': Invalid argument @ io_fread - 1Subjects.nt (Errno::EINVAL)
	from /Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/util/file.rb:332:in `block in open_file'
	from /Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/util/file.rb:322:in `open'
	from /Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/util/file.rb:322:in `open_file'
	from /Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/reader.rb:221:in `open'
	from /Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/reader.rb:212:in `open'
	from tgn.rb:29:in `<main>'

Why it happened

The file at lib/rdf/util/file.rb:332 is the Ruby File object yielded by Kernel.open, and calling .read on the file indeed tries to read the entire file into memory, to be passed as a gigantic string to RemoteDocument.new.

Proposed solutions

Full fix

The NTriples data I work with always has one statement per line of the file (which I thought was a critical feature of the format), so ideally this could be fixed by handling each_statement from an NTriples::Reader by reading the file line-by-line instead of all-at-once (or providing some option to force this -- I looked for one in the code and API docs and didn't find it).

Prevention of issue without full fix

The issue could have been prevented by being clear that the NTriples::Reader is going to (try to) read the whole file into memory as one String in the documentation examples.

If the documentation was clear about that, I wouldn't have tried this and run into this issue

Mitigation of issue by refactoring to throw a more informative error message

Errno::EINVAL means "Invalid argument. This is used to indicate various kinds of problems with passing the wrong argument to a library function." (src)

Neither my code nor the rdf Gem has passed a wrong argument, so this error is very unclear in this context. The io_fread failing because of a bad argument is somewhere in Ruby's C code and thus pretty obscure and uninformative to the average Ruby user.

Footnotes

  1. Not providing system details because the issue isn't system specific (beyond the fact that my system (like most?) falls over trying to read a 26 million line file into memory as a String, as I expected it would)

  2. But not unreasonably, given the general Ruby pattern of open to create an IO-type object, and then an each... method to iterate part-by-part through the whole thing without having to hold the entire thing in memory

@gkellogg
Copy link
Member

gkellogg commented Oct 1, 2023

Sorry, the issue got lost.

You're correct that N-Triples (and N-Quads) is line-based, with one statement per line; this is a fundamental feature of the format. It does make it ripe for a streaming line-based reader, although the library has moved away from support that over time. The reader does read into memory, which is not great for long dumps.

The Errno::EINVAL comes from Kernel.read. I'd welcome a PR to address the documentation. Changing the behavior of File.open_file could address the issue, but would have some big consequences. It could possibly be done by refactoring File::RemoteDocument to handle a streaming case with an open file handle.

PRs welcome, as I don't have time to address large refactors at present, and not likely for some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants