Skip to content
This repository has been archived by the owner on Dec 21, 2017. It is now read-only.

Parsing very slow on larger files #3

Open
ijdickinson opened this issue Jul 1, 2010 · 1 comment
Open

Parsing very slow on larger files #3

ijdickinson opened this issue Jul 1, 2010 · 1 comment

Comments

@ijdickinson
Copy link

I'm reading in a bunch of RDF files, each into their own RdfContext::Graph. The results below show the timings I'm getting. Small files load just fine; larger files take disproportionately long. One file takes 8.5 minutes to load 38k triples. I'm running on a quad-core 64 bit Ubuntu system with 8Gb memory and using ruby 1.9.1, so I don't think the raw performance of the machine is an issue.

log file output:

loading concept definitions...
Initializing coins_concept with target/def/sector.nt
... parsing complete in 0.1s producing 39 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/sector
Initializing coins_concept with target/def/data-type.nt
... parsing complete in 1.6s producing 487 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/data-type
Initializing coins_concept with target/def/programme-admin.nt
... parsing complete in 0.2s producing 47 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/programme-admin
Initializing coins_concept with target/def/cga-body-type.nt
... parsing complete in 0.2s producing 47 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/cga-body-type
Initializing coins_concept with target/def/resource-capital.nt
... parsing complete in 0.1s producing 39 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/resource-capital
Initializing coins_concept with target/def/pesa-transfer.nt
... parsing complete in 0.3s producing 87 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/pesa-transfer
Initializing coins_concept with target/def/account-code.nt
... parsing complete in 20.2s producing 4711 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/account-code
Initializing coins_concept with target/def/estimate-number.nt
... parsing complete in 2.5s producing 503 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-number
Initializing coins_concept with target/def/cofog.nt
... parsing complete in 4.5s producing 1271 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/cofog
Initializing coins_concept with target/def/department-code.nt
... parsing complete in 3.3s producing 847 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/department-code
Initializing coins_concept with target/def/budget-capital-current.nt
... parsing complete in 0.3s producing 47 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/budget-capital-current
Initializing coins_concept with target/def/request-for-resources-next-year.nt
... parsing complete in 0.2s producing 63 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/request-for-resources-next-year
Initializing coins_concept with target/def/counterparty-code.nt
... parsing complete in 1.7s producing 431 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/counterparty-code
Initializing coins_concept with target/def/pesa-delivery.nt
... parsing complete in 0.1s producing 31 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/pesa-delivery
Initializing coins_concept with target/def/income-category.nt
... parsing complete in 0.5s producing 111 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/income-category
Initializing coins_concept with target/def/estimate-line.nt
... parsing complete in 2.1s producing 615 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-line
Initializing coins_concept with target/def/programme-object-group-code.nt
... parsing complete in 125.7s producing 15895 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/programme-object-group-code
Initializing coins_concept with target/def/estimates-aina.nt
... parsing complete in 0.1s producing 39 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/estimates-aina
Initializing coins_concept with target/def/estimates-capital-current.nt
... parsing complete in 2.1s producing 63 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/estimates-capital-current
Initializing coins_concept with target/def/activity-code.nt
... parsing complete in 6.0s producing 1375 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/activity-code
Initializing coins_concept with target/def/estimate-number-next-year.nt
... parsing complete in 2.4s producing 503 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-number-next-year
Initializing coins_concept with target/def/accounting-authority.nt
... parsing complete in 0.9s producing 159 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/accounting-authority
Initializing coins_concept with target/def/pesa-current-grants.nt
... parsing complete in 1.0s producing 215 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/pesa-current-grants
Initializing coins_concept with target/def/estimate-line-next-year.nt
... parsing complete in 2.8s producing 615 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-line-next-year
Initializing coins_concept with target/def/request-for-resources.nt
... parsing complete in 0.2s producing 63 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/request-for-resources
Initializing coins_concept with target/def/pesa-services.nt
... parsing complete in 0.4s producing 39 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/pesa-services
Initializing coins_concept with target/def/estimate-line-last-year.nt
... parsing complete in 2.6s producing 575 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-line-last-year
Initializing coins_concept with target/def/nac.nt
... parsing complete in 4.0s producing 951 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/nac
Initializing coins_concept with target/def/estimate-number-last-year.nt
... parsing complete in 2.5s producing 495 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-number-last-year
Initializing coins_concept with target/def/budget-boundary.nt
... parsing complete in 0.1s producing 39 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/budget-boundary
Initializing coins_concept with target/def/pesa-1.1.nt
... parsing complete in 0.1s producing 31 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/pesa-1.1
Initializing coins_concept with target/def/esa.nt
... parsing complete in 2.6s producing 543 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/esa
Initializing coins_concept with target/def/territory.nt
... parsing complete in 0.2s producing 71 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/territory
Initializing coins_concept with target/def/data-subtype.nt
... parsing complete in 2.3s producing 471 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/data-subtype
Initializing coins_concept with target/def/department-group.nt
... parsing complete in 2.1s producing 439 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/department-group
Initializing coins_concept with target/def/signage.nt
... parsing complete in 0.1s producing 31 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/signage
Initializing coins_concept with target/def/request-for-resources-last-year.nt
... parsing complete in 0.4s producing 63 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/request-for-resources-last-year
Initializing coins_concept with target/def/programme-object-code.nt
... parsing complete in 513.3s producing 38855 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/programme-object-code
Initializing coins_concept with target/def/sbi.nt
... parsing complete in 8.1s producing 455 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/sbi
Initializing coins_concept with target/def/time.nt
... parsing complete in 0.8s producing 119 triples
... indexed as http://finance.data/gov.uk/def/statistical-concept/time
Total time taken 720.9s

The files are in n-triples format: I also tried with Turtle input but gave up after waiting too long! I've tried with :list_store and :memory_store, it doesn't make much difference.

My guess is that something in the parser loop is not scaling linearly with the size of the input file, but that's just a guess. I don't think there's anything special about the input files themselves, but am happy to provide copies if that helps with debugging.

Ian

@gkellogg
Copy link
Owner

gkellogg commented Jul 1, 2010

The SQLite3 store will provide persistent storage, and may scale better for even larger graphs, but it is slower for smaller graphs. That would be :store => SQLite3.new(:path => "store.db"). You may have also found a memory leak within the Parser. The NTriples parser is the same as the Turtle/N3, so that could be an issue. Do you have the same problem parsing large files in other serializations?

If you have a script to run through these, I'll check it out.

Also, note that the same parsers and serializers in RdfContext are also available through RDF.rb as rdf-rdfa, rdf-n3 and rdf-rdfxml. RDF.rb has a richer infrastructure for graph storage than RdfContext. I've also noticed that RDF/XML parsing is substantially faster, due to some underlying optimizations in that implementation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants