Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OGC CSW 2.0.2 Harvesting / Performance / Configuration of getRecords-Value #7995

Open
rime1014 opened this issue Apr 29, 2024 · 1 comment · May be fixed by #8070
Open

OGC CSW 2.0.2 Harvesting / Performance / Configuration of getRecords-Value #7995

rime1014 opened this issue Apr 29, 2024 · 1 comment · May be fixed by #8070

Comments

@rime1014
Copy link

Is your feature request related to a problem? Please describe.
This is a suggestion for improving harvesting performance by configuring the maxRecords value for the getRecords request per harvester.

An impact of the getRecord value on performance was noticed by the following observation in a harvester.

Warning

By reducing the response of a CSW harvester to 10 data records (instead of 20), the harvesting time has increased enormously from 13 hours to 26 hours.

We used the profiling tool VisualVM to analyze which methods require the most time during the harvesting process:

The analysis showed that the align method with the initialization of the UUIDMapper class was called twice as often.
Therefore, for every 10 data records (instead of 20), a DB query on the metadata table with filtering of the data for the harvester is executed. With 259,188 metadata records, this corresponds to 25,918 DB queries which is evident from the number of geonetwork warnings

Declared number of returned records (10) does not match requested record count (20)

in the harvester log file.

Before the switch to 10 records, only 12,959 DB queries would have been necessary.
Additionally, a matching of the local metadata with the remote metadata is performed for every 10 data records. Therefore, 10 metadata records of the CSW response are compared to all 259,188 metadata records of the harvester stored in the DB. This matching process is repeated 25,918 times (instead of 12,959 times with 20 metadata records within the CSW response). In total about 3.3 billion metadata records were compared during one harvesting process of 259,188 metadata records.

The database queries and matching represent a bottleneck due to partially time-consuming methods (setDateAndTime).
In addition, more getRecords queries against the CSW interface are necessary to retrieve all data.

Describe the solution you'd like
CSW interfaces might support a higher response value than 20 for maxRecords.

For each response to the getRecords query, the align method is called, which creates a new instance of the UUIDMapper. When the UUIDMapper is instantiated, the findAllSimple method is called, which determines all metadata records already available in the GN for the given harvester with a DB query.

With fewer getRecords queries due to a higher maxRecords value, the align method is called less often and therefore fewer DB queries are required.

An additional setting in the harvester settings to set this value per harvester might significantly improve harvesting performance.
Default value: 20

Additional context
Result of Visual VM analysis:
image

@fxprunayre
Copy link
Member

Maybe we could even default to a higher number (eg. 200) to also reduce HTTP calls. 200 was used in INSPIRE monitoring exercise in the past and was working fine. Also to improve performances, we can maybe use GetRecords operation only with results instead of requesting each records with GetRecordsById.

@rime1014 rime1014 changed the title OGC CSW 2.0.2 Harvesting / Performance / Konfiguration of getRecords-Value OGC CSW 2.0.2 Harvesting / Performance / Configuration of getRecords-Value Apr 29, 2024
josegar74 added a commit to GeoCat/core-geonetwork that referenced this issue May 19, 2024
- Increase GetRecords max records parameter to 100
- Use GetRecords with ElementSetName FULL to retrieve the full xml and avoid individual GetRecordById requests

Includes Sonarlint improvements.

Fixes geonetwork#7995
@josegar74 josegar74 linked a pull request May 19, 2024 that will close this issue
10 tasks
josegar74 added a commit to GeoCat/core-geonetwork that referenced this issue May 19, 2024
- Increase GetRecords max records parameter to 100
- Use GetRecords with ElementSetName FULL to retrieve the full xml and avoid individual GetRecordById requests

Includes Sonarlint improvements.

Fixes geonetwork#7995
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants