Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OGC CSW 2.0.2 Harvesting / Performance improvement for (UTC) datetime conversion #8007

Open
rime1014 opened this issue May 2, 2024 · 0 comments

Comments

@rime1014
Copy link

rime1014 commented May 2, 2024

Is your feature request related to a problem? Please describe.
We used the profiling tool VisualVM to analyze which methods require the most time during the harvesting process and found that the setDateAndTimeUtc method is a bottleneck during harvesting.
However, this method is not called for every harvester.

Example: Harvesting of metadata (Geonetwork 4.2.2)
  added: 1545
  removed: 21
  total: 257363
  unchanged: 255773
  updated: 45

When harvesting the metadata shown above, the following harvesting times were determined using the VisualVM profiling tool:

  • DraftMetadataUtils.findAllSimple() requires **~ 73%** of the total harvesting time.
  • ISODate.setDateAndTimeUtc() requires**~ 29.6%** of the total harvesting time.

The analysis shows that the UUIDMapper class with the findAllSimple method requires the most CPU total time. When the UUIDMapper is instantiated, the findAllSimple method is called, which queries all metadata records already available in the GN for the given harvester (DB query) and stores information on these in HashMaps. Noticeable is a method setDateAndTimeUtc, which alone accounts for 10 - 30% of the total time in various tests.

This bottleneck does not occur with all harvesters.

Describe the solution you'd like
Analysis proposal.
The following questions arrive:

  1. The date is processed using string operators. Can this be implemented with better performance?
  2. What is the cause of the call to the setDateAndTimeUtc method?
  3. May the call of the setDateAndTimeUtc method be reduced by measures (settings regarding the time zone, notation of date in the database)?

Describe alternatives you've considered
During further tests, no clear correlation could be found between the date format in the CSW (Metadata XML) and the total time required for the setDateAndTimeUtc method. For some harvesters, the method is not called at all. The method takes a conspicuously long time for two harvesters. The metadata of these harvesters was not yet saved in the full date format (e.g. 2023-04-17T12:36:29) in the metadata table after the database migration (to version 4.2.2). The affected metadata did not change and was not updated during harvesting. One assumption was that the changedate had to be converted to the full notation for each metadata record with the short notation (YYYY-MM-DD) to synchronize the data during harvesting. This assumption has not been confirmed by the measures implemented: Deleting the metadata and re-harvesting the data completely with GN 4.2. The setDateAndtimeUtc method is still called at a similar percentage of the total time. The harvesting time has not improved.

Date format in CSW Metadata XML:

<gmd:dateStamp>
    <gco:Date>2021-07-18</gco:Date>
</gmd:dateStamp>

<gmd:date>
    <gmd:CI_Date>
        <gmd:date>
            <gco:Date>2020-12-31</gco:Date>
        </gmd:date>
        <gmd:dateType>
            <gmd:CI_DateTypeCode codeSpace="ISOTC211/19115" codeList="http://www.isotc211.org/2005/resources/Codelist/
gmxCodelists.xml#CI_DateTypeCode" codeListValue="creation">creation</gmd:CI_DateTypeCode>
        </gmd:dateType>
    </gmd:CI_Date>
</gmd:date>

For this harvester, in the metadata table, 238,069 entries have a changedate and createdate in the following notation: 2023-02-10
19,294 metadata records have a date with the notation 2023-02-10T00:00:00.000Z.

Additional context
Harvesting process (chronology) highlighting bottlenecks.

  • harvest.harvester.csw.Harvester.harvest()
  • harvest.harvester.csw.Harvester.searchAndAlign()
  • harvest.harvester.csw.Harvester.align()
  • harvest.harvester.UUIDMapper.<init> ()
  • datamanager.draft.DraftMetadataUtils.findAllSimple () (~ 73 % of total harvesting time)
  • datamanager.base.BaseMetadataUtils.findAllSimple ()
  • com.sun.proxy.$Proxy202.findSimple ()
  • ... (org.springframework)
  • repository.MetadataRepositoryCustomImpl.findSimple ()
  • com.sun.proxy.$Proxy295.getResultList ()
  • ... (org.hibernate)
  • domain.ISODate.setDateAndTimeUtc ()(not called for each harvester but if it does: ~ 10 - 30% of total harvesting time)
    • domain.ISODate.setDateAndTime ()
      • domain.ISODate.parseDate ()
        • java.lang.String.split ()
      • utils.DateUtil.convertToISOZuluDateTime ()
        • utils.DateUtil.parseISODateTimes ()
          • utils.DateUtil.parseBasicOrFullDateTime ()
            • java.time.format.DateTimeFormatter.parseBest ()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant