You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
We used the profiling tool VisualVM to analyze which methods require the most time during the harvesting process and found that the setDateAndTimeUtc method is a bottleneck during harvesting.
However, this method is not called for every harvester.
Example: Harvesting of metadata (Geonetwork 4.2.2)
When harvesting the metadata shown above, the following harvesting times were determined using the VisualVM profiling tool:
DraftMetadataUtils.findAllSimple() requires **~ 73%** of the total harvesting time.
ISODate.setDateAndTimeUtc() requires**~ 29.6%** of the total harvesting time.
The analysis shows that the UUIDMapper class with the findAllSimple method requires the most CPU total time. When the UUIDMapper is instantiated, the findAllSimple method is called, which queries all metadata records already available in the GN for the given harvester (DB query) and stores information on these in HashMaps. Noticeable is a method setDateAndTimeUtc, which alone accounts for 10 - 30% of the total time in various tests.
This bottleneck does not occur with all harvesters.
Describe the solution you'd like
Analysis proposal.
The following questions arrive:
The date is processed using string operators. Can this be implemented with better performance?
What is the cause of the call to the setDateAndTimeUtc method?
May the call of the setDateAndTimeUtc method be reduced by measures (settings regarding the time zone, notation of date in the database)?
Describe alternatives you've considered
During further tests, no clear correlation could be found between the date format in the CSW (Metadata XML) and the total time required for the setDateAndTimeUtc method. For some harvesters, the method is not called at all. The method takes a conspicuously long time for two harvesters. The metadata of these harvesters was not yet saved in the full date format (e.g. 2023-04-17T12:36:29) in the metadata table after the database migration (to version 4.2.2). The affected metadata did not change and was not updated during harvesting. One assumption was that the changedate had to be converted to the full notation for each metadata record with the short notation (YYYY-MM-DD) to synchronize the data during harvesting. This assumption has not been confirmed by the measures implemented: Deleting the metadata and re-harvesting the data completely with GN 4.2. The setDateAndtimeUtc method is still called at a similar percentage of the total time. The harvesting time has not improved.
For this harvester, in the metadata table, 238,069 entries have a changedate and createdate in the following notation: 2023-02-10
19,294 metadata records have a date with the notation 2023-02-10T00:00:00.000Z.
Additional context
Harvesting process (chronology) highlighting bottlenecks.
harvest.harvester.csw.Harvester.harvest()
harvest.harvester.csw.Harvester.searchAndAlign()
harvest.harvester.csw.Harvester.align()
harvest.harvester.UUIDMapper.<init> ()
datamanager.draft.DraftMetadataUtils.findAllSimple () (~ 73 % of total harvesting time)
Is your feature request related to a problem? Please describe.
We used the profiling tool VisualVM to analyze which methods require the most time during the harvesting process and found that the
setDateAndTimeUtc
method is a bottleneck during harvesting.However, this method is not called for every harvester.
Example: Harvesting of metadata (Geonetwork 4.2.2)
When harvesting the metadata shown above, the following harvesting times were determined using the VisualVM profiling tool:
DraftMetadataUtils.findAllSimple()
requires **~ 73%** of the total harvesting time.ISODate.setDateAndTimeUtc()
requires**~ 29.6%** of the total harvesting time.The analysis shows that the
UUIDMapper
class with thefindAllSimple
method requires the most CPU total time. When theUUIDMapper
is instantiated, thefindAllSimple
method is called, which queries all metadata records already available in the GN for the given harvester (DB query) and stores information on these in HashMaps. Noticeable is a methodsetDateAndTimeUtc
, which alone accounts for 10 - 30% of the total time in various tests.This bottleneck does not occur with all harvesters.
Describe the solution you'd like
Analysis proposal.
The following questions arrive:
setDateAndTimeUtc
method?setDateAndTimeUtc
method be reduced by measures (settings regarding the time zone, notation of date in the database)?Describe alternatives you've considered
During further tests, no clear correlation could be found between the date format in the CSW (Metadata XML) and the total time required for the
setDateAndTimeUtc
method. For some harvesters, the method is not called at all. The method takes a conspicuously long time for two harvesters. The metadata of these harvesters was not yet saved in the full date format (e.g.2023-04-17T12:36:29
) in the metadata table after the database migration (to version 4.2.2). The affected metadata did not change and was not updated during harvesting. One assumption was that thechangedate
had to be converted to the full notation for each metadata record with the short notation (YYYY-MM-DD
) to synchronize the data during harvesting. This assumption has not been confirmed by the measures implemented: Deleting the metadata and re-harvesting the data completely with GN 4.2. ThesetDateAndtimeUtc
method is still called at a similar percentage of the total time. The harvesting time has not improved.Date format in CSW Metadata XML:
For this harvester, in the metadata table, 238,069 entries have a changedate and createdate in the following notation:
2023-02-10
19,294 metadata records have a date with the notation
2023-02-10T00:00:00.000Z
.Additional context
Harvesting process (chronology) highlighting bottlenecks.
harvest()
searchAndAlign()
align()
<init> ()
findAllSimple ()
(~ 73 % of total harvesting time)findAllSimple ()
findSimple ()
findSimple ()
getResultList ()
setDateAndTimeUtc ()
(not called for each harvester but if it does: ~ 10 - 30% of total harvesting time)setDateAndTime ()
parseDate ()
split ()
convertToISOZuluDateTime ()
parseISODateTimes ()
parseBasicOrFullDateTime ()
parseBest ()
The text was updated successfully, but these errors were encountered: