Incremental harvesting? #22

jgonggrijp · 2018-10-31T14:52:43Z

Dear @bloomonkey, I'd like to harvest a big set incrementally. My impression is that oai-harvest does not support this scenario. If I run oai-harvest -s SET -l 3 provider twice, it will simply download the same records twice.

Is there a simple way in which I could modify oai-harvest in order to support this? I'd be happy to submit a pull request. Alternatively, are you aware of a tool that already supports this functionality? Thanks in advance.

The text was updated successfully, but these errors were encountered:

bloomonkey · 2018-10-31T21:14:32Z

Hi

TL;DR - internally it does harvest incrementally in batches (or pages), but unfortunately I don't think that there's a simple way to control batch sizes or starting points.

I don't think that there is a simple way to support this, as it is not actually supported in the OAI-PMH protocol itself. Selective harvesting in OAI-PMH only supports retrieving records based on when they were added/updated/deleted or by requesting a specific set, or a combination of these.

Batching, or paging, in a ListRecords request in OAI-PMH is controlled by the server. The server decides the number of records it will deliver in each page, and provides a resumptionToken to allow the continuation of the request starting on the next page. The traversal of pages is handled by the pyoai library. The -l, --limit option to oai-harvest is potentially inefficient; if you run oai-harvest -s SET -l 3 provider it will stop fetching pages once it has 3 records, but if the server decided that each page should contains 10000 records, there's no way to stop all of those records being returned in the response, they will just not be output to files.

Also, my interpretation of the OAI-PMH standard section on flow control is that there's no requirement that the server returns records in a consistent order, except through the use of a resumptionToken. That being the case, any home-brewed solution for incremental harvesting might risk potentially missing some records, or harvesting them more than once.

jgonggrijp · 2018-11-01T13:42:02Z

Thank you for your quick response. As a workaround, I tried simply suspending the oai-harvest process by sending SIGTSTP and resuming by sending SIGCONT, but this results in a connection reset.

Maybe I can modify oai-harvest to postpone fetching the next page until some interval has passed, when some minimum number of records has been fetched (but only after all records that the server wants to include in the page have been processed). The interval and the minimum number of records could be passed on the command line. For my use case, this would be good enough (I don't need a strict limit on the number of records, I just need to ensure that harvesting only happens during the night). Does this sound workable?

jgonggrijp · 2018-11-01T15:00:10Z

I looked at the code and I think I found a way to add this functionality. I'd add some optional arguments to the argparser in harvest.py. If set, these would get passed to harvester.harvest() through the kwargs over here:

oai-harvest/oaiharvest/harvest.py

Lines 303 to 306 in 0ab56f6

    
           completed = harvester.harvest(baseUrl, 
        
                                         args.metadataPrefix, 
        
                                         **kwargs 
        
                                         )

which in turn gets passed to OAIHarvester._listrecords(). Depending on these settings, I'd then increment a counter and compare it to the passed limit setting, or compare the current time to the passed limit time, at the end of this loop body:

oai-harvest/oaiharvest/harvest.py

Lines 91 to 97 in 0ab56f6

    
           for record in client.listRecords(**kwargs): 
        
               # Unit test hotfix 
        
               header, metadata, about = record 
        
               # Fix pyoai returning a "b'...'" string for py3k 
        
               if isinstance(metadata, str) and metadata.startswith("b'"): 
        
                   metadata = ast.literal_eval(metadata).decode("utf-8") 
        
               yield (header, metadata, about)

when the limit is passed, I reset the counter if applicable and then simply let the process sleep for the set amount of time.

This will likely cause the process to sleep in the middle of a page, but looking through the pyoai code, it appears this shouldn't be a problem because each resumption request is fully handled before client.listRecords yields the next record.

bloomonkey · 2018-11-01T22:15:08Z

So if I understand, you're trying to harvest a set so big that it takes more than about 8 hours, but you need to harvest only at night?

To be honest, it might be easiest to write a subclass of DirectoryOAIHarvester and use this in a simple script, rather than wrangle the command line arguments. Something like:

import codecs
import logging
import time
from datetime import datetime
from oaiharvest.harvest import DirectoryOAIHarvester

class NightTimeDirectoryOAIHarvester(DirectoryOAIHarvester):
    def harvest(self, baseUrl, metadataPrefix, **kwargs):
        logger = logging.getLogger(__name__).getChild(self.__class__.__name__)
        for header, metadata, about in self._listRecords(
                 baseUrl,
                 metadataPrefix=metadataPrefix,
                 **kwargs):

            # This is quick and dirty, you could calculate the exact time to sleep for instead
            hour = datetime.now().hour
            while 6 <= hour < 22:
                logger.debug("not night-time, sleeping for 10 mins...")
                time.sleep(600)  # Sleep for 10 minutes
                hour = datetime.now().hour

            fp = self._get_output_filepath(header, metadataPrefix)
            self._ensure_dir_exists(fp)

            if not header.isDeleted():
                logger.debug('Writing to file {0}'.format(fp))
                with codecs.open(fp, "w", encoding="utf-8") as fh:
                    fh.write(metadata)

            else:
                if self.respectDeletions:
                    logger.debug("Respecting server request to delete file "
                                 "{0}".format(fp))
                    try:
                        os.remove(fp)
                    except OSError:
                        # File probably does't exist in destination directory
                        # No further action needed
                        pass
                else:
                    logger.debug("Ignoring server request to delete file "
                                 "{0}".format(fp))
        else:
            # Harvesting completed, all available records stored
            return True

        # Loop must have been stopped with ``break``, e.g. due to
        # arbitrary limit
        return False

# Set up metadata registry
xmlReader = XMLMetadataReader()
metadata_registry = DefaultingMetadataRegistry(defaultReader=xmlReader)
harvester = NightTimeDirectoryOAIHarvester(metadata_registry, "/path/to/store/files")
harvester.harvest(
    "http://oaipmh.example.com/oaipmh/2.0/",
    metadataPrefix="oai_dc",
    set="SET")

If you do want to change the command line arguments, updating the whole CLI to use Click would be a nice improvement.

jgonggrijp · 2018-11-06T13:22:07Z

So if I understand, you're trying to harvest a set so big that it takes more than about 8 hours, but you need to harvest only at night?

Yes. Though I should add that I'm not only harvesting the metadata, but also additional data linked from within. I'll do the latter with a separate script.

To be honest, it might be easiest to write a subclass of DirectoryOAIHarvester and use this in a simple script, rather than wrangle the command line arguments. Something like: [big chunk of code]

That looks like more code than I was planning to write. Also, I'm being paid from public money, so I'd prefer to make it reusable for the rest of the world.

If you do want to change the command line arguments, updating the whole CLI to use Click would be a nice improvement.

Maybe, but I'm not being paid to do that. You are already using argparse; it looks maintainable enough to me for now.

I'm now going to work on what I last proposed. I'll test it tonight and probably send you a pull request tomorrow or Thursday.

bloomonkey added the enhancement label Nov 1, 2018

bloomonkey removed the enhancement label Nov 2, 2018

jgonggrijp mentioned this issue Nov 6, 2018

Incremental harvesting UUDigitalHumanitieslab/oai-harvest#1

Closed

jgonggrijp mentioned this issue Nov 7, 2018

Incremental harvesting #23

Merged

bloomonkey mentioned this issue Mar 21, 2019

Harvester Timed Out #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental harvesting? #22

Incremental harvesting? #22

jgonggrijp commented Oct 31, 2018

bloomonkey commented Oct 31, 2018

jgonggrijp commented Nov 1, 2018

jgonggrijp commented Nov 1, 2018

bloomonkey commented Nov 1, 2018

jgonggrijp commented Nov 6, 2018

Incremental harvesting? #22

Incremental harvesting? #22

Comments

jgonggrijp commented Oct 31, 2018

bloomonkey commented Oct 31, 2018

jgonggrijp commented Nov 1, 2018

jgonggrijp commented Nov 1, 2018

bloomonkey commented Nov 1, 2018

jgonggrijp commented Nov 6, 2018