Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental harvesting? #22

Open
jgonggrijp opened this issue Oct 31, 2018 · 5 comments
Open

Incremental harvesting? #22

jgonggrijp opened this issue Oct 31, 2018 · 5 comments

Comments

@jgonggrijp
Copy link
Contributor

Dear @bloomonkey, I'd like to harvest a big set incrementally. My impression is that oai-harvest does not support this scenario. If I run oai-harvest -s SET -l 3 provider twice, it will simply download the same records twice.

Is there a simple way in which I could modify oai-harvest in order to support this? I'd be happy to submit a pull request. Alternatively, are you aware of a tool that already supports this functionality? Thanks in advance.

@bloomonkey
Copy link
Owner

Hi

TL;DR - internally it does harvest incrementally in batches (or pages), but unfortunately I don't think that there's a simple way to control batch sizes or starting points.

I don't think that there is a simple way to support this, as it is not actually supported in the OAI-PMH protocol itself. Selective harvesting in OAI-PMH only supports retrieving records based on when they were added/updated/deleted or by requesting a specific set, or a combination of these.

Batching, or paging, in a ListRecords request in OAI-PMH is controlled by the server. The server decides the number of records it will deliver in each page, and provides a resumptionToken to allow the continuation of the request starting on the next page. The traversal of pages is handled by the pyoai library. The -l, --limit option to oai-harvest is potentially inefficient; if you run oai-harvest -s SET -l 3 provider it will stop fetching pages once it has 3 records, but if the server decided that each page should contains 10000 records, there's no way to stop all of those records being returned in the response, they will just not be output to files.

Also, my interpretation of the OAI-PMH standard section on flow control is that there's no requirement that the server returns records in a consistent order, except through the use of a resumptionToken. That being the case, any home-brewed solution for incremental harvesting might risk potentially missing some records, or harvesting them more than once.

@jgonggrijp
Copy link
Contributor Author

Thank you for your quick response. As a workaround, I tried simply suspending the oai-harvest process by sending SIGTSTP and resuming by sending SIGCONT, but this results in a connection reset.

Maybe I can modify oai-harvest to postpone fetching the next page until some interval has passed, when some minimum number of records has been fetched (but only after all records that the server wants to include in the page have been processed). The interval and the minimum number of records could be passed on the command line. For my use case, this would be good enough (I don't need a strict limit on the number of records, I just need to ensure that harvesting only happens during the night). Does this sound workable?

@jgonggrijp
Copy link
Contributor Author

I looked at the code and I think I found a way to add this functionality. I'd add some optional arguments to the argparser in harvest.py. If set, these would get passed to harvester.harvest() through the kwargs over here:

completed = harvester.harvest(baseUrl,
args.metadataPrefix,
**kwargs
)

which in turn gets passed to OAIHarvester._listrecords(). Depending on these settings, I'd then increment a counter and compare it to the passed limit setting, or compare the current time to the passed limit time, at the end of this loop body:

for record in client.listRecords(**kwargs):
# Unit test hotfix
header, metadata, about = record
# Fix pyoai returning a "b'...'" string for py3k
if isinstance(metadata, str) and metadata.startswith("b'"):
metadata = ast.literal_eval(metadata).decode("utf-8")
yield (header, metadata, about)

when the limit is passed, I reset the counter if applicable and then simply let the process sleep for the set amount of time.

This will likely cause the process to sleep in the middle of a page, but looking through the pyoai code, it appears this shouldn't be a problem because each resumption request is fully handled before client.listRecords yields the next record.

@bloomonkey
Copy link
Owner

So if I understand, you're trying to harvest a set so big that it takes more than about 8 hours, but you need to harvest only at night?

To be honest, it might be easiest to write a subclass of DirectoryOAIHarvester and use this in a simple script, rather than wrangle the command line arguments. Something like:

import codecs
import logging
import time
from datetime import datetime
from oaiharvest.harvest import DirectoryOAIHarvester

class NightTimeDirectoryOAIHarvester(DirectoryOAIHarvester):
    def harvest(self, baseUrl, metadataPrefix, **kwargs):
        logger = logging.getLogger(__name__).getChild(self.__class__.__name__)
        for header, metadata, about in self._listRecords(
                 baseUrl,
                 metadataPrefix=metadataPrefix,
                 **kwargs):

            # This is quick and dirty, you could calculate the exact time to sleep for instead
            hour = datetime.now().hour
            while 6 <= hour < 22:
                logger.debug("not night-time, sleeping for 10 mins...")
                time.sleep(600)  # Sleep for 10 minutes
                hour = datetime.now().hour

            fp = self._get_output_filepath(header, metadataPrefix)
            self._ensure_dir_exists(fp)

            if not header.isDeleted():
                logger.debug('Writing to file {0}'.format(fp))
                with codecs.open(fp, "w", encoding="utf-8") as fh:
                    fh.write(metadata)

            else:
                if self.respectDeletions:
                    logger.debug("Respecting server request to delete file "
                                 "{0}".format(fp))
                    try:
                        os.remove(fp)
                    except OSError:
                        # File probably does't exist in destination directory
                        # No further action needed
                        pass
                else:
                    logger.debug("Ignoring server request to delete file "
                                 "{0}".format(fp))
        else:
            # Harvesting completed, all available records stored
            return True

        # Loop must have been stopped with ``break``, e.g. due to
        # arbitrary limit
        return False

# Set up metadata registry
xmlReader = XMLMetadataReader()
metadata_registry = DefaultingMetadataRegistry(defaultReader=xmlReader)
harvester = NightTimeDirectoryOAIHarvester(metadata_registry, "/path/to/store/files")
harvester.harvest(
    "http://oaipmh.example.com/oaipmh/2.0/",
    metadataPrefix="oai_dc",
    set="SET")

If you do want to change the command line arguments, updating the whole CLI to use Click would be a nice improvement.

@jgonggrijp
Copy link
Contributor Author

So if I understand, you're trying to harvest a set so big that it takes more than about 8 hours, but you need to harvest only at night?

Yes. Though I should add that I'm not only harvesting the metadata, but also additional data linked from within. I'll do the latter with a separate script.

To be honest, it might be easiest to write a subclass of DirectoryOAIHarvester and use this in a simple script, rather than wrangle the command line arguments. Something like: [big chunk of code]

That looks like more code than I was planning to write. Also, I'm being paid from public money, so I'd prefer to make it reusable for the rest of the world.

If you do want to change the command line arguments, updating the whole CLI to use Click would be a nice improvement.

Maybe, but I'm not being paid to do that. You are already using argparse; it looks maintainable enough to me for now.

I'm now going to work on what I last proposed. I'll test it tonight and probably send you a pull request tomorrow or Thursday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants