Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XMLParser( recover=True ) option to pyoai/client; add --recover option to harvest. #31

Open
wants to merge 15 commits into
base: develop
Choose a base branch
from

Conversation

sdm7g
Copy link

@sdm7g sdm7g commented Apr 19, 2019

This adds XMLParser( recover=True ) option to pyoai/client and includes that module locally. #25
( I've had an outstanding pull request to fix code in that project. )
Adds an optional --recover / --no-recover option to command line args.
( default is --no-recover, to be conservative and keep the same behavior )
Adds code to log errors as warnings when using recover option, and to log which identifiers had parser errors, so that upstream feeds can be notified.
And fixed some issues with logger config that cropped up when testing this code:
harvest.py logging now goes to harvest.log, registry.py goes to registry.log #26 ;
and added some missing whitespace in help strings.

Copy link
Owner

@bloomonkey bloomonkey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi

First, thanks for taking the time and trouble to code this up and submit a PR. This looks like a very useful addition.

That said, I'm uneasy about adding this entire module into oaiharvest as this brings with it the burden of maintaining this code (which I didn't write and don't understand). I'd much prefer it if pyoai could approve your pull request and mint a new release... (I'll go and +1 your PR there shortly)

There a few other changes I've suggested to remove code you're replacing rather than commenting it out. The git repo will keep history for us, so there's no need for it to be in the latest version.

Thanks again, and good luck getting the pyoai change merged... 🤞

John

@@ -129,6 +129,9 @@ def _listRecords(self, baseUrl, metadataPrefix="oai_dc", **kwargs):
if isinstance(metadata, str) and metadata.startswith("b'"):
metadata = ast.literal_eval(metadata).decode("utf-8")
yield (header, metadata, about)
if client.XMLParser.error_log and len(client.XMLParser.error_log) > 0 :
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove space before : (PEP 8)


ch = logging.StreamHandler()
format='%(levelname)-8s %(message)s',
# format='%(asctime)s %(name)-16s %(levelname)-8s %(message)s',
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove old code instead of commenting it out

datefmt='[%Y-%m-%d %H:%M:%S]',
filename=os.path.join(appdir, 'registry.log')
format='%(levelname)-8s %(message)s'
# format='%(asctime)s %(name)-16s %(levelname)-8s %(message)s',
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove old code instead of commenting it out

)
logger = logging.getLogger(__name__)
ch = logging.StreamHandler()
# ch = logging.StreamHandler()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove old code instead of commenting it out

ch.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(levelname)-8s %(message)s')
# formatter = logging.Formatter('%(levelname)-8s %(message)s')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove old code instead of commenting it out

formatter = logging.Formatter(
'%(asctime)s %(name)-16s %(levelname)-8s %(message)s',
'[%Y-%m-%d %H:%M:%S]')
#formatter = logging.Formatter('%(levelname)-8s %(message)s')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove old code instead of commenting it out

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK: commented code removed. ( I had left them in originally, as I was unsure of the intent of the original code, and wanted to make clear what exactly what behavior I was changing. )
I also added the --recover option to the doc-string at the top, which I missed earlier.

@sdm7g
Copy link
Author

sdm7g commented Apr 30, 2019

On further use and testing, I have found a couple of cases where it does recover from parsing Mal-formed payload, but where the issue is unmatching tags, the errors cascade outward to the OAI container and cause the resumption token to not be found and parsed correctly, thus the harvest halts. ( At least that my initial diagnosis. ) It still gives plenty of parser warning messages before halting, however, I'ld like to figure out a better way to unambiguously identify this case.
I still consider this behavior better than before, but perhaps it should be documented that --recover could have this side effect.

Steve Majewski added 5 commits May 2, 2019 15:17
sometimes mismatched tag problems can propagate outwards to container and resumptionToken not found in exact path.
…ntities

use unicode, not utf-8, so that it works under python 2.x as well as 3
…ATE_VERIFY_FAILED

I don't know what needs to be fixed on the server ends (and I have no control over that end)
but I found this fix googling the problem (several fixes suggested)
https://www.howtouselinux.com/post/ssl-certificate_verify_failed-in-python
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants