Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsers heterogeneity #111

Open
marblestation opened this issue Oct 23, 2020 · 0 comments
Open

Parsers heterogeneity #111

marblestation opened this issue Oct 23, 2020 · 0 comments
Assignees

Comments

@marblestation
Copy link
Contributor

Most parsers have a parse(...) method that accept a raw string to be parsed such as:

def parse(self, fp, **kwargs):

But others, act very differently. For instance:

  • For ATels, an URL is passed and the raw data is fetched from there:

def parse(self, url, **kwargs):
atel_recs = [{}]
headers = {
'Content-type': 'text/xml',
'Accept': 'text/html,application/xhtml+xml,application/xml',
'User-agent': 'Mozilla/5.0'}
data = self.get_records(url, headers=headers, **kwargs)

It would be convenient to separate the download action from the parse action.

  • For proquest, there are no arguments, so it is unclear what it is parsing:

Unless we read the code and understand that the input is a file name (not a raw string) passed in the constructor and not in the parse method:

def __init__(self, filename):
marc_input_file = config.PROQUEST_BASE_PATH + filename
oa_input_file = marc_input_file.replace('.UNX', '_OpenAccessTitles.csv')
self.records = open(marc_input_file).read().strip().split('\n')
oa_input_data = open(oa_input_file).read().strip().split('\n')

  • For Affiliations and GCNCirc, there are no arguments again:


It is again in the constructor that the input is passed, but this time is not a file but a raw string:

def __init__(self, input_string):
self.original_string = input_string

def __init__(self, data):
# econv = EntityConverter()
# econv.input_text = data
# econv.convert()
# self.raw = econv.output_text
self.raw = data
self.data_dict = dict()

The package would benefit from some homogenization so that all parser have some common methods (e.g., parse) with similar arguments / requirements, behavior / expectations and return values. For instance, sometimes we return the parsed result:

Sometimes we do not return but we store it in the object:

self.results.append(output_metadata)
return

@marblestation marblestation changed the title Parser heterogeneity Parsers heterogeneity Oct 23, 2020
@seasidesparrow seasidesparrow self-assigned this Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants