Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elsevier parser is mishandling author groups embedded in author groups #102

Closed
seasidesparrow opened this issue Apr 26, 2024 · 3 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@seasidesparrow
Copy link
Member

Describe the bug
A recent PhLB paper from the CMS collaboration has author-affiliation metadata for each institution author-group embedded within a master author-group for the collaboration itself: <ce:author-group>...CMS Collab...<ce:author-group><ce:author>authors at Institution 1</ce:author><sa:affiliation>Institution 1</sa:affiliation></ce:author-group>....etc...</ce:author-group>

To Reproduce
Parse the test case file els_phlb_compound_affil.xml with release v0.9.17 of ADSIngest Parser. Each author will be assigned all affiliations present, rather than those within the nested author group they belong to.

Additional context
A recursive find-extract in beautifulsoup that extracts author-group tags while soup.find('author-group') is not None may be able to do this, but you have to make sure you recursively parse each author-group found to see if it too has any author-groups.

@seasidesparrow seasidesparrow added the bug Something isn't working label Apr 26, 2024
@seasidesparrow seasidesparrow self-assigned this Apr 26, 2024
@seasidesparrow
Copy link
Member Author

The following seems to work in isolation (using BeautifulSoup alone, not ElsevierParser):

from bs4 import BeautifulSoup


def get_groups(soup):
    group_list = []
    ag = soup.find('ce:author-group').extract()
    while ag.find('ce:author-group'):
        group_list.append(get_groups(ag))
    group_list.append(ag)
    return group_list





def main():
    with open("cms_omg.xml", "rb") as fc:
        data = fc.read()

    soup = BeautifulSoup(data, "lxml-xml")

    auth_blocks = get_groups(soup)
    for a in auth_blocks:
        print(a)
        print("\n\n")


if __name__ == '__main__':
    main()

@seasidesparrow
Copy link
Member Author

The solution above has two issues. One, the text content of the first enclosing author-group tag is appended to the end of the group_list object, leading to the pieces being out of order. Two, while the content of the top enclosing author group is of type str, everything else is of type list with length 1.

One possible solution:

def get_groups(soup):
    group_list = []
    ag = soup.find('ce:author-group').extract()
    while ag.find('ce:author-group'):
        group_list.append(get_groups(ag))
    g2 = [ag]
    g2.extend(group_list)
    group_list = []
    for g in g2:
        if type(g) == list:
            group_list.append(g[0])
        else:
            group_list.append(g)
    return group_list

@seasidesparrow
Copy link
Member Author

Fixed by #103

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant