Elsevier parser is mishandling author groups embedded in author groups #102

seasidesparrow · 2024-04-26T13:26:33Z

Describe the bug
A recent PhLB paper from the CMS collaboration has author-affiliation metadata for each institution author-group embedded within a master author-group for the collaboration itself: <ce:author-group>...CMS Collab...<ce:author-group><ce:author>authors at Institution 1</ce:author><sa:affiliation>Institution 1</sa:affiliation></ce:author-group>....etc...</ce:author-group>

To Reproduce
Parse the test case file els_phlb_compound_affil.xml with release v0.9.17 of ADSIngest Parser. Each author will be assigned all affiliations present, rather than those within the nested author group they belong to.

Additional context
A recursive find-extract in beautifulsoup that extracts author-group tags while soup.find('author-group') is not None may be able to do this, but you have to make sure you recursively parse each author-group found to see if it too has any author-groups.

The text was updated successfully, but these errors were encountered:

seasidesparrow · 2024-04-26T14:11:12Z

The following seems to work in isolation (using BeautifulSoup alone, not ElsevierParser):

from bs4 import BeautifulSoup


def get_groups(soup):
    group_list = []
    ag = soup.find('ce:author-group').extract()
    while ag.find('ce:author-group'):
        group_list.append(get_groups(ag))
    group_list.append(ag)
    return group_list





def main():
    with open("cms_omg.xml", "rb") as fc:
        data = fc.read()

    soup = BeautifulSoup(data, "lxml-xml")

    auth_blocks = get_groups(soup)
    for a in auth_blocks:
        print(a)
        print("\n\n")


if __name__ == '__main__':
    main()

seasidesparrow · 2024-04-26T14:27:55Z

The solution above has two issues. One, the text content of the first enclosing author-group tag is appended to the end of the group_list object, leading to the pieces being out of order. Two, while the content of the top enclosing author group is of type str, everything else is of type list with length 1.

One possible solution:

def get_groups(soup):
    group_list = []
    ag = soup.find('ce:author-group').extract()
    while ag.find('ce:author-group'):
        group_list.append(get_groups(ag))
    g2 = [ag]
    g2.extend(group_list)
    group_list = []
    for g in g2:
        if type(g) == list:
            group_list.append(g[0])
        else:
            group_list.append(g)
    return group_list

seasidesparrow · 2024-05-03T18:28:10Z

Fixed by #103

seasidesparrow added the bug Something isn't working label Apr 26, 2024

seasidesparrow self-assigned this Apr 26, 2024

seasidesparrow closed this as completed May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elsevier parser is mishandling author groups embedded in author groups #102

Elsevier parser is mishandling author groups embedded in author groups #102

seasidesparrow commented Apr 26, 2024

seasidesparrow commented Apr 26, 2024

seasidesparrow commented Apr 26, 2024

seasidesparrow commented May 3, 2024

Elsevier parser is mishandling author groups embedded in author groups #102

Elsevier parser is mishandling author groups embedded in author groups #102

Comments

seasidesparrow commented Apr 26, 2024

seasidesparrow commented Apr 26, 2024

seasidesparrow commented Apr 26, 2024

seasidesparrow commented May 3, 2024