You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
A recent PhLB paper from the CMS collaboration has author-affiliation metadata for each institution author-group embedded within a master author-group for the collaboration itself: <ce:author-group>...CMS Collab...<ce:author-group><ce:author>authors at Institution 1</ce:author><sa:affiliation>Institution 1</sa:affiliation></ce:author-group>....etc...</ce:author-group>
To Reproduce
Parse the test case file els_phlb_compound_affil.xml with release v0.9.17 of ADSIngest Parser. Each author will be assigned all affiliations present, rather than those within the nested author group they belong to.
Additional context
A recursive find-extract in beautifulsoup that extracts author-group tags while soup.find('author-group') is not None may be able to do this, but you have to make sure you recursively parse each author-group found to see if it too has any author-groups.
The text was updated successfully, but these errors were encountered:
The following seems to work in isolation (using BeautifulSoup alone, not ElsevierParser):
from bs4 import BeautifulSoup
def get_groups(soup):
group_list = []
ag = soup.find('ce:author-group').extract()
while ag.find('ce:author-group'):
group_list.append(get_groups(ag))
group_list.append(ag)
return group_list
def main():
with open("cms_omg.xml", "rb") as fc:
data = fc.read()
soup = BeautifulSoup(data, "lxml-xml")
auth_blocks = get_groups(soup)
for a in auth_blocks:
print(a)
print("\n\n")
if __name__ == '__main__':
main()
The solution above has two issues. One, the text content of the first enclosing author-group tag is appended to the end of the group_list object, leading to the pieces being out of order. Two, while the content of the top enclosing author group is of type str, everything else is of type list with length 1.
One possible solution:
def get_groups(soup):
group_list = []
ag = soup.find('ce:author-group').extract()
while ag.find('ce:author-group'):
group_list.append(get_groups(ag))
g2 = [ag]
g2.extend(group_list)
group_list = []
for g in g2:
if type(g) == list:
group_list.append(g[0])
else:
group_list.append(g)
return group_list
Describe the bug
A recent PhLB paper from the CMS collaboration has author-affiliation metadata for each institution author-group embedded within a master author-group for the collaboration itself:
<ce:author-group>...CMS Collab...<ce:author-group><ce:author>authors at Institution 1</ce:author><sa:affiliation>Institution 1</sa:affiliation></ce:author-group>....etc...</ce:author-group>
To Reproduce
Parse the test case file
els_phlb_compound_affil.xml
with release v0.9.17 of ADSIngest Parser. Each author will be assigned all affiliations present, rather than those within the nested author group they belong to.Additional context
A recursive find-extract in beautifulsoup that extracts author-group tags while soup.find('author-group') is not None may be able to do this, but you have to make sure you recursively parse each author-group found to see if it too has any author-groups.
The text was updated successfully, but these errors were encountered: