Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identify a use-case for Cross Domain Interoperability Framework integration to help increase searchability of GloBI indexed datasets #967

Open
jhpoelen opened this issue Mar 12, 2024 · 5 comments

Comments

@jhpoelen
Copy link
Member

fyi @deboradrucker @zedomel @Filipi-Soares

https://docs.google.com/document/d/1QZjtZeYOT-Mn5IwdeGfkVmKzOy39rrc1ebnBXeVWtiA/edit

related to work package WorldFAIR D10.

Comments on Cross Domain Interoperability Framework (CDIF) model - discovery model draft - Cross-Domain Interoperability Framework (CDIF) Working Group, Richard, S., Gregory, A., Hodson, S., Fils, D., Kanjala, C., Bell, D., Winstanley, P., Edwards, M., Heus, P., Brickley, D., Rizzolo, F., Maxwell, L., Luis, G., Buttigieg, P. L., & Le Franc, Y. (2023). Cross Domain Interoperability Framework (CDIF): Discovery Module (v01 draft for public consultation) (Version 01). Zenodo. https://doi.org/10.5281/zenodo.10252564

@jhpoelen
Copy link
Member Author

Cross-Domain Interoperability Framework (CDIF) appears to be using the same strategy that Google uses to integrate data into the Google Knowledge Graph.

@jhammock @KatjaSchulz - What have been your experience in working with Google folks and their services to help better re-use Encyclopedia of Life data products? How do (if at all) you imagine GloBI to re-use this approach?

@jhpoelen
Copy link
Member Author

@jhammock
Copy link
Collaborator

They did take a little chunk back toward the beginning of our structured data. I remember hearing that they contacted us. I think it was a special project on their end. Hand-picked slices of things like mammal size, processed almost manually and connections severed, as far as I could tell. A couple of recent google samples suggest they're not using much of it any more. This was about eight years ago; one hopes it was an early experiment and their methods are very different now.

I don't know if this is relevant, but we definitely had to provide them with an updated sitemap following a major taxonomic update about six years ago. We preserved the taxon IDs as best we could at the time, but we weren't able to keep them all. There was some talk that perhaps google would re-crawl and update, but after a few months they clearly hadn't, (there were google search results to nonexistent EOL pages) and we concluded we'd best do it ourselves. I wasn't in on the sitemap design, but I think we provided the usual/basic/default stuff. It's organized in (I think) a sitemap index; anyway, it lists a bunch of sitemap files, which appear to provide all the urls in the EOL website, with little else that I can read apart from language codes. It's at https://eol.org/data/sitemap/sitemap.xml.gz . I doubt that's directly relevant to GloBI, but what do I know? Maybe there's an analogous approach for data organized in something other than html pages.

@jhammock
Copy link
Collaborator

Oh, I almost forgot. I only noticed this a few weeks ago. I have only guesswork, and maybe you've found these phenomena already, but it is at least current.

https://www.google.com/search?q=diaptomus+castaneti&oq=Diaptomus+castaneti

(try an incognito window if you want to see exactly what I saw, I guess)

I don't know what all is happening. EOL has done nothing deliberately to facilitate. Different species give different results and I haven't experimented much, but I'll bet if you try some obscure/neglected species you'll see a variety of databases represented in various ways.

The first EOL entry is straightforward enough. We generated that text from our structured data, but it was probably scraped from the html; I checked on it a few times and it included some accidental source text for a few days. Why it's a "featured snippet", I don't know. EOL text entries aren't always, even when they appear at the top of the results. You'll notice WoRMS is included twice, with the same taxon ID but, I guess, different search settings. GBIF appears twice with different taxon IDs. iNaturalist appears at least twice, same ID, different languages.

The second EOL entry is quite interesting. I had to burrow through source code to find the text they selected. If you want to see it, go to https://eol.org/pages/4293114/data (same taxon ID as the first result, different tab), and select "drag powered swimming" from the third record down.

Overall, I imagine... natural language processing, cruising through our nicely structured databases looking for friendly narrative text?

@jhpoelen
Copy link
Member Author

@jhammock thanks for sharing your experience. I've attached a screenshot of the search results I got served by our Google friends. That is one cute looking copepod. I wish I was transparent - much easier to debug any internal issues my body may have ; )

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants