Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARQL endpoint #963

Open
lisestork opened this issue Feb 20, 2024 · 11 comments
Open

SPARQL endpoint #963

lisestork opened this issue Feb 20, 2024 · 11 comments

Comments

@lisestork
Copy link

The SPARQL endpoint does not seem to contain all interactions available through the API. As an example, there is information about garden tomato's (https://api.globalbioticinteractions.org/findExternalUrlForTaxon/Solanum%20lycopersicum), but there are no interactions available for the garden tomato via the SPARQL endpoint. Am I not querying correctly, or does the SPARQL endpoint contain only a subset of interactions? Moreover, there are no human-readable labels available in the SPARQL endpoint, hampering querying (since various different taxon IDs are used).

@jhpoelen
Copy link
Member

jhpoelen commented Feb 20, 2024

Hi @lisestork

Thanks for asking about the GloBI sparql endpoint and associated data. I think you are one of the first to ask about this in the decade its been available . . . ; )

Before I dig into this, can you tell a little more about how you are planning to use the sparql endpoint?

@aahmeti
Copy link

aahmeti commented Mar 21, 2024

I have been playing the last few days with the sparql endpoint and I have a similar experience. The data seems to be incomplete and the data modelling looks as it is dumped from another data model, perhaps property graphs?

In any case can you provide a sparql query that returns me all the interactions just like I see in the browser for a particular species? There are no provided examples.

I tried different queries to get all the interactions for the brown bear for example, in which I get incomplete and some results are incompatible with the interactions shown in the browser feature:

select distinct ?species1 ?p1 ?species2 ?p2
where { 
    
    SERVICE <https://lod.globalbioticinteractions.org/globi/sparql> {
        
        
        {
        VALUES ?ursusArctos {  <https://www.wikidata.org/wiki/Q243359>  <https://www.wikidata.org/wiki/Q44847189> <https://www.wikidata.org/wiki/Q36341> }
            
        ?s <http://purl.obolibrary.org/obo/RO_0002350> ?species2 .  # member of
        ?species2 owl:sameAs ?ursusArctos . 
        ?q ?r ?s .
        ?q <http://purl.obolibrary.org/obo/RO_0000057> ?a1 . # has participant
        ?q <http://purl.obolibrary.org/obo/RO_0000057> ?a2 . # has participant
        filter (?a1 != ?a2)
        ?a1 <http://purl.obolibrary.org/obo/RO_0002350> ?species1 . 
        ?a2 <http://purl.obolibrary.org/obo/RO_0002350> ?species2 . 
        ?a1 ?p1 ?o1 . FILTER (!regex(str(?p1), "type")) FILTER (!regex(str(?p1), "RO_0002350"))
        ?a2 ?p2 ?o2 . FILTER (!regex(str(?p2), "type")) FILTER (!regex(str(?p2), "RO_0002350"))
        #FILTER (str(?species1) < str(?species2))
        }
       
    }
} limit 100

or

PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select distinct ?wikidata ?ursusArctos
where { 
    
    SERVICE <https://lod.globalbioticinteractions.org/globi/sparql> {
        
        {
            ?prey <http://purl.obolibrary.org/obo/RO_0002350> ?member1 .
            ?member1 (owl:sameAs)* ?wikidata . FILTER regex(str(?wikidata), "https://www.wikidata.org/wiki/")
            
            VALUES ?ursusArctos { <https://www.wikidata.org/wiki/Q243359> <https://www.wikidata.org/wiki/Q44847189> } 

            ?prey <http://purl.obolibrary.org/obo/RO_0002471> ?predator . # eaten by 
            ?predator <http://purl.obolibrary.org/obo/RO_0002350> ?member2 . 
            ?member2 (owl:sameAs|^owl:sameAs)* ?ursusArctos . 
        }
    }
} limit 100

Also the endpoint times out after 60sec, especially problematic with (owl:sameAs|^owl:sameAs)* reasoning.

@jhpoelen
Copy link
Member

@lisestork @aahmeti thanks again for your interest. As I mentioned before, you are among the first to ask about the sparql endpoint that GloBI has had for about a decade.

It'd be helpful if you can provide some context to how you are planning the use the sparql endpoint. Also, please let me know if you are willing to contribute to possible modeling tweaks to the rdf/nquad versions of the GloBI interaction data. The model implemented today hasn't been touched for quite a while and could probably use some TLC.

Thanks for being patient and for sharing your concern.

@aahmeti
Copy link

aahmeti commented Mar 22, 2024

@jhpoelen thanks for the quick reply. From my side, as I also tried to explain with my queries, the endpoint needs to be able to answer the simple of question of "give me all the interactions for a particular species?" and return the answers we see in the "browser" mode. Thanks!

Edit: I'd be curious if you have had any SPARQL queries that you can share with us.

@jhpoelen
Copy link
Member

@aahmeti thanks for sharing your desires and questions.

First, see https://github.com/globalbioticinteractions/globalbioticinteractions/wiki#accessing-species-interaction-data for some documentation about various access methods.

Also, you can find the triples loaded into the triple store in the resource interactions.nq.gz at https://www.globalbioticinteractions.org/data or https://zenodo.org/record/8284068/files/interactions.nq.gz . The data is generated in https://github.com/globalbioticinteractions/globalbioticinteractions/blob/837093955f9a543e6a903d27c0dadceb249bf6b8/eol-globi-neo4j-index-export/src/main/java/org/eol/globi/export/ExporterRDF.java and uses a model like:

organism A classified as taxon X
organism B classified as taxon Y
organism A interacts with organism B

where "interacts with" can be OBO Relation Ontology terms.

Just curious -

Why not just use the GloBI Rest API instead ? This is used by the "brower" mode.

What is your particular reason for using SPARQL?

Thanks for being patient for me as I am trying to understand your data access constraints.

Curious to hear your thoughts
-jorrit

@aahmeti
Copy link

aahmeti commented Mar 22, 2024

Thanks for providing the links. I had loaded the file locally in my GraphDB triple store, but that did not bring me more complete interactions. Looks like the only viable solution is via API, which is the route I am going to take. The reason why I go via SPARQL endpoint is that all my data is stored as RDF and that is what I use for data integration; with other formats CSV/JSON I need to go thru the route of data wrangling and transforming to RDF, which is an extra step.

@jhpoelen
Copy link
Member

jhpoelen commented Mar 22, 2024

@aahmeti thanks for sharing your data integration methods. I can see how it'd be easier to integrate via rdf if you are already using a graphdb triple store.

Please note that individual datasets have "nanopub" endpoints. These contain rdf snippets in form of nanopubs.

You can find an example of these nanopubs archives via the https://globalbioticinteractions.org/datasets via the "nanopub" badge:

image

the reaching the individual nanopub archives directly can be done via:

https://depot.globalbioticinteractions.org/reviews/AgentschapPlantentuinMeise/ashForestInteractions/nanopub.trig.gz

where AgentschapPlantentuinMeise/ashForestInteractions is the namespace of the dataset.

You can find a list of all dataset namespaces via:

https://depot.globalbioticinteractions.org/snapshot/target/data/tsv/datasets.tsv

or

https://depot.globalbioticinteractions.org/snapshot/target/data/csv/datasets.csv

E.g.,

curl --silent "https://depot.globalbioticinteractions.org/snapshot/target/data/csv/datasets.csv"\
 | head
namespace
AgentschapPlantentuinMeise/ashForestInteractions
BDMYRepository/Echino-Interactions
BDMYRepository/Paguroidea-Mollusca-Interactions
BDMYRepository/Sponge_Interactions
Big-Bee-Network/bee-interaction-database
BugGuy/Megalomyrmex-interactions
CALeDNA/Klamath-mountains
CEIDatUGA/Fournie-etal_2015-interactions
EMTuckerLab/ummzi

Happy to work with you to come up with an rdf shape that would work for you, if you provide an example of the shape you'd enjoy working with, perhaps we can incorporate that in existing GloBI data products or create a new one.

Again, apologies for having to deal with all the "dust" collected on the largely unused RDF perspective onto GloBI. Maybe this is an opportunity to blow some life into that aspect of GloBI.

Thanks for being patient.

@aahmeti
Copy link

aahmeti commented Mar 22, 2024

Wow, this looks promising! Thank you, Sir! 😎

I downloaded one of those .trigs related to "grizzly" and with the following SPARQL query I already got 255 interactions, much more than what I had before!

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?s1 ?p ?s2 
WHERE
{
    {
        filter (?p = <http://purl.obolibrary.org/obo/RO_0002470>) # eats
        { ?species1 ?p ?species2 .  } UNION { ?species2 ?p ?species1  }
        ?species1 rdfs:label "Ursus arctos" . 
        ?species2 rdfs:label ?s2 .    
        BIND ("Ursus arctos" as ?s1)
    }

}

So this means that if I import all those nanopub archives I will come to the complete numbers coinciding with the ones shown in the "browser", correct? Do I have still to go thru species nomenclature mapping, or I am good if I just use latin name "Ursus arctos" and get all the aggregated data at this point?

I think this is the way to go, I can run a set of INSERT queries and change the data model the way I see fit now. If you want I can write a guide that after you verify and proofread it can put in the list of guides. What do you say?

@jhpoelen
Copy link
Member

So this means that if I import all those nanopub archives I will come to the complete numbers coinciding with the ones shown in the "browser", correct? Do I have still to go thru species nomenclature mapping, or I am good if I just use latin name "Ursus arctos" and get all the aggregated data at this point?

The nanopubs is using the names as provided by the data source. So, this does not include name alignment, at least not yet . . . So, if a data source provides some kind of taxon id (e.g., NCBI:9606 for Homo sapiens), the nanopubs may include it, otherwise, only the provided names are included.

Note that, if you'd like, you can use some other tool like Nomer, an associated tool https://github.com/globalbioticinteractions/name-alignment-template, or your own methods to add the linkages.

Also note that the "Browse" results are generated after GloBI's name alignment process, so the results may vary a little, depending on the effect of name alignment processes. For more information see https://globalbioticinteractions.org/process .

I think this is the way to go, I can run a set of INSERT queries and change the data model the way I see fit now. If you want I can write a guide that after you verify and proofread it can put in the list of guides. What do you say?

I very much like your idea, and I'd be happy to review a pull request on the "how-to" page https://github.com/globalbioticinteractions/globalbioticinteractions.github.io/blob/main/how-to.md created by @EMTuckerLab .

Curious to hear what you come up with, and open to suggestions / comments / questions that you may have.

@jhpoelen
Copy link
Member

hey @aahmeti @lisestork - I've upgraded GloBI to have more capacity, and am hoping to try to host more of the GloBI data through a sparql endpoint. If you are willing to host a full copy of the GloBI triples in one of your triples stores, please do let me know!

Thanks for being patient.

@jhpoelen
Copy link
Member

jhpoelen commented May 3, 2024

A full copy (6.6GB compressed) of the interactions.nq.gz is now available via https://depot.globalbioticinteractions.org/snapshot/target/data/interactions.nq.gz

Am open to suggestions on how to host these 2.2 billion triples in a triple store of sorts. Perhaps an option would be to generate the data in a modular way, one per indexed dataset perhaps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants