Renaming SubgraphMatrix #273

Mec-iS · 2022-09-07T11:23:33Z

SubGraphMatrix (and in perspective SubGraphTensor) is the reference class for graph algebra and network analysis. Would it be better to rename the subg.py module and classes therein for encompassing a more general approach?

For example:

subg.py -> ?
SubGraphMatrix: keeping the fact that a SPARQL query is needed (so the subgraph naming), it would be better from a data scientist point-of-view to have this class to follow some more popular convention like for example GraphFrame or DataGraph or NetFrame (just throw a die with the right naming permutations, I though about this names: graph, frame, net, datagram, data, table, ...)
SubGraphTensor -> as above for future applications

cc: @ceteri @tomaarsen

The text was updated successfully, but these errors were encountered:

ceteri · 2022-09-16T23:10:18Z

This is a good question.

One caveat is that the name subgraph may be conflating two important features for our library:

providing transform() and inverse_transform() methods to convert between a graph and some target algebraic object.
projecting subsets of a graph into some special usage

For the algebraic objects, there are three possible transforms, as shown in https://derwen.ai/s/kcgh#37

vector
matrix
tensor

Different library integrations and applications will need to mix & match different cases of these.

The term subgraph has a meaning in W3C using labels to denotes subsets of triples: https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-mt/index.html#notation Labeled property graphs have related features, ergo the "label" notation. In either case, this definition has become rather archaic: it's too explicit, often seriously constrained in practice (either SPARQL or Cypher had odd limitations), and not really quite what's needed in a world were ML applications are widespread.

A more contemporary definition – and what's intended here – is that some repeatable process can be used to identify regions of interest within a relatively larger overall graph. It's important to note that a subgraph could be produced by several competing or even conflicting means other than explicit labels or other annotations: declarative (queries), empirical (Graph ML), algorithmic (connected components), probabilistic (PSL), topological, counterfactuals, etc.

Our our SubGraphMatrix example we use a SPARQL query, then construct the subgraph from the query result set. The results from applying a SHACL rule set or from PSL analysis could be other forms of subgraphs. Motifs from GNNs are another closely related notion.

So one might think of industry use cases for KGs, where there's some very large graph, but then particular data objects which are subgraphs that been constructed from some common set of definitions. These data objects might be repeated many times, for example with Bill of Materials within customer data.

We've had some lively discourse among researchers who are actively pursuing research in this area, and applications that would fit. FWIW, I started out with a reinforcement learning demo for topological categories.

I'd definitely loop in: @maparent @jmueller5 @mbesta @jelisf @paoespinozarias @neobernad

Thinking about subgraphs has certainly evolved much since these library components were named in late 2020, with many thanks to @jmueller5 as the prime force of nature for pragmatic ideas about leveraging subgraphs!

Here's a summary of different possible subgraph construction approaches we've encountered, so far:

Elements for formal descriptions of subgraphs

explicit
- enumerations of { nodes } ^ { edges } ^ { props }
- labels (RDF or Cypher)
- mask (NVIDIA)
algorithmic
- global indices, e.g., identified by starting nodes
- connected components
- boundaries
declarative
- shape constraint rule set (SHACL)
- query result set (SPARQL, Cypher)
algebraic
- category theory (magmas e.g., as in algebird)
- using some approach based on approximation algorithms?
empirical
- motifs classified by ML models
- annotation, potentially from some HITL process such as weak supervision
topological
- persistent homology ("top down")
- emergent patterns from a census (see RL example in the code)
- other computational TDA approaches?
probabilistic
- results form PSL rule set => uncertainty measures
- set of counterfactuals
- causality models?
misc
- other means of describing patterns that apply to graphs?

The core idea is we must be able to blend any of the above.

On the one hand I want to be careful not to introduce misnomers (e.g., my conflation of "transform" vs. "subset" operations).

On the other hand, we should not optimize this area to be too specific to a given instance (e.g., SPARQL => matrix).

And (speaking as a person with formal math background who loves functional programming) we should not let the Linear Algebra camp dictate definitions ;) Decades of that got us into the current mess! I would much rather follow the brilliant lessons from projects such as algebird

Mec-iS · 2022-09-19T15:47:01Z

Thanks for summing up the scope so well, this will keep the discussion on the right footing.

Our our SubGraphMatrix example we use a SPARQL query, then construct the subgraph from the query result set. The results from applying a SHACL rule set or from PSL analysis could be other forms of subgraphs. Motifs from GNNs are another closely related notion.

Ok so it would be better to have pluggable classes that inherit somehow from SubgraphMatrix to add some polymorphism in the argument taken, like for example:

kglab.subg.SPARQLmatrix or maybe better kglab.subg.from_SPARQL
kglab.subg.SHACLmatrix
kglab.subg.MLmatrix
...
the naming part of "subgraph" being so generally applicable to be implied and referenced to the W3C definition in the parent class.

And (speaking as a person with formal math background who loves functional programming) we should not let the Linear Algebra camp dictate definitions ;) Decades of that got us into the current mess! I would much rather follow the brilliant lessons from projects such as algebird

I support this approach ~~but then we should drop the Matrix for a generic Subgraph that would be an alias for KnowledgeGraph so to have a "recursive" representation.~~

My main question was about having a clear entrypoint to all these functionalities via an instatiation of a generic class (Dataframe in pandas, like GFrame, SubGraphFrame, or anything that keeps the semantic relevance), so to have:

# kglab.<Frame>

class <Frame>:
     graph: KnowledgeGraph = ...
     subgraph: KnowledgeGraph = ...    # or a more relevant alias like `SubGraph`
     subgraphmtx: SubgraphMatrix = ...

     def __init__(...):
         # creates the knowledge graph
         self.graph = KnowledgeGraph(...)
         ...

     def graph(self) -> KnowledgeGraph:
         return self.graph

     def subgraph(self) -> KnowledgeGraph:
         return self.subgraph
...

     def _get_subg_linear(self, query, ...) -> SubgraphMatrix:
          matrix = None
          if is_sparql(query):
              matrix = SPARQLmatrix(...)  # or `subg.from_SPARQL`
              setattr(self.subgraph, matrix)
          elif is_shacl(query):
              ...
          return matrix

This will allow to keep the current workflow using KnowledgeGraph but also provide a more consistent experience for users that wants to treat subgraphs without caring too much about the reference graph. A more "operational" entrypoint based on the access patterns of other well-established Python libraries.

Anyway, if subg.py has relevance as you pointed out (taking from the W3C definition) would be better to rename it to subgraph.py

EDITED

ceteri · 2022-09-19T17:29:03Z

Excellent plan!

Having a Frame class works well. Connotations of the word "Frame" (more general than "DataFrame" which is a table) fit well here.

And I really agree with what you pointed out about the name "matrix", that does lead to confusion when people don't have exposure to algebraic graph theory.

How about if we used naming conventions similar to NumPy?

instead of "vector" => "1D"
instead of "matrix" => "2D"
instead of "tensor" => "ND"

tomaarsen · 2022-09-19T17:33:48Z

Both naming conventions are clear, but vector/matrix/tensor are generally understood concepts within all of computer science, while "ND" without any context may be somewhat confusing. That said, the latter is shorter to write... haha

mbesta · 2022-09-21T17:28:02Z

Hello, Thanks for sharing - I'm working on a closely related stuff these days (will send a link once it's out). Best, Maciej

…

------------------------------------- Maciej Besta https://people.inf.ethz.ch/bestam Dept. of Computer Science ETH Zürich Universitätsstrasse 6 Zurich-8092, Switzerland

________________________________________ From: Paco Nathan ***@***.*** Sent: Monday, September 19, 2022 7:29 PM To: DerwenAI/kglab Cc: Besta Maciej; Mention Subject: Re: [DerwenAI/kglab] Renaming SubgraphMatrix (Issue #273) Excellent plan! Having a Frame class works well. Connotations of the word "Frame" (more general than "DataFrame" which is a table) fit well here. And I really agree with what you pointed out about the name "matrix", that does lead to confusion when people don't have exposure to algebraic graph theory. How about if we used naming conventions similar to NumPy? * instead of "vector" => "1D" * instead of "matrix" => "2D" * instead of "tensor" => "ND" — Reply to this email directly, view it on GitHub<#273 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACBPFSQKSPY5AS2FKAXPVZDV7CPGVANCNFSM6AAAAAAQGVP734>. You are receiving this because you were mentioned.Message ID: ***@***.***> [ { ***@***.***": "http://schema.org", ***@***.***": "EmailMessage", "potentialAction": { ***@***.***": "ViewAction", "target": "#273 (comment)", "url": "#273 (comment)", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { ***@***.***": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Mec-iS added the help wanted Extra attention is needed label Sep 7, 2022

ceteri added the question Further information is requested label Sep 16, 2022

Mec-iS mentioned this issue Oct 6, 2022

implement Frame in place of Subgraph #277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Renaming SubgraphMatrix #273

Renaming SubgraphMatrix #273

Mec-iS commented Sep 7, 2022 •

edited

ceteri commented Sep 16, 2022 •

edited

Mec-iS commented Sep 19, 2022 •

edited

ceteri commented Sep 19, 2022

tomaarsen commented Sep 19, 2022 •

edited

mbesta commented Sep 21, 2022 via email

Renaming SubgraphMatrix #273

Renaming SubgraphMatrix #273

Comments

Mec-iS commented Sep 7, 2022 • edited

ceteri commented Sep 16, 2022 • edited

Elements for formal descriptions of subgraphs

Mec-iS commented Sep 19, 2022 • edited

ceteri commented Sep 19, 2022

tomaarsen commented Sep 19, 2022 • edited

mbesta commented Sep 21, 2022 via email

Mec-iS commented Sep 7, 2022 •

edited

ceteri commented Sep 16, 2022 •

edited

Mec-iS commented Sep 19, 2022 •

edited

tomaarsen commented Sep 19, 2022 •

edited