Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Renaming SubgraphMatrix #273

Open
Mec-iS opened this issue Sep 7, 2022 · 5 comments
Open

Renaming SubgraphMatrix #273

Mec-iS opened this issue Sep 7, 2022 · 5 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@Mec-iS
Copy link
Contributor

Mec-iS commented Sep 7, 2022

SubGraphMatrix (and in perspective SubGraphTensor) is the reference class for graph algebra and network analysis. Would it be better to rename the subg.py module and classes therein for encompassing a more general approach?

For example:

  • subg.py -> ?
  • SubGraphMatrix: keeping the fact that a SPARQL query is needed (so the subgraph naming), it would be better from a data scientist point-of-view to have this class to follow some more popular convention like for example GraphFrame or DataGraph or NetFrame (just throw a die with the right naming permutations, I though about this names: graph, frame, net, datagram, data, table, ...)
  • SubGraphTensor -> as above for future applications

cc: @ceteri @tomaarsen

@Mec-iS Mec-iS added the help wanted Extra attention is needed label Sep 7, 2022
@ceteri ceteri added the question Further information is requested label Sep 16, 2022
@ceteri
Copy link
Collaborator

ceteri commented Sep 16, 2022

This is a good question.

One caveat is that the name subgraph may be conflating two important features for our library:

  1. providing transform() and inverse_transform() methods to convert between a graph and some target algebraic object.
  2. projecting subsets of a graph into some special usage

For the algebraic objects, there are three possible transforms, as shown in https://derwen.ai/s/kcgh#37

  • vector
  • matrix
  • tensor

Different library integrations and applications will need to mix & match different cases of these.


The term subgraph has a meaning in W3C using labels to denotes subsets of triples: https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-mt/index.html#notation Labeled property graphs have related features, ergo the "label" notation. In either case, this definition has become rather archaic: it's too explicit, often seriously constrained in practice (either SPARQL or Cypher had odd limitations), and not really quite what's needed in a world were ML applications are widespread.

A more contemporary definition – and what's intended here – is that some repeatable process can be used to identify regions of interest within a relatively larger overall graph. It's important to note that a subgraph could be produced by several competing or even conflicting means other than explicit labels or other annotations: declarative (queries), empirical (Graph ML), algorithmic (connected components), probabilistic (PSL), topological, counterfactuals, etc.

Our our SubGraphMatrix example we use a SPARQL query, then construct the subgraph from the query result set. The results from applying a SHACL rule set or from PSL analysis could be other forms of subgraphs. Motifs from GNNs are another closely related notion.

So one might think of industry use cases for KGs, where there's some very large graph, but then particular data objects which are subgraphs that been constructed from some common set of definitions. These data objects might be repeated many times, for example with Bill of Materials within customer data.

We've had some lively discourse among researchers who are actively pursuing research in this area, and applications that would fit. FWIW, I started out with a reinforcement learning demo for topological categories.

I'd definitely loop in: @maparent @jmueller5 @mbesta @jelisf @paoespinozarias @neobernad

Thinking about subgraphs has certainly evolved much since these library components were named in late 2020, with many thanks to @jmueller5 as the prime force of nature for pragmatic ideas about leveraging subgraphs!

Here's a summary of different possible subgraph construction approaches we've encountered, so far:

Elements for formal descriptions of subgraphs

  • explicit

    • enumerations of { nodes } ^ { edges } ^ { props }
    • labels (RDF or Cypher)
    • mask (NVIDIA)
  • algorithmic

    • global indices, e.g., identified by starting nodes
    • connected components
    • boundaries
  • declarative

    • shape constraint rule set (SHACL)
    • query result set (SPARQL, Cypher)
  • algebraic

    • category theory (magmas e.g., as in algebird)
    • using some approach based on approximation algorithms?
  • empirical

    • motifs classified by ML models
    • annotation, potentially from some HITL process such as weak supervision
  • topological

    • persistent homology ("top down")
    • emergent patterns from a census (see RL example in the code)
    • other computational TDA approaches?
  • probabilistic

    • results form PSL rule set => uncertainty measures
    • set of counterfactuals
    • causality models?
  • misc

    • other means of describing patterns that apply to graphs?

The core idea is we must be able to blend any of the above.


On the one hand I want to be careful not to introduce misnomers (e.g., my conflation of "transform" vs. "subset" operations).

On the other hand, we should not optimize this area to be too specific to a given instance (e.g., SPARQL => matrix).

And (speaking as a person with formal math background who loves functional programming) we should not let the Linear Algebra camp dictate definitions ;) Decades of that got us into the current mess! I would much rather follow the brilliant lessons from projects such as algebird

@Mec-iS
Copy link
Contributor Author

Mec-iS commented Sep 19, 2022

Thanks for summing up the scope so well, this will keep the discussion on the right footing.

Our our SubGraphMatrix example we use a SPARQL query, then construct the subgraph from the query result set. The results from applying a SHACL rule set or from PSL analysis could be other forms of subgraphs. Motifs from GNNs are another closely related notion.

Ok so it would be better to have pluggable classes that inherit somehow from SubgraphMatrix to add some polymorphism in the argument taken, like for example:

  • kglab.subg.SPARQLmatrix or maybe better kglab.subg.from_SPARQL
  • kglab.subg.SHACLmatrix
  • kglab.subg.MLmatrix
  • ...
    the naming part of "subgraph" being so generally applicable to be implied and referenced to the W3C definition in the parent class.

And (speaking as a person with formal math background who loves functional programming) we should not let the Linear Algebra camp dictate definitions ;) Decades of that got us into the current mess! I would much rather follow the brilliant lessons from projects such as algebird

I support this approach but then we should drop the Matrix for a generic Subgraph that would be an alias for KnowledgeGraph so to have a "recursive" representation.

My main question was about having a clear entrypoint to all these functionalities via an instatiation of a generic class (Dataframe in pandas, like GFrame, SubGraphFrame, or anything that keeps the semantic relevance), so to have:

# kglab.<Frame>

class <Frame>:
     graph: KnowledgeGraph = ...
     subgraph: KnowledgeGraph = ...    # or a more relevant alias like `SubGraph`
     subgraphmtx: SubgraphMatrix = ...

     def __init__(...):
         # creates the knowledge graph
         self.graph = KnowledgeGraph(...)
         ...

     def graph(self) -> KnowledgeGraph:
         return self.graph

     def subgraph(self) -> KnowledgeGraph:
         return self.subgraph
...

     def _get_subg_linear(self, query, ...) -> SubgraphMatrix:
          matrix = None
          if is_sparql(query):
              matrix = SPARQLmatrix(...)  # or `subg.from_SPARQL`
              setattr(self.subgraph, matrix)
          elif is_shacl(query):
              ...
          return matrix

This will allow to keep the current workflow using KnowledgeGraph but also provide a more consistent experience for users that wants to treat subgraphs without caring too much about the reference graph. A more "operational" entrypoint based on the access patterns of other well-established Python libraries.

Anyway, if subg.py has relevance as you pointed out (taking from the W3C definition) would be better to rename it to subgraph.py

EDITED

@ceteri
Copy link
Collaborator

ceteri commented Sep 19, 2022

Excellent plan!

Having a Frame class works well. Connotations of the word "Frame" (more general than "DataFrame" which is a table) fit well here.

And I really agree with what you pointed out about the name "matrix", that does lead to confusion when people don't have exposure to algebraic graph theory.

How about if we used naming conventions similar to NumPy?

  • instead of "vector" => "1D"
  • instead of "matrix" => "2D"
  • instead of "tensor" => "ND"

@tomaarsen
Copy link
Collaborator

tomaarsen commented Sep 19, 2022

Both naming conventions are clear, but vector/matrix/tensor are generally understood concepts within all of computer science, while "ND" without any context may be somewhat confusing. That said, the latter is shorter to write... haha

@mbesta
Copy link

mbesta commented Sep 21, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants