Binary/array-based storage for network data structure #7154

dschult · 2023-12-12T15:45:34Z

This is an issue to try to keep track of ideas in discussions we've had about a data structure for network analysis that could be read directly by other platforms -- similar to what numpy arrays or pandas dataframes make available.

A good place to start here is that a graph consists of two arrays: nodes along with node-data, and edges (specified by a pair of nodes, with perhaps an edge ID column as well (for multigraph and unique within a node pair -- or some libraries require it to be unique across the entire graph which makes sense too for some uses). One could also provide a table with data about the graph itself. Thus 3 tables provide a universal way to store the network data: one with a column "node" and with data columns. Another table with two columns "source node" and "target node" and then data columns (and possibly an "edge_id" column). And a third with data columns describing the graph as a whole -- I'm not sure what the rows of this table would be, so maybe just a key-value storage system -- or a vector of attribute values. This system stores all the data. But it doesn't make it accessible in a way that many/most graph algorithms need it to be accessible. That is, it provides "edgelist" functionality but not "adjacency" functionality. You can easily process all the edges. But you can't easily find the neighbors of a node.

One reasonable approach to providing the adjacency info is to remove data from that data structure. You could store the previously mentioned tables along with an adjacency structure. The adjacency structure can provide a pointer to the row of the edge-data table that corresponds to that edge (or to an array of rows in the case of muligraphs). But the adjacency is not typically a rectangular array-like structure. The number of neighbors is not the same for every node. And storing it for memory efficiency (say, as a CSR format) makes it difficult to change. So if you add a new edge, the entire structure must be rewritten. Thus people often use linked-lists (list of lists), or hash-of-hash (dict-of-dict) structures. Those allow fast manipulation of the network, but typically don't allow the nice strided-memory (is that even the right phrase?) access that an array would provide.

If we want to have a adjacency data structure that isn't mutable, probably a compressed-storage CSR-based set of arrays is the most well-recognized structure for encoding adjacency. Three 1d arrays provide both node and edge pointers: the first array provides a "compressed" list of pointers to the position in the 2nd and 3rd arrays which starts describing each node. This array has the length of the number of nodes plus 1 (the plus 1 is to store a pointer to the end of the last node's neighbors). Thus each consecutive pair of entries in this array provide bounds for an array of neighbors (in the 2nd array) or edges (in the 3rd array). The 2nd array then stores the neighbors via node-ids (or pointers to nodes in the node data table?). These are ordered by node so the first nodes neighbors are all listed and then the second nodes neighbors, etc. Using the first array, we can pick out just the neighbors of any specific node. The 3rd array is similar to the 2nd, only it holds the edge_id or pointer to an edge or edges in the edge data table. I believe we could reduce the adjacency structure to just two arrays by removing the 2nd pointer to neighbor nodes -- that info is available from the edge-data table.

Another version of the data-structure could ensure that the edge-table is sorted by "source" node. Then all the neighbors of each node can be found in consecutive rows of that table. Then the adjacency structure would be a single array of length (n_nodes + 1) holding the pointers in compressed storage.

These array-based data structures are good for static networks that don't change nodes or edges during an algorithm. The more adjustable linked-list or has-based data structures are better when the network can change during the algorithm.

Would defining a compressed storage data-structure for static network data be helpful across the OSS network analysis ecosystem? I suspect the answer is yes. Thoughts? Has it already been done?

eriknw · 2023-12-13T19:59:16Z

Thanks @dschult, it would be great to see progress on this front. I would find examples (and more organization) helpful.

CSR (and CSC) can be "not great" at times if there are many empty rows (or columns) in the adjacency matrix. DCSR (or DCSC) could be better.

Do you envision a single format or a family of formats? For example, COO or CSR or DCSC etc? Combo formats are also "easy" and theoretically nice: CSR+COO (like sorted COO with extra pointers array), DCSR+COO, etc.

Perhaps of interest is the "binsparse" effort to formalize storing sparse arrays (and tensors) in a binary format: https://github.com/GraphBLAS/binsparse-specification/

dschult · 2023-12-15T17:42:15Z

To avoid the whole DCSR vs CSR, it'd be good to have the adjacency only describe nodes that are connected (and for directed graphs, only the nodes with successors in the successor adjacency, and only nodes with preds in the pred adj if it is decided to store preds). Then we only have CS storage.

The data for isolated, leaf(no succ) and root(no pred) nodes can be stored separately from the other nodes, either as a separate node-table, or just sorted within the node-table along with a number of nodes that have a connection.

Other thoughts on network storage:

the simplistic setting I describe in the OP above is really best for directed graphs where only successors are needed (the most basic case). More likely we would want to have undirected graphs or both successors and predecessors.
For either undirected edges or storing both succ and pred info, we'd want to separate the edge table from the adjacency structure. That's because we want to look at the same edge from either direction. There are two entries in the adjacency info that point to the same edge data. Whether you store those adjacencies as two objects (like NX does with directed graphs) or as a single object with two entries to the edge (like NX does with undirected graphs), you still want two lookups to reach the same edge data.

proposal for info/tables needed:

Scalars -- the number of edges, number of nodes in each category:
- n_edges, n_iso_nodes, n_leaf_nodes, n_root_nodes, n_nodes
Data Tables node-table and edge-table
- node-table has n_nodes unsorted rows, with the data columns.
- edge-table has n_edges unsorted rows. Data columns and columns for source_id and target_id.

Undirected network:

CS storage for adjacency with two 1d-arrays:
- 1st array has length n_nodes - n_iso_nodes + 1. Values are each node's starting position in the 2nd array.
- 2nd array has length 2*n_edges, is sorted by source node_id, with value being the edge_id

Directed network:

2 pairs of CS storage for adjacency. One for succ and one for pred.
- 1st succ array has length n_nodes -n_iso_nodes - n_leaf_nodes + 1, value is starting pos in 2nd array.
- 2nd succ array has length 2*n_edges, is sorted by source node_id, with value being the edge_id
- 1st pred array same but for preds.
- 2nd pred array same but for preds.

And of course you could do a directed network with only successor info to save space if you didn't need preds.

eriknw · 2023-12-15T19:42:48Z

To avoid the whole DCSR vs CSR, it'd be good to have the adjacency only describe nodes that are connected (and for directed graphs, only the nodes with successors in the successor adjacency, and only nodes with preds in the pred adj if it is decided to store preds). Then we only have CS storage.

I don't like this idea. It adds another layer of indirection and (imho) overcomplicates things (for example, otherwise simple queries involving CSR are now more complicated). I would much rather simply have the option of having COO, CSR, CSC, DCSR, DCSC, their equivalents for multigraphs, or some combination of these. I think this has the benefit of more easily being "read directly by other platforms", which I believe is a primary goal of this effort.

If you want to ditch DCSR, then I would also want to ditch CSR and only keep COO :-P . COO really is the simplest to start with, the most universal, and is more-or-less memory-efficient. The goal is sharing graph data, right? It's not to query or manipulate graph data is it? It's pretty easy (and fast) to convert COO to any of the other formats I listed above--especially if the COO is known to be sorted in some way (perhaps sortedness--if known--ought to be included).

For undirected, some platforms prefer to store only the upper triangular or lower triangular of the adjacency matrix.

Should we focus more on purpose/goals/scope?

dschult · 2023-12-17T21:25:07Z

If the goal is to share data, then COO is definitely the way to go. A node table and an edge-list table (with source and target nodes as the lookup columns). Probably also a graph meta-data section with traits like multiedges, directed, self-loops.

I thought you were aiming for a way to access the data directly from the libraries rather than having to read it in. If we make them read it in, then they can set up the adjacency data-structure they need as the read it in.

rossbar added the Discussion Issues for discussion and decision making label Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary/array-based storage for network data structure #7154

Binary/array-based storage for network data structure #7154

dschult commented Dec 12, 2023

eriknw commented Dec 13, 2023

dschult commented Dec 15, 2023

eriknw commented Dec 15, 2023

dschult commented Dec 17, 2023

Binary/array-based storage for network data structure #7154

Binary/array-based storage for network data structure #7154

Comments

dschult commented Dec 12, 2023

eriknw commented Dec 13, 2023

dschult commented Dec 15, 2023

eriknw commented Dec 15, 2023

dschult commented Dec 17, 2023