-
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary/array-based storage for network data structure #7154
Comments
Thanks @dschult, it would be great to see progress on this front. I would find examples (and more organization) helpful. CSR (and CSC) can be "not great" at times if there are many empty rows (or columns) in the adjacency matrix. DCSR (or DCSC) could be better. Do you envision a single format or a family of formats? For example, COO or CSR or DCSC etc? Combo formats are also "easy" and theoretically nice: CSR+COO (like sorted COO with extra pointers array), DCSR+COO, etc. Perhaps of interest is the "binsparse" effort to formalize storing sparse arrays (and tensors) in a binary format: https://github.com/GraphBLAS/binsparse-specification/ |
To avoid the whole DCSR vs CSR, it'd be good to have the adjacency only describe nodes that are connected (and for directed graphs, only the nodes with successors in the successor adjacency, and only nodes with preds in the pred adj if it is decided to store preds). Then we only have CS storage. The data for isolated, leaf(no succ) and root(no pred) nodes can be stored separately from the other nodes, either as a separate node-table, or just sorted within the node-table along with a number of nodes that have a connection. Other thoughts on network storage:
proposal for info/tables needed:
Undirected network:
Directed network:
And of course you could do a directed network with only successor info to save space if you didn't need preds. |
I don't like this idea. It adds another layer of indirection and (imho) overcomplicates things (for example, otherwise simple queries involving CSR are now more complicated). I would much rather simply have the option of having COO, CSR, CSC, DCSR, DCSC, their equivalents for multigraphs, or some combination of these. I think this has the benefit of more easily being "read directly by other platforms", which I believe is a primary goal of this effort. If you want to ditch DCSR, then I would also want to ditch CSR and only keep COO :-P . COO really is the simplest to start with, the most universal, and is more-or-less memory-efficient. The goal is sharing graph data, right? It's not to query or manipulate graph data is it? It's pretty easy (and fast) to convert COO to any of the other formats I listed above--especially if the COO is known to be sorted in some way (perhaps sortedness--if known--ought to be included). For undirected, some platforms prefer to store only the upper triangular or lower triangular of the adjacency matrix. Should we focus more on purpose/goals/scope? |
If the goal is to share data, then COO is definitely the way to go. A node table and an edge-list table (with source and target nodes as the lookup columns). Probably also a graph meta-data section with traits like multiedges, directed, self-loops. I thought you were aiming for a way to access the data directly from the libraries rather than having to read it in. If we make them read it in, then they can set up the adjacency data-structure they need as the read it in. |
This is an issue to try to keep track of ideas in discussions we've had about a data structure for network analysis that could be read directly by other platforms -- similar to what numpy arrays or pandas dataframes make available.
A good place to start here is that a graph consists of two arrays: nodes along with node-data, and edges (specified by a pair of nodes, with perhaps an edge ID column as well (for multigraph and unique within a node pair -- or some libraries require it to be unique across the entire graph which makes sense too for some uses). One could also provide a table with data about the graph itself. Thus 3 tables provide a universal way to store the network data: one with a column "node" and with data columns. Another table with two columns "source node" and "target node" and then data columns (and possibly an "edge_id" column). And a third with data columns describing the graph as a whole -- I'm not sure what the rows of this table would be, so maybe just a key-value storage system -- or a vector of attribute values. This system stores all the data. But it doesn't make it accessible in a way that many/most graph algorithms need it to be accessible. That is, it provides "edgelist" functionality but not "adjacency" functionality. You can easily process all the edges. But you can't easily find the neighbors of a node.
One reasonable approach to providing the adjacency info is to remove data from that data structure. You could store the previously mentioned tables along with an adjacency structure. The adjacency structure can provide a pointer to the row of the edge-data table that corresponds to that edge (or to an array of rows in the case of muligraphs). But the adjacency is not typically a rectangular array-like structure. The number of neighbors is not the same for every node. And storing it for memory efficiency (say, as a CSR format) makes it difficult to change. So if you add a new edge, the entire structure must be rewritten. Thus people often use linked-lists (list of lists), or hash-of-hash (dict-of-dict) structures. Those allow fast manipulation of the network, but typically don't allow the nice strided-memory (is that even the right phrase?) access that an array would provide.
If we want to have a adjacency data structure that isn't mutable, probably a compressed-storage CSR-based set of arrays is the most well-recognized structure for encoding adjacency. Three 1d arrays provide both node and edge pointers: the first array provides a "compressed" list of pointers to the position in the 2nd and 3rd arrays which starts describing each node. This array has the length of the number of nodes plus 1 (the plus 1 is to store a pointer to the end of the last node's neighbors). Thus each consecutive pair of entries in this array provide bounds for an array of neighbors (in the 2nd array) or edges (in the 3rd array). The 2nd array then stores the neighbors via node-ids (or pointers to nodes in the node data table?). These are ordered by node so the first nodes neighbors are all listed and then the second nodes neighbors, etc. Using the first array, we can pick out just the neighbors of any specific node. The 3rd array is similar to the 2nd, only it holds the edge_id or pointer to an edge or edges in the edge data table. I believe we could reduce the adjacency structure to just two arrays by removing the 2nd pointer to neighbor nodes -- that info is available from the edge-data table.
Another version of the data-structure could ensure that the edge-table is sorted by "source" node. Then all the neighbors of each node can be found in consecutive rows of that table. Then the adjacency structure would be a single array of length (n_nodes + 1) holding the pointers in compressed storage.
These array-based data structures are good for static networks that don't change nodes or edges during an algorithm. The more adjustable linked-list or has-based data structures are better when the network can change during the algorithm.
Would defining a compressed storage data-structure for static network data be helpful across the OSS network analysis ecosystem? I suspect the answer is yes. Thoughts? Has it already been done?
The text was updated successfully, but these errors were encountered: