Closeness centrality for a huge graph #28

Ishitori · 2021-12-01T04:07:18Z

Hi there,

I am using 2.12:66565565-SNAPSHOT version of the Sparkling, which is compatible with 2.12 version of Scala.

I have a single csv file of nodes and ~265M edges (4.5 Gb) and I am trying to load it into sparkling to calculate closeness centrality. Data is already in a numeric format. I encounter multiple things that I don't understand, and would like to understand what am I doing wrong:

Had to provide graph data type [Integer, Double] to run Closeness Centrality.

In the beginning I have he following code and I experiment with loading small graph (4 edges)

val filePath="s3_path"
val schema = StructType(
    StructField("vertex1", IntegerType, false) ::
    StructField("vertex2", IntegerType, false) :: Nil)
val graph = LoadGraph.from(CSV(filePath)).using(Schema(schema)).using(NoHeader).load()

Once this executed, I want to make sure that the data is loaded by calculating number of vertices graph.vertices.count, and it seems to work.

Once I call graph.closenessCentrality(VertexMeasureConfiguration(treatAsUndirected=true)) I get:

<console>:58: error: value closenessCentrality is not a member of org.apache.spark.graphx.Graph[Nothing,Nothing]

I figured out that I need to specify graph type and changed the line to:

val graph : Graph[Integer, Integer] = LoadGraph.from(CSV(filePath)).using(Schema(schema)).using(NoHeader).load()

It worked well, but the exception message changed to:

<console>:61: error: could not find implicit value for parameter num: Numeric[Integer]

The only configuration allowed me to run the Closeness Centrality was using Double in graph type

val schema = StructType(
    StructField("vertex1", IntegerType, false) ::
    StructField("vertex2", IntegerType, false) :: Nil)
    
val graph : Graph[Integer, Double] = LoadGraph.from(CSV(filePath)).using(Schema(schema)).using(NoHeader).load()

But this is kinda weird, because nodes are of the same type - they bith integer, so why should I convert graph to Graph[Integer, Double]?

Once I start to calculate closeness centrality, my Spark job fails with maximum waiting time is reached.

My questions are:

Is Sparkling-Graph supposed to deal with graphs of that size?
If yes, how big Spark cluster should be to perform closeness centrality (# of executorCores and executorMemory)?
Any hints how I can make it to work at all / work faster?

The text was updated successfully, but these errors were encountered:

riomus · 2021-12-16T09:12:58Z

Closeness centrality is quite complex algo in terms of the amount of computation and memory requirements. Double needs to be used for edges because of the way closeness is implemented currently. For the cluster sizes it all depends on the structure of graph, maybe please try to start with small sample of the graph (n first rows of the file) and try to increase that. Do you see on spark UI that job is running?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closeness centrality for a huge graph #28

Closeness centrality for a huge graph #28

Ishitori commented Dec 1, 2021

riomus commented Dec 16, 2021

Closeness centrality for a huge graph #28

Closeness centrality for a huge graph #28

Comments

Ishitori commented Dec 1, 2021

riomus commented Dec 16, 2021