Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closeness centrality for a huge graph #28

Open
Ishitori opened this issue Dec 1, 2021 · 1 comment
Open

Closeness centrality for a huge graph #28

Ishitori opened this issue Dec 1, 2021 · 1 comment

Comments

@Ishitori
Copy link

Ishitori commented Dec 1, 2021

Hi there,

I am using 2.12:66565565-SNAPSHOT version of the Sparkling, which is compatible with 2.12 version of Scala.

I have a single csv file of nodes and ~265M edges (4.5 Gb) and I am trying to load it into sparkling to calculate closeness centrality. Data is already in a numeric format. I encounter multiple things that I don't understand, and would like to understand what am I doing wrong:

  1. Had to provide graph data type [Integer, Double] to run Closeness Centrality.

In the beginning I have he following code and I experiment with loading small graph (4 edges)

val filePath="s3_path"
val schema = StructType(
    StructField("vertex1", IntegerType, false) ::
    StructField("vertex2", IntegerType, false) :: Nil)
val graph = LoadGraph.from(CSV(filePath)).using(Schema(schema)).using(NoHeader).load()

Once this executed, I want to make sure that the data is loaded by calculating number of vertices graph.vertices.count, and it seems to work.

Once I call graph.closenessCentrality(VertexMeasureConfiguration(treatAsUndirected=true)) I get:

<console>:58: error: value closenessCentrality is not a member of org.apache.spark.graphx.Graph[Nothing,Nothing]

I figured out that I need to specify graph type and changed the line to:

val graph : Graph[Integer, Integer] = LoadGraph.from(CSV(filePath)).using(Schema(schema)).using(NoHeader).load()

It worked well, but the exception message changed to:

<console>:61: error: could not find implicit value for parameter num: Numeric[Integer]

The only configuration allowed me to run the Closeness Centrality was using Double in graph type

val schema = StructType(
    StructField("vertex1", IntegerType, false) ::
    StructField("vertex2", IntegerType, false) :: Nil)
    
val graph : Graph[Integer, Double] = LoadGraph.from(CSV(filePath)).using(Schema(schema)).using(NoHeader).load()

But this is kinda weird, because nodes are of the same type - they bith integer, so why should I convert graph to Graph[Integer, Double]?

Once I start to calculate closeness centrality, my Spark job fails with maximum waiting time is reached.

My questions are:

  1. Is Sparkling-Graph supposed to deal with graphs of that size?
  2. If yes, how big Spark cluster should be to perform closeness centrality (# of executorCores and executorMemory)?
  3. Any hints how I can make it to work at all / work faster?
@riomus
Copy link
Member

riomus commented Dec 16, 2021

Closeness centrality is quite complex algo in terms of the amount of computation and memory requirements. Double needs to be used for edges because of the way closeness is implemented currently. For the cluster sizes it all depends on the structure of graph, maybe please try to start with small sample of the graph (n first rows of the file) and try to increase that. Do you see on spark UI that job is running?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants