You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using 2.12:66565565-SNAPSHOT version of the Sparkling, which is compatible with 2.12 version of Scala.
I have a single csv file of nodes and ~265M edges (4.5 Gb) and I am trying to load it into sparkling to calculate closeness centrality. Data is already in a numeric format. I encounter multiple things that I don't understand, and would like to understand what am I doing wrong:
Had to provide graph data type [Integer, Double] to run Closeness Centrality.
In the beginning I have he following code and I experiment with loading small graph (4 edges)
val filePath="s3_path"
val schema = StructType(
StructField("vertex1", IntegerType, false) ::
StructField("vertex2", IntegerType, false) :: Nil)
val graph = LoadGraph.from(CSV(filePath)).using(Schema(schema)).using(NoHeader).load()
Once this executed, I want to make sure that the data is loaded by calculating number of vertices graph.vertices.count, and it seems to work.
Once I call graph.closenessCentrality(VertexMeasureConfiguration(treatAsUndirected=true)) I get:
<console>:58: error: value closenessCentrality is not a member of org.apache.spark.graphx.Graph[Nothing,Nothing]
I figured out that I need to specify graph type and changed the line to:
val graph : Graph[Integer, Integer] = LoadGraph.from(CSV(filePath)).using(Schema(schema)).using(NoHeader).load()
It worked well, but the exception message changed to:
<console>:61: error: could not find implicit value for parameter num: Numeric[Integer]
The only configuration allowed me to run the Closeness Centrality was using Double in graph type
val schema = StructType(
StructField("vertex1", IntegerType, false) ::
StructField("vertex2", IntegerType, false) :: Nil)
val graph : Graph[Integer, Double] = LoadGraph.from(CSV(filePath)).using(Schema(schema)).using(NoHeader).load()
But this is kinda weird, because nodes are of the same type - they bith integer, so why should I convert graph to Graph[Integer, Double]?
Once I start to calculate closeness centrality, my Spark job fails with maximum waiting time is reached.
My questions are:
Is Sparkling-Graph supposed to deal with graphs of that size?
If yes, how big Spark cluster should be to perform closeness centrality (# of executorCores and executorMemory)?
Any hints how I can make it to work at all / work faster?
The text was updated successfully, but these errors were encountered:
Closeness centrality is quite complex algo in terms of the amount of computation and memory requirements. Double needs to be used for edges because of the way closeness is implemented currently. For the cluster sizes it all depends on the structure of graph, maybe please try to start with small sample of the graph (n first rows of the file) and try to increase that. Do you see on spark UI that job is running?
Hi there,
I am using 2.12:66565565-SNAPSHOT version of the Sparkling, which is compatible with 2.12 version of Scala.
I have a single csv file of nodes and ~265M edges (4.5 Gb) and I am trying to load it into sparkling to calculate closeness centrality. Data is already in a numeric format. I encounter multiple things that I don't understand, and would like to understand what am I doing wrong:
In the beginning I have he following code and I experiment with loading small graph (4 edges)
Once this executed, I want to make sure that the data is loaded by calculating number of vertices
graph.vertices.count
, and it seems to work.Once I call
graph.closenessCentrality(VertexMeasureConfiguration(treatAsUndirected=true))
I get:I figured out that I need to specify graph type and changed the line to:
It worked well, but the exception message changed to:
The only configuration allowed me to run the Closeness Centrality was using Double in graph type
But this is kinda weird, because nodes are of the same type - they bith integer, so why should I convert graph to Graph[Integer, Double]?
Once I start to calculate closeness centrality, my Spark job fails with maximum waiting time is reached.
My questions are:
The text was updated successfully, but these errors were encountered: