PSCAN cannot find communities #21

dawnranger · 2018-08-21T08:59:50Z

Hello, I try to find communities using PSCAN of sparking. Refering to the doc of PSCAN, I write the following codes :

val conf = new SparkConf().setAppName("pscan-test").setMaster("local")
implicit val ctx:SparkContext = new SparkContext(conf)

val filePath = "path_to_edgelist_file"
val graph:Graph[String, Int] = LoadGraph.from(CSV(filePath))
        .using(NoHeader).using(Delimiter(","))
        .load[String, Int]()
val components:Graph[ComponentID, Int] = graph.PSCAN(epsilon = 0.1)
println("num communities: " + components.vertices.map{case (vId,cId)=>cId}.distinct.count)
components.vertices.take(10).foreach(println)

The doc said that:

val components: Graph[ComponentID, Int] = graph.PSCAN(epsilon=0.5)
// Graph where each vertex is associated with its component identifier

But when I run above code, I find that, no matter how I tune the value of epsilon, the number of communities is always equals to the number of vertices and component identifier of every vertex is always the same as their vertex id.

I'm wondering whether I misunderstand the docs or there is something wrong with the PSCAN of sparking. Anybody can offer some help? Thanks in advance.

here is my edges file(karate club graph):

0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,10
0,11
0,12
0,13
0,17
0,19
0,21
0,31
1,2
1,3
1,7
1,13
1,17
1,19
1,21
1,30
2,3
2,7
2,8
2,9
2,13
2,27
2,28
2,32
3,7
3,12
3,13
4,6
4,10
5,6
5,10
5,16
6,16
8,30
8,32
8,33
9,33
13,33
14,32
14,33
15,32
15,33
18,32
18,33
19,33
20,32
20,33
22,32
22,33
23,25
23,27
23,29
23,32
23,33
24,25
24,27
24,31
25,31
26,29
26,33
27,33
28,31
28,33
29,32
29,33
30,32
30,33
31,32
31,33
32,33

The text was updated successfully, but these errors were encountered:

dawnranger · 2018-08-22T07:06:13Z

It works after I change the code to:

import org.apache.spark.graphx.{Graph, GraphLoader}

val conf = new SparkConf().setAppName("pscan-test").setMaster("local")
val sc = new SparkContext(conf)

val graph:Graph[Int, Int] = GraphLoader.edgeListFile(sc, "path_to_edgelist_file")
val components:Graph[ComponentID, Int] = graph.PSCAN(epsilon = 0.1)
components.vertices.map(v=>(v._2, v._1)).groupByKey().collect.foreach(x=>println("%d: %s".format(x._1, x._2.mkString(" "))))

I guess there is something wrong with the graph loading api of sparking.

riomus · 2018-08-22T09:51:45Z

Hi,

I will have a look at that. Please be aware that you are using String as a vertex property in the first case. By default sparkling is not indexing vertices, in combination with strings as vertex properties, that can cause issues.

dawnranger · 2018-08-22T11:27:35Z

Compile will fail if I change the vertex property to Long or Int. This is the code:

type VD = Int
type ED = Double

val graph:Graph[VD, ED] = LoadGraph.from(CSV(filePath))
    .using(NoHeader).using(Delimiter(" "))
    .load[VD, ED]()

println("num nodes: " + graph.vertices.count)  // error

And this is the ERROR:

java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:92)
at org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap.setMerge(GraphXPrimitiveKeyOpenHashMap.scala:87)
at org.apache.spark.graphx.impl.ShippableVertexPartition$$anonfun$apply$5.apply(ShippableVertexPartition.scala:60)
at org.apache.spark.graphx.impl.ShippableVertexPartition$$anonfun$apply$5.apply(ShippableVertexPartition.scala:59)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.graphx.impl.ShippableVertexPartition$.apply(ShippableVertexPartition.scala:59)
at org.apache.spark.graphx.VertexRDD$$anonfun$2.apply(VertexRDD.scala:326)
at org.apache.spark.graphx.VertexRDD$$anonfun$2.apply(VertexRDD.scala:323)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1092)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1083)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1018)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1083)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:809)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

riomus · 2018-08-22T11:35:56Z

Ok, i will have look on that. There is some inconsistency in loading API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PSCAN cannot find communities #21

PSCAN cannot find communities #21

dawnranger commented Aug 21, 2018 •

edited

dawnranger commented Aug 22, 2018

riomus commented Aug 22, 2018

dawnranger commented Aug 22, 2018 •

edited

riomus commented Aug 22, 2018

PSCAN cannot find communities #21

PSCAN cannot find communities #21

Comments

dawnranger commented Aug 21, 2018 • edited

dawnranger commented Aug 22, 2018

riomus commented Aug 22, 2018

dawnranger commented Aug 22, 2018 • edited

riomus commented Aug 22, 2018

dawnranger commented Aug 21, 2018 •

edited

dawnranger commented Aug 22, 2018 •

edited