Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSCAN cannot find communities #21

Open
dawnranger opened this issue Aug 21, 2018 · 4 comments
Open

PSCAN cannot find communities #21

dawnranger opened this issue Aug 21, 2018 · 4 comments

Comments

@dawnranger
Copy link

dawnranger commented Aug 21, 2018

Hello, I try to find communities using PSCAN of sparking. Refering to the doc of PSCAN, I write the following codes :

val conf = new SparkConf().setAppName("pscan-test").setMaster("local")
implicit val ctx:SparkContext = new SparkContext(conf)

val filePath = "path_to_edgelist_file"
val graph:Graph[String, Int] = LoadGraph.from(CSV(filePath))
        .using(NoHeader).using(Delimiter(","))
        .load[String, Int]()
val components:Graph[ComponentID, Int] = graph.PSCAN(epsilon = 0.1)
println("num communities: " + components.vertices.map{case (vId,cId)=>cId}.distinct.count)
components.vertices.take(10).foreach(println)

The doc said that:

val components: Graph[ComponentID, Int] = graph.PSCAN(epsilon=0.5)
// Graph where each vertex is associated with its component identifier

But when I run above code, I find that, no matter how I tune the value of epsilon, the number of communities is always equals to the number of vertices and component identifier of every vertex is always the same as their vertex id.

I'm wondering whether I misunderstand the docs or there is something wrong with the PSCAN of sparking. Anybody can offer some help? Thanks in advance.


here is my edges file(karate club graph):

0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,10
0,11
0,12
0,13
0,17
0,19
0,21
0,31
1,2
1,3
1,7
1,13
1,17
1,19
1,21
1,30
2,3
2,7
2,8
2,9
2,13
2,27
2,28
2,32
3,7
3,12
3,13
4,6
4,10
5,6
5,10
5,16
6,16
8,30
8,32
8,33
9,33
13,33
14,32
14,33
15,32
15,33
18,32
18,33
19,33
20,32
20,33
22,32
22,33
23,25
23,27
23,29
23,32
23,33
24,25
24,27
24,31
25,31
26,29
26,33
27,33
28,31
28,33
29,32
29,33
30,32
30,33
31,32
31,33
32,33

@dawnranger
Copy link
Author

It works after I change the code to:

import org.apache.spark.graphx.{Graph, GraphLoader}

val conf = new SparkConf().setAppName("pscan-test").setMaster("local")
val sc = new SparkContext(conf)

val graph:Graph[Int, Int] = GraphLoader.edgeListFile(sc, "path_to_edgelist_file")
val components:Graph[ComponentID, Int] = graph.PSCAN(epsilon = 0.1)
components.vertices.map(v=>(v._2, v._1)).groupByKey().collect.foreach(x=>println("%d: %s".format(x._1, x._2.mkString(" "))))

I guess there is something wrong with the graph loading api of sparking.

@riomus
Copy link
Member

riomus commented Aug 22, 2018

Hi,

I will have a look at that. Please be aware that you are using String as a vertex property in the first case. By default sparkling is not indexing vertices, in combination with strings as vertex properties, that can cause issues.

@dawnranger
Copy link
Author

dawnranger commented Aug 22, 2018

Compile will fail if I change the vertex property to Long or Int. This is the code:

type VD = Int
type ED = Double

val graph:Graph[VD, ED] = LoadGraph.from(CSV(filePath))
    .using(NoHeader).using(Delimiter(" "))
    .load[VD, ED]()

println("num nodes: " + graph.vertices.count)  // error

And this is the ERROR:

java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:92)
at org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap.setMerge(GraphXPrimitiveKeyOpenHashMap.scala:87)
at org.apache.spark.graphx.impl.ShippableVertexPartition$$anonfun$apply$5.apply(ShippableVertexPartition.scala:60)
at org.apache.spark.graphx.impl.ShippableVertexPartition$$anonfun$apply$5.apply(ShippableVertexPartition.scala:59)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.graphx.impl.ShippableVertexPartition$.apply(ShippableVertexPartition.scala:59)
at org.apache.spark.graphx.VertexRDD$$anonfun$2.apply(VertexRDD.scala:326)
at org.apache.spark.graphx.VertexRDD$$anonfun$2.apply(VertexRDD.scala:323)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1092)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1083)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1018)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1083)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:809)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

@riomus
Copy link
Member

riomus commented Aug 22, 2018

Ok, i will have look on that. There is some inconsistency in loading API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants