Recently the graph of Panama papers entities was released: https://offshoreleaks.icij.org/pages/database. It would be interesting explore the data with some data mining algorithms included in Spark GraphX. First we need to load the graph. Lets use edge information only:
import org.apache.spark.graphx._
// TODO: use csv loader https://github.com/databricks/spark-csv
val edges = sc.textFile("/data/users/ulanov/panama/all_edges.csv").mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }.map { line => val tokens = line.split(","); Edge(tokens.head.toLong, tokens.last.toLong, tokens(1)) }
val vertices = edges.flatMap (edge => Seq(edge.srcId, edge.dstId)).distinct().map(x => (x, ()))
val graph = Graph(vertices, edges)
Run PageRank to find the most important vertices:
import org.apache.spark.graphx.lib._
val pageRank = PageRank.run(graph, 50)
Take top 10 vertices:
pageRank.vertices.sortBy(x => x._2, false).take(10)
res28: Array[(org.apache.spark.graphx.VertexId, Double)] = Array((236724,8831.740202475814), (288469,903.9413886137522), (264051,667.7063735145971), (285729,562.2529301436048), (237076,499.39065213060286), (237583,405.1697624077441), (279944,380.5466754547696), (271169,269.19387690203007), (236832,255.08207719736498), (237148,224.94910000303614))
The vertices with the top PageRank are the Ids of the companies that provide service for offshores. The top one has really decent PageRank and is called "Portcullis TrustNet Chambers", id 236724. It is connected to about 36K entities: https://offshoreleaks.icij.org/nodes/54662. Stay tuned for more analysis.
import org.apache.spark.graphx._
// TODO: use csv loader https://github.com/databricks/spark-csv
val edges = sc.textFile("/data/users/ulanov/panama/all_edges.csv").mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }.map { line => val tokens = line.split(","); Edge(tokens.head.toLong, tokens.last.toLong, tokens(1)) }
val vertices = edges.flatMap (edge => Seq(edge.srcId, edge.dstId)).distinct().map(x => (x, ()))
val graph = Graph(vertices, edges)
Run PageRank to find the most important vertices:
import org.apache.spark.graphx.lib._
val pageRank = PageRank.run(graph, 50)
Take top 10 vertices:
pageRank.vertices.sortBy(x => x._2, false).take(10)
res28: Array[(org.apache.spark.graphx.VertexId, Double)] = Array((236724,8831.740202475814), (288469,903.9413886137522), (264051,667.7063735145971), (285729,562.2529301436048), (237076,499.39065213060286), (237583,405.1697624077441), (279944,380.5466754547696), (271169,269.19387690203007), (236832,255.08207719736498), (237148,224.94910000303614))
The vertices with the top PageRank are the Ids of the companies that provide service for offshores. The top one has really decent PageRank and is called "Portcullis TrustNet Chambers", id 236724. It is connected to about 36K entities: https://offshoreleaks.icij.org/nodes/54662. Stay tuned for more analysis.
I recently came across your blog and have been reading along. I thought I would leave my first comment. I don’t know what to say except that I have enjoyed reading. Nice blog, I will keep visiting this blog very often.
ReplyDeletepagerank
ok, good idea, i like them
ReplyDeleteArrows RC airplane
Atten soldering gun
Dynam RC airplane
Kerui GSM alarm
FMS
Freewing RC jet
Hobbystar Motor
LX RC Jet