Wednesday, May 11, 2016

The Panama papers graph analysis with Spark pt1

Recently the graph of Panama papers entities was released: https://offshoreleaks.icij.org/pages/database. It would be interesting explore the data with some data mining algorithms included in Spark GraphX. First we need to load the graph. Lets use edge information only:
import org.apache.spark.graphx._
// TODO: use csv loader https://github.com/databricks/spark-csv
val edges = sc.textFile("/data/users/ulanov/panama/all_edges.csv").mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }.map { line => val tokens = line.split(","); Edge(tokens.head.toLong, tokens.last.toLong, tokens(1)) }
val vertices = edges.flatMap (edge => Seq(edge.srcId, edge.dstId)).distinct().map(x => (x, ()))
val graph = Graph(vertices, edges)
Run PageRank to find the most important vertices:
import org.apache.spark.graphx.lib._
val pageRank = PageRank.run(graph, 50)
Take top 10 vertices:
pageRank.vertices.sortBy(x => x._2, false).take(10)
res28: Array[(org.apache.spark.graphx.VertexId, Double)] = Array((236724,8831.740202475814), (288469,903.9413886137522), (264051,667.7063735145971), (285729,562.2529301436048), (237076,499.39065213060286), (237583,405.1697624077441), (279944,380.5466754547696), (271169,269.19387690203007), (236832,255.08207719736498), (237148,224.94910000303614))
The vertices with the top PageRank are the Ids of the companies that provide service for offshores. The top one has really decent PageRank and is called "Portcullis TrustNet Chambers", id 236724. It is connected to about 36K entities: https://offshoreleaks.icij.org/nodes/54662. Stay tuned for more analysis.