Technical notes: The Panama papers graph analysis with Spark pt1

Recently the graph of Panama papers entities was released: https://offshoreleaks.icij.org/pages/database. It would be interesting explore the data with some data mining algorithms included in Spark GraphX. First we need to load the graph. Lets use edge information only:
import org.apache.spark.graphx._
// TODO: use csv loader https://github.com/databricks/spark-csv
val edges = sc.textFile("/data/users/ulanov/panama/all_edges.csv").mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }.map { line => val tokens = line.split(","); Edge(tokens.head.toLong, tokens.last.toLong, tokens(1)) }
val vertices = edges.flatMap (edge => Seq(edge.srcId, edge.dstId)).distinct().map(x => (x, ()))
val graph = Graph(vertices, edges)
Run PageRank to find the most important vertices:
import org.apache.spark.graphx.lib._
val pageRank = PageRank.run(graph, 50)
Take top 10 vertices:
pageRank.vertices.sortBy(x => x._2, false).take(10)
res28: Array[(org.apache.spark.graphx.VertexId, Double)] = Array((236724,8831.740202475814), (288469,903.9413886137522), (264051,667.7063735145971), (285729,562.2529301436048), (237076,499.39065213060286), (237583,405.1697624077441), (279944,380.5466754547696), (271169,269.19387690203007), (236832,255.08207719736498), (237148,224.94910000303614))
The vertices with the top PageRank are the Ids of the companies that provide service for offshores. The top one has really decent PageRank and is called "Portcullis TrustNet Chambers", id 236724. It is connected to about 36K entities: https://offshoreleaks.icij.org/nodes/54662. Stay tuned for more analysis.

2 comments:

ramizJuly 29, 2016 at 7:56 AM
I recently came across your blog and have been reading along. I thought I would leave my first comment. I don’t know what to say except that I have enjoyed reading. Nice blog, I will keep visiting this blog very often.
pagerank
jewelrywholesaleMay 15, 2021 at 12:23 AM
ok, good idea, i like them

Arrows RC airplane
Atten soldering gun
Dynam RC airplane
Kerui GSM alarm
FMS
Freewing RC jet
Hobbystar Motor
LX RC Jet

Wednesday, May 11, 2016

The Panama papers graph analysis with Spark pt1

2 comments: