Wednesday, May 11, 2016

The Panama papers graph analysis with Spark pt1

Recently the graph of Panama papers entities was released: https://offshoreleaks.icij.org/pages/database. It would be interesting explore the data with some data mining algorithms included in Spark GraphX. First we need to load the graph. Lets use edge information only:
import org.apache.spark.graphx._
// TODO: use csv loader https://github.com/databricks/spark-csv
val edges = sc.textFile("/data/users/ulanov/panama/all_edges.csv").mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }.map { line => val tokens = line.split(","); Edge(tokens.head.toLong, tokens.last.toLong, tokens(1)) }
val vertices = edges.flatMap (edge => Seq(edge.srcId, edge.dstId)).distinct().map(x => (x, ()))
val graph = Graph(vertices, edges)
Run PageRank to find the most important vertices:
import org.apache.spark.graphx.lib._
val pageRank = PageRank.run(graph, 50)
Take top 10 vertices:
pageRank.vertices.sortBy(x => x._2, false).take(10)
res28: Array[(org.apache.spark.graphx.VertexId, Double)] = Array((236724,8831.740202475814), (288469,903.9413886137522), (264051,667.7063735145971), (285729,562.2529301436048), (237076,499.39065213060286), (237583,405.1697624077441), (279944,380.5466754547696), (271169,269.19387690203007), (236832,255.08207719736498), (237148,224.94910000303614))
The vertices with the top PageRank are the Ids of the companies that provide service for offshores. The top one has really decent PageRank and is called "Portcullis TrustNet Chambers", id 236724. It is connected to about 36K entities: https://offshoreleaks.icij.org/nodes/54662. Stay tuned for more analysis.


Tuesday, April 26, 2016

Log scale computations

Sometimes there is the need to multiply a lot of small positive numbers. This leads to underflow. Log scale is used to overcome (aka postpone) it:

  • log(a * b) = log(a) + log(b)

scala> val x = Double.MinPositiveValue
x: Double = 4.9E-324
scala> val xx = x * x
xx: Double = 0.0 // underflow
scala> val logx = math.log(x)
logx: Double = -744.4400719213812
scala> val logxx = logx + logx
logxx: Double = -1488.8801438427624

It enables to do log product of about E305 minimum Doubles and E307 of 0.1's:
scala> val logProdMin = Double.MinValue / logx
logProdMin: Double = 2.414825857268154E305
scala> val logProd01 = Double.MinValue / math.log(0.1)
logProd01: Double = 7.807282086260621E307

Interestingly, there is not a big difference in capacity for log product of 0.1 and 4.9E-324. The capacity looks good unless you are doing some sort of iterative computations, so that each step you need to multiply the products from the previous step. Apparently, after about a thousand of steps one will reach overflow:
scala> val iterProd = math.log(logProd01) / math.log(2)
iterProd: Double = 1022.7967455273003

That number is smaller for single precision: ~126.8.

Sometime there is the need to perform normalization of small numbers, i.e. division of each element by their sum. This is typical when your small numbers are in vector (a1, a2):
(a1, a2) = (a1 + a2) * (a1 / ( a1 + a2), a2 / (a1 + a2))
Now we could apply log, but what is log of sum?

  • log(a + b) = log[exp(log(a)) + exp(log(b))]=log[exp[(log(a))*(1 + exp(log(b) - log(a))]]=log(a) + log[1+exp(log(b) - log(a))]
  • similarly, log(a - b) = log(a) + log[1 - exp(log(b) - log(a))]

Note that we assumed that a > b, otherwise the difference will be (a - b). For a vector of bigger length:

  • log(a1 + ... + aN) = log(max(a)) + log[1 + exp(log(a1 +... aN - max(a)))]

Implementation of log(x) is not a very good approximation of log(1 + x), so there is a separate function for that purpose math.log1p.

However, there is exponent in the formula. If log(b) - log(a) is less than a log(Double.MinPositiveValue) then exponent is zero, log1p is zero and log(a+b)=log(a).

Relevant link: http://colorfulengineering.org/logmath-notes.pdf

Saturday, March 12, 2016

Code review

My main coding issues are "shotgun surgery" and "programming by permutation" according to the following:https://en.wikipedia.org/wiki/Design_smellhttps://en.wikipedia.org/wiki/Anti-patternhttps://en.wikipedia.org/wiki/Code_smell. I think those can be used for code review. Strangely, I have not found that they are included in any of code review tutorials.

Thursday, January 21, 2016

FLOPS estimation

The following formula can be used for estimating FLOPS for a CPU:
FLOPS = FLOPS_per_cycle * cores * frequency
Flops per cycle might be hard to find, they should be in the technical documentation of the particular CPU series, e.g. Haswell. Some of it can be found here: http://stackoverflow.com/questions/15655835/flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2. Sometimes there is a doc from producer with FLOPS: http://download.intel.com/support/processors/xeon/sb/xeon_5600.pdf

Wednesday, September 16, 2015

Intellij Idea Scala compilation weird error

I've generated a maven Scala project with maven, opened it in Idea and it did not compile:
Error:scalac: Error: org.jetbrains.jps.incremental.scala.remote.ServerException
Error compiling sbt component 'compiler-interface-2.10.0-52.0'
at sbt.compiler.AnalyzingCompiler$$anonfun$compileSources$1$$anonfun$apply$2.apply(AnalyzingCompiler.scala:145)
at sbt.compiler.AnalyzingCompiler$$anonfun$compileSources$1$$anonfun$apply$2.apply(AnalyzingCompiler.scala:142)
at sbt.IO$.withTemporaryDirectory(IO.scala:285)......
Full log can be found here. By the way, the answer on that site is not precise.
The problem is that Idea have found some scala-compiler files in my local maven repo. It tried to use them instead of Scala SDK: File->Project Structure->Modules->Dependecies. The solution is to remove scala-compiler from the list, then Idea will propose to add Scala SDK.

Friday, September 11, 2015

Parameter servers


Links to the open source servers:
1)Parameter server (CMU, Mu Li & Alex Smola) https://github.com/dmlc/parameter_server
2)Petuum (CMU, Eric Xing) http://petuum.github.io/
4)PS on Spark (Alibaba, Qiping Li) https://issues.apache.org/jira/browse/SPARK-6932

Links to the other servers described in research papers:

3)Factorbird (Stanford, Reza Zadeh) http://stanford.edu/~rezab/papers/factorbird.pdf