Technical notes: 2016

Friday, September 16, 2016

Integer overflow

Integer overflow in JVM can lead to negative, zero and positive values, i.e. can everything:
scala> val x = 1 + (Int.MaxValue - 1) / 2
x: Int = 1073741824
scala> x * x
res16: Int = 0
scala> val x = 1 + (Int.MaxValue - 5) / 2
x: Int = 1073741822
scala> x * x
res18: Int = 4
scala> val x = 1 + (Int.MaxValue - 6) / 2
x: Int = 1073741821
scala> x * x
res19: Int = -2147483639
The check for overflow due to multiplication divide back and check if get the same result. Summation overflow can be checked by the sign of result. The other option is to use double and long.

Thursday, August 11, 2016

Graph based user behavior modeling

http://www.cs.cmu.edu/~abeutel/kdd2015_tutorial/

Thursday, August 4, 2016

Check scalastyle

Maven - checks only main sources:
mvn scalastyle:check
Sbt check main:
sbt scalastyle
Sbt check test:
sbt test:scalastyle

Tuesday, July 5, 2016

Conference reference

Industrial conferences: http://www.kdnuggets.com/meetings/
Scientific conferences: http://www.wikicfp.com/
Ranking of scientific conferences by Microsoft: http://academic.research.microsoft.com/RankList?entitytype=3&topDomainID=2
Ranking by Google: https://scholar.google.ru/citations?view_op=top_venues&hl=en&vq=eng
Good old rankings page: http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html

Friday, June 24, 2016

Java and Hadoop URI differences

Hadoop Path.toUri adds only / to the beginning of local path but does not add file:. Java File toURI adds file:/ to the beginning and / to the end of any path. In particular, JavaFile.toURI.getPath equals to Hadoop Path.toUri.toString.

Saturday, June 11, 2016

Most frequently used git commands

init local repo and push it to the remote repo (it has to be already created on github)
git init
git add .
git reset [file to exclude]
git commit -am "First commit"
git remote add origin [github address]
git push origin master
add remote repo, give it a name [name] and fetch it locally
git remote add [name] [github address]
git fetch [name]
create a new branch based on some branch
git checkout -b [new_branch] [some_branch]

create a new branch on top of uncommited changes

git checkout -b [new_branch]
edit/merge history of N last commits (replace pick with squash to merge commit)
git rebase -i HEAD~N
get file from the other branch
git checkout [branch] -- [file]
rebase on top of some branch
git rebase [branch]
if merge conflict - resolve by hand then
git add [resolved file]
git rebase --continue
if merge conflict - use either ours or theirs file
git checkout --ours|--theirs [file]
git add [resolved file]
git rebase --continue
cherry-pick a commit
git cherry-pick [commit 6-number hash]
patch
   add '.patch' to the pull request URL and press 'Enter' in the browser
   curl -L [url] > /tmp/my.patch
   git apply (--check) /tmp/my.patch
git cherry-pick [commit 6-number hash]
delete a branch
git -d [branch]
revert local changes to the last commit
git reset --hard HEAD
git push from some local branch to some remote
git push <REMOTENAME> <LOCALBRANCHNAME>:<REMOTEBRANCHNAME>
Pull/fetch/merge/rebase reference
https://www.derekgourlay.com/blog/git-when-to-merge-vs-when-to-rebase/

Wednesday, May 11, 2016

The Panama papers graph analysis with Spark pt1

Recently the graph of Panama papers entities was released: https://offshoreleaks.icij.org/pages/database. It would be interesting explore the data with some data mining algorithms included in Spark GraphX. First we need to load the graph. Lets use edge information only:
import org.apache.spark.graphx._
// TODO: use csv loader https://github.com/databricks/spark-csv
val edges = sc.textFile("/data/users/ulanov/panama/all_edges.csv").mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }.map { line => val tokens = line.split(","); Edge(tokens.head.toLong, tokens.last.toLong, tokens(1)) }
val vertices = edges.flatMap (edge => Seq(edge.srcId, edge.dstId)).distinct().map(x => (x, ()))
val graph = Graph(vertices, edges)
Run PageRank to find the most important vertices:
import org.apache.spark.graphx.lib._
val pageRank = PageRank.run(graph, 50)
Take top 10 vertices:
pageRank.vertices.sortBy(x => x._2, false).take(10)
res28: Array[(org.apache.spark.graphx.VertexId, Double)] = Array((236724,8831.740202475814), (288469,903.9413886137522), (264051,667.7063735145971), (285729,562.2529301436048), (237076,499.39065213060286), (237583,405.1697624077441), (279944,380.5466754547696), (271169,269.19387690203007), (236832,255.08207719736498), (237148,224.94910000303614))
The vertices with the top PageRank are the Ids of the companies that provide service for offshores. The top one has really decent PageRank and is called "Portcullis TrustNet Chambers", id 236724. It is connected to about 36K entities: https://offshoreleaks.icij.org/nodes/54662. Stay tuned for more analysis.

Tuesday, April 26, 2016

Log scale computations

Sometimes there is the need to multiply a lot of small positive numbers. This leads to underflow. Log scale is used to overcome (aka postpone) it:

log(a * b) = log(a) + log(b)

scala> val x = Double.MinPositiveValue
x: Double = 4.9E-324
scala> val xx = x * x
xx: Double = 0.0 // underflow
scala> val logx = math.log(x)
logx: Double = -744.4400719213812
scala> val logxx = logx + logx
logxx: Double = -1488.8801438427624

It enables to do log product of about E305 minimum Doubles and E307 of 0.1's:
scala> val logProdMin = Double.MinValue / logx
logProdMin: Double = 2.414825857268154E305
scala> val logProd01 = Double.MinValue / math.log(0.1)
logProd01: Double = 7.807282086260621E307

Interestingly, there is not a big difference in capacity for log product of 0.1 and 4.9E-324. The capacity looks good unless you are doing some sort of iterative computations, so that each step you need to multiply the products from the previous step. Apparently, after about a thousand of steps one will reach overflow:
scala> val iterProd = math.log(logProd01) / math.log(2)
iterProd: Double = 1022.7967455273003

That number is smaller for single precision: ~126.8.

Sometime there is the need to perform normalization of small numbers, i.e. division of each element by their sum. This is typical when your small numbers are in vector (a1, a2):
(a1, a2) = (a1 + a2) * (a1 / ( a1 + a2), a2 / (a1 + a2))
Now we could apply log, but what is log of sum?

log(a + b) = log[exp(log(a)) + exp(log(b))]=log[exp[(log(a))*(1 + exp(log(b) - log(a))]]=log(a) + log[1+exp(log(b) - log(a))]
similarly, log(a - b) = log(a) + log[1 - exp(log(b) - log(a))]

Note that we assumed that a > b, otherwise the difference will be (a - b). For a vector of bigger length:

log(a1 + ... + aN) = log(max(a)) + log[1 + exp(log(a1 +... aN - max(a)))]

Implementation of log(x) is not a very good approximation of log(1 + x), so there is a separate function for that purpose math.log1p.

However, there is exponent in the formula. If log(b) - log(a) is less than a log(Double.MinPositiveValue) then exponent is zero, log1p is zero and log(a+b)=log(a).

Relevant link: http://colorfulengineering.org/logmath-notes.pdf

Monday, April 25, 2016

Typical memory usage summary

Saturday, March 12, 2016

Code review

My main coding issues are "shotgun surgery" and "programming by permutation" according to the following:https://en.wikipedia.org/wiki/Design_smell, https://en.wikipedia.org/wiki/Anti-pattern, https://en.wikipedia.org/wiki/Code_smell. I think those can be used for code review. Strangely, I have not found that they are included in any of code review tutorials.

Thursday, January 21, 2016

FLOPS estimation

The following formula can be used for estimating FLOPS for a CPU:

FLOPS = FLOPS_per_cycle * cores * frequency

Flops per cycle might be hard to find, they should be in the technical documentation of the particular CPU series, e.g. Haswell. Some of it can be found here: http://stackoverflow.com/questions/15655835/flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2. Sometimes there is a doc from producer with FLOPS: http://download.intel.com/support/processors/xeon/sb/xeon_5600.pdf