Technical notes: 2015

Wednesday, September 16, 2015

Intellij Idea Scala compilation weird error

I've generated a maven Scala project with maven, opened it in Idea and it did not compile:

Error:scalac: Error: org.jetbrains.jps.incremental.scala.remote.ServerException

Error compiling sbt component 'compiler-interface-2.10.0-52.0'

at sbt.compiler.AnalyzingCompiler$$anonfun$compileSources$1$$anonfun$apply$2.apply(AnalyzingCompiler.scala:145)

at sbt.compiler.AnalyzingCompiler$$anonfun$compileSources$1$$anonfun$apply$2.apply(AnalyzingCompiler.scala:142)

at sbt.IO$.withTemporaryDirectory(IO.scala:285)......

Full log can be found here. By the way, the answer on that site is not precise.
The problem is that Idea have found some scala-compiler files in my local maven repo. It tried to use them instead of Scala SDK: File->Project Structure->Modules->Dependecies. The solution is to remove scala-compiler from the list, then Idea will propose to add Scala SDK.

Friday, September 11, 2015

Parameter servers

Links to the open source servers:

1)Parameter server (CMU, Mu Li & Alex Smola) https://github.com/dmlc/parameter_server

2)Petuum (CMU, Eric Xing) http://petuum.github.io/

3)DistML (Intel, Yunlong He) https://github.com/intel-machine-learning/DistML

4)PS on Spark (Alibaba, Qiping Li) https://issues.apache.org/jira/browse/SPARK-6932

Links to the other servers described in research papers:

1)DistBelief (Google, Jeff Dean et al.) http://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf

2)Project Adam (MS, Trishul Chilimbi et al.) https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chilimbi.pdf

3)Factorbird (Stanford, Reza Zadeh) http://stanford.edu/~rezab/papers/factorbird.pdf

Monday, August 17, 2015

Generating Spark docs on Windows

The process of Spark docs generation for Linux is described in https://github.com/apache/spark/blob/master/docs/README.md. It might be a little tricky on Windows. You need to install Ruby 2.0, two jems then Python 2.7 and two jems. Nice docs for the first step is at http://jekyll-windows.juthilo.com/.

Get Ruby 2.0.0 for your architecture http://rubyinstaller.org/downloads/. Install it and check "Add Ruby executables to your PATH"
Get RubyDev from the same site. Unzip it to C:\RubyDevKit. Install it as follows. Before running the "install" command (last one), check that "config.yml" contains the path to your Ruby installation. Add it, if missing.

cd C:\RubyDevKit
ruby dk.rb init
ruby dk.rb install

Install "jekyll" as follows specifying your proxy if needed:

gem install --http-proxy http://proxy:port jekyll
gem install --http-proxy http://proxy:port jekyll-redirects-from
groupadd hadoop
usermod -a -G hadoop hduser

Get Python 2.7 from https://www.python.org/ and install it. Make sure that "Python" and "Python\Scripts" folders were added to your PATH.
Install "pygments" and "sphinx". To use proxy, you need to have environment variable "http_proxy" with your proxy:port.

pip install pygments
pip install sphinx

If everything was OK, you will be able to generate docs from docs folder. Lets' skip API docs:

cd %SPARK_HOME%\docs
set SKIP_API=1
jekyll build

It will create a folder "_site" with all docs generated in HTML

Thursday, July 9, 2015

In place update of Spark RDD

RDD in Spark are immutable. When you need to change the RDD, you produce a new RDD. When you need to change all the data in RDD frequently, e.g. when running some iterative algorithm, you might not want to spend time and memory on creation of the new structure. However, when you cache the RDD, you can get the reference to the data inside and change it. You need to make sure that the RDD stays in memory all the time. The following is a hack around RDD immutability and is not recommended to do. Also, you fault tolerance is lost, though you can force check-pointing.


// class that allows modification of its variable with inc function

class Counter extends Serializable { var i: Int = 0; def inc: Unit = i += 1 }

// Approach 1:
create an RDD with 5 instances of this class

val rdd = sc.parallelize(1 to 5, 5).map(x => new Counter())

// trying to apply modification

rdd.foreach(x => x.inc)

// modification did not work: all values are still zeroes

rdd.collect.foreach(x => println(x.i))



// Approach 2:
create a cached RDD with 5 instances of the Counter class

val cachedRdd = sc.parallelize(1 to 5, 5).map(x => new Counter()).cache

// trying to apply modification

cachedRdd.foreach(x => x.inc)

// modification worked: all values are ones

cachedRdd.collect.foreach(x => println(x.i))

Monday, July 6, 2015

Git https ssh errors

I get the error in Linux when trying to git push:
error: The requested URL returned error: 403 Forbidden while accessing https://github.com/avulanov/ann-benchmark.git/info/refs
When I set:
git remote set-url origin https://username@github.com/username/reponame.git
I get:
Gtk-WARNING **: cannot open display:
The following command fixes it:
unset SSH_ASKPASS
http://stackoverflow.com/questions/16077971/git-push-produces-gtk-warning

Thursday, July 2, 2015

Installation of the new NVIDIA driver on Red Hat 6.3

In my case, when I download a new driver from http://www.nvidia.com/download/find.aspx and try to run it, I get "ERROR: You appear to be running an X server; please exit X before installing.". I comment x startup in /etc/rc.local and /etc/rc.d/rc.local (in my case there are two lines). After the reboot, there is no X server running, but running the driver installation tells: "unload nvidia module". I can't do this with "rmmod -f" because it is in use. GPU monitoring utility might use it. You might want to kill gmond. In my case the following worked:

sudo yum remove nvidia-kmod

It uninstalls the kernel module with the driver. So now the new driver can be installed.

IntelliJ IDEA Scala: bad compiler option error message

It happens in 14.1.2. Remove it in Options->Scala Compiler->Additional options
http://stackoverflow.com/questions/26995023/errorscalac-bad-option-p-intellij-idea

Friday, May 15, 2015

GNU screen cheat sheet

Screen allows disconnect and connect to a running shell from multiple locations. If the process is already running you can steal it to screen using reptyr: reptyr PID. Screen cheat sheet from here: http://www.rackaid.com/blog/linux-screen-tutorial-and-how-to/

yum install screen
screen
screen -r #reattach
“Ctrl-a” then “?”. You should now have the screen help page.
“Ctrl-a” “c” new window
“Ctrl-a” “n” for the next window or “Ctrl-a” “p” for the previous
detach from the window using “Ctrl-a” “d”.
“Ctrl-a” “H”, creates a running log of the session.
“Ctrl-a” “M” to look for activity. 
“Ctrl-a” “x”.  This will require a password to access the session again.
“Ctrl-a” “k”.  You should get a message if you want to kill the screen

Thursday, May 7, 2015

Cluster configuration and Apache Spark installation, configuration and start

Stand-alone cluster configuration notes

Skip this if you have already configured Hadoop cluster.
Create users on all nodes:

useradd hduser
passwd hduser 
groupadd hadoop
usermod -a -G hadoop hduser

sudo su - hduser

Spark nodes interact with ssh. Password-less ssh should be enabled on all nodes:

ssh-keygen -t rsa -P ''
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

Check that ssh works locally without password:

ssh localhost

Copy public key from the master node to worker node:

ssh-copy-id -i ~/.ssh/id_rsa.pub user@worker_node

Spark compilation notes

Spark needs Java (at least 1.7), Scala 2.10 (does not support 2.11) and Maven. Hadoop is optional.
Install Java and Scala using package manager yum or apt-get. Or download rpm files from corresponding sites and run install:

sudo rpm -i java.rpm
sudo rpm -i scala.rpm

Download and unzip maven to /usr/local/maven. Configure proxy for Maven if needed in maven/conf/settings.xml. Another config file might be in ~/.m2/settings.xml.
Add to your ~/.bashrc

export M2_HOME=/usr/local/maven
export M2=$M2_HOME/bin
export PATH=$M2:$PATH

If Java is < 1.8:

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

If you need proxy:

export http_proxy=http://my-web-proxy.net:8088
export https_proxy=http://my-web-proxy.net:8088

Sometimes Maven collects these options, so be careful if the following does not compile because of maven trying do download something.
Clone from git, compile and change owner to hduser (the user who has password-less ssh between nodes enabled):

sudo git clone https://github.com/apache/spark.git /usr/local/spark
cd /usr/local/spark
mvn -Dhadoop.version=1.2.1 -Pnetlib-lgpl -DskipTests clean package
sudo chown -R hduser:hadoop /usr/local/spark

Spark installation notes

Assume that Spark compilation was done on the master node. One need to copy Spark to all other nodes in the cluster to /usr/local/spark and change its owner to hduser (as above).
Also, add to hduser ~/.bashrc on all nodes:

export SPARK_HOME=/usr/local/spark
export _JAVA_OPTIONS=-Djava.io.tmpdir=[Folder with a lot of space]

The latter option is needed for Java temporary folder when Spark writes data on shuffle. By default it is /tmp and it is usually small.
Also if there is Hadoop installation it is useful to force Spark read its configuration instead of using the default one (e.g. for replication factor etc.):
export PATH=$PATH:$HADOOP_HOME/conf

Spark configuration notes

Some theory:

Spark runs one Master and several Workers. It is not recommended to have both Master and Worker on the same node. It worth having only one Worker on one node that owns all RAM and CPU cores unless it has many CPUs or the task is better solved ON many Workers.
When one submit a task, Spark creates Driver on Master node and Executors on Worker nodes.

It would be nice that one has to configure only Master node and all options will be transferred to Workers. However, it is not the case. Though, in there is some minimal configuration when you don't need to touch each Workers's config. It is one Worker per node.
spark/conf/spark-defaults:

spark.master    spark://mymaster.com:7077
spark.driver.memory     16g
spark.driver.cores      4
spark.executor.memory   16g #no more than available, otherwise will fail
spark.local.dir /home/hduser/tmp #Shuffle directory, should be fast and big disk
spark.sql.shuffle.partitions    2000 #number of reducers for SQL, default is 200

List all the Worker nodes in spark/conf/slaves:

my-node2.net
my-node3.net

Spark start

Start all nodes:

$SPARK_HOME/sbin/start-all.sh

You should be able to see the Web-interface on my-node1.net:8088

Start Spark shell:

$SPARK_HOME/bin/spark-shell.sh --master spark://my-node1.net:7077

Stop all nodes:

$SPARK_HOME/sbin/stop-all.sh

Monday, May 4, 2015

Hadoop free space and file sizes

It is useful to understand what would be the size of data and free space if you want to write something to HDFS. Default block size in HDFS is 64MB, so one file will take at least 64MB. Also, default replication ratio is 3x. The size will be:

3 * Sum(i)(size[i] / 64 + 1)

Check the block size and replication ratio:

$HADOOP_HOME/bin/hadoop fsck /

Check the free space (plain free space, not taking into account replication or block size):

$HADOOP_HOME/bin/hadoop dfsadmin -report

How big is the folder (it is actually replication ratio times bigger):

$HADOOP_HOME/bin/hadoop dfs -dus [/some/folder]

Git rebase (put your history on top of upsteam history)

Git rebase is usually needed when you want to push commits to the branch, but it is ahead of you and you don't want to add your merge messages to the branch history. Sort of bon ton. Rebase will replay your history on top of the branch. Workflow for git rebase:

git rebase [upstream/master]

If there are conflicts, resolve them by hand, or:

git checkout --ours (or --theirs) [filename]
git add [filename]

Continue:

git rebase --skip #if theirs (or --continue #if ours)

Wednesday, April 29, 2015

Apache Spark Dataframe SQL vs RDD functions

Dataframe is a wrapper for RDD in Spark that can wrap RDD of case classes. One can run SQL queries with Dataframe, so it's convenient. Eventually, SQL should be translated into RDD functions. However, there are some differences. Lets sorting of 2 billion records, i.e. RDD "sortBy" vs Dataframe SQL "order by":

Create 2B rows of MyRecord within 2000 partitions, so each partition will have 1M of rows. (We should not have less partitions because then the number of rows per partition will be problematic for sorting on commodity node.)

import sqlContext.implicits._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
case class MyRecord(time: Double, id: String)
val rdd = sc.parallelize(1 to 200, 200).flatMap(x => 
Seq.fill(10000000)(MyRecord(util.Random.nextDouble, "xxx")))

Lets sort this RDD by time:

val sorted = rdd.sortBy(x => x.time)
result.count

It finished in about 8 minutes on my cluster of 8 nodes. Everything's fine. You can also check tasks that were completed in Spark web UI. The number of reducers was equal to the number of partitions, i.e. 2000

Lets convert the original RDD to Dataframe and sort again:

val df = sqlContext.createDataFrame(rdd)
df.registerTempTable("data")
val result = sqlContext.sql("select * from data order by time")
result.count

It will run for a while and then crash. If you check tasks in the Spark Web UI, you will see that some of them were cancelled due to lost executors (ExecutorLost) due to some strange Exceptions. It is really hard to trace back which executor was first to be lost. The other follow it as in house of cards. What's the problem? The number of reducers. For the first task it is equal to the number of partitions, i.e. 2000, but for the second it switched to 200. Why? It is the default value of

spark.sql.shuffle.partitions

So it is better to set an appropriate value before running SQL queries in Spark.

Thursday, April 23, 2015

Cross entropy and squared error and their derivatives

Cost function of target vs output might be expressed with cross entropy or with squared error. Output is generated by some function. It might be logistic or softmax. Depending on the combination of error and output function, there are different derivatives that are used for gradient computation. The following article gives very good explanation along with needed formulas:

On The Pairing Of The Softmax Activation And Cross--Entropy Penalty Functions And The Derivation Of The Softmax Activation Function

Wednesday, March 25, 2015

CBLAS compilation as a shared library

Reference CBLAS from Netlib compiles into static library (http://www.netlib.org/blas/blast-forum/cblas.tgz). I needed a shared library.
First, you need to download and compile BLAS, since CBLAS is just C interface to Fortran BLAS:


wget http://www.netlib.org/blas/blas.tgz

tar xzvf blas.tgz

cd BLAS

make

It will produce static library "blas_LINUX.a".
Next, you need to download and configure CBLAS:


wget http://www.netlib.org/blas/blast-forum/cblas.tgz

tar xzvf cblas.tgz

cd CBLAS

Replace the corresponding variables in "Makefile.in" with:


BLLIB = /path_to_compiled_BLAS/blas_LINUX.a
CBLIB = ../lib/cblas_$(PLAT).so

CFLAGS = -O3 -DADD_ -fPIC

FFLAGS = -O3 -fPIC


ARCH = gcc

ARCHFLAGS = -shared -o

Finally, make CBLAS:


make

It will produce shared library "cblas_LINUX.so".

Monday, March 9, 2015

Serialize classes or models is Apache Spark

Normal serialization does work but the deserialized objects cannot be mapped to RDD, i.e. their functions cannot be applied to RDD. Hack:


sc.parallelize(Seq(model), 1).saveAsObjectFile("path")
val sameModel = sc.objectFile[YourCLASS]("path").first()

Monday, March 2, 2015

Unzip files from network drive and copy to hadoop

Data is zipped on some shared drive. Goal is to unzip and copy some files to Hadoop.
Mount the drive:

mkdir data

sudo mount --verbose -t cifs //some.net/shared/data -o username=user,password=**** data

Unzip&copy script:

#!/bin/bash
for ARCHFILE in ~/data/*.7z
do
   echo $ARCHFILE;
   7za l $ARCHFILE | grep '.txt\|.bin' | awk -F' ' '{print $NF}' |
   while read -r -a MYFILE ; do
      7za e $ARCHFILE $MYFILE;
      FF=$(basename $MYFILE);
      echo $FF
# place here the corresponding Hadoop folder name
      $HADOOP_HOME/bin/hadoop dfs -copyFromLocal $FF /data/$FF ;
      rm $FF;
   done
done

Thursday, February 19, 2015

Funny thing about Apache Spark

To build it you need Java and Scala. To build its docs you need Ruby and Python.

Monday, January 26, 2015

Difference between multinomial logistic regression and artificial neural network (multilayer perceptron) without hidden layer

Initially, they seems to be the same: they both have sigmoids within and NxK parameters where N is number of inputs, K is number of outputs. However, regression uses so-called softmax function, so all outputs sum to 1. The difference comes when one needs to run optimization routine. Logistic regression optimizes cross-entropy and ANN optimizes mean squared error. One takes a partial derivatives of these equations which result in slightly different formulas for regression and back-propagation. See more at http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm and http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression.

Friday, January 23, 2015

Git rebase

Rebase is needed to put your history on top of history of the given branch, when you don't want to see instances of merge in the given branch history. This is important for keeping the main branch history clear.
Rebase process:

git rebase [upstream/master]

Hopefully, everything is ok. If there are conflicts, resolve them by hand or use theirs/ours:

git checkout --theirs (or --ours) filename

Then you need to add the edited file:

git add filename

And continue rebase if hand edited/used theirs or skip it otherwise:

git rebase --continue (or --skip)