Tuesday, December 23, 2014

HTTPS certificate for Tomcat

Tomcat has its own documentation how to do it. I've just followed it with Tomcat 7:

Generate a personal private key wiht Java keytool:
keytool -genkey -alias www.mysite.com -dname "cn=www.mysite.com, o=
mysite, o=.com" -keysize 2048 -keyalg RSA
Generate a request for certificate:
keytool -certreq -alias www.mysite.com -file www.mysite.com.csr
Submit the resulting request (as text) to the certificate authority. I did it with https://www.startssl.com/ for free. It produced a certificate (text) that I put into a www.mysite.com.signed.crt file.
Before using it, you need to import Root Certificate and the Class 1 domain validation certificate from the authority into your keystore, otherwise import of reply would not find the right chain. In my case it was:
wget http://www.startssl.com/certs/ca.crt
keytool -import -trustcacerts -alias startcom.ca -file ca.crt
wget https://startssl.com/certs/sca.server1.crt
keytool -import -alias startcom.ca.sub -file sub.class1.server.ca.crt
Finally, import certificate to your keystore:
keytool -import -alias www.mysite.com -file www.mysite.com.signed.c
rt
Now, you need to configure Tomcat via server.xml:
     <Connector port="8443" protocol="HTTP/1.1" SSLEnabled="true"
               maxThreads="150" scheme="https" secure="true"
               clientAuth="false" sslProtocol="TLS"
                keystoreFile="/home/ubuntu/.keystore"
                keystorePass="PASSWORD"
        compression="on"
        compressionMinSize="2048"
        noCompressionUserAgents="gozilla, traviata"
        compressableMimeType="text/html,text/xml,text/javascript,text/css,image$
    />
You may also want all HTTP requests redirected to HTTPS:
    <Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               URIEncoding="UTF-8"
               redirectPort="8443"
        compression="on"
        compressionMinSize="2048"
        noCompressionUserAgents="gozilla, traviata"
        compressableMimeType="text/html,text/xml,text/javascript,text/css,image$
/>
Not to mention, your iptables has route both HTTP and HTTPS requests to your Tomcat port:
#IPTABLES
# forward 80 to 8080 and save tables
# ports 80 and 443 must be opened in AMAZON AWS UI Securtiy Groups
sudo iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 80 -j REDIRECT --to-port 8080
sudo iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 443 -j REDIRECT --to-port 8443
sudo iptables-save
sudo bash -c 'iptables-save > /etc/iptables/rules.v4'
sudo iptables  -nvL -t nat

Thursday, December 11, 2014

Apache Spark workers configuration

How to guess the number of Workers per cluster node that will provide the best performance? I performed few tests on 6-node cluster with 3.3GHz 4 core 16GB machines. I used mnist8m classification task as a test. It terms that there should be Workers < cores. Moreover, 1 Worker per node worked the same time as more than one. This means that one worker takes advantage of multiple cores. Also, data partitions should be >= number of workers, otherwise it will be processed by less Workers.

Rule of thumb for classification

There are quite a few machine learning classifiers. It is usually hard to say which is better until every one is tried on the given data and performance is measured. However, there are few rules of thumb:
  • Linear classifier is better used when:
    • Sparse data (lot of zeroes in feature vector) 
    • Feature engineering performed, or deep feature learning
    • Up to large datasets (fits one machine)
  • Non-linear or kernel-based classifier is better used when
    • There are only few features (up to tens)
    • Big data - a lot of training examples
Bonus: how to manage imbalanced training set:
  • Evaluation: ROC under PR curve
  • Negative subsampling
  • Weighs for imbalanced classes (also - regularization parameter)

Friday, September 5, 2014

How to run netlib-java/breeze in native mode with BLAS/Lapack binaries on Windows 7 x64

Netlib-java allows fast linear algebra in native mode with BLAS/LAPACK libraries. Breeze is for Scala and is based on netlib as well. The problem is where to get binaries and how to set up netlib-java. So, which libraries are needed?
  • libquadmath-0.dll //MINGW
    libwinpthread-1.dll //MINGW
    libgcc_s_seh-1.dll //MINGW
    libgfortran-3.dll //MINGW
    liblapack3.dll //OpenBLAS copy of libopeblas.dll
    libblas3.dll //OpenBLAS copy of libopenblas.dll
    netlib-native_system-win-x86_64.dll //netlib-java
So, few libs from MINGW, one from OpenBLAS copied two times and one from netlib-java. It is important that all libraries have the same architecture: x64 (or x32). Mixing doesn't work. Here are the links where these libraries exist pre-compiled:


You need to place the needed dlls into the project folder or to some folder that is in PATH. Your maven pom project must have the following entry if you use only netlib-java:
  •       <dependency>
              <groupId>com.github.fommil.netlib</groupId>
              <artifactId>all</artifactId>
              <version>1.1.2</version>
          </dependency>
or if you use breeze:
  •     <dependency>
          <groupId>org.scalanlp</groupId>
          <artifactId>breeze_${scala.binary.version}</artifactId>
          <version>0.9</version>
        </dependency>
      <dependency>
              <groupId>com.github.fommil.netlib</groupId>
              <artifactId>all</artifactId>
              <version>1.1.2</version>
      </dependency>
Another option would be to compile OpenBLAS. You will need MINGW64, OpenBLAS sources plus MSYS64: http://sourceforge.net/projects/mingw-w64/files/External%20binary%20packages%20%28Win64%20hosted%29/MSYS%20%2832-bit%29/ (more details at https://code.google.com/p/tonatiuh/wiki/InstallingMinGWForWindows64)

  • unzip OpenBLAS sources
  • extract MINGW and MSYS in the same folder, 
  • run msys.bat from MINGW/MSYS folder
  • cd to OpenBLAS sources dir and run "make BINARY=64". After a while you'll get libopenblas.dll

Thursday, August 7, 2014

Text classification with Apache Spark 1.1 (sentiment classification)

Word to vector space model converter was recently implemented in Apache Spark MLLib. Now it is possible to perform text classification. Lets see how it works with the sentiment analysis.

  • Download a Pang and Lee sentence polarity dataset v 1.0 from http://www.cs.cornell.edu/people/pabo/movie-review-data/. It contains 5331 positive and 5331 negative processed sentences.
  • Clone and install the latest version of Apache Spark that contains HashingTF and MulticlassMetrics classes
  • Code snippet. It can be executed in Spark shell or as a separate application that uses spark-core and mllib:
  •   /* instantiate Spark context (not needed for running inside Spark shell */
        val sc = new SparkContext("local", "test")
        /* word to vector space converter, limit to 10000 words */
        val htf = new HashingTF(10000)
        /* load positive and negative sentences from the dataset */
        /* let 1 - positive class, 0 - negative class */
        /* tokenize sentences and transform them into vector space model */
        val positiveData = sc.textFile("/data/rt-polaritydata/rt-polarity.pos")
          .map { text => new LabeledPoint(1, htf.transform(text.split(" ")))}
        val negativeData = sc.textFile("/data/rt-polaritydata/rt-polarity.neg")
          .map { text => new LabeledPoint(0, htf.transform(text.split(" ")))}
        /* split the data 60% for training, 40% for testing */
        val posSplits = positiveData.randomSplit(Array(0.6, 0.4), seed = 11L)
        val negSplits = negativeData.randomSplit(Array(0.6, 0.4), seed = 11L)
        /* union train data with positive and negative sentences */
        val training = posSplits(0).union(negSplits(0))
        /* union test data with positive and negative sentences */
        val test = posSplits(1).union(negSplits(1))
        /* Multinomial Naive Bayesian classifier */
        val model = NaiveBayes.train(training)
        /* predict */
        val predictionAndLabels = test.map { point =>
          val score = model.predict(point.features)
          (score, point.label)
        }
        /* metrics */
        val metrics = new MulticlassMetrics(predictionAndLabels)
        /* output F1-measure for all labels (0 and 1, negative and positive) */
        metrics.labels.foreach( l => println(metrics.fMeasure(l)))
  • I've got around 74% F1-measure for both classes. Similar results can be observed in Weka
  • 0.7377086668191173
    0.7351650888940199

Build a specific Maven project

Assume there is a parent project and child projects. You can build children from the parent folder with the following:
mvn install -pl NAME -am
NAME is the folder of the child project.
http://stackoverflow.com/questions/1114026/maven-modules-building-a-single-specific-module

Wednesday, August 6, 2014

How to use Apache Spark libraries that were compiled locally in your project

Officially released versions of Apache Spark libraries are in maven http://search.maven.org/, so you can always add dependencies to them in your project and maven will download them. See how to make a maven project that uses Spark libraries at avulanov.blogspot.com/2014/07/how-to-create-scala-project-for-apache.html. I want to use the latest build of Spark in my maven project, moreover, my custom build of Spark. There are at least two options of doing this. First is building Apache Spark with install or running `install` target for a particular Spark project:
  1. mvn -Dhadoop.version=1.2.1 -DskipTests clean install
  • Compile your local version of Apache Spark with  
  1. mvn -Dhadoop.version=1.2.1 -DskipTests clean package
  1. mvn install:install-file -Dfile=/spark/core/target/spark-core_2.10-1.1.0-latest.jar -DpomFile=/spark/core/pom.xml -DgroupId=org.apache.spark -Dversion=1.1.0-latest -DartifactId=spark-core_2.10
    • Reference the new version of this library in your pom.xml (1.1.0-latest)
    • There might be a problem with imports and there versions, so try to run mvn install (I run it in Idea IDE). In my case maven didn't like that asm and fz4 dependencies didn't have versions specified. Specify them if needed.

    Tuesday, August 5, 2014

    Hashing trick for word dictionary

    One can use hash for building a dictionary and converting text documents to vector space representation. Dictionary size N has to be specified and documents tokenized to terms. Then, Hash(term) mod N is a term index in VSM. More details at: http://en.wikipedia.org/wiki/Feature_hashing and http://www.shogun-toolbox.org/static/notebook/current/HashedDocDotFeatures.html

    Thursday, July 3, 2014

    How to create a Scala project for Apache Spark

    • Install maven
    • Generate maven project with
    • mvn archetype:generate -B \
        -DarchetypeGroupId=net.alchim31.maven -DarchetypeArtifactId=scala-archetype-simple -DarchetypeVersion=1.5 \
        -DgroupId=org.apache.spark -DartifactId=spark-myNewProject -Dversion=0.1-SNAPSHOT -Dpackage=org.apache.spark
    • or with
    • mvn archetype:generate
    • Add a new dependency to your pom.xml (check the latest version at http://search.maven.org/)
    •       <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_2.10</artifactId>
                <version>1.0.2</version>
            </dependency>
      
    • Install IntelliJ idea and Scala plugin for it (alternatively, Eclipse with Scala plugin, which didn't work for me) 
    • Open pom file in IntelliJ Idea
    • Important:
      • Be careful when including libraries in your project, because they may exist in Spark libraries with different versions and you will get java.lang.IncompatibleClassChangeError: Implementing class
      • If you want to do create SparkContext and perform RDD operations in Windows, there is known bug with winutils from Hadoop. You need to get it and set HADOOP_HOME path for it.