Tuesday, December 23, 2014

HTTPS certificate for Tomcat

Tomcat has its own documentation how to do it. I've just followed it with Tomcat 7:

Generate a personal private key wiht Java keytool:
keytool -genkey -alias www.mysite.com -dname "cn=www.mysite.com, o=
mysite, o=.com" -keysize 2048 -keyalg RSA
Generate a request for certificate:
keytool -certreq -alias www.mysite.com -file www.mysite.com.csr
Submit the resulting request (as text) to the certificate authority. I did it with https://www.startssl.com/ for free. It produced a certificate (text) that I put into a www.mysite.com.signed.crt file.
Before using it, you need to import Root Certificate and the Class 1 domain validation certificate from the authority into your keystore, otherwise import of reply would not find the right chain. In my case it was:
wget http://www.startssl.com/certs/ca.crt
keytool -import -trustcacerts -alias startcom.ca -file ca.crt
wget https://startssl.com/certs/sca.server1.crt
keytool -import -alias startcom.ca.sub -file sub.class1.server.ca.crt
Finally, import certificate to your keystore:
keytool -import -alias www.mysite.com -file www.mysite.com.signed.c
rt
Now, you need to configure Tomcat via server.xml:
     <Connector port="8443" protocol="HTTP/1.1" SSLEnabled="true"
               maxThreads="150" scheme="https" secure="true"
               clientAuth="false" sslProtocol="TLS"
                keystoreFile="/home/ubuntu/.keystore"
                keystorePass="PASSWORD"
        compression="on"
        compressionMinSize="2048"
        noCompressionUserAgents="gozilla, traviata"
        compressableMimeType="text/html,text/xml,text/javascript,text/css,image$
    />
You may also want all HTTP requests redirected to HTTPS:
    <Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               URIEncoding="UTF-8"
               redirectPort="8443"
        compression="on"
        compressionMinSize="2048"
        noCompressionUserAgents="gozilla, traviata"
        compressableMimeType="text/html,text/xml,text/javascript,text/css,image$
/>
Not to mention, your iptables has route both HTTP and HTTPS requests to your Tomcat port:
#IPTABLES
# forward 80 to 8080 and save tables
# ports 80 and 443 must be opened in AMAZON AWS UI Securtiy Groups
sudo iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 80 -j REDIRECT --to-port 8080
sudo iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 443 -j REDIRECT --to-port 8443
sudo iptables-save
sudo bash -c 'iptables-save > /etc/iptables/rules.v4'
sudo iptables  -nvL -t nat

Thursday, December 11, 2014

Apache Spark workers configuration

How to guess the number of Workers per cluster node that will provide the best performance? I performed few tests on 6-node cluster with 3.3GHz 4 core 16GB machines. I used mnist8m classification task as a test. It terms that there should be Workers < cores. Moreover, 1 Worker per node worked the same time as more than one. This means that one worker takes advantage of multiple cores. Also, data partitions should be >= number of workers, otherwise it will be processed by less Workers.

Rule of thumb for classification

There are quite a few machine learning classifiers. It is usually hard to say which is better until every one is tried on the given data and performance is measured. However, there are few rules of thumb:
  • Linear classifier is better used when:
    • Sparse data (lot of zeroes in feature vector) 
    • Feature engineering performed, or deep feature learning
    • Up to large datasets (fits one machine)
  • Non-linear or kernel-based classifier is better used when
    • There are only few features (up to tens)
    • Big data - a lot of training examples
Bonus: how to manage imbalanced training set:
  • Evaluation: ROC under PR curve
  • Negative subsampling
  • Weighs for imbalanced classes (also - regularization parameter)