Wednesday, April 29, 2015

Apache Spark Dataframe SQL vs RDD functions

Dataframe is a wrapper for RDD in Spark that can wrap RDD of case classes. One can run SQL queries with Dataframe, so it's convenient. Eventually, SQL should be translated into RDD functions. However, there are some differences. Lets sorting of 2 billion records, i.e. RDD "sortBy" vs Dataframe SQL "order by":

Create 2B rows of MyRecord within 2000 partitions, so each partition will have 1M of rows. (We should not have less partitions because then the number of rows per partition will be problematic for sorting on commodity node.)
import sqlContext.implicits._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
case class MyRecord(time: Double, id: String)
val rdd = sc.parallelize(1 to 200, 200).flatMap(x => 
Seq.fill(10000000)(MyRecord(util.Random.nextDouble, "xxx")))
Lets sort this RDD by time:
val sorted = rdd.sortBy(x => x.time)
result.count
It finished in about 8 minutes on my cluster of 8 nodes. Everything's fine. You can also check tasks that were completed in Spark web UI. The number of reducers was equal to the number of partitions, i.e. 2000

Lets convert the original RDD to Dataframe and sort again:
val df = sqlContext.createDataFrame(rdd)
df.registerTempTable("data")
val result = sqlContext.sql("select * from data order by time")
result.count
It will run for a while and then crash. If you check tasks in the Spark Web UI, you will see that some of them were cancelled due to lost executors (ExecutorLost) due to some strange Exceptions. It is really hard to trace back which executor was first to be lost. The other follow it as in house of cards. What's the problem? The number of reducers. For the first task it is equal to the number of partitions, i.e. 2000, but for the second it switched to 200. Why? It is the default value of
spark.sql.shuffle.partitions
So it is better to set an appropriate value before running SQL queries in Spark.

Thursday, April 23, 2015

Cross entropy and squared error and their derivatives

Cost function of target vs output might be expressed with cross entropy or with squared error. Output is generated by some function. It might be logistic or softmax. Depending on the combination of error and output function, there are different derivatives that are used for gradient computation. The following article gives very good explanation along with needed formulas: