Thursday, August 7, 2014

Text classification with Apache Spark 1.1 (sentiment classification)

Word to vector space model converter was recently implemented in Apache Spark MLLib. Now it is possible to perform text classification. Lets see how it works with the sentiment analysis.

  • Download a Pang and Lee sentence polarity dataset v 1.0 from http://www.cs.cornell.edu/people/pabo/movie-review-data/. It contains 5331 positive and 5331 negative processed sentences.
  • Clone and install the latest version of Apache Spark that contains HashingTF and MulticlassMetrics classes
  • Code snippet. It can be executed in Spark shell or as a separate application that uses spark-core and mllib:
  •   /* instantiate Spark context (not needed for running inside Spark shell */
        val sc = new SparkContext("local", "test")
        /* word to vector space converter, limit to 10000 words */
        val htf = new HashingTF(10000)
        /* load positive and negative sentences from the dataset */
        /* let 1 - positive class, 0 - negative class */
        /* tokenize sentences and transform them into vector space model */
        val positiveData = sc.textFile("/data/rt-polaritydata/rt-polarity.pos")
          .map { text => new LabeledPoint(1, htf.transform(text.split(" ")))}
        val negativeData = sc.textFile("/data/rt-polaritydata/rt-polarity.neg")
          .map { text => new LabeledPoint(0, htf.transform(text.split(" ")))}
        /* split the data 60% for training, 40% for testing */
        val posSplits = positiveData.randomSplit(Array(0.6, 0.4), seed = 11L)
        val negSplits = negativeData.randomSplit(Array(0.6, 0.4), seed = 11L)
        /* union train data with positive and negative sentences */
        val training = posSplits(0).union(negSplits(0))
        /* union test data with positive and negative sentences */
        val test = posSplits(1).union(negSplits(1))
        /* Multinomial Naive Bayesian classifier */
        val model = NaiveBayes.train(training)
        /* predict */
        val predictionAndLabels = test.map { point =>
          val score = model.predict(point.features)
          (score, point.label)
        }
        /* metrics */
        val metrics = new MulticlassMetrics(predictionAndLabels)
        /* output F1-measure for all labels (0 and 1, negative and positive) */
        metrics.labels.foreach( l => println(metrics.fMeasure(l)))
  • I've got around 74% F1-measure for both classes. Similar results can be observed in Weka
  • 0.7377086668191173
    0.7351650888940199

2 comments: