Word to vector space model converter was recently implemented in Apache Spark MLLib. Now it is possible to perform text classification. Lets see how it works with the sentiment analysis.
- Download a Pang and Lee sentence polarity dataset v 1.0 from http://www.cs.cornell.edu/people/pabo/movie-review-data/. It contains 5331 positive and 5331 negative processed sentences.
- Clone and install the latest version of Apache Spark that contains HashingTF and MulticlassMetrics classes
- Code snippet. It can be executed in Spark shell or as a separate application that uses spark-core and mllib:
/* instantiate Spark context (not needed for running inside Spark shell */ val sc = new SparkContext("local", "test") /* word to vector space converter, limit to 10000 words */ val htf = new HashingTF(10000) /* load positive and negative sentences from the dataset */ /* let 1 - positive class, 0 - negative class */ /* tokenize sentences and transform them into vector space model */ val positiveData = sc.textFile("/data/rt-polaritydata/rt-polarity.pos") .map { text => new LabeledPoint(1, htf.transform(text.split(" ")))} val negativeData = sc.textFile("/data/rt-polaritydata/rt-polarity.neg") .map { text => new LabeledPoint(0, htf.transform(text.split(" ")))} /* split the data 60% for training, 40% for testing */ val posSplits = positiveData.randomSplit(Array(0.6, 0.4), seed = 11L) val negSplits = negativeData.randomSplit(Array(0.6, 0.4), seed = 11L) /* union train data with positive and negative sentences */ val training = posSplits(0).union(negSplits(0)) /* union test data with positive and negative sentences */ val test = posSplits(1).union(negSplits(1)) /* Multinomial Naive Bayesian classifier */ val model = NaiveBayes.train(training) /* predict */ val predictionAndLabels = test.map { point => val score = model.predict(point.features) (score, point.label) } /* metrics */ val metrics = new MulticlassMetrics(predictionAndLabels) /* output F1-measure for all labels (0 and 1, negative and positive) */ metrics.labels.foreach( l => println(metrics.fMeasure(l)))
- I've got around 74% F1-measure for both classes. Similar results can be observed in Weka
0.7377086668191173 0.7351650888940199
Thanks a lot, it helped me ...
ReplyDeleteThis comment has been removed by the author.
ReplyDelete