Thursday, July 3, 2014

How to create a Scala project for Apache Spark

  • Install maven
  • Generate maven project with
  • mvn archetype:generate -B \
      -DarchetypeGroupId=net.alchim31.maven -DarchetypeArtifactId=scala-archetype-simple -DarchetypeVersion=1.5 \
      -DgroupId=org.apache.spark -DartifactId=spark-myNewProject -Dversion=0.1-SNAPSHOT -Dpackage=org.apache.spark
  • or with
  • mvn archetype:generate
  • Add a new dependency to your pom.xml (check the latest version at http://search.maven.org/)
  •       <dependency>
              <groupId>org.apache.spark</groupId>
              <artifactId>spark-core_2.10</artifactId>
              <version>1.0.2</version>
          </dependency>
    
  • Install IntelliJ idea and Scala plugin for it (alternatively, Eclipse with Scala plugin, which didn't work for me) 
  • Open pom file in IntelliJ Idea
  • Important:
    • Be careful when including libraries in your project, because they may exist in Spark libraries with different versions and you will get java.lang.IncompatibleClassChangeError: Implementing class
    • If you want to do create SparkContext and perform RDD operations in Windows, there is known bug with winutils from Hadoop. You need to get it and set HADOOP_HOME path for it.