RDD in Spark are immutable. When you need to change the RDD, you produce a new RDD. When you need to change all the data in RDD frequently, e.g. when running some iterative algorithm, you might not want to spend time and memory on creation of the new structure. However, when you cache the RDD, you can get the reference to the data inside and change it. You need to make sure that the RDD stays in memory all the time. The following is a hack around RDD immutability and is not recommended to do. Also, you fault tolerance is lost, though you can force check-pointing.
// class that allows modification of its variable with inc functionclass Counter extends Serializable { var i: Int = 0; def inc: Unit = i += 1 }// Approach 1: create an RDD with 5 instances of this classval rdd = sc.parallelize(1 to 5, 5).map(x => new Counter())// trying to apply modificationrdd.foreach(x => x.inc)// modification did not work: all values are still zeroesrdd.collect.foreach(x => println(x.i))// Approach 2: create a cached RDD with 5 instances of the Counter classval cachedRdd = sc.parallelize(1 to 5, 5).map(x => new Counter()).cache// trying to apply modificationcachedRdd.foreach(x => x.inc)// modification worked: all values are onescachedRdd.collect.foreach(x => println(x.i))