@costin We bumped up the executor memory and now it works on a standalone cluster, it took almost 90mins to load all the data though.
Now, am running it on Yarn to test and am getting a different error message
ge 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 10,
mavencode.ca): java.lang.IllegalArgumentException: Size exceeds In
teger.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStor
e.scala:125)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStor
e.scala:113)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:5
11)
at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:429
)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:617)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:
35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:
35)
Still pulling my hairs to figure out how to fix this