ESHadoop - Hadoop vs Spark

(Pat Humphreys) #1

I have a current Hadoop job running on AWS EMR, running ESHadoop with Cascading. It does bulk inserts of 10,000 4k records about 300M of them.
I was wondering would there be any speed benefits of using Spark instead?

(Jhendric98) #2

Without hearing more about your job, I'll have to relate my general experience. We've found Spark to reduce runtimes on jobs over traditional MR in indexing to Elasticsearch. I am not an Elasticsearch expert but it seems data locality may play a part. We used a custom jar loader in a YARN job to load data and have replaced ours with the ES-Hadoop Spark library.

(Costin Leau) #3

A big advantage that Spark SQL gives over other libraries, it that it allows push down - that is in Spark SQL the operations executed can be detected and thus pushed down by 3rd party plugins (like ES-Hadoop). This significantly reduces the amount of data that needs to be pulled in from ES.

(system) #4