Load data into HDFS using ES-Spark

(Lucas Weissert) #1


I am reading data from elasticsearch using spark (ES-Spark). After i get the data using sc.esRDD(".../...") I want to store everything in HDFS so I use the saveAsTextFile method but it is very slow ...
Am I doing the right things ? It takes 15min to save (and it is saving 11Go)

Es-Hadoop can be used to store data in HDFS or it is just used to write into ES or read data from ES and display some queries ?

Best regards

(Costin Leau) #2

There might be various reasons why the saveAsTextFile takes a long time - typically it might be because the parallelism is small (there's only one task handling it) or because the there's a large number of values (sometimes all) under the same key.
What does you RDD looks like - any information on Spark during the wait and what it is doing? What's your hardware?

As for es-hadoop, in a nutshell it's a connector between Elasticsearch and Hadoop so it likely fits the latter description.
es-hadoop itself doesn't store any state, rather it helps data move between Elastic and Hadoop.

Load data into HDFS using ES-Spark
(system) #3