I am reading data from elasticsearch using spark (ES-Spark). After i get the data using sc.esRDD(".../...") I want to store everything in HDFS so I use the saveAsTextFile method but it is very slow ...
Am I doing the right things ? It takes 15min to save (and it is saving 11Go)
Es-Hadoop can be used to store data in HDFS or it is just used to write into ES or read data from ES and display some queries ?
There might be various reasons why the saveAsTextFile takes a long time - typically it might be because the parallelism is small (there's only one task handling it) or because the there's a large number of values (sometimes all) under the same key.
What does you RDD looks like - any information on Spark during the wait and what it is doing? What's your hardware?
As for es-hadoop, in a nutshell it's a connector between Elasticsearch and Hadoop so it likely fits the latter description.
es-hadoop itself doesn't store any state, rather it helps data move between Elastic and Hadoop.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.