Load data into HDFS using ES-Spark

Lucas_Weissert · May 6, 2015, 3:03pm

Hello,

I am reading data from elasticsearch using spark (ES-Spark). After i get the data using sc.esRDD(".../...") I want to store everything in HDFS so I use the saveAsTextFile method but it is very slow ...
Am I doing the right things ? It takes 15min to save (and it is saving 11Go)

Es-Hadoop can be used to store data in HDFS or it is just used to write into ES or read data from ES and display some queries ?

Best regards

costin · May 14, 2015, 6:22am

There might be various reasons why the saveAsTextFile takes a long time - typically it might be because the parallelism is small (there's only one task handling it) or because the there's a large number of values (sometimes all) under the same key.
What does you RDD looks like - any information on Spark during the wait and what it is doing? What's your hardware?

As for es-hadoop, in a nutshell it's a connector between Elasticsearch and Hadoop so it likely fits the latter description.
es-hadoop itself doesn't store any state, rather it helps data move between Elastic and Hadoop.

Topic		Replies	Views
Load data into HDFS using ES-Spark Elasticsearch	2	574	July 6, 2017
How should I search data in hdfs Elasticsearch es-hadoop	3	1880	July 6, 2017
Save and search data with es & hadoop Elasticsearch es-hadoop	4	1240	July 6, 2017
Slow Performance of Elastic Search with Spark Elasticsearch es-hadoop	4	1557	July 29, 2021
Ingesting data from HDFS to ElasticSearch Elasticsearch	3	3737	February 15, 2017

Load data into HDFS using ES-Spark

Related topics