Ingesting data from HDFS to ElasticSearch

mkarthikswamy · January 13, 2017, 10:21am

Hi,
We have a Hadoop cluster in which a Spark job creates about 3 GB of data every hour in hour partitioned directories in HDFS. This data needs to be moved to ElasticSearch in the most efficient way. The following are the options we are considering:

In the existing Spark job that writes the data to HDFS, add another stage to write the Data Frame to ElasticSearch - using ES-Hadoop connector. This bypasses one extra disk i/o if we write to HDFS & read from it again to store in ES. On the other hand, it increases the execution time of the hourly Spark job
Have another Spark job get triggered after the first Spark job gets completed, to push the data from HDFS to ES using ES-Hadoop connector
Is there a way to use logstash here? Hourly batch ingest using Spark ES-Hadoop looks most optimal from performance point of view, but does logstash support a hdfs input plugin, I didn't see that in the Elastic site? Even if it does, can it scale well enough?
In order to reduce the data transfer over the wire, is it better to overlay the ElasticSearch cluster on the HDFS/Spark cluster?
Any suggestions, pointers will be very helpful, Thanks a lot!
MK

mainec · January 18, 2017, 11:39am

I'd suggest taking a look at the spark-elasticsearch connector described here:

https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html

though @costin probably has more information for you.

Hope this helps,
Isabel

mkarthikswamy · January 18, 2017, 3:44pm

@mainec, thanks for your response.
Yes, es-hadoop is the connector we are using in Spark

system · February 15, 2017, 3:44pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pulling data from HDFS to elasticsearch Elasticsearch es-hadoop	2	1242	July 6, 2017
Load data into HDFS using ES-Spark Elasticsearch es-hadoop	2	1999	July 6, 2017
Load data into HDFS using ES-Spark Elasticsearch	2	579	July 6, 2017
Save and search data with es & hadoop Elasticsearch es-hadoop	4	1257	July 6, 2017
ELK and Hadoop integration Elasticsearch es-hadoop	6	6625	July 6, 2017

Ingesting data from HDFS to ElasticSearch

Related topics