Ingesting data from HDFS to ElasticSearch

We have a Hadoop cluster in which a Spark job creates about 3 GB of data every hour in hour partitioned directories in HDFS. This data needs to be moved to ElasticSearch in the most efficient way. The following are the options we are considering:

  1. In the existing Spark job that writes the data to HDFS, add another stage to write the Data Frame to ElasticSearch - using ES-Hadoop connector. This bypasses one extra disk i/o if we write to HDFS & read from it again to store in ES. On the other hand, it increases the execution time of the hourly Spark job
  2. Have another Spark job get triggered after the first Spark job gets completed, to push the data from HDFS to ES using ES-Hadoop connector
  3. Is there a way to use logstash here? Hourly batch ingest using Spark ES-Hadoop looks most optimal from performance point of view, but does logstash support a hdfs input plugin, I didn't see that in the Elastic site? Even if it does, can it scale well enough?
  4. In order to reduce the data transfer over the wire, is it better to overlay the ElasticSearch cluster on the HDFS/Spark cluster?
    Any suggestions, pointers will be very helpful, Thanks a lot!

I'd suggest taking a look at the spark-elasticsearch connector described here:

though @costin probably has more information for you.

Hope this helps,

@mainec, thanks for your response.
Yes, es-hadoop is the connector we are using in Spark

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.