We have a Hadoop cluster in which a Spark job creates about 3 GB of data every hour in hour partitioned directories in HDFS. This data needs to be moved to ElasticSearch in the most efficient way. The following are the options we are considering:
- In the existing Spark job that writes the data to HDFS, add another stage to write the Data Frame to ElasticSearch - using ES-Hadoop connector. This bypasses one extra disk i/o if we write to HDFS & read from it again to store in ES. On the other hand, it increases the execution time of the hourly Spark job
- Have another Spark job get triggered after the first Spark job gets completed, to push the data from HDFS to ES using ES-Hadoop connector
- Is there a way to use logstash here? Hourly batch ingest using Spark ES-Hadoop looks most optimal from performance point of view, but does logstash support a hdfs input plugin, I didn't see that in the Elastic site? Even if it does, can it scale well enough?
- In order to reduce the data transfer over the wire, is it better to overlay the ElasticSearch cluster on the HDFS/Spark cluster?
Any suggestions, pointers will be very helpful, Thanks a lot!