I have a lot of batch processes that write large batches of data to elastic in scheduled intervals. Currently we are writing to elastic using the es-hadoop library. From what I understand when writing to an ECK instance you have to turn on es.nodes.wan.only which slows down the library, and it is very important to us for the elastic ingestion to be as fast as possible.
We are considering other ways of writing from spark to elastic, namely logstash.
My question is, is there any feasible way to write large batches from spark to ECK through logstash or is this not the use case? Is are there better ways than using es-hadoop to ingest large batches of data from spark to elastic or is this the best solution for this use case?
Hi @krezno. I am not aware of any off-the-shelf way to write from spark to logstash. It might be worth trying out using es-hadoop to write to your ECK cluster (with es.nodes.wan.only set to true) to get a sense for how good or bad the performance really is though.
I'm not really familiar with ECK and it's been a while since i used kubernetes, but I assume the problem is that Elasticsearch is exposed as a service at a single URL, and discovery does you no good since none of the discovered nodes are accessible. Is that right? Or are you running into other problems?
If you do try es-hadoop, keep in mind that it will see your whole Elasticsearch cluster as a single node, so your hadoop or spark jobs will fail if it gets "blacklisted" due to failures writing to it. We have this problem with customers using load balancers as well. You might want to list the same node several times as I described here.
My team is migrating from a standard elastic cluster to ECK, and as a part of the process we are reevaluating how we perform our writes. For example, some other teams that I've talked to that are familiar with Logstash and Apache Kafka as part of their stack have achieved nice results by writing from Spark to a Kafka topic then reading from that topic through logstash and writing to elastic. My team doesn't have experience with either of these products and I also find it overcomplicated to add 2 more points of failure to our pipelines. However, we did historically have some hard-to-troubleshoot problems using es-hadoop, like elastic error messages that the library "swallows up" and makes less descriptive. I am interested in the possible alternatives so that we may improve this part of our pipelines if we have to make changes to it anyway.
I will probably start with simply trying to use es-hadoop when testing our new environment and will definitely take note of the last paragraph of your reply.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.