ES and sparkStreaming

JoaquinS · April 4, 2016, 4:24pm

Hi, I have an ES cluster and a Cloudera cluster and I want to stream data (indexes) from ES to Cloudera-Kudu tables. Now I'm using Logstash to extract the index data, then Kafka to carry them into SparkStreaming and finally Spark writes the rows. But Logstash does not support streaming the indexes data, only batch. So my question is, it is posible to do that with this connector?.

costin · April 5, 2016, 3:18pm

Not sure what you mean by this. Having less parts is ideal so having Logstash reading the data and send it to ES or vice-versa would be the best solution in this case.
If you want to stream just the updates you can do so by splitting the data into dedicated indices - one per day/hour/12h and so on.
ES-Hadoop does support Spark, in fact it is the riches integration out there. With Spark streaming one needs to be careful in the number of connections it opens; since Spark Streaming uses micro-batching it means creating small jobs to ES which, in some cases, keep creating connections that are not closing down in time which in turn eat all resources.

Note that ES-Spark and logstash both rely on ES in the end which does not yet support a changelog like feature (to check just the updates of an index).

Topic		Replies	Views
Reading structured streaming data from Elasticsearch into Spark using Python Elasticsearch es-hadoop	3	1846	April 1, 2022
Multiple ES clusters in SparkSQL Elasticsearch es-hadoop	9	2876	July 6, 2017
Logstash vs spark streaming and storm Logstash	3	9345	July 6, 2017
ElasticSearch Data Transformation Elasticsearch es-hadoop	2	774	July 30, 2018
Streaming data from ES to hadoop Elasticsearch	1	496	January 13, 2018

ES and sparkStreaming

Related topics