ES and sparkStreaming


(JoaquĆ­n Silva) #1

Hi, I have an ES cluster and a Cloudera cluster and I want to stream data (indexes) from ES to Cloudera-Kudu tables. Now I'm using Logstash to extract the index data, then Kafka to carry them into SparkStreaming and finally Spark writes the rows. But Logstash does not support streaming the indexes data, only batch. So my question is, it is posible to do that with this connector?.


(Costin Leau) #2

Not sure what you mean by this. Having less parts is ideal so having Logstash reading the data and send it to ES or vice-versa would be the best solution in this case.
If you want to stream just the updates you can do so by splitting the data into dedicated indices - one per day/hour/12h and so on.
ES-Hadoop does support Spark, in fact it is the riches integration out there. With Spark streaming one needs to be careful in the number of connections it opens; since Spark Streaming uses micro-batching it means creating small jobs to ES which, in some cases, keep creating connections that are not closing down in time which in turn eat all resources.

Note that ES-Spark and logstash both rely on ES in the end which does not yet support a changelog like feature (to check just the updates of an index).


(system) #3