How to connect elasticsearch to apache spark streaming or apache storm?

Hi folks,

You can find my question in stackoverflow:

PS: I had some troubles to import my pictures here that's why i only wrote the stackoverflow url.



any ideas ?

thank you in advance.


Elasticsearch is not really capable of streaming data the same way as Kafka does. The es-hadoop connector really just makes it possible to ship bulk data between Elasticsearch and other Hadoop ecosystem technologies. Because of this, it is unwise to expect that reading from Elasticsearch using Spark Streaming or Storm will produce the same effects as reading an event stream from Kafka. The connector uses the scroll api with an optionally provided query. Once it completes reading all of the data returned from the scroll, the data source is exhausted and the spout will idle.

In your case, it seems that you are looking for a kind of lambda style architecture. A possible way to approach this is to stream data out of Kafka into two places: Elasticsearch for serving up the raw data, and into an ML pipeline for creating enriched data. The enriched data could also then be pushed into Elasticsearch for serving to whatever applications or dashboards you may require. In the event that you need to perform a bulk rebuild of the enriched data, or if you must retrain your machine learning model, you have the entire raw data corpus available in Elasticsearch.

Since you mention using streaming tools for machine learning, I assume you are looking to leverage model-based real time analysis on data. Depending on the machine learning approach you're looking to execute, your mileage may vary with this strategy.

Hope this helps. As always with architectural advice, take it with a grain of salt!

P.S. - Cross posting using SO is probably not the best way to interact with the forums. I would replicate the original question here (even if it's without pictures!) so that the entire conversation is available in one thread. Thanks!

1 Like

Thank you very much James, your explanation is totally clear and well understood. It was exactly what I was looking for.

I have just one question regarding the processing part of a lambda architecture. Do you think that Logstash is suitable for this kind of architecture or should I pick up another framework for instance apache spark streaming or apache storm ?

Glad to hear!

Regarding Logstash - it very much depends on your processing needs. I would advise taking some time to look through what Logstash has to offer, or asking some questions on it's forum (located here). It might also be worth taking a look at the recently introduced ingest node feature coming in Elasticsearch v5.0.0 as well. That being said, I have a feeling that if you're looking for advanced machine learning capability there are probably other options more suited to it.

1 Like