How to connect elasticsearch to apache spark streaming or apache storm?

sen · July 8, 2016, 4:42pm

Hi folks,

You can find my question in stackoverflow: http://stackoverflow.com/questions/38271713/how-to-connect-elasticsearch-to-apache-spark-streaming-or-storm

PS: I had some troubles to import my pictures here that's why i only wrote the stackoverflow url.

Regards,

S

sen · July 10, 2016, 2:43pm

any ideas ?

thank you in advance.

james.baiera · July 11, 2016, 3:12pm

Hello!

Elasticsearch is not really capable of streaming data the same way as Kafka does. The es-hadoop connector really just makes it possible to ship bulk data between Elasticsearch and other Hadoop ecosystem technologies. Because of this, it is unwise to expect that reading from Elasticsearch using Spark Streaming or Storm will produce the same effects as reading an event stream from Kafka. The connector uses the scroll api with an optionally provided query. Once it completes reading all of the data returned from the scroll, the data source is exhausted and the spout will idle.

In your case, it seems that you are looking for a kind of lambda style architecture. A possible way to approach this is to stream data out of Kafka into two places: Elasticsearch for serving up the raw data, and into an ML pipeline for creating enriched data. The enriched data could also then be pushed into Elasticsearch for serving to whatever applications or dashboards you may require. In the event that you need to perform a bulk rebuild of the enriched data, or if you must retrain your machine learning model, you have the entire raw data corpus available in Elasticsearch.

Since you mention using streaming tools for machine learning, I assume you are looking to leverage model-based real time analysis on data. Depending on the machine learning approach you're looking to execute, your mileage may vary with this strategy.

Hope this helps. As always with architectural advice, take it with a grain of salt!

P.S. - Cross posting using SO is probably not the best way to interact with the forums. I would replicate the original question here (even if it's without pictures!) so that the entire conversation is available in one thread. Thanks!

sen · July 11, 2016, 3:39pm

Thank you very much James, your explanation is totally clear and well understood. It was exactly what I was looking for.

I have just one question regarding the processing part of a lambda architecture. Do you think that Logstash is suitable for this kind of architecture or should I pick up another framework for instance apache spark streaming or apache storm ?

james.baiera · July 11, 2016, 3:50pm

Glad to hear!

Regarding Logstash - it very much depends on your processing needs. I would advise taking some time to look through what Logstash has to offer, or asking some questions on it's forum (located here). It might also be worth taking a look at the recently introduced ingest node feature coming in Elasticsearch v5.0.0 as well. That being said, I have a feeling that if you're looking for advanced machine learning capability there are probably other options more suited to it.

Topic		Replies	Views
ES and sparkStreaming Elasticsearch es-hadoop	2	1311	July 6, 2017
Reading from Elasticsearch index using spark ( es-hadoop ) connectors Elasticsearch es-hadoop	2	1405	March 22, 2022
Use cases for es-hadoop Elasticsearch es-hadoop	3	1170	November 20, 2019
ElasticSearch Data Transformation Elasticsearch es-hadoop	2	774	July 30, 2018
Is Elasticsearch a suitable tool for event streaming and processing? Elasticsearch	2	953	July 5, 2017

How to connect elasticsearch to apache spark streaming or apache storm?

Related topics