ES + Hadoop = primary datastore?

Luca_Rondanini · July 13, 2015, 1:34am

Hi all,

Let me start apologizing. This is a very common question but I'm not very familiar with Hadoop.

I'm starting to work on a project that requires a very reliable data storage. I love ES and I would choose ES over MongoDB for any project but with such strict requirements I see myself forced to use an RDBMS solution.

Could ES+Hadoop be a viable alternative? My understanding is that HDFS is pretty reliable, close to a RDBMS (but slow for "real-time" applications). There are security features in this project that guarantee data synchronization across the two systems (ES and Hadoop)?

Thanks

costin · July 14, 2015, 6:07am

HDFS and Hadoop makes sense when you have a LOT (think easily plenty of gigabytes and terabytes of data) typically in raw format.
It can definitely work on a smaller data set as well but it does bring some operational and logistic overhead.
Whether Elastic and Hadoop make sense for you; that's for you to consider. The connector makes it easy to index data from Hadoop into Elasticsearch so it can be queried and aggregated in real-time.
However there are two moving parts (Hadoop and Elasticsearch) and they need to be kept in sync (which does work through the connector). Some are perfectly fine with this, some might consider it too heavy for their use-case.

In the end, it's up to you; do a quick POC to validate your requirements and see how it fits your use case.

Luca_Rondanini · July 15, 2015, 10:21am

Thanks a lot Costin,

I have that kind of data and I have resources to deal with operational and logistic overhead.

The only thing I cannot afford is data loss. If HDFS and Hadoop can guarantee RDBMS levels of "security" and ES is "eventually consistent" and they keep each other align, this is the system for me!

Can you confirm this?

costin · July 15, 2015, 7:58pm

I can only speak for ES; it can easily sort your data however we don't advertise ES as a storage solution. Simply because that's not the product main goal/feature but rather a side-effect. There are plenty of folks that use ES as their primary source of data and they're happy with it. Others, use ES a secondary source ; so in your case the data can sit in HDFS and be made accessible/analyzed through ES.
In general, to be safe, we advice folks to at least keep a backup of the data (in ES this is easy to do through the snapshot/restore API).

Luca_Rondanini · July 15, 2015, 8:16pm

ok, thanks Costin. That was my assumption about ES. I'll do my homework about Hadoop and decide.

Thanks a lot

Topic		Replies	Views
Hadoop / Elasticsearch functionality Elasticsearch es-hadoop	20	3236	July 6, 2017
Query on Indexing using es-hadoop Elasticsearch es-hadoop	6	1957	July 6, 2017
Query HDFS Elasticsearch	6	654	April 17, 2017
Elasticsearch and Hadoop Questions Elasticsearch	10	377	July 6, 2017
HDFS storage options Elasticsearch es-hadoop	6	1571	July 6, 2017

ES + Hadoop = primary datastore?

Related topics