ES + Hadoop = primary datastore?


#1

Hi all,

Let me start apologizing. This is a very common question but I'm not very familiar with Hadoop.

I'm starting to work on a project that requires a very reliable data storage. I love ES and I would choose ES over MongoDB for any project but with such strict requirements I see myself forced to use an RDBMS solution.

Could ES+Hadoop be a viable alternative? My understanding is that HDFS is pretty reliable, close to a RDBMS (but slow for "real-time" applications). There are security features in this project that guarantee data synchronization across the two systems (ES and Hadoop)?

Thanks


(Costin Leau) #2

HDFS and Hadoop makes sense when you have a LOT (think easily plenty of gigabytes and terabytes of data) typically in raw format.
It can definitely work on a smaller data set as well but it does bring some operational and logistic overhead.
Whether Elastic and Hadoop make sense for you; that's for you to consider. The connector makes it easy to index data from Hadoop into Elasticsearch so it can be queried and aggregated in real-time.
However there are two moving parts (Hadoop and Elasticsearch) and they need to be kept in sync (which does work through the connector). Some are perfectly fine with this, some might consider it too heavy for their use-case.

In the end, it's up to you; do a quick POC to validate your requirements and see how it fits your use case.


#3

Thanks a lot Costin,

I have that kind of data and I have resources to deal with operational and logistic overhead.

The only thing I cannot afford is data loss. If HDFS and Hadoop can guarantee RDBMS levels of "security" and ES is "eventually consistent" and they keep each other align, this is the system for me!

Can you confirm this?


(Costin Leau) #4

I can only speak for ES; it can easily sort your data however we don't advertise ES as a storage solution. Simply because that's not the product main goal/feature but rather a side-effect. There are plenty of folks that use ES as their primary source of data and they're happy with it. Others, use ES a secondary source ; so in your case the data can sit in HDFS and be made accessible/analyzed through ES.
In general, to be safe, we advice folks to at least keep a backup of the data (in ES this is easy to do through the snapshot/restore API).


#5

ok, thanks Costin. That was my assumption about ES. I'll do my homework about Hadoop and decide.

Thanks a lot


(system) #6