Elasticsearch with Hadoop

The gateway functionality itself is deprecated in favor of snapshot/restore, a new feature that will be part of the
1.0.x line.
In elasticsearch-hadoop, having a native HDFS implementation of snapshot/restore is part of the roadmap [1] and planned
for the M2 release of
1.3.0.
This will allow HDFS to be used as a long term, backing store for ES.

Storing indexes is a slightly different story - one can do so now by simply mapping HDFS as a local file-system (through
nfs or other vendor specific means). Additionally, we're looking in potentially using HDFS as a store itself.

A local FS will always be faster then HDFS for several reasons - HDFS is a distributed file-system meaning data written
to it is distributed across the cluster. This means not only that it's remote (as oppose to a local fs) but also spread
across multiple machines.
Further more, HDFS itself is not as efficient as a local file-system (think of OS page cache, etc...) - this can have
huge implications from a performance pov especially when accessing data right after writing: with a local fs, the OS
kicks in transparently through its page cache; in case of HDFS that is not the case.
Further more there's a redundancy in functionality in that HDFS (out of the box) keeps data replicated (to prevent data
loss in case of node failure) but this functionality is already available in ES as well.
In other words, HDFS has a direct impact on performance and thus on the near-real-time aspect of it. 1-2 seconds in a
batch system might be okay but in a NRT system, not at all.

That's not to say HDFS cannot be used with Elasticsearch - quite the opposite and as I've pointed out we are working on
it. It is important though to be aware of the semantics involved and how they relate to the overall system performance.

Cheers,

[1] HDFS support for ES 1.0 snapshot/restore feature · Issue #72 · elastic/elasticsearch-hadoop · GitHub

On 15/08/2013 2:09 AM, Phani wrote:

Hi all,

I see that in the latest releases, it is mentioned that the hadoop gateway is deprecated and will be removed. Does
that mean that, there won't be any support for storing the indexes on HDFS ?

And, late in the thread it is mentioned that indexes are only snapshotted to HDFS - so indexes are first built on local
FS and then copied to HDFS every 10 seconds or so ?

It is also mentioned that using HDFS directly as a data storage for indexes is not good performance-wise, can one give
more insights into the performance issues and how degrading it will be compared to local FS ?

On Wednesday, May 9, 2012 4:18:58 PM UTC-7, Mo wrote:

Is anyone using elasticsearch with Hadoop? Would like to know how it's used and any suggestions that you can provide
would be helpful.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.