Tiered deployment of elasticsearch

I am brainstorming about implementing different realtime distribution
architecture based on elasticsearch. With the default deployment of
elasticsearch, mainly due to disk IO bottleneck, for scalable system with
less than 1 minutes delay from data pipeline to elasticsearch cluster, it
needs excessive number of servers than our cost budget. So I need to
discuss with you guys about the following architecture borrowed from Druid
developed and open sourced by metamx.

There are two types of elasticsearch clusters, the first one is realtime
and the second one is historical. Realtime nodes are indexing with Memory
store type. Every hour(this can be configured), realtime nodes flush the
index into the disk and notify historical nodes. When historical nodes
receives the notification, they start to copy the index from realtime nodes
and add to the index. After the historical nodes finishes copying the
index, realtime nodes can close the index and free it from the memory.

We need a client module to merge the search result from realtime nodes and
historical nodes.

What do you think about this idea? If I want to implement this
architecture, is there anything that should be added/changed in
elasticsearch? In other words, can I implement this architecture without
touching anything in elasticsearch core side?

Thank you
Best, Jae

--

Hello Jae,

I think you can achieve the same thing by tweaking flush options with local
gateway:

Also, you might benefit from store-level throttling. You can set this to
"merge" and fill in a value that fits your setup to make sure that merges
won't suffocate your IO:

On the same page there are some details about store-level compression.
Which should help your disk throughput, at the expense of CPU.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Thu, Nov 29, 2012 at 9:01 PM, Jae metacret@gmail.com wrote:

I am brainstorming about implementing different realtime distribution
architecture based on elasticsearch. With the default deployment of
elasticsearch, mainly due to disk IO bottleneck, for scalable system with
less than 1 minutes delay from data pipeline to elasticsearch cluster, it
needs excessive number of servers than our cost budget. So I need to
discuss with you guys about the following architecture borrowed from Druid
developed and open sourced by metamx.

There are two types of elasticsearch clusters, the first one is realtime
and the second one is historical. Realtime nodes are indexing with Memory
store type. Every hour(this can be configured), realtime nodes flush the
index into the disk and notify historical nodes. When historical nodes
receives the notification, they start to copy the index from realtime nodes
and add to the index. After the historical nodes finishes copying the
index, realtime nodes can close the index and free it from the memory.

We need a client module to merge the search result from realtime nodes and
historical nodes.

What do you think about this idea? If I want to implement this
architecture, is there anything that should be added/changed in
elasticsearch? In other words, can I implement this architecture without
touching anything in elasticsearch core side?

Thank you
Best, Jae

--

--

If I understand correctly, all you want is already there in Elasticsearch.
You have always realtime nodes, indexing and searching takes place in
memory, for performance reasons. The data is regularly persisted to the
gateway storage, think of it as a historical state of the index. The index
is already flushed each hour or so, even if idle. Note, kimchy compares the
gateway storage with Apple's backup mechanism, the Time Machine.

Best regards,

Jörg

--

Thanks a lot for your answer.

Could you explain translog more in detail? What kind of impact can I expect
with tuning flushing option?

On Friday, November 30, 2012 11:03:53 AM UTC-8, Radu Gheorghe wrote:

Hello Jae,

I think you can achieve the same thing by tweaking flush options with
local gateway:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Also, you might benefit from store-level throttling. You can set this to
"merge" and fill in a value that fits your setup to make sure that merges
won't suffocate your IO:
Elasticsearch Platform — Find real-time answers at scale | Elastic

On the same page there are some details about store-level compression.
Which should help your disk throughput, at the expense of CPU.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Thu, Nov 29, 2012 at 9:01 PM, Jae <meta...@gmail.com <javascript:>>wrote:

I am brainstorming about implementing different realtime distribution
architecture based on elasticsearch. With the default deployment of
elasticsearch, mainly due to disk IO bottleneck, for scalable system with
less than 1 minutes delay from data pipeline to elasticsearch cluster, it
needs excessive number of servers than our cost budget. So I need to
discuss with you guys about the following architecture borrowed from Druid
developed and open sourced by metamx.

There are two types of elasticsearch clusters, the first one is realtime
and the second one is historical. Realtime nodes are indexing with Memory
store type. Every hour(this can be configured), realtime nodes flush the
index into the disk and notify historical nodes. When historical nodes
receives the notification, they start to copy the index from realtime nodes
and add to the index. After the historical nodes finishes copying the
index, realtime nodes can close the index and free it from the memory.

We need a client module to merge the search result from realtime nodes
and historical nodes.

What do you think about this idea? If I want to implement this
architecture, is there anything that should be added/changed in
elasticsearch? In other words, can I implement this architecture without
touching anything in elasticsearch core side?

Thank you
Best, Jae

--

--

There are two types of elasticsearch clusters, the first one is realtime
and the second one is
If I understand correctly, all you want is already there in Elasticsearch.

Jörg is absolutely right. Unless you have a very specific scenario or a lot
of data / very high load on constrained resources, most of that is already
handled by elasticsearch and the operating system.

In the most simple case, you can use the "realtime" index and the
"historical" index, with different refresh intervals ("1s" and "1h"). Then,
every hour, you can scan data from "realtime" and push them to
"historical". But it would make more sense to segment you data into
time-based indices, and use aliases to provide opaque API to your
application -- see
http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html
for inspiration.

Karel

--

Hello Jae,

On Mon, Dec 3, 2012 at 7:20 PM, Jae metacret@gmail.com wrote:

Thanks a lot for your answer.

Could you explain translog more in detail?

I'm not sure I can say more than what's already in the documentation. The
point of the transaction log is to have persistence of data that has just
been inserted, without needing to commit to the lucene index for every
document.

What kind of impact can I expect with tuning flushing option?

I think it's a matter of testing, like it is with the bulk size when you're
indexing.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Friday, November 30, 2012 11:03:53 AM UTC-8, Radu Gheorghe wrote:

Hello Jae,

I think you can achieve the same thing by tweaking flush options with
local gateway:
Elasticsearch Platform — Find real-time answers at scale | Elastic**
translog.htmlhttp://www.elasticsearch.org/guide/reference/index-modules/translog.html

Also, you might benefit from store-level throttling. You can set this to
"merge" and fill in a value that fits your setup to make sure that merges
won't suffocate your IO:
Elasticsearch Platform — Find real-time answers at scale | Elastichttp://www.elasticsearch.org/guide/reference/index-modules/store.html

On the same page there are some details about store-level compression.
Which should help your disk throughput, at the expense of CPU.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Thu, Nov 29, 2012 at 9:01 PM, Jae meta...@gmail.com wrote:

I am brainstorming about implementing different realtime distribution
architecture based on elasticsearch. With the default deployment of
elasticsearch, mainly due to disk IO bottleneck, for scalable system with
less than 1 minutes delay from data pipeline to elasticsearch cluster, it
needs excessive number of servers than our cost budget. So I need to
discuss with you guys about the following architecture borrowed from Druid
developed and open sourced by metamx.

There are two types of elasticsearch clusters, the first one is realtime
and the second one is historical. Realtime nodes are indexing with Memory
store type. Every hour(this can be configured), realtime nodes flush the
index into the disk and notify historical nodes. When historical nodes
receives the notification, they start to copy the index from realtime nodes
and add to the index. After the historical nodes finishes copying the
index, realtime nodes can close the index and free it from the memory.

We need a client module to merge the search result from realtime nodes
and historical nodes.

What do you think about this idea? If I want to implement this
architecture, is there anything that should be added/changed in
elasticsearch? In other words, can I implement this architecture without
touching anything in elasticsearch core side?

Thank you
Best, Jae

--

--

--