Tiered deployment of elasticsearch

Jae · November 29, 2012, 7:01pm

I am brainstorming about implementing different realtime distribution
architecture based on elasticsearch. With the default deployment of
elasticsearch, mainly due to disk IO bottleneck, for scalable system with
less than 1 minutes delay from data pipeline to elasticsearch cluster, it
needs excessive number of servers than our cost budget. So I need to
discuss with you guys about the following architecture borrowed from Druid
developed and open sourced by metamx.

There are two types of elasticsearch clusters, the first one is realtime
and the second one is historical. Realtime nodes are indexing with Memory
store type. Every hour(this can be configured), realtime nodes flush the
index into the disk and notify historical nodes. When historical nodes
receives the notification, they start to copy the index from realtime nodes
and add to the index. After the historical nodes finishes copying the
index, realtime nodes can close the index and free it from the memory.

We need a client module to merge the search result from realtime nodes and
historical nodes.

What do you think about this idea? If I want to implement this
architecture, is there anything that should be added/changed in
elasticsearch? In other words, can I implement this architecture without
touching anything in elasticsearch core side?

Thank you
Best, Jae

--

radu_gheorghe · November 30, 2012, 7:03pm

Hello Jae,

I think you can achieve the same thing by tweaking flush options with local
gateway:

Also, you might benefit from store-level throttling. You can set this to
"merge" and fill in a value that fits your setup to make sure that merges
won't suffocate your IO:

On the same page there are some details about store-level compression.
Which should help your disk throughput, at the expense of CPU.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Thu, Nov 29, 2012 at 9:01 PM, Jae metacret@gmail.com wrote:

I am brainstorming about implementing different realtime distribution
architecture based on elasticsearch. With the default deployment of
elasticsearch, mainly due to disk IO bottleneck, for scalable system with
less than 1 minutes delay from data pipeline to elasticsearch cluster, it
needs excessive number of servers than our cost budget. So I need to
discuss with you guys about the following architecture borrowed from Druid
developed and open sourced by metamx.

There are two types of elasticsearch clusters, the first one is realtime
and the second one is historical. Realtime nodes are indexing with Memory
store type. Every hour(this can be configured), realtime nodes flush the
index into the disk and notify historical nodes. When historical nodes
receives the notification, they start to copy the index from realtime nodes
and add to the index. After the historical nodes finishes copying the
index, realtime nodes can close the index and free it from the memory.

We need a client module to merge the search result from realtime nodes and
historical nodes.

What do you think about this idea? If I want to implement this
architecture, is there anything that should be added/changed in
elasticsearch? In other words, can I implement this architecture without
touching anything in elasticsearch core side?

Thank you
Best, Jae

--

--

jprante · December 1, 2012, 12:43am

If I understand correctly, all you want is already there in Elasticsearch.
You have always realtime nodes, indexing and searching takes place in
memory, for performance reasons. The data is regularly persisted to the
gateway storage, think of it as a historical state of the index. The index
is already flushed each hour or so, even if idle. Note, kimchy compares the
gateway storage with Apple's backup mechanism, the Time Machine.

Best regards,

Jörg

--

Jae · December 3, 2012, 5:20pm

Thanks a lot for your answer.

Could you explain translog more in detail? What kind of impact can I expect
with tuning flushing option?

On Friday, November 30, 2012 11:03:53 AM UTC-8, Radu Gheorghe wrote:

Hello Jae,

I think you can achieve the same thing by tweaking flush options with
local gateway:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Also, you might benefit from store-level throttling. You can set this to
"merge" and fill in a value that fits your setup to make sure that merges
won't suffocate your IO:
Elasticsearch Platform — Find real-time answers at scale | Elastic

On the same page there are some details about store-level compression.
Which should help your disk throughput, at the expense of CPU.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Thu, Nov 29, 2012 at 9:01 PM, Jae <meta...@gmail.com <javascript:>>wrote:

I am brainstorming about implementing different realtime distribution
architecture based on elasticsearch. With the default deployment of
elasticsearch, mainly due to disk IO bottleneck, for scalable system with
less than 1 minutes delay from data pipeline to elasticsearch cluster, it
needs excessive number of servers than our cost budget. So I need to
discuss with you guys about the following architecture borrowed from Druid
developed and open sourced by metamx.

There are two types of elasticsearch clusters, the first one is realtime
and the second one is historical. Realtime nodes are indexing with Memory
store type. Every hour(this can be configured), realtime nodes flush the
index into the disk and notify historical nodes. When historical nodes
receives the notification, they start to copy the index from realtime nodes
and add to the index. After the historical nodes finishes copying the
index, realtime nodes can close the index and free it from the memory.

We need a client module to merge the search result from realtime nodes
and historical nodes.

What do you think about this idea? If I want to implement this
architecture, is there anything that should be added/changed in
elasticsearch? In other words, can I implement this architecture without
touching anything in elasticsearch core side?

Thank you
Best, Jae

--

--

karmi · December 4, 2012, 11:31am

There are two types of elasticsearch clusters, the first one is realtime
and the second one is
If I understand correctly, all you want is already there in Elasticsearch.

Jörg is absolutely right. Unless you have a very specific scenario or a lot
of data / very high load on constrained resources, most of that is already
handled by elasticsearch and the operating system.

In the most simple case, you can use the "realtime" index and the
"historical" index, with different refresh intervals ("1s" and "1h"). Then,
every hour, you can scan data from "realtime" and push them to
"historical". But it would make more sense to segment you data into
time-based indices, and use aliases to provide opaque API to your
application -- see
http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html
for inspiration.

Karel

--

radu_gheorghe · December 4, 2012, 12:25pm

Hello Jae,

On Mon, Dec 3, 2012 at 7:20 PM, Jae metacret@gmail.com wrote:

Thanks a lot for your answer.

Could you explain translog more in detail?

I'm not sure I can say more than what's already in the documentation. The
point of the transaction log is to have persistence of data that has just
been inserted, without needing to commit to the lucene index for every
document.

What kind of impact can I expect with tuning flushing option?

I think it's a matter of testing, like it is with the bulk size when you're
indexing.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Friday, November 30, 2012 11:03:53 AM UTC-8, Radu Gheorghe wrote:

Hello Jae,

I think you can achieve the same thing by tweaking flush options with
local gateway:
Elastic — The Search AI Company | Elastic**
translog.htmlhttp://www.elasticsearch.org/guide/reference/index-modules/translog.html

Also, you might benefit from store-level throttling. You can set this to
"merge" and fill in a value that fits your setup to make sure that merges
won't suffocate your IO:
Elastic — The Search AI Company | Elastic http://www.elasticsearch.org/guide/reference/index-modules/store.html

On the same page there are some details about store-level compression.
Which should help your disk throughput, at the expense of CPU.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Thu, Nov 29, 2012 at 9:01 PM, Jae meta...@gmail.com wrote:

I am brainstorming about implementing different realtime distribution
architecture based on elasticsearch. With the default deployment of
elasticsearch, mainly due to disk IO bottleneck, for scalable system with
less than 1 minutes delay from data pipeline to elasticsearch cluster, it
needs excessive number of servers than our cost budget. So I need to
discuss with you guys about the following architecture borrowed from Druid
developed and open sourced by metamx.

There are two types of elasticsearch clusters, the first one is realtime
and the second one is historical. Realtime nodes are indexing with Memory
store type. Every hour(this can be configured), realtime nodes flush the
index into the disk and notify historical nodes. When historical nodes
receives the notification, they start to copy the index from realtime nodes
and add to the index. After the historical nodes finishes copying the
index, realtime nodes can close the index and free it from the memory.

We need a client module to merge the search result from realtime nodes
and historical nodes.

What do you think about this idea? If I want to implement this
architecture, is there anything that should be added/changed in
elasticsearch? In other words, can I implement this architecture without
touching anything in elasticsearch core side?

Thank you
Best, Jae

--

--

--

Topic		Replies	Views
ES indexing throughput and scalability Elasticsearch	7	1063	July 6, 2017
Cluster resource usage Elasticsearch	14	447	July 6, 2017
What can I do to make the "readings" do not disturb "writings"? Elasticsearch	7	420	July 6, 2017
Insert later feature Elasticsearch	11	373	July 6, 2017
3,000 events/sec Architecture Elasticsearch	10	1788	July 6, 2017

Tiered deployment of elasticsearch

Best regards, Radu

Best regards, Radu

Best regards, Radu

Best regards, Radu

Related topics

Best regards,
Radu

Best regards,
Radu

Best regards,
Radu

Best regards,
Radu