I am brainstorming about implementing different realtime distribution
architecture based on elasticsearch. With the default deployment of
elasticsearch, mainly due to disk IO bottleneck, for scalable system with
less than 1 minutes delay from data pipeline to elasticsearch cluster, it
needs excessive number of servers than our cost budget. So I need to
discuss with you guys about the following architecture borrowed from Druid
developed and open sourced by metamx.
There are two types of elasticsearch clusters, the first one is realtime
and the second one is historical. Realtime nodes are indexing with Memory
store type. Every hour(this can be configured), realtime nodes flush the
index into the disk and notify historical nodes. When historical nodes
receives the notification, they start to copy the index from realtime nodes
and add to the index. After the historical nodes finishes copying the
index, realtime nodes can close the index and free it from the memory.
We need a client module to merge the search result from realtime nodes and
historical nodes.
What do you think about this idea? If I want to implement this
architecture, is there anything that should be added/changed in
elasticsearch? In other words, can I implement this architecture without
touching anything in elasticsearch core side?
I think you can achieve the same thing by tweaking flush options with local
gateway:
Also, you might benefit from store-level throttling. You can set this to
"merge" and fill in a value that fits your setup to make sure that merges
won't suffocate your IO:
On the same page there are some details about store-level compression.
Which should help your disk throughput, at the expense of CPU.
I am brainstorming about implementing different realtime distribution
architecture based on elasticsearch. With the default deployment of
elasticsearch, mainly due to disk IO bottleneck, for scalable system with
less than 1 minutes delay from data pipeline to elasticsearch cluster, it
needs excessive number of servers than our cost budget. So I need to
discuss with you guys about the following architecture borrowed from Druid
developed and open sourced by metamx.
There are two types of elasticsearch clusters, the first one is realtime
and the second one is historical. Realtime nodes are indexing with Memory
store type. Every hour(this can be configured), realtime nodes flush the
index into the disk and notify historical nodes. When historical nodes
receives the notification, they start to copy the index from realtime nodes
and add to the index. After the historical nodes finishes copying the
index, realtime nodes can close the index and free it from the memory.
We need a client module to merge the search result from realtime nodes and
historical nodes.
What do you think about this idea? If I want to implement this
architecture, is there anything that should be added/changed in
elasticsearch? In other words, can I implement this architecture without
touching anything in elasticsearch core side?
If I understand correctly, all you want is already there in Elasticsearch.
You have always realtime nodes, indexing and searching takes place in
memory, for performance reasons. The data is regularly persisted to the
gateway storage, think of it as a historical state of the index. The index
is already flushed each hour or so, even if idle. Note, kimchy compares the
gateway storage with Apple's backup mechanism, the Time Machine.
On Thu, Nov 29, 2012 at 9:01 PM, Jae <meta...@gmail.com <javascript:>>wrote:
I am brainstorming about implementing different realtime distribution
architecture based on elasticsearch. With the default deployment of
elasticsearch, mainly due to disk IO bottleneck, for scalable system with
less than 1 minutes delay from data pipeline to elasticsearch cluster, it
needs excessive number of servers than our cost budget. So I need to
discuss with you guys about the following architecture borrowed from Druid
developed and open sourced by metamx.
There are two types of elasticsearch clusters, the first one is realtime
and the second one is historical. Realtime nodes are indexing with Memory
store type. Every hour(this can be configured), realtime nodes flush the
index into the disk and notify historical nodes. When historical nodes
receives the notification, they start to copy the index from realtime nodes
and add to the index. After the historical nodes finishes copying the
index, realtime nodes can close the index and free it from the memory.
We need a client module to merge the search result from realtime nodes
and historical nodes.
What do you think about this idea? If I want to implement this
architecture, is there anything that should be added/changed in
elasticsearch? In other words, can I implement this architecture without
touching anything in elasticsearch core side?
There are two types of elasticsearch clusters, the first one is realtime
and the second one is
If I understand correctly, all you want is already there in Elasticsearch.
Jörg is absolutely right. Unless you have a very specific scenario or a lot
of data / very high load on constrained resources, most of that is already
handled by elasticsearch and the operating system.
In the most simple case, you can use the "realtime" index and the
"historical" index, with different refresh intervals ("1s" and "1h"). Then,
every hour, you can scan data from "realtime" and push them to
"historical". But it would make more sense to segment you data into
time-based indices, and use aliases to provide opaque API to your
application -- see http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html
for inspiration.
I'm not sure I can say more than what's already in the documentation. The
point of the transaction log is to have persistence of data that has just
been inserted, without needing to commit to the lucene index for every
document.
What kind of impact can I expect with tuning flushing option?
I think it's a matter of testing, like it is with the bulk size when you're
indexing.
I am brainstorming about implementing different realtime distribution
architecture based on elasticsearch. With the default deployment of
elasticsearch, mainly due to disk IO bottleneck, for scalable system with
less than 1 minutes delay from data pipeline to elasticsearch cluster, it
needs excessive number of servers than our cost budget. So I need to
discuss with you guys about the following architecture borrowed from Druid
developed and open sourced by metamx.
There are two types of elasticsearch clusters, the first one is realtime
and the second one is historical. Realtime nodes are indexing with Memory
store type. Every hour(this can be configured), realtime nodes flush the
index into the disk and notify historical nodes. When historical nodes
receives the notification, they start to copy the index from realtime nodes
and add to the index. After the historical nodes finishes copying the
index, realtime nodes can close the index and free it from the memory.
We need a client module to merge the search result from realtime nodes
and historical nodes.
What do you think about this idea? If I want to implement this
architecture, is there anything that should be added/changed in
elasticsearch? In other words, can I implement this architecture without
touching anything in elasticsearch core side?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.