Index Scaling Question


(dallasmahrt) #1

We are evaluating ElasticSearch for our search solution and I had a
question about scaling indexes. Based on my research thus far, once an
index is created the shard count and replication factor are fixed. I
read in another post that if capacity is exceeded that a recommended
approach is to build a new index from the initial sources with higher
shard counts or higher replication factor (based on the type of
capacity being exceeded). The nature of our system makes this
difficult since we do not retain much of the data we are indexing and
cannot re-acquire that data easily.

My question is if there was any supported means of creating new index
from an existing index with new shard counts or replication factors?
If not, is there a plan to add this support? If so, how?

Thanks for any help you can offer.

--
Dallas Mahrt


(Shay Banon) #2

Hi,

Good question. First, you can always create more shards than you expect
to scale to. For example, a 20 shards with 1 replica cluster will scale up
to 40 machines until it reaches its limit... . Bump that to 40 with one
replica, and you can scale up to 80... .

Second, by default, elasticsearch stores the source document as well. So
you can always start a scrolled search and reindex the data. As discussed in
another post, the reindex capability can be provided out of the box if the
source is there.

There is an option to add "resharding", but thats quite a complex task to
accomplish with distributed systems (dynamo model take out of the picture
since it does not really apply to search...). It can be implemented under
certain restrictions, but its not something that I have on my plate for the
near future.

Last, the idea of multi index support in elasticsearch tries to address
that. For example, if you index data based on time, you can create an index
per month (for example). Each index has its own number of shards and number
of replicas settings (though you might as well have it set to a fix value).
This is not complete, of course, without the ability to search across
multiple indices, which elasticsearch provides ;). This gives you the added
benefit of doing faster searches on more recent months and then going
backwards if you don't find anything relevant.

The idea above does not have to be partitioned by month, of course, you
can choose how you partition it. The benefit is that you can scale out
endlessly in this scenario.

Really last point ;), if you are concerned about scaling search, the
ability to dynamically change the replica count is something that I plan to
add (replicas sever search/count/get requests). You can actually do it
currently by shutting down the cluster, changing in the gateway metadata the
number of replicas, and starting it up again, but thats a hack ;).

cheers,
shay.banon

On Fri, May 7, 2010 at 12:19 AM, Dallas Mahrt dallasmahrt@gmail.com wrote:

We are evaluating ElasticSearch for our search solution and I had a
question about scaling indexes. Based on my research thus far, once an
index is created the shard count and replication factor are fixed. I
read in another post that if capacity is exceeded that a recommended
approach is to build a new index from the initial sources with higher
shard counts or higher replication factor (based on the type of
capacity being exceeded). The nature of our system makes this
difficult since we do not retain much of the data we are indexing and
cannot re-acquire that data easily.

My question is if there was any supported means of creating new index
from an existing index with new shard counts or replication factors?
If not, is there a plan to add this support? If so, how?

Thanks for any help you can offer.

--
Dallas Mahrt


(system) #3