Strategy for index over a rolling window of realtime data


(Colin Surprenant) #1

Hi,

I need to keep an index over a 4-5 weeks rolling window of realtime
data. The granularity could be week or day.

My question is, performance wise, is it a better idea to deal with 4
weekly indices of 30 daily indices? Would my number of current and
planned nodes impact this choice? What would scale better in terms of
increased data volume and addition of new nodes?

I am planning on using an alias which will associate with the past 4
weekly indices or the past 30 daily indices.

Thanks,
Colin


(Shay Banon) #2

Hi,

Good question. Let me get back first to how search works. At the end, a
search is executed (in parallel) on index shards, the results are reduced
and sent back to the client. Its all about index shards, not really about
indices. As an example, a search against a single index with 7 shards is the
same as executing search against 7 indices each with 1 shard (ignoring
replicas here, as search is round robin between shard and its replicas).

So, the main question is how many shards do you expect to need for one
week. If you want to use one day resolution or 7 day resolution is your
decision. For example, if you think you will need 7 shards for a single
index in an index per week scenario, you can as easily create a single index
each day with 1 shard (replicas are again out of the question here, its an
agnostic decision).

The benefit of either 1 week or 1 day resolution is the ability to scale
based on load. For example, if you find out that there is more load than
expected, then the next index you create can hold more shards. In the one
week scenario, you have more resolution to add shards, for example, you can
decide to go from 7 shards to 10 shards. If you have one day resolution,
then going up from 1 shard to 2 shards means going up to 14 shards per week.

-shay.banon

On Thu, Sep 2, 2010 at 10:07 PM, Colin Surprenant <
colin.surprenant@gmail.com> wrote:

Hi,

I need to keep an index over a 4-5 weeks rolling window of realtime
data. The granularity could be week or day.

My question is, performance wise, is it a better idea to deal with 4
weekly indices of 30 daily indices? Would my number of current and
planned nodes impact this choice? What would scale better in terms of
increased data volume and addition of new nodes?

I am planning on using an alias which will associate with the past 4
weekly indices or the past 30 daily indices.

Thanks,
Colin


(system) #3