Sharding a big index by name


(David Stendardi) #1

Hello !

I'am using elasticsearch intensively to index statistics. Since most of
the requests will concern the last 2 months records and the volumetry will
be quite important (400 000 records/day), i'am considering to add a year
suffix to the index name, and use the template api.

example : curl -XPOST 'localhost:9200/statistic-2011/foo/' -d ...

Does it makes sense ? I'am wondering if this will optimize something or if
elasticsearch already doing these kinds of optimizations internally ?

Cheers,

David Stendardi


(David Stendardi) #2

Addendum :
It will probably be a Year - Month suffix rather than only Year. (ex :
statistic-2011-10)


(phobos182) #3

That's what we do at my company. We choose a index creation strategy to partition the data by time series. Then when we query it, it will look at a much smaller set of data rather than having it in one large index.

We choose a week index strategy (2011-42, 2011-43, 2011-44, ...) where the shard name is the year + week number.

Each index has 8 shards based on the size of the cluster. So when querying two weeks of information is hits 16 shards. When querying 3 weeks it hits 24 shards. Etc... Since ElasticSearch handles the parallel dispatch of requests, it's really not an issue to have high shard count if you have the machines to handle it.

When querying the data in ElasticSearch, you can choose what shards to execute the query on with a comma separated list.

Ex:

curl -XGET 'http://localhost:9200/2011-42,2011-43/_search?q=user:kimchy'


(system) #4