Time based index and Routing

(Jérôme) #1


We've got very big dataset to index. Our idea was to make a single index and to use routing based on user_id (something like 200 users growing every month).
The thing is that some user_id have many hundred millions of lines to be indexed. So we thought of creating one index per year, routing using user_id and create an alias containing all those different indexes.

For example an alias called "LOG" containing LOG2010, LOG2011, LOG2012,LOG2013 and so on...

But I think that when searching using this alias, all the indexes will be searched (one shard per index thanks to routing...)

It may not be the best solution. I thought that filters on alias would be a way to select which indexes should be queried to give the results, but I think it's not the aim of those alias filters.

On one query, if we do this, if we have ten years of data, it means that 10 indexes will be queried to get the results (which are mainly aggregations).

Is it a good strategy ?

How would you tackle this ?

We thought defining 20 shards per index... Is it too much or too small ?

Thanks for your help

(Mark Walkom) #2

Routing in that manner makes a lot of sense, though generally we suggest that if you go that path you keep an eye on the shard size, try to keep them under 50GB and 2 billion docs and reshard/split larger users out into their own as needed.

Definitely use aliases as well. Doing time series may make sense if your data is also time based.

Ultimately though, this is something only you can really answer with time and testing.

(Jérôme) #3

Thank you very much for your answer...

In fact my question was much like " If I have many indexes having the same alias, when querying the alias, are all the subsequent indexes queried ?"

If so, is there a way to limit query to indexes that are relevant ?

Actually routing is the way to do it between an index and its shards, but is there a way to do it between an alias and its indexes ?

Hope I have been quite clear...

(Mark Walkom) #4

Any index attached to an alias will be queried, the only way to stop this is to query specific indices and not the alias.

(system) #5