We've got very big dataset to index. Our idea was to make a single index and to use routing based on user_id (something like 200 users growing every month).
The thing is that some user_id have many hundred millions of lines to be indexed. So we thought of creating one index per year, routing using user_id and create an alias containing all those different indexes.
For example an alias called "LOG" containing LOG2010, LOG2011, LOG2012,LOG2013 and so on...
But I think that when searching using this alias, all the indexes will be searched (one shard per index thanks to routing...)
It may not be the best solution. I thought that filters on alias would be a way to select which indexes should be queried to give the results, but I think it's not the aim of those alias filters.
On one query, if we do this, if we have ten years of data, it means that 10 indexes will be queried to get the results (which are mainly aggregations).
Is it a good strategy ?
How would you tackle this ?
We thought defining 20 shards per index... Is it too much or too small ?
Routing in that manner makes a lot of sense, though generally we suggest that if you go that path you keep an eye on the shard size, try to keep them under 50GB and 2 billion docs and reshard/split larger users out into their own as needed.
Definitely use aliases as well. Doing time series may make sense if your data is also time based.
Ultimately though, this is something only you can really answer with time and testing.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.