In the example of facebook/twitter/logs, newer data is generally "hotter".
And you can put some default time limits in your application so that most
searches will hit just the newest index. And because that index is small,
the majority of your searches will turn out to be faster.
And because the "current" index always changes, you don't need to
overshard. For example, if you have 8 nodes, you can have 8 shards (or 4
shards and 1 replica, etc), and use Shard Allocation to make sure your
index is well distributed in your cluster. Then if you add a 9th node, you
can change the settings/template to make the next "current" index have 9
Of course, users should be able to widen their searches and hit more/all
indices - and in that case there's indeed a performance penalty like you
said. But in regular usecases for time-based indices, you can assume that
old ones will be more or less unchanged, so you can optimize "old" indices
and searches on them should be pretty fast.
Then there's indexing performance. Because the "current" index is smaller,
there's going to be less merging while indexing - and that will put less
strain on I/O.
http://sematext.com/ -- ElasticSearch -- Solr -- Lucene
On Wed, Jan 23, 2013 at 9:19 AM, T Vinod Gupta email@example.com:
in couple of talks on ES performance (including the one by shay), it was
illustrated that for constantly growing data streams (like facebook/twitter
or log streams) it is better to create indices for each week/month. and
then an aliased index can be created to combine a bunch of them into a
logical index. this will help in keeping the searchable space under control
and avoid heavy deletions.
but then at the same time, it also means that any search would be hitting
multiple shards and that has a perf penalty. one can use routing parameter
to direct all traffic for a user to 1 shard. that way searches are fast.
but using multiple indices kind of offsets it..
im little confused/unclear now.
does anyone have any insights?