hi,
in couple of talks on ES performance (including the one by shay), it was
illustrated that for constantly growing data streams (like facebook/twitter
or log streams) it is better to create indices for each week/month. and
then an aliased index can be created to combine a bunch of them into a
logical index. this will help in keeping the searchable space under control
and avoid heavy deletions.
but then at the same time, it also means that any search would be hitting
multiple shards and that has a perf penalty. one can use routing parameter
to direct all traffic for a user to 1 shard. that way searches are fast.
but using multiple indices kind of offsets it..
im little confused/unclear now.
does anyone have any insights?
In the example of facebook/twitter/logs, newer data is generally "hotter".
And you can put some default time limits in your application so that most
searches will hit just the newest index. And because that index is small,
the majority of your searches will turn out to be faster.
And because the "current" index always changes, you don't need to
overshard. For example, if you have 8 nodes, you can have 8 shards (or 4
shards and 1 replica, etc), and use Shard Allocation[0] to make sure your
index is well distributed in your cluster. Then if you add a 9th node, you
can change the settings/template to make the next "current" index have 9
shards.
Of course, users should be able to widen their searches and hit more/all
indices - and in that case there's indeed a performance penalty like you
said. But in regular usecases for time-based indices, you can assume that
old ones will be more or less unchanged, so you can optimize "old" indices
and searches on them should be pretty fast.
Then there's indexing performance. Because the "current" index is smaller,
there's going to be less merging while indexing - and that will put less
strain on I/O.
hi,
in couple of talks on ES performance (including the one by shay), it was
illustrated that for constantly growing data streams (like facebook/twitter
or log streams) it is better to create indices for each week/month. and
then an aliased index can be created to combine a bunch of them into a
logical index. this will help in keeping the searchable space under control
and avoid heavy deletions.
but then at the same time, it also means that any search would be hitting
multiple shards and that has a perf penalty. one can use routing parameter
to direct all traffic for a user to 1 shard. that way searches are fast.
but using multiple indices kind of offsets it..
im little confused/unclear now.
does anyone have any insights?
Thanks Radu for the elaborate response.
I am wondering if someone has done experiments on what is a good partition
size for a shard on a m1.xlarge host on ec2? how many docs/data size is
good for 1 shard?
for instance, i currently have about 20M docs in a 5 shard index on an
m1.xlarge. this is with no routing intelligence.. so the searches go to all
shards.. and my searches take over 10s of seconds many times.. but my last
month data is only 2.3M docs. so if i create a month index with just 1
shard, will that suffice? or i create 2 shards for the month data with
routing in it. my typical search is over 3 months. so in the month based
scheme, i would be going to 3 shards instead of earlier 5 and on smaller
shards.
In the example of facebook/twitter/logs, newer data is generally "hotter".
And you can put some default time limits in your application so that most
searches will hit just the newest index. And because that index is small,
the majority of your searches will turn out to be faster.
And because the "current" index always changes, you don't need to
overshard. For example, if you have 8 nodes, you can have 8 shards (or 4
shards and 1 replica, etc), and use Shard Allocation[0] to make sure your
index is well distributed in your cluster. Then if you add a 9th node, you
can change the settings/template to make the next "current" index have 9
shards.
Of course, users should be able to widen their searches and hit more/all
indices - and in that case there's indeed a performance penalty like you
said. But in regular usecases for time-based indices, you can assume that
old ones will be more or less unchanged, so you can optimize "old" indices
and searches on them should be pretty fast.
Then there's indexing performance. Because the "current" index is smaller,
there's going to be less merging while indexing - and that will put less
strain on I/O.
hi,
in couple of talks on ES performance (including the one by shay), it was
illustrated that for constantly growing data streams (like facebook/twitter
or log streams) it is better to create indices for each week/month. and
then an aliased index can be created to combine a bunch of them into a
logical index. this will help in keeping the searchable space under control
and avoid heavy deletions.
but then at the same time, it also means that any search would be hitting
multiple shards and that has a perf penalty. one can use routing parameter
to direct all traffic for a user to 1 shard. that way searches are fast.
but using multiple indices kind of offsets it..
im little confused/unclear now.
does anyone have any insights?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.