I find I can get better bulk indexing performance when the index has more
shards. Does that make sense ?
My own theory is that when I have multiple bulk clients, then by increasing
shards the server achieves better concurrency (?)
So if I increase the shards to say 30, and get a good indexing run... is it
possible to reduce the number of shards subsequently.. or does it matter if
the number remains at say 30?
Sorry, you can't reduce it. I imagine the performance increase you get is because the merge logic is per shard so it does less when there are more shards for the same data. You can likely get similar numbers if you set the refresh interval to -1 and play with the merge policy before the bulk load. You'd want to reset it afterwords and then run an optimize. This amounts to the same thing as starting with more shards and merging them. Mostly. I think.
I find I can get better bulk indexing performance when the index has more shards. Does that make sense ?
My own theory is that when I have multiple bulk clients, then by increasing shards the server achieves better concurrency (?)
So if I increase the shards to say 30, and get a good indexing run... is it possible to reduce the number of shards subsequently.. or does it matter if the number remains at say 30?
the default setting is refresh by every second. Refresh works very fast
when segments are small. If you have more than one shard and use bulk
indexing, the segments are small enough for refresh for a longer time. So
you will observe a faster bulk indexing, but only for the first 20 minutes
or so (longer runs will also show increasing response times). Disabling
refresh at bulk indexing time is strongly recommended.
the default, ES settings are selected for more than one shard (default is
5). To utilize the server resources (CPU, RAM) by a single shard, a little
optimization of thread pools and buffer sizes may be required, especially
the bulk thread pool and the index buffer size.
Jörg
On Mon, Apr 14, 2014 at 1:30 AM, Nik Everett nik9000@gmail.com wrote:
Sorry, you can't reduce it. I imagine the performance increase you get is
because the merge logic is per shard so it does less when there are more
shards for the same data. You can likely get similar numbers if you set the
refresh interval to -1 and play with the merge policy before the bulk load.
You'd want to reset it afterwords and then run an optimize. This amounts to
the same thing as starting with more shards and merging them. Mostly. I
think.
I find I can get better bulk indexing performance when the index has more
shards. Does that make sense ?
My own theory is that when I have multiple bulk clients, then by
increasing shards the server achieves better concurrency (?)
So if I increase the shards to say 30, and get a good indexing run... is
it possible to reduce the number of shards subsequently.. or does it matter
if the number remains at say 30?
Thanks guys,
yes I already have refresh interval at -1
What I'm suggesting is that to support multiple client threads say : 50 then it seems that 50 shards is a big help.
ie more shards equals more concurrency.
Although many shards mean higher concurrency, a single shard also has very
high concurrency. ES concurrency is implemented independent of the number
of shards and is very flexible in configuration of thread pools and modules.
The default settings are for a small ES system which can scale to a around
5 nodes, so if you have a powerful machine, you will see that more than one
shard gives better experience simply because the default settings are not
designed for being limited to a single node.
50 shards per node is quite high and needs heavy adjustment at other places
unless you go for a total number of around 50 nodes in the cluster, so I do
not recommend high number of shards by default.
Thanks guys,
yes I already have refresh interval at -1
What I'm suggesting is that to support multiple client threads say : 50
then it seems that 50 shards is a big help.
ie more shards equals more concurrency.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.