I have a 3 node cluster with an index which contains 72 shards. There are 330M documents amounting to 100GB according to _cat/indices. This index has ~1000 fields, but each document has no more than 30.
Routing by customer id is used, but one of the customers in this index has abnormally many documents so one of the shards is much larger - 30GB. Other shards are around 0.5-1GB.
The problem is that queries to this large shard often take 20-30 seconds, even for routing id matching not many documents (e.g. 10k). The performance issues happen when there is lucene merging thread active for this index. The queries are rather simple - a set of term/bool filters + sorting by 2-3 fields.
Queries to small (0.5-1GB) shards are fine and take < 200ms.
The only other abnormality I found is that his large index has a lot of deleted documents - probably 50%.
It's not happening all the time, only during increased indexing throughput, or afterwards.
index.merge.scheduler.max_merge_count is set to 1 to limit cpu used for merging.
- what to do to fix those slow queries? I plan to reduce num of segments by running optimize and then probably move data from this large shard to a separate index
- why is merging thread affecting searching so much? There is still CPU time to spare.Those nodes have 4 cpus (8 threads), there is one merging thread and cpu usage stays around 25% (0.5 cpu for lucene merging, the rest for searching and some indexing)
- maybe there is some other reason and my suspicions are incorrect?
ES version: 5.3.2