Very slow searches on a very large shard

I have a 3 node cluster with an index which contains 72 shards. There are 330M documents amounting to 100GB according to _cat/indices. This index has ~1000 fields, but each document has no more than 30.

Routing by customer id is used, but one of the customers in this index has abnormally many documents so one of the shards is much larger - 30GB. Other shards are around 0.5-1GB.

The problem is that queries to this large shard often take 20-30 seconds, even for routing id matching not many documents (e.g. 10k). The performance issues happen when there is lucene merging thread active for this index. The queries are rather simple - a set of term/bool filters + sorting by 2-3 fields.

Queries to small (0.5-1GB) shards are fine and take < 200ms.

The only other abnormality I found is that his large index has a lot of deleted documents - probably 50%.

It's not happening all the time, only during increased indexing throughput, or afterwards.
index.merge.scheduler.max_merge_count is set to 1 to limit cpu used for merging.


  • what to do to fix those slow queries? I plan to reduce num of segments by running optimize and then probably move data from this large shard to a separate index
  • why is merging thread affecting searching so much? There is still CPU time to spare.Those nodes have 4 cpus (8 threads), there is one merging thread and cpu usage stays around 25% (0.5 cpu for lucene merging, the rest for searching and some indexing)
  • maybe there is some other reason and my suspicions are incorrect?

ES version: 5.3.2

1 Like

hey Marcin, good questions! Lemme try to explain some things....

this is no good. We should try to fix this down the road. I think you have one or more hotspot customer that you should either try to split out into a separate index or you use a routing partitioned index that is available since 5.3.0. This will solve your hotspot issue and keep your shard sizes at bay.

this sounds counter-intuitive until you look at what merges do and why. When we get a lot of changes to an index we have to build up a lot of new segments that need to be merged since we are append only on the lower levels. At the same time every doc that is updated caused a delete unless it's the first time we see the doc. Now you end up with a lot of data that needs to be read and written to disk. Unfortunately java can't read / write data without going through the FS cache that means we are subject to cache trashing in the case of heavy merging. This is something we try to improve once java 10 is available since then we might be able to use O_DIRECT when opening streams to write / read from disk without going through the cache. I think it's less of a concurrency issue here...

I't try to use routing partitioning first and see what the impact will be if you do.

Actually we already keep large customers in separate indices or clusters (which do not route by customer id), but this one did something that added a lot of nested documents and we noticed too late (and their shard grew).
But partitioned routing sounds like a good way to mitigate this, thanks for the suggestion.

This makes sense. But this also means that additional merge threads could have negative impact, correct?

What metric would be best to look at to detect that such trashing is happening? Something like major page faults, or maybe merges.current_size_in_bytes from /_nodes/stats?


yeah so I think it would be good to have an eye on page cache stats like:

  • page cache accesses / dirties
  • page cache writes / additions

note I think I recall the is a cost to measure this but last time I checked it was less than 5%

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.