Dealing with deleted documents

Hello,

I have a production system that is under a high indexing churn (mostly updates/inserts, very infrequent deletes). I have been trying to tweak the system to keep the percentage of deleted document stable. Right now, deleted documents are about 12% of the index. It has been growing up by 2% every week, consistently. I would like to stabilize this value to avoid the pitfalls of having too many deleted documents [a]. There are a couple of things I tried, with not much success

  • increased store throttling to 200MB/s (I run on SSDs) [b]
  • gradually increase reclaim_deletes_weight. It's at 7.0 right now. [c]

I also have been monitoring the ES cluster and didn't notice anything that would slow down merges:

  • peak disk writes seems to stay way below the configured 200MB/s, which indicates IO is not saturated.
  • CPU is between 20 and 30% with some occasional spikes.
  • active merge threads are 2 at peak (max size of 4).
  • the number of segments has been stable (~120 per node).
  • the average segment size is way below the 5G maximum (~700MB and growing).

I did not try to manually expunge deleted documents (using the optimize API) since that would strain the cluster too much afaik.

Could people share their experience managing deleted documents, more specifically share thoughts around the following questions:

  1. what's a realistic % of deleted documents ?
  2. for environment where there is a lot of churn, what are the knobs to tune to keep #1 under control ?
  3. it seems like pushing reclaim_deletes_weight too high may be a bad idea [a]. What would be a good measure on when to stop increasing it ?
  4. do people frequently optimize their indexes in prod, especially for large indexes (100M+ documents) ? What's the impact on search/indexing performance ?

I'm running 1.4, with 8CPU/64G ram machines (30G allocated to ES), and everything SSDs.

[a] https://www.elastic.co/blog/lucenes-handling-of-deleted-documents
[b] https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html#store-throttling
[c] https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-merge.html

Not sure where you got that 5G shard limit from?

I'd try running a manual optimise, given you have SSDs it should cope with it pretty well.

Thanks for the feedback. Is there a way to tune elasticsearch so that I don't need to run optimize ? I'm curious why cranking up reclaim_deletes_weight isn't helping.

I probably mis-understood's this (from [a] above):

Looks like it's actually the size post-merge (index.merge.policy.max_merged_segment).

Ah right, a shard is not a segment though, different things!