Deleted document count very high compared to the actual count of documents in the index

ES version: 7.5
Number of shards: 5
Number of replicas: 4

We have a use case, where we have a lot of updates happening to the document (location updates).
So every update is deleting and creating a new document in the background.

At a point, the actual records was ~70k whereas the deleted count went beyond 40 million and the query times were linearly related to this. Query times went from ~100ms to 3seconds due to this.

Another observation made, after around 12 hours this count comes down to about 4 million (which is still huge). Do we have any configuration setting where we can do this deletion/merge of segments more frequently to have the deleted docs count under control.

Thanks in advance

Hey,

this sounds like a high count indeed. A couple of questions. Do you have any long running scroll searches/snapshots going on, that require certain files to be held open? You can also check using lsof if there are files marked as deleted but still held open by the Elasticsearch process. You can use the node stats to check for any open search contexts. And also check the indices part of that output.

Have you tweaked any Elasticsearch configuration? GC collection or merging options?

Also, what version are you running on exactly? About how much data are we talking when you only consider the 70k documents?

--Alex

@spinscale, thanks for responding. Please find answers to your queries below:

Do you have any long running scroll searches/snapshots going on, that require certain files to be held open?
-- We do have scroll queries in place (which will soon be replaced with paginated queries), but they are not long running. At one point in time we might see around 100 scroll queries happening. Also, we have increased the maximum open scroll contexts to 10000 .

You can also check using lsof if there are files marked as deleted but still held open by the Elasticsearch process. You can use the node stats to check for any open search contexts. And also check the indices part of that output.
--- Below is the output for the index stats (we have a 11 node cluster):

   "docs": {
        "count": 384139,
        "deleted": 8980155
    },
    "search": {
        "open_contexts": 36719,
        "query_total": 151414443,
        "query_time_in_millis": 585255876,
         "query_current": 10,
        "fetch_total": 132808243,
        "fetch_time_in_millis": 80296130,
        "fetch_current": 4,
        "scroll_total": 87969990,
        "scroll_time_in_millis": 2913045740496,
        "scroll_current": 36719,
        "suggest_total": 0,
        "suggest_time_in_millis": 0,
        "suggest_current": 0
    }

In the above result, the count is including all 5 shards

Have you tweaked any Elasticsearch configuration? GC collection or merging options?
-- We did not tweak any of the above configurations. Although we did reduce the node query cache from 10% of heap (16 GB heap) to 5 mb. This is because our search results constantly change as we do a radial search to get the list of people within a radius & the location of the people are changing constantly. Hence, we did not find the need to have a high query cache which had a very high miss rate.

Also, what version are you running on exactly? About how much data are we talking when you only consider the 70k documents?
-- We are running on ES 7.5 (recently moved to a this version, previously on 2.4). Data is not huge. Currently, we have 77,223 documents with a size of 1.74 GB
Adding to this, the size of documents was much lesser in ES 2.4 compared to ES 7.5. In ES 2.4, the total size for 75k docs was only ~50 MB, but, this is too high in the new version. Were there any changes with respect to this?

We have around 12 fields in the index - one of which is a geo_point location, one is a nested object with keyword fields & others are text fields.
The nested object looks something like this:

"outer_field": {
   "type": "nested",
    "properties": {
          "inner_field_1": {
               "type": "keyword"
           },
          "inner_field_2": {
              "type": "keyword"
          }
     }
}

Now that is interesting, you do have a lot of open scroll contexts. What are you using these searches for and why cant you use a regular search? Also, can you make sure you are clearing your scrolls?

See https://www.elastic.co/guide/en/elasticsearch/reference/7.6/search-request-body.html#request-body-search-scroll - also how to clear a scroll.

@spinscale , thanks for that.
We are already in the process of moving away from scroll (it is a work in progress) as it does not suit our real time search requirement.

And when you say clear scrolls, do you mean clear it after completion of a query? Or do a time to time cleanup of all scrolls?

Also, any help regarding the huge size of docs in ES 7.5 compared to ES 2.4?

And, is the high deleted document count because of the usage of scroll?

I meant after finishing your searches, as part of the tool that is doing the search. They should also auto-expire, unless you have specified a high timeout? What is the scroll timeout set to?

I suppose that the high document count stems from unclosed scrolls.

@spinscale
We have the scroll parameter set as 3s.