How can I clear "keyword" entries that have no documents?

I accidentally indexed a document with a wrong text/keyword field. I later removed the document, but the keyword is still there.

For instance, when I run the query:

GET _search
{
  "size": 0,
  "aggs": {
    "group_by_foo": {
      "terms": {
        "field": "foo.keyword",
        "size": 500,
        "min_doc_count": 0
      }
    }
  }
}

I get results with the wrong keyword and a doc count of 0:

How can I remove these incorrect entries altogether, such that this query will not return the wrong keyword?

Hey Werner,

Using a min_doc_count of 1 should solve the problem

Tim

Thanks for the suggestion. I should have clarified that this is not the solution I need. Later, I want to filter the query by a certain time range, and I will have to include results with a doc count of 0 here.

For instance, assume the field is a server hostname, and it sends a heartbeat every 10 minutes. I want to find out whether a server is down by checking whether the sum of heartbeats for each hostname is 0 or not, within a certain time interval. So I need the results where the doc count is 0, and all hostnames that actually exist should be included in the aggregation.

Hey Werner,

That makes sense. This blog describes how Lucene and Elasticsearch handle document deletions. Lucene's Handling of Deleted Documents | Elastic Blog . The phenomenon you are observing is called "ghost terms".

I hope this helps

Thanks, I see what the underlying issue is, and I am aware that this might have performance impacts. Probably it will be a one-time operation to clean up the index.

What I don't directly see from the linked post is what I can do now, if anything at all. The blog post references a deprecated optimize API for Elasticsearch 1.x that doesn't exist anymore.

Do I understand correctly that the force merge action will achieve what I need? In particular, am I correctly understanding that I need to set only_expunge_deletes to true?

This depends on what you still want to do with your index. Force merging is only recommended for read-only indices (See the warning Box on the Force merge page). If you still want to write/update in the index, you could wait for a merge. Otherwise you could rollover (in case of a data stream) and force merge the index then.

I could set it to read-only for as long as it force-merges, and then allow writes again, so I think this should be the solution.

Interesting. I didn't do anything and the field disappeared. Could it be that Elasticsearch automatically cleans up these keywords? (I have not configured anything special for this index.)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.