The problem of a lot of deleted docs?

I need to delete docs frequent,but es only flags these as deleted.If there are a lot of deleted docs,the speed of query will lower?And has other problems?

Depending on the use case you can:

  • do nothing else. At some point elasticsearch will eventually remove deleted docs.
  • run a force merge
  • index in a new index and remove the old one
  • use index name pattern such as time based indices with index-DDMMYYY and remove old data by removing old indices (you can use curator for that)

May be describe more what is the use case?

1 Like

I use es to save member info(about 100 million docs),and often need to delete member info(doc).

I find the number of deleted docs grow fast,Which moment es will remove deleted docs,if i never use force merge api ,i will have query performance issues after a period of time?

It'll pick up and merge those deleted documents asynchronously, whenever it feels like it has enough work to do. For the most part deleted documents are just a thing you have to live with if you have an index that you are constantly updating or being deleted. They have very little CPU overhead which should be fine. They cost disk space and proportional IO bandwidth.

What proportion of deleted documents to live documents do you have now?

Now i have 5% deleted docs,i think i will have 30% - 40% deleted docs..

If it cost proportional IO bandwidth,i think i will have query performance issues?

I tried 'force merge',but the number of deleted docs just from 5 million to 3 million,I can clear the deleted docs?

Note that when you update a document you basically delete and reindex.

Did you see any slowness issue? I mean that it's often better letting elasticsearch do his job.
It will eventually remove deleted docs when needed.

Thank you,i haven't seen any slowness issue until now,i just worry about what will happen someday in the future.

I use _stats to find there are 3212174 deleted docs.But after i use index/_forcemerge?only_expunge_deletes=true api,it still have 3065150 deleted docs.

Before:

"primaries": {
     "docs": {
          "count": 74090335,
          "deleted": 3212174
 }

After:

"docs": {
  "count": 74090878,
  "deleted": 3065150
}

Well. I'm always saying something like "don't try to fix issues you don't have".
Monitor things is a great thing but let elasticsearch does its jobs.

Unless you want to reindex more than 50% of your docs at once, stick with the default behavior.
It has proven to work well for years. At least on projects I was managing before joining elastic.

Yes. The "problem" here (which is not a problem) is that you probably have more than one segment. You need to tell elasticsearch to write only one segment max_num_segments =1.
But again, it's probably a bad idea to run that often. Elasticsearch is smart enough to know what is the best number of segments you should have.

Unless it's a logging use case but here it's not the case.