Always keep under docs.deleted after Delete By Query API


(Eason Lau) #1

Hi guys,

These days I deleted documents by Query API. Below is my command to delete my document by Query API.

curl -XPOST 'http://elasticsearch:9200/myindex/_delete_by_query?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "range" : {
        "createTime" : {
           "lte" : "2018-02-08"
        }
    }
  }
}
'

After took effect to delete document by Query API. But it always keep under docs.deleted. Here you are to see:

[user@elk ~]$ curl -XGET "http://elasticsearch:9200/_cat/indices/myindex  ?v"
health status index    uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   myindex  STGtht6tes2upE4WrSOKTg   5   1   46420892      3271000     29.3gb         14.3gb

In additional, it still occupied the store.size.

If anyone know how to release the deleted size. Restart all cluster node?

Regards/Eason


(David Pilato) #2

A _forcemerge should help.

Note that is the reason using timebased indices is much better!


(Eason Lau) #3

HI @dadoonet,

You mean need to _forcemerge and will really delete it?
If user timebased indices, it will delete whole indice, which is not my expectation. So that I use _delete_by_query.


(David Pilato) #4

Yes. This is what I meant. https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html

If user timebased indices, it will delete whole indice, which is not my expectation. So that I use _delete_by_query.

But your request shows: "lte" : "2018-02-08" here?

Which is basically the same thing as doing something like:

DELETE 2017-*,2018-01-*,2018-02-01,2018-02-02,2018-02-03,2018-02-04,2018-02-05,2018-02-06,2018-02-07,2018-02-08

But my version of it is a way much more efficient than your way!

So if it's a one time only operation, using the DELETE BY QUERY could be fine as long as you remove only a small subset of the data. Otherwise, if you meant to keep like 10% of the data, it's better to use REINDEX API and drop the old index.
If it's a task you are going to run every day, then use timebased indices and then use Curator to automate that index removal every day.


(Eason Lau) #5

Oh. One thing to be mentioned, I only have one index to store all data, not separate them by date postfix for indice name. So I only one way to delete the document of indice instead of indices.


(David Pilato) #6

That's exactly what I meant.
It's wrong IMO and you should revisit your architecture.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.