_ttl problems Bulk deletion failures

I have following problem: I have _ttl set on all the indexes, I have migrated from 1.5 to 1.7 a few weeks ago, the _ttl was working perfectly in 1.5.
And since yesterday I have following errors in the log:
" [2016-01-27 11:30:37,361][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [1680]/[2111] items"
And the cluster is behaving very badly, searches are slow and indexing is slow.
I have found in https://discuss.elastic.co/t/ttl-purge-not-working-after-upgrade-from-1-5-2-to-1-7-1/26787 that this error is related to shield plugin blocking the delete but I do not have the plugin installed.
My question is - can I switch of the _ttl removing so it does not try to remove the expired documents ?
I am planing to migrate to day or week based indexes, and delete them with no _ttl but this migration will not be easy, as indexes are big and in use, I can not just re-index it all, I need to sort of grow the new ones.

Hi,

There is an index setting called index.ttl.disable_purge; when set to true, documents won't be deleted by the TTL purge service.

I'm curious to have more info and log message about why the deletions failed.

[2016-01-27 03:26:44,494][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [10000]/[10000] items [2016-01-27 03:26:44,515][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [10000]/[10000] items [2016-01-27 03:26:44,590][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [10000]/[10000] items [2016-01-27 03:26:44,749][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [10000]/[10000] items [2016-01-27 03:26:44,763][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [10000]/[10000] items [2016-01-27 03:26:44,785][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [10000]/[10000] items [2016-01-27 03:26:44,796][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [4382]/[4382] items [2016-01-27 03:26:44,920][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [613]/[613] items [2016-01-27 03:26:44,974][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [588]/[588] items [2016-01-27 03:26:45,085][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [650]/[650] items [2016-01-27 03:26:45,265][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [605]/[605] items [2016-01-27 03:26:45,360][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [611]/[611] items [2016-01-27 03:26:45,918][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [4]/[4] items [2016-01-27 03:26:45,920][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [4]/[4] items [2016-01-27 03:26:45,921][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [4]/[4] items [2016-01-27 03:26:45,925][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [3]/[3] items [2016-01-27 03:26:45,926][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [4]/[4] items [2016-01-27 03:26:45,927][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [4]/[4] items [2016-01-27 03:27:10,659][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [7313]/[7313] items [2016-01-27 03:27:12,115][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [7249]/[7249] items [2016-01-27 03:27:13,909][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [7257]/[7257] items [2016-01-27 03:27:15,514][ERROR][indices.ttl ] [elknode2] bulk deletion failures for [7301]/[7301] items
Here is some more log, it is not saying why :slight_smile: it is failing. But thanks for the setting tip for index.ttl.disable_purge, it seems to be working, after setting that the cluster is way more responsive. I might be wrong but it looks like indexing is two times faster than yesterday.

I have cluster with 2 nodes of 1.7.0 and 43 indexes with different _tt times on them. these are various event logs and I need to keep 30 days worth of it. Some of the _ttl purging was still working (I could see that total count of documents in kopf plugin decreased) so the purging was working for some indexes but not all of them.
I do not know why it was failing but I am motivated now to remove the _ttl and use dated indexes.

That's definitely the way to go and you should have better performance: TTL purge service executes bulk deletes of documents where deleting timestamped indices is basically freeing resources and deleting files.

Do you have read-only indices? Do your documents continuously updated in background? Bulk deletions often failed because the document has been updated in the meanwhile.

You can also turn on TRACE logging for indices.ttl to know more about the errors but be careful because that will be very very verbose... Maybe you can snapshot some indices and restore them on a testing cluster and see if you can reproduce the issue and then enable TRACE logging.

The documents are not suppose to be updated as such (log events),
But I am parsing the same log file more than one times (some times) and that would update the document version. Thank you for your help but for me the problem is solved an I will not have more time to dig in to the cause.

Ok, cool :slight_smile:

That's a terrific idea. TTL is deprecated in Elasticsearch 2+ anyway.