ES Curator Delete Indices Age Filter Fail

Today, we hit a really strange issue with Elasticsearch Curator 5.5.4 delete_indices.

Here is our action rule:

6:
        action: delete_indices
        options:
          ignore_empty_list: True
          timeout_override:
          continue_if_exception: False
        filters:
        - filtertype: age
          source: name
          direction: older
          timestring: '%Y_%m'
          unit: months
          unit_count: 3

So I had an index named like this:

54b2372cdd4d073b77ed99bd5b93930f5c2c0333_2018.10.10

or even like this:

54b2372cdd4d073b77ed99bd5b93930f5c2c0333_204828402480248

BTW, 54b2372cdd4d073b77ed99bd5b93930f5c2c0333 is just one of our git commit sha-1.

To our surprise, the above delete_indices action actually matched and deleted both of the indices above.

On paper, there is nothing in the name of those indices suggests a match to "3 months older than 2018_10". However, it just did repeatedly, never failed.

2018-10-10 23:01:22,307 DEBUG          curator.indexlist           __actionable:35   Index 54b2372cdd4d073b77ed99bd5b93930f5c2c0333_204828402480248 is actionable and remains in the list.

This "54b2372cdd4d073b77ed99bd5b93930f5c2c0333" seems to be a magic number. Had I changed to a different sha-1 then it will not match, such as:

2018-10-10 23:01:22,308 DEBUG          curator.indexlist          filter_by_age:546  Index "8f87138194295772625d4eb6365a3f9cdc233cea_2018.10.10" does not meet provided criteria. Removing from list.

Could it be some kind of internal regex match or numeric handling/overflow problem?

Let us know if more details are needed.

Thanks!

@theuntergeek maybe you have an idea about this?

With an SHA-1 hash as an index name, you should not be using source: name as your age filter. You are correct that it is regular-expression related. I could not have guessed (though perhaps I should have?) that someone would use a hash as an index name, where it would contain a large set of letters and numbers. A timestring value of %Y_%m is merely translated to a regular expression like ^.*(\d{4}_\d{2}).*$. After these values are extracted, they are calculated as a date. You can perhaps see, then where this might cause issues. I apologize for the bad user experience you have had. Please understand that this use case is not something I ever anticipated. It is definitely not a good thing with indices named like yours, but this sort of naming is atypical. Most people putting a named date in their index names are doing indexname-%Y.%m.%d or something like it. This is what source: name was meant to handle.

The good thing is that you’re not without recourse. First, don’t use source: name. Try creation_date instead, for example. If that doesn’t match properly because the documents inside don’t match the date of creation, then you can use the field_stats source instead, which can actually calculate the minimum and maximum timestamp values in each index. This is the most precise method, though it takes a few milliseconds longer per index to make those calculations.

1 Like

Hi, Aaron

Thanks for the update and suggestions.

Yeah, it is because "^.(\d{4}_\d{2}).$" was matching with 0333_20 in the following index name.

54b2372cdd4d073b77ed99bd5b93930f5c2c0333_2018.10.10

We had since fixed the test with double dashes (__) in between index_prefix and the actual date string.

It is all good now.

Wish that "^.(\d{4}_\d{2}).$" can be fine tuned to match only 4 (not 4 or more) digits in the front and 2 (not 2 or more) digits in the end.

Like: "^.\D(\d{4}_\d{2})\D.$"

But I know there is just so much we can predict in a random index name.
And we don't want to create and maintain a super complicated regex.

Thanks,

Hang Du

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.