Stuck with lots of empty indices for every beat version I've ever used

Hello,

I run a small 2-node cluster with a kitchen sink of beats and elastic agents on approx. 20 hosts. Nevertheless, I have a ton of indices and I'm hitting the cluster.max_shards_per_node.

There is not much data in my indices - in fact, some of them are entirely empty. An issue is that every beat, for every version, creates it's own ILM-managed index, and some of them are long dead:

I have a Lifecycle Policy with a delete phase, but this doesn't help me as Elasticsearch seems to keep re-creating empty indices for a metricbeat version that I'm no longer using. What's keeping the indices alive? What's the best way to tidy this up? And finally, is there an automatic way to clean this up?

Thanks!

Edit: I have tweaking to do with the number of primaries/replicas, sure, but this general question of housekeeping still remains.

Elasticsearch will create indices when asked. In this case there must be something asking it to create these.

Sounds reasonable. So if there truly was no metricbeat 7.13 anymore, the ILM would stop rolling over the index and eventually delete all remaining indices as per ILM policy?

And if there is still a metricbeat, do I have any chance of catching this happening without diving into audit logs?

The ILM policy can exist, without having actual indicies being created.

So I would delete all the 7.13 indices and then see if any others get created and go from there (if you haven't done that yet).

Hey, sorry, I have to come back to this. I did some testing - created an ILM-Managed index with a 1minute hot -> 3minute delete policy.
I added one or two documents, and let it rest over night. Right now I'm at generation 120, so something has made this index roll over multiple times while it was completely empty. It's a test index I specifically created, so I can be sure that no component (beat, elastic agent, logstash...) ever requested anything about this index nor ingested a document. The behavior seems erratic - right now the indices are >6min old despite a lifetime of at most 4minutes (1min hot + 3 min delete).

I'm wondering if there is a better way to explain or debug this behavior. What prompts ILM to check whether a rollover is necessary? It can't be a strict timer, and it also can't be an incoming document.

GET ilmtest/_ilm/explain

{
  "indices" : {
    "ilmtest-000122" : {
      "index" : "ilmtest-000122",
      "managed" : true,
      "policy" : "testpolicy",
      "lifecycle_date_millis" : 1642101859057,
      "age" : "3.96m",
      "phase" : "hot",
      "phase_time_millis" : 1642101859617,
      "action" : "rollover",
      "action_time_millis" : 1642101860217,
      "step" : "check-rollover-ready",
      "step_time_millis" : 1642101860217,
      "phase_execution" : {
        "policy" : "testpolicy",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "set_priority" : {
              "priority" : 100
            },
            "rollover" : {
              "max_primary_shard_size" : "50gb",
              "max_age" : "1m"
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1642028874821
      }
    },
    "ilmtest-000120" : {
      "index" : "ilmtest-000120",
      "managed" : true,
      "policy" : "testpolicy",
      "lifecycle_date_millis" : 1642101259707,
      "age" : "13.95m",
      "phase" : "delete",
      "phase_time_millis" : 1642101856411,
      "action" : "delete",
      "action_time_millis" : 1642101856411,
      "step" : "wait-for-shard-history-leases",
      "step_time_millis" : 1642101856411,
      "phase_execution" : {
        "policy" : "testpolicy",
        "phase_definition" : {
          "min_age" : "3m",
          "actions" : {
            "delete" : {
              "delete_searchable_snapshot" : true
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1642028874821
      }
    },
    "ilmtest-000121" : {
      "index" : "ilmtest-000121",
      "managed" : true,
      "policy" : "testpolicy",
      "lifecycle_date_millis" : 1642101859017,
      "age" : "3.97m",
      "phase" : "hot",
      "phase_time_millis" : 1642101260308,
      "action" : "complete",
      "action_time_millis" : 1642101860017,
      "step" : "complete",
      "step_time_millis" : 1642101860017,
      "phase_execution" : {
        "policy" : "testpolicy",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "set_priority" : {
              "priority" : 100
            },
            "rollover" : {
              "max_primary_shard_size" : "50gb",
              "max_age" : "1m"
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1642028874821
      }
    }
  }
}

I think this is actually all intended behaviour.

First of all, ILM only runs every 10minutes in default configuration.

indices.lifecycle.poll_interval

(Dynamic, time unit value) How often index lifecycle management checks for indices that meet policy criteria. Defaults to 10m .

Let's assume an unused index metricbeat-7.13 that has a size limit as well as a time limit in it's ILM policy. Since it's unused, ILM will trigger if and only if the time criterion is met. It checks this every 10minutes in default config. For ILM it would not matter that no data has been ingested - when the time limit has been reached, it will roll over the index and create a new generation.

So my takeaway is, ILM will keep a number of "zombie indices" alive and rolling over for all beat versions that have ever been used in a cluster. Or is that wrong?

I'd recommend giving a thumbs up to: [Rollover] Autodelete Empty Indices · Issue #73349 · elastic/elasticsearch · GitHub. You can stop this by deleting the current "writable" index, this will prevent ILM from creating any additional new empty indices, provided you are in fact no longer writing to the index pattern.

3 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.