Data streams stuck in frozen searchable_snapshot phase (wait state inconsistent with indices status)

Adrien_WATTEZ · July 18, 2023, 10:51pm

This request follows a previous ticket that was never really answered.

ES 8.8.2 - free trial in local - paid enterprise licence on other environment

Same problem, I have created a data stream with ILM with phases ranging from Hot, Frozen to Deleted.

The backing indices are successfully moved into Frozen state but never continue into the Deleted phase.

Why are the indices not moving to deleted state?

Add some settings ans lifecycle policy with aws s3 searchable snapshot.

PUT _cluster/settings
{
  "transient": {
    "indices.lifecycle.poll_interval": "30s"
  }
}

PUT _ilm/policy/test-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "set_priority": {
            "priority": 100
          },
          "rollover": {
            "max_primary_shard_size": "50mb",
            "max_age": "10s"
          }
        }
      },
      "frozen": {
        "min_age": "5m",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "snapshot_s3_repository"
          }
        }
      },
      "delete": {
        "min_age": "15m",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Create index template with lifecylcle and data streams. Add some data.

PUT _index_template/test-template
{
  "index_patterns": ["test-index*"],
  "data_stream": { },
  "template": {
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date",
          "format": "date_optional_time||epoch_millis"
        }
      }
    },
    "settings": {
      "lifecycle": {
        "name": "test-policy"
      },
      "number_of_shards": 1,
      "number_of_replicas": 0
    }
  } 
}



POST test-index-1/_doc
{
  "field1": "someValue2",
  "@timestamp": 1689718060023
}

You can find elasticsearch.log here after one hour:

gist.github.com

https://gist.github.com/awattez/f6556d68064bb4c3ff72d1f2a4b922dc

elasticsearch.log

[2023-07-19T00:02:51,667][INFO ][o.e.c.s.ClusterSettings  ] [AWA-LAPTOP] updating [indices.lifecycle.poll_interval] from [10s] to [30s]
[2023-07-19T00:03:52,957][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [AWA-LAPTOP] adding index lifecycle policy [test-policy]
[2023-07-19T00:06:17,269][INFO ][o.e.c.m.MetadataIndexTemplateService] [AWA-LAPTOP] adding index template [test-template] for index patterns [test-index*]
[2023-07-19T00:06:34,162][INFO ][o.e.c.m.MetadataIndexTemplateService] [AWA-LAPTOP] updating index template [test-template] for index patterns [test-index*]
[2023-07-19T00:08:28,387][INFO ][o.e.c.m.MetadataCreateIndexService] [AWA-LAPTOP] [.ds-test-index-1-2023.07.18-000001] creating index, cause [initialize_data_stream], templates [test-template], shards [1]/[0]
[2023-07-19T00:08:28,391][INFO ][o.e.c.m.MetadataCreateDataStreamService] [AWA-LAPTOP] adding data stream [test-index-1] with write index [.ds-test-index-1-2023.07.18-000001], backing indices [], and aliases []
[2023-07-19T00:08:28,817][INFO ][o.e.x.i.IndexLifecycleTransition] [AWA-LAPTOP] moving index [.ds-test-index-1-2023.07.18-000001] from [null] to [{"phase":"new","action":"complete","name":"complete"}] in policy [test-policy]
[2023-07-19T00:08:28,960][INFO ][o.e.x.i.IndexLifecycleTransition] [AWA-LAPTOP] moving index [.ds-test-index-1-2023.07.18-000001] from [{"phase":"new","action":"complete","name":"complete"}] to [{"phase":"hot","action":"set_priority","name":"set_priority"}] in policy [test-policy]
[2023-07-19T00:08:29,367][INFO ][o.e.c.m.MetadataMappingService] [AWA-LAPTOP] [.ds-test-index-1-2023.07.18-000001/2ff-u2EuQGmSQOyLB9qKKQ] update_mapping [_doc]
[2023-07-19T00:08:29,516][INFO ][o.e.x.i.IndexLifecycleTransition] [AWA-LAPTOP] moving index [.ds-test-index-1-2023.07.18-000001] from [{"phase":"hot","action":"set_priority","name":"set_priority"}] to [{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"}] in policy [test-policy]

This file has been truncated. show original

ILM Explain

GET test-index-1/_ilm/explain
{
  "indices": {
    ".ds-test-index-1-2023.07.18-000001": {
      "index": ".ds-test-index-1-2023.07.18-000001",
      "managed": true,
      "policy": "test-policy",
      "index_creation_date_millis": 1689718108382,
      "time_since_index_creation": "33.78m",
      "lifecycle_date_millis": 1689718131813,
      "age": "33.39m",
      "phase": "frozen",
      "phase_time_millis": 1689718461739,
      "action": "searchable_snapshot",
      "action_time_millis": 1689718461739,
      "step": "wait-for-index-color",
      "step_time_millis": 1689718524472,
      "repository_name": "snapshot_s3_repository",
      "snapshot_name": "2023.07.18-.ds-test-index-1-2023.07.18-000001-test-policy-dgun9fltszwfctmnc1zj-a",
      "step_info": {
        "message": "index is not green; not all shards are active"
      },
      "phase_execution": {
        "policy": "test-policy",
        "phase_definition": {
          "min_age": "5m",
          "actions": {
            "searchable_snapshot": {
              "snapshot_repository": "snapshot_s3_repository",
              "force_merge_index": true
            }
          }
        },
        "version": 1,
        "modified_date_in_millis": 1689717832957
      }
    },
    ".ds-test-index-1-2023.07.18-000002": {
      "index": ".ds-test-index-1-2023.07.18-000002",
      "managed": true,
      "policy": "test-policy",
      "index_creation_date_millis": 1689718131896,
      "time_since_index_creation": "33.39m",
      "lifecycle_date_millis": 1689718131896,
      "age": "33.39m",
      "phase": "hot",
      "phase_time_millis": 1689718132431,
      "action": "rollover",
      "action_time_millis": 1689718133239,
      "step": "check-rollover-ready",
      "step_time_millis": 1689718133239,
      "phase_execution": {
        "policy": "test-policy",
        "phase_definition": {
          "min_age": "0ms",
          "actions": {
            "set_priority": {
              "priority": 100
            },
            "rollover": {
              "max_age": "10s",
              "max_primary_shard_size": "50mb"
            }
          }
        },
        "version": 1,
        "modified_date_in_millis": 1689717832957
      }
    }
  }
}

My index still green

GET /_cat/indices/.ds-test-index-1-2023.07.18-000001?v
health status index                              uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .ds-test-index-1-2023.07.18-000001 2ff-u2EuQGmSQOyLB9qKKQ   1   0          1            0      4.3kb          4.3kb

I don't really see why I have this "index is not green; not all shards are active" when everything is green and why the passage to delete is not done maybe I have to open an issue on the Github side?
But I have a feeling it's not the first, there seem to be similar issues here:

github.com/elastic/elasticsearch

ILM policies with searchable_snapshot and allocate in the cold phase will get stuck

opened 05:28PM - 03 Sep 21 UTC

closed 07:19PM - 04 Mar 22 UTC

joegallo

>bug :Data Management/ILM+SLM Team:Data Management

**Elasticsearch version** (`bin/elasticsearch --version`): 7.14.1 **Description of the problem including expected versus actual behavior**: If an ILM policy uses both `searchable_snapshot` and `allocate` in the cold phase, then the `allocate` action won't work right, and in fact will get wedged permanently (see workaround below for how to unstick it). **Steps to reproduce**: ``` PUT _cluster/settings { "transient": { "indices.lifecycle.poll_interval": "30s" } } PUT _ilm/policy/test-policy { "policy": { "phases": { "hot": { "min_age": "0m", "actions": { "set_priority": { "priority": 100 } } }, "warm": { "min_age": "2m", "actions": { "set_priority": { "priority": 50 } } }, "cold": { "min_age": "4m", "actions": { "set_priority": { "priority": 100 }, "searchable_snapshot": { "snapshot_repository": "found-snapshots" }, "allocate": { "number_of_replicas": 0 } } }, "frozen": { "min_age": "6m", "actions": { "searchable_snapshot": { "snapshot_repository": "found-snapshots" } } } } } } PUT _template/test-template { "index_patterns": ["test-index*"], "settings": { "lifecycle": { "name": "test-policy" }, "number_of_shards": 1, "number_of_replicas": 1 } } POST test-index-1/_doc { "field1": "someValue2" } ``` Once the policy gets to the `allocate` action & step, it'll just sit there forever: ``` GET test-index-1/_ilm/explain { "indices" : { "restored-test-index-1" : { "index" : "restored-test-index-1", "managed" : true, "policy" : "test-policy", "lifecycle_date_millis" : 1630689049418, "age" : "6.5m", "phase" : "cold", "phase_time_millis" : 1630689317926, "action" : "allocate", "action_time_millis" : 1630689318037, "step" : "allocate", "step_time_millis" : 1630689379467, "repository_name" : "found-snapshots", "snapshot_name" : "2021.09.03-test-index-1-test-policy-vx9cjvout8q3ahibxej20g", "phase_execution" : { "policy" : "test-policy", "phase_definition" : { "min_age" : "4m", "actions" : { "allocate" : { "number_of_replicas" : 0, "include" : { }, "exclude" : { }, "require" : { } }, "searchable_snapshot" : { "snapshot_repository" : "found-snapshots", "force_merge_index" : true }, "set_priority" : { "priority" : 100 } } }, "version" : 4, "modified_date_in_millis" : 1630689027809 } } } } ``` **Workaround**: For any stuck indices, if you manually move the stuck index to complete, then everything will pick back up. ``` POST /_ilm/move/restored-test-index-1 { "current_step": { "phase": "cold", "action": "allocate", "name": "allocate" }, "next_step": { "phase": "cold", "action": "complete", "name": "complete" } } ```

github.com/elastic/elasticsearch

ILM doesnt delete searchable snapshots when index associated with ILM policy encounters a momentary red status

opened 03:53PM - 27 Aug 21 UTC

TheRiffRafi

>bug :Data Management/ILM+SLM Team:Data Management

**Elasticsearch version** : 7.14.0 **Plugins installed**: [] **Description… of the problem including expected versus actual behavior**: **Current behavior**: ILM doesn't delete searchable snapshots when one of the associated indices waiting to go from a hot to a cold phase and then to a delete phase finds itself in a red status. Even if the index later recovers, the ILM policy doesn't return to deleting the searchable snapshots for any of the subsequent indices. This has the consequence that the storage can fill up on a hot or cold node because the snapshot is fully mounted. **Expected behavior**: ILM should resume to deleting searchable snapshot for the next indices regardless of failures on previous indices. **Steps to reproduce**: 1. Create ILM policy with moving indices from hot to cold nodes with fully mounted searchable snapshots, with these settings: <details> <summary> ILM Settings </summary> ``` "policy" : { "phases" : { "hot" : { "min_age" : "0ms", "actions" : { "rollover" : { "max_primary_shard_size" : "3mb", "max_age" : "30s" }, "set_priority" : { "priority" : 100 } } }, "delete" : { "min_age" : "90s", "actions" : { "delete" : { "delete_searchable_snapshot" : true } } }, "cold" : { "min_age" : "1m", "actions" : { "allocate" : { "number_of_replicas" : 0, "include" : { }, "exclude" : { }, "require" : { } }, "searchable_snapshot" : { "snapshot_repository" : "found-snapshots", "force_merge_index" : true }, "set_priority" : { "priority" : 0 } } } } } ``` </details> 2. Create index template with 3 shards: <details> <summary> Index Template </summary> ``` PUT _index_template/template1 { "index_patterns": ["test*"], "template": { "settings": { "number_of_shards": 3, "number_of_replicas": 0, "index.lifecycle.name": "move_hot_cold_ss", "index.lifecycle.rollover_alias": "testkibana" } } } ``` </details> 3. Bootstrap the index. 4. Observe ILM process runs smoothly 5. Cause a red index status (for my repro I paused one of my hot nodes to cause this). 6. Wait for ILM process to fail on move to cold phase because index is missing primary shards. 7. Recover cluster to green status (I resumed my hot node). 8. Observe how ILM no longer deletes searchable snapshot indices. 9. Observe how ILM process continues rolling over indices, creating searchable snapshots, and not deleting the searchable snapshots. 10. Observe how the ilm-history index doesn't show any errors or failures (besides the one where the primary shards were gone). 11. ILM history doesn't provide any errors in the running phases. <details> <summary> screengrab of ilm history </summary> Correct ILM behavior before triggering issue: <img width="1450" alt="Screen Shot 2021-08-26 at 19 27 07" src="https://user-images.githubusercontent.com/62263912/131052704-14217ec4-809e-4c9e-92bd-74476e745157.png"> After triggering issue, note no delete step: <img width="1444" alt="Screen Shot 2021-08-26 at 19 28 28" src="https://user-images.githubusercontent.com/62263912/131052795-b359a54f-e2e3-4fc5-8772-afa76039a0c5.png"> </details> 12. Observe how any GET ilm explains don't tell you any errors, for example: <details> <summary> get ilm explains for index and searchable snapshot index </summary> ``` { "indices" : { "restored-testkibana_sample-000043" : { "index" : "restored-testkibana_sample-000043", "managed" : true, "policy" : "move_hot_cold_ss", "lifecycle_date_millis" : 1630022898837, "age" : "3.73m", "phase" : "cold", "phase_time_millis" : 1630022960237, "action" : "allocate", "action_time_millis" : 1630022963864, "step" : "allocate", "step_time_millis" : 1630023086716, "repository_name" : "found-snapshots", "snapshot_name" : "2021.08.27-testkibana_sample-000043-move_hot_cold_ss-vyjwezscr2uq3qwuls0gfg", "phase_execution" : { "policy" : "move_hot_cold_ss", "phase_definition" : { "min_age" : "1m", "actions" : { "allocate" : { "number_of_replicas" : 0, "include" : { }, "exclude" : { }, "require" : { } }, "searchable_snapshot" : { "snapshot_repository" : "found-snapshots", "force_merge_index" : true }, "set_priority" : { "priority" : 0 } } }, "version" : 3, "modified_date_in_millis" : 1630021406861 } } } } ``` </details> **Things I tried to recover from the issue** 1. Tried stopping, starting ILM 2. Tried deleting snapshots and indices where the issue had occurred. 3. A friend suggested forcing the stuck indices to the ILM next step, but that is not scalable when the issue has been going on for a while and there is a whole lot of indices with the problem. **Interesting observation** If you add the same policy to a different index pattern, the same issue happens, however a new ilm policy doesn't get affected (or other existing policies). So it seems this affects only the policy associated to the index with the failures. Seems the only available workaround is to associate the new indices to a new policy of the same type through the index template.

system · August 15, 2023, 10:52pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.