Index rollover and ILM issue

Hello

I hope my message finds the community & their loved ones safe and healthy.

  1. I have a 3-node cluster.

  2. 2 nodes store data and carry out processing

  3. 1 node is a voting only node.

  4. All the data to Elasticsearch is sent via two logstash nodes running multiple & mirrored configurations (pipelines). Load balancing is via configuration from the hosts sending logs.
    output.logstash: hosts: ["logging.maniarfamily.com:port","logging.maniarfamily.com:port"]

  5. Logstash has the following Elasticsearch configurations:

output {
   if [type] == "cowrie" {
       elasticsearch {
           hosts => ["https://ip1:9200","https://ip2:9200"]
           #data_stream => true  #Causes Errors: added after reading this: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-data_streamwhile diagnosing cowrie ingestion causing data duplication.
           index => "cowrie-logstash-%{+yyyy.MM.dd}"
           ssl => true
           user => ''
           password => ''
           cacert => '/etc/logstash/elasticsearch-ca.pem'
           ssl_certificate_verification => false
           ilm_enabled => auto
           ilm_rollover_alias => "cowrie-logstash"
       }
       #file {
       #    path => "/tmp/cowrie-logstash.log"
       #    codec => json
       #}
       #stdout {
           #codec => rubydebug
       #}
   }
  1. All indices & lifecycle policies work correctly except one related to the honeypot data I collect.

Data from honeypots is sent via Filebeat --> logstash --> elasticsearch

  1. Cluster is running the latest release (8.17.2)

  2. ILM details of the index that is not rolling over

  3. This is the current size of the index :frowning:

  4. All previous indices have rolled over correctly & they have the same configuration.

In 2021, as a desperate measure, I deleted hidden indexes since the cluster was over 1200 (or whatever the default indexes were). I am not sure if that is what caused this.

  1. The problem persists in all indices post 2021, however, since the size hasn't reached over 50 GB, I am not too worried. However, I am confident I have broken something in Elasticsearch that supports ILMs

I am perfectly happy to share any configuration or diagnostic logs to help resolve this.

I have attempted to resolve this earlier: ILM policy not applied

Thank you very much.

it seems like you don't have alias define in template it should be something like this

go to management, index template and look for cowrie-logstash template

name might be different, may be your template will say 30-days-single-phase as lifecycle.name

{
  "template": {
    "settings": {
      "index": {
        "lifecycle": {
          "name": "cowrie-logstash",
          "rollover_alias": "cowrie-logstash"
        },
        "number_of_shards": "1",
        "number_of_replicas": "1"
      }
    },
    "aliases": {},
    "mappings": {}
  }
}

what ILM does is use this template to roll over your index.

I am sorry, but the policy I had put in place was a desperate attempt.

Here is the one via lostash configuration

{
  "policy": "logstash-policy",
  "phase_definition": {
    "min_age": "0ms",
    "actions": {
      "rollover": {
        "max_age": "30d",
        "min_docs": 1,
        "max_primary_shard_docs": 200000000,
        "max_primary_shard_size": "50gb"
      }
    }
  },
  "version": 4,
  "modified_date_in_millis": 1622398401153
}

Here is the output of ILM Explain

{
    "indices": {
        "cowrie-logstash-2021.12.30-000018": {
            "index": "cowrie-logstash-2021.12.30-000018",
            "managed": true,
            "policy": "logstash-policy",
            "index_creation_date_millis": 1640884225151,
            "time_since_index_creation": "1155.83d",
            "lifecycle_date_millis": 1640884225151,
            "age": "1155.83d",
            "phase": "hot",
            "phase_time_millis": 1740748726875,
            "action": "rollover",
            "action_time_millis": 1740748726875,
            "step": "check-rollover-ready",
            "step_time_millis": 1740748726875,
            "phase_execution": {
                "policy": "logstash-policy",
                "phase_definition": {
                    "min_age": "0ms",
                    "actions": {
                        "rollover": {
                            "max_age": "30d",
                            "max_primary_shard_docs": 200000000,
                            "min_docs": 1,
                            "max_primary_shard_size": "50gb"
                        }
                    }
                },
                "version": 4,
                "modified_date_in_millis": 1622398401153
            }
        }
    }
}

Thank you for helping.

Hello @parthmaniar , that is indeed pretty weird behaviour.

However, I see some inconsistencies in some of the screenshots, my guess is that they are the results of some manual actions which makes it hard to figure out what's going on. I will try to do a recap here and you can help by confirming or sharing some of the outputs.

  • I assume ILM is currently running, because you mentioned your other policies are working correctly, but can you confirm it by running again and sharing the status:
https://IP:9200/_ilm/status
  • According to the image at point 8, it looks like the index cowrie-logstash-2021.12.30-000018 has completed the hot phase. This should mean that it had rolled over already unless there was any manual action to change this.
  • In your screenshot at point 10 there is a banner saying that 1 index has lifecycle errors, can you share the error?
  • You mention that the problem persists in all indices post 2021, but you mentioned in the beginning that all other policies and indices are working correctly. Can you elaborate what exactly is not working since 2021?
  • At a later comment of yours you are sharing the output of ILM explain of the same index, managed by a different policy and it looks like it's on the check-rollover-ready step. Which implies manual action again. Right?

Before we try to resolve this, I would like to collect some information so we can better understand what's going on.

  • Do you see any logs related to this index and ILM or rollover requests?
  • Can you retrieve and share with us the indices this alias is pointing to: GET _alias/cowrie-logstash
  • Can you run:
POST cowrie-logstash/_rollover?dry_run=true
{
  "conditions": {
    "max_age": "30d",
    "max_primary_shard_size": "50gb",
    "min_docs": "1"
  }
}

And share with us the response.

  • Can you execute GET /_cluster/state/metadata/cowrie-logstash-2021.12.30-000018 and share with us the content of the fields settings, ilm, rollover_info, aliases field?

Thank you, I think we will be in a better place to suggest some next steps then.

2 Likes

Hello,

I apologise for the haphazard responses; I'm dealing with personal and professional commitments. :slight_smile:

  1. ILM status does not show any errors. Output for GET /_ilm/status is
{
  "operation_mode": "RUNNING"
}
  1. I am skipping the second bullet point and is answered subscequently

  2. The index with the error was sensitive, and hence, I did not dwell on it. I am, however, happy and comfortable sharing it.

  1. A. As you can see, the error embedding appears and disappears without my intervention. Today (02-03-2025, ~1300 IST, there is no error warning :expressionless: )

  2. B. The index it shows the error for is the largest one here:

I understand that the index name relates to a sensitive topic; the purpose was to evaluate electronic warfare and was part of my research at the university.

  1. My statement that only one index failed rollover was incorrect. The correct status is no index after ~2021 has rolled over. Additionally, some indexes, such as packetbeat/auditbeat, are merged as a single one instead of taking the configuration as part of the logstash pipeline.

  2. Yes, in one of the printscreens, I was attempting to create a new policy to force rollover (purely out of desperation)

  3. May I please request the process to view ILM error logs? I am running logstash on Raspberry Pis, and I have removed logging for logstash to save on SD Card I/Os. Elasticsearch + Kibana are running on VMs where I should have logging (default state). Let me know where to check, and I will provide the logs.

  4. Output for GET _alias/cowrie-logstash

{
  "cowrie-logstash-2021.06.03-000011": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  },
  "cowrie-logstash-2021.05.04-000010": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  },
  "cowrie-logstash-2021.11.30-000017": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  },
  "cowrie-logstash-2021.12.30-000018": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": true
      }
    }
  },
  "cowrie-logstash-2021.09.01-000014": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  },
  "cowrie-logstash-2020.12.05-000005": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  },
  "cowrie-logstash-2021.07.03-000012": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  },
  "cowrie-logstash-2021.10.31-000016": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  },
  "cowrie-logstash-2020.10.06-000003": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  },
  "cowrie-logstash-2021.01.04-000006": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  },
  "cowrie-logstash-2020.08.09-000001": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  },
  "cowrie-logstash-2021.02.03-000007": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  },
  "cowrie-logstash-2021.08.02-000013": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  },
  "cowrie-logstash-2020.09.06-000002": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  },
  "cowrie-logstash-2021.03.05-000008": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  },
  "cowrie-logstash-2021.10.01-000015": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  },
  "cowrie-logstash-2020.11.05-000004": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  },
  "cowrie-logstash-2021.04.04-000009": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": false
      }
    }
  }
}
  1. Output for
{
  "conditions": {
    "max_age": "30d",
    "max_primary_shard_size": "50gb",
    "min_docs": "1"
  }

Response

  "acknowledged": false,
  "shards_acknowledged": false,
  "old_index": "cowrie-logstash-2021.12.30-000018",
  "new_index": "cowrie-logstash-2025.03.02-000019",
  "rolled_over": false,
  "dry_run": true,
  "lazy": false,
  "conditions": {
    "[max_age: 30d]": true,
    "[max_primary_shard_size: 50gb]": true,
    "[min_docs: 1]": true
  }
}
  1. Output for

settings:

   "indices": {
      "cowrie-logstash-2021.12.30-000018": {
        "version": 1482,
        "mapping_version": 20,
        "settings_version": 16,
        "aliases_version": 4,
        "routing_num_shards": 1024,
        "state": "open",
        "settings": {
          "index": {
            "lifecycle": {
              "name": "logstash-policy",
              "rollover_alias": "cowrie-logstash",
              "indexing_complete": "true"
            },
            "routing": {
              "allocation": {
                "include": {
                  "_tier_preference": "data_content"
                }
              }
            },
            "refresh_interval": "5s",
            "number_of_shards": "1",
            "provided_name": "<cowrie-logstash-{now/d}-000018>",
            "creation_date": "1640884225151",
            "priority": "100",
            "number_of_replicas": "1",
            "uuid": "g5GyH1aFTpC9n8pEGA0dLg",
            "version": {
              "created": "7160299"
            }
          }
        },

ILM

        "ilm": {
          "phase": "hot",
          "phase_definition": """{"policy":"logstash-policy","phase_definition":{"min_age":"0ms","actions":{"rollover":{"max_age":"30d","max_primary_shard_size":"50gb"}}},"version":4,"modified_date_in_millis":1622398401153}""",
          "action_time": "1740749317958",
          "phase_time": "1740748726875",
          "action": "complete",
          "step": "complete",
          "creation_date": "1643743227492",
          "step_time": "1740749317958"
        },

Alias:

        "aliases": [
          "cowrie-logstash"
        ],

Rollover Info:

        "rollover_info": {
          "cowrie-logstash": {
            "met_conditions": {},
            "time": 1643743227492
          }
        },

I hope this helps, and I am again sorry for leaving the same thread halfway a year ago and putting in conflicting information. It was not intentional, and I will be more prudent.

Thank you.

Can you post another health report from Elasticsearch? When you posted the last one you were very close to the cluster shard limit. GET /_health_report. If you have not adjusted the shard limit, you are extremely close to the shard limit right now "active_shards": 2382,, your previously shared limit was 2400. Lots of funky stuff happens when you hit the shard limit and don't fix it.

Your write index is still set to that 2021 index so all writes will continue to flow into that index:

},
  "cowrie-logstash-2021.12.30-000018": {
    "aliases": {
      "cowrie-logstash": {
        "is_write_index": true
      }
    }
  },

But ILM has completed and won't be doing a rollover any time soon.

So it's time to perform a couple of steps:

  1. Increase the max shard count in your cluster or you need to identify what is consuming 2400 shards in your cluster and significantly (and permanently via automated cleanup), reduce your shard count
  2. it's time to perform a manual rollover (not a dry run)
  3. It's time to ensure that you are writing error Elasticsearch logs to something, even if it's to a tmpfs that doesn't persist past reboot
  4. It's time to upgrade your cluster, 8.13 is end of life
  5. It's time to switch to data streams instead of using indices

One option for monitoring your Elasticsearch cluster could be to setup an Elasticsearch serverless project. They charge on ingest of data, about $0.50/gb. You could setup fleet managed elastic agent to monitor your cluster and push logs to serverless. If you're picky about which data you send (only errors and warnings) and only important metrics at a low interval, you may find it to be an extremely inexpensive option. You won't have access to the stack monitoring UI but you would be able to use the integration dashboards, read the logs, and setup relevant alerts on the metrics.

1 Like

Hello,

Thank you very much for your assistance.

  1. Request GET /_health_report

Response:

{
  "status": "yellow",
  "cluster_name": "data_analytics_1",
  "indicators": {
    "master_is_stable": {
      "status": "green",
      "symptom": "The cluster has a stable master node",
      "details": {
        "current_master": {
          "node_id": "2uxQoQ9HSkuBNX6_ExPOvw",
          "name": "secondarynode"
        },
        "recent_masters": [
          {
            "node_id": "2uxQoQ9HSkuBNX6_ExPOvw",
            "name": "secondarynode"
          }
        ]
      }
    },
    "repository_integrity": {
      "status": "green",
      "symptom": "All repositories are healthy.",
      "details": {
        "total_repositories": 1
      }
    },
    "disk": {
      "status": "green",
      "symptom": "The cluster has enough available disk space.",
      "details": {
        "indices_with_readonly_block": 0,
        "nodes_with_enough_disk_space": 3,
        "nodes_with_unknown_disk_status": 0,
        "nodes_over_high_watermark": 0,
        "nodes_over_flood_stage_watermark": 0
      }
    },
    "shards_capacity": {
      "status": "green",
      "symptom": "The cluster has enough room to add new shards.",
      "details": {
        "data": {
          "max_shards_in_cluster": 2400
        },
        "frozen": {
          "max_shards_in_cluster": 0
        }
      }
    },
    "shards_availability": {
      "status": "green",
      "symptom": "This cluster has all shards available.",
      "details": {
        "started_replicas": 1031,
        "unassigned_primaries": 0,
        "restarting_replicas": 0,
        "creating_primaries": 0,
        "initializing_replicas": 0,
        "unassigned_replicas": 0,
        "started_primaries": 1031,
        "restarting_primaries": 0,
        "initializing_primaries": 0,
        "creating_replicas": 0
      }
    },
    "data_stream_lifecycle": {
      "status": "green",
      "symptom": "Data streams are executing their lifecycles without issues",
      "details": {
        "stagnating_backing_indices_count": 0,
        "total_backing_indices_in_error": 0
      }
    },
    "slm": {
      "status": "green",
      "symptom": "Snapshot Lifecycle Management is running",
      "details": {
        "slm_status": "RUNNING",
        "policies": 1
      }
    },
    "ilm": {
      "status": "yellow",
      "symptom": "An index has stayed on the same action longer than expected.",
      "details": {
        "stagnating_indices_per_action": {
          "allocate": 0,
          "shrink": 0,
          "searchable_snapshot": 0,
          "rollover": 1,
          "forcemerge": 0,
          "delete": 0,
          "migrate": 0
        },
        "policies": 76,
        "stagnating_indices": 1,
        "ilm_status": "RUNNING"
      },
      "impacts": [
        {
          "id": "elasticsearch:health:ilm:impact:stagnating_index",
          "severity": 3,
          "description": "Automatic index lifecycle and data retention management cannot make progress on one or more indices. The performance and stability of the indices and/or the cluster could be impacted.",
          "impact_areas": [
            "deployment_management"
          ]
        }
      ],
      "diagnosis": [
        {
          "id": "elasticsearch:health:ilm:diagnosis:stagnating_action:rollover",
          "cause": "Some indices have been stagnated on the action [rollover] longer than the expected time.",
          "action": "Check the current status of the Index Lifecycle Management for every affected index using the [GET /<affected_index_name>/_ilm/explain] API. Please replace the <affected_index_name> in the API with the actual index name.",
          "help_url": "https://ela.st/ilm-explain",
          "affected_resources": {
            "ilm_policies": [
              "ml-size-based-ilm-policy"
            ],
            "indices": [
              ".ml-state-000001"
            ]
          }
        }
      ]
    }
  }
}
  1. Do i need to increase the limit to 2400? I'm at ~2062 right now. :slight_smile: . Please don't get me wrong. I am happy to do it, but I wanted to confirm.

  2. I am unsure how a manual rollover could be done. Could I please get guiding principals? This is my 5 years of research and I don't want to screw it up :slight_smile:

  3. I would be gathering Elasticsearch logs since it runs on an SSD (more durable storage than SD cards on an RPi). Are there specific logs that would be helpful? I am grep and send or attach them too.

  4. My cluster (Elasticsearch, Kibana, Logstash, and Beats) is all at the latest release: 8.17.2. May I please know where that incorrect information was displayed? Again, sorry if any logs/printscreen I uploaded suggested that. I have a monthly patching cadence that takes care of any upgrades.

  1. I am happy to move from indices to streams, but I reckon I would first like to fix the error, learn the difference, and move ahead. :slight_smile:

  2. Given the importance and length of my work, I am happy to explore a paid version. I am no longer a student (I happily graduated with distinction and a bow and thank you to the Elasticteam for giving me the license to run ML analysis). I will happily consider moving the operational aspects to a paid project, but as a homelab owner, letting someone else "administer" is always a fight. :slight_smile:

This command (without the dry run part) is all you need to initiate a rollover:

You have 2062 shards. Each time you add an index, that's two shards. So you're at 85% of capacity right now. I would increase the shard count. Or figure out what the 1000 indices in your cluster are doing and whether you need them all.

I would keep Elasticsearch, Kibana, and Logstash error and warning logs

My list of recommendations up above (1-5) goes from things you should definitely do now to things you should plan to do :slight_smile:

Please tell me you have some sort of backup.

If not, make a snapshot of your data. Don't think about doing it, do it.

3 Likes

Oh yes. One of my favourite subjects was Designing Security (Or reliability engineering)

  1. I have 3 VMs running off 3 different SSDs
  2. One SSD hosts OS for two of the VMs & one SSD hosts OS for the third VM (second data node)
  3. The same cross is used for data across two different SSDs. These are mounted within the OS and are independent from the OS disk.
  4. Full NAS backup of the indices. NAS can run up to two HDDs failure
  5. Full VM backup on NAS with for configurations.

Except for CPU failure, everything else has double redundancy since this is running over a single socket server. :smiley:

It's the poor man's reliability engineering. :smiley:

1 Like

For point 3. The error has appread.

Here are the details

Yes, your ukraine-war index will not work as you have provided a naming convention that is invalid,

The name of the index must match the pattern defined in the index template and end with a number. This number is incremented to generate the name of indices created by the rollover action.

Your index name does not end with a number so ILM doesnt know what to call the new one.

Logstash generally does this for you via the default value of the ilm_pattern setting = {now/d}-000001 https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-ilm_pattern

So perhaps this index was created some other way?

Thank you @parthmaniar for the information and @strawgate for responding about the other index errors.

I can elaborate a bit more with what's going on now with cowrie-logstash-2021.12.30-000018.

  • Index cowrie-logstash-2021.12.30-000018 has the setting indexing_complete set to true, this means that ILM will skip rolling it over because it thinks it already has happened, see docs.* This is confirmed by the rollover_info that you posted, apparently, this index was rolled over on Tuesday, 1 February 2022 19:20:27.492 GMT. I am not sure if this helps in any way, but it looks like something went wrong then. Probably it's too long ago to find out what.
  • In order to get this alias unstuck, the easiest way is to follow the recommendation of @strawgate and perform a manual rollover. A manual rollover does not check this, so it will performed. Later, if you want you can also use split to split into more shards.
  • Given that the index template and policy are ok, ILM will take over from the next index and this should not happen again.

*Correction: The flag gets removed every time you remove the ILM policy via the API. But the rollover info remains, this why it doesn't try to roll it over again.

1 Like