Can I avoid elasticsearch monitoring index from moving to my temporary nodes?

Churchill · February 13, 2024, 6:53am

Good day,

I'm currently maintaining a cluster with 4 permanent nodes and 3 temporary nodes. How it works is, during work hours, our temporary nodes will be started automatically, and will be shutdown after work hours. The problem is that the .monitoring-es-7* indices are being moved to the temporary nodes every time it starts. During the shutdown process of the temporary nodes, I exclude the temporary nodes from the cluster, and the .monitoring indices becomes UNASSIGNED shards and doesn't want to move to the permanent nodes. This leaves us a red state on our cluster. I have to manually delete the .monitoring index to bring it back to green state. Eventually, the monitoring index will be generated again and will be placed in our permanent nodes, but, it gets transferred back to the temporary nodes when the temporary nodes starts the next day.

Is there a way to tell Elasticsearch not to place the monitoring indices to our temporary nodes?

DavidTurner · February 13, 2024, 8:34am

Why is this exactly? I.e. what does the allocation explain API say about these shards? Excluding nodes from the cluster with an allocation filter won't itself cause any shards to become UNASSIGNED so I think something else is going on here.

Churchill · February 13, 2024, 10:17am

Hi @DavidTurner! I'm not sure if this plays part of the issue, but upon creation of our temporary nodes, we're setting cluster.routing.rebalance.enable to replicas to ensure that only replicas will be relocated to the temporary nodes. Then we set cluster.routing.rebalance.enable back to all after the temporary nodes gets destroyed. Here's the elasticsearch.yml of one of our temp nodes.

path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
http.port: 9200
search.max_buckets: 65000
cluster.routing.allocation.enable: all
cluster.routing.rebalance.enable: none
cluster.routing.allocation.allow_rebalance: always
cluster.routing.allocation.node_concurrent_recoveries: 200
xpack.security.enabled: false
xpack.monitoring.enabled: true
cluster.name: platform-elasticsearch

node.name: temp-node-01
network.host: ["x.x.x.x","127.0.0.1"]
discovery.seed_hosts: ["x.x.x.x","x.x.x.x","x.x.x.x"]
cluster.initial_master_nodes: ["master-node-01", "master-node-02", "master-node-03"]
node.roles: [data]
network.publish_host: x.x.x.x

Here's the result when I check the settings of the .monitoring index.

curl -XGET "localhost:9200/.monitoring-es-7*/_settings?pretty"
{
  ".monitoring-es-7-2024.02.09" : {
    "settings" : {
      "index" : {
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",
        "provided_name" : ".monitoring-es-7-2024.02.09",
        "format" : "7",
        "creation_date" : "1707486559155",
        "number_of_replicas" : "1",
        "uuid" : "2hYLQ2nGR4e7rM16xX6dLA",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  },
  ".monitoring-es-7-2024.02.07" : {
    "settings" : {
      "index" : {
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",
        "provided_name" : ".monitoring-es-7-2024.02.07",
        "format" : "7",
        "creation_date" : "1707264000054",
        "number_of_replicas" : "1",
        "uuid" : "_HUw6V3mS9K1Goh7rCzWIw",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  },
  ".monitoring-es-7-2024.02.08" : {
    "settings" : {
      "index" : {
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",
        "provided_name" : ".monitoring-es-7-2024.02.08",
        "format" : "7",
        "creation_date" : "1707401542253",
        "number_of_replicas" : "1",
        "uuid" : "qidNlIgPQL6C2_ZlqjvMkg",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  },
  ".monitoring-es-7-2024.02.11" : {
    "settings" : {
      "index" : {
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",
        "provided_name" : ".monitoring-es-7-2024.02.11",
        "format" : "7",
        "creation_date" : "1707609601292",
        "number_of_replicas" : "1",
        "uuid" : "xf2RDXH8QMii1l28nGll-A",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  },
  ".monitoring-es-7-2024.02.12" : {
    "settings" : {
      "index" : {
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",
        "provided_name" : ".monitoring-es-7-2024.02.12",
        "format" : "7",
        "creation_date" : "1707749582549",
        "number_of_replicas" : "1",
        "uuid" : "L4m1IlhkTOegxwdVzv0KBA",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  },
  ".monitoring-es-7-2024.02.13" : {
    "settings" : {
      "index" : {
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",
        "provided_name" : ".monitoring-es-7-2024.02.13",
        "format" : "7",
        "creation_date" : "1707782400198",
        "number_of_replicas" : "1",
        "uuid" : "2TZnmirvTd6Stwu9PuqYDw",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  },
  ".monitoring-es-7-2024.02.10" : {
    "settings" : {
      "index" : {
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",
        "provided_name" : ".monitoring-es-7-2024.02.10",
        "format" : "7",
        "creation_date" : "1707523200335",
        "number_of_replicas" : "1",
        "uuid" : "8lCqa4obTs20U7tZzsOWqA",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  }
}

DavidTurner · February 13, 2024, 10:37am

Nor am I, but allocation explain will clarify.

Churchill · February 13, 2024, 11:05am

This is what I get from the last allocation explain i did:

node_left
cannot allocate because all found copies of the shard are either stale or corrupt.

I'm not sure why it would become stale or corrupt when I just excluded the temp node from the cluster.

DavidTurner · February 13, 2024, 11:07am

Nor am I, but the full allocation explain output will clarify.

Churchill · February 13, 2024, 11:34am

As a work around, I just added the following to the shutdown script to ensure that all .monitoring indices gets transferred to our master nodes.

curl -X PUT "localhost:9200/.monitoring-es-7*/_settings?pretty" -H 'Content-Type: application/json' -d'
{
  "index.routing.allocation.include._name": "master-es-nodes-*"
}
'

curl -X PUT "localhost:9200/.monitoring-kibana-7*/_settings?pretty" -H 'Content-Type: application/json' -d'
{
  "index.routing.allocation.include._name": "master-es-nodes-*"
}
'

I tested and it did moved all .monitoring indices to the master nodes. I guess I'm good for now. Thanks!

DavidTurner · February 13, 2024, 12:15pm

Hmm that's what I thought you meant when you said you were "excluding the temp node from the cluster". What did you mean if not adjusting the allocation filters?

Also you're setting index-level filters in those API calls, that's a lot less efficient than cluster-level shard allocation filtering.

Churchill · February 13, 2024, 1:11pm

The way we exclude the temp nodes from the cluster is via:
cluster.routing.allocation.exclude._ip: "x.x.x.x, x.x.x.x, x.x.x.x, x.x.x.x".. for some reason this makes the monitoring indices go UNASSIGNED and will not move to other nodes.

Thanks for this! I'll use the cluster-level filtering instead.

DavidTurner · February 13, 2024, 1:31pm

Ok, setting cluster.routing.allocation.exclude._ip itself cannot cause any shards to become UNASSIGNED. Unless you've found a weird bug ofc. We'd really like to see the full allocation explain from that situation to understand it.

Churchill · February 13, 2024, 3:47pm

I'll provide the full allocation explain when I encounter the issue again. Thank you!

system · March 12, 2024, 3:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Internal monitoring indices shards in UNASSIGNED state Elasticsearch	9	1009	January 17, 2023
Delete .monitoring index as it's unassigned elk 8.7 Elasticsearch	0	5	December 18, 2024
Scaling down ElasticSearch: what happens with closed indices? Elasticsearch snapshot-and-restore	6	470	February 24, 2021
Index relocation during initialization Elasticsearch	4	665	July 6, 2017
Unassigned replica shard stuck in that state Elasticsearch	2	421	July 6, 2017

Can I avoid elasticsearch monitoring index from moving to my temporary nodes?

Related topics