Can I avoid elasticsearch monitoring index from moving to my temporary nodes?

Good day,

I'm currently maintaining a cluster with 4 permanent nodes and 3 temporary nodes. How it works is, during work hours, our temporary nodes will be started automatically, and will be shutdown after work hours. The problem is that the .monitoring-es-7* indices are being moved to the temporary nodes every time it starts. During the shutdown process of the temporary nodes, I exclude the temporary nodes from the cluster, and the .monitoring indices becomes UNASSIGNED shards and doesn't want to move to the permanent nodes. This leaves us a red state on our cluster. I have to manually delete the .monitoring index to bring it back to green state. Eventually, the monitoring index will be generated again and will be placed in our permanent nodes, but, it gets transferred back to the temporary nodes when the temporary nodes starts the next day.

Is there a way to tell Elasticsearch not to place the monitoring indices to our temporary nodes?

Why is this exactly? I.e. what does the allocation explain API say about these shards? Excluding nodes from the cluster with an allocation filter won't itself cause any shards to become UNASSIGNED so I think something else is going on here.

Hi @DavidTurner! I'm not sure if this plays part of the issue, but upon creation of our temporary nodes, we're setting cluster.routing.rebalance.enable to replicas to ensure that only replicas will be relocated to the temporary nodes. Then we set cluster.routing.rebalance.enable back to all after the temporary nodes gets destroyed. Here's the elasticsearch.yml of one of our temp nodes.

path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
http.port: 9200
search.max_buckets: 65000
cluster.routing.allocation.enable: all
cluster.routing.rebalance.enable: none
cluster.routing.allocation.allow_rebalance: always
cluster.routing.allocation.node_concurrent_recoveries: 200
xpack.security.enabled: false
xpack.monitoring.enabled: true
cluster.name: platform-elasticsearch

node.name: temp-node-01
network.host: ["x.x.x.x","127.0.0.1"]
discovery.seed_hosts: ["x.x.x.x","x.x.x.x","x.x.x.x"]
cluster.initial_master_nodes: ["master-node-01", "master-node-02", "master-node-03"]
node.roles: [data]
network.publish_host: x.x.x.x

Here's the result when I check the settings of the .monitoring index.

curl -XGET "localhost:9200/.monitoring-es-7*/_settings?pretty"
{
  ".monitoring-es-7-2024.02.09" : {
    "settings" : {
      "index" : {
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",
        "provided_name" : ".monitoring-es-7-2024.02.09",
        "format" : "7",
        "creation_date" : "1707486559155",
        "number_of_replicas" : "1",
        "uuid" : "2hYLQ2nGR4e7rM16xX6dLA",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  },
  ".monitoring-es-7-2024.02.07" : {
    "settings" : {
      "index" : {
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",
        "provided_name" : ".monitoring-es-7-2024.02.07",
        "format" : "7",
        "creation_date" : "1707264000054",
        "number_of_replicas" : "1",
        "uuid" : "_HUw6V3mS9K1Goh7rCzWIw",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  },
  ".monitoring-es-7-2024.02.08" : {
    "settings" : {
      "index" : {
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",
        "provided_name" : ".monitoring-es-7-2024.02.08",
        "format" : "7",
        "creation_date" : "1707401542253",
        "number_of_replicas" : "1",
        "uuid" : "qidNlIgPQL6C2_ZlqjvMkg",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  },
  ".monitoring-es-7-2024.02.11" : {
    "settings" : {
      "index" : {
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",
        "provided_name" : ".monitoring-es-7-2024.02.11",
        "format" : "7",
        "creation_date" : "1707609601292",
        "number_of_replicas" : "1",
        "uuid" : "xf2RDXH8QMii1l28nGll-A",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  },
  ".monitoring-es-7-2024.02.12" : {
    "settings" : {
      "index" : {
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",
        "provided_name" : ".monitoring-es-7-2024.02.12",
        "format" : "7",
        "creation_date" : "1707749582549",
        "number_of_replicas" : "1",
        "uuid" : "L4m1IlhkTOegxwdVzv0KBA",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  },
  ".monitoring-es-7-2024.02.13" : {
    "settings" : {
      "index" : {
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",
        "provided_name" : ".monitoring-es-7-2024.02.13",
        "format" : "7",
        "creation_date" : "1707782400198",
        "number_of_replicas" : "1",
        "uuid" : "2TZnmirvTd6Stwu9PuqYDw",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  },
  ".monitoring-es-7-2024.02.10" : {
    "settings" : {
      "index" : {
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",
        "provided_name" : ".monitoring-es-7-2024.02.10",
        "format" : "7",
        "creation_date" : "1707523200335",
        "number_of_replicas" : "1",
        "uuid" : "8lCqa4obTs20U7tZzsOWqA",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  }
}

Nor am I, but allocation explain will clarify.

This is what I get from the last allocation explain i did:

node_left
cannot allocate because all found copies of the shard are either stale or corrupt.

I'm not sure why it would become stale or corrupt when I just excluded the temp node from the cluster.

Nor am I, but the full allocation explain output will clarify.

As a work around, I just added the following to the shutdown script to ensure that all .monitoring indices gets transferred to our master nodes.

curl -X PUT "localhost:9200/.monitoring-es-7*/_settings?pretty" -H 'Content-Type: application/json' -d'
{
  "index.routing.allocation.include._name": "master-es-nodes-*"
}
'
curl -X PUT "localhost:9200/.monitoring-kibana-7*/_settings?pretty" -H 'Content-Type: application/json' -d'
{
  "index.routing.allocation.include._name": "master-es-nodes-*"
}
'

I tested and it did moved all .monitoring indices to the master nodes. I guess I'm good for now. Thanks! :slight_smile:

Hmm that's what I thought you meant when you said you were "excluding the temp node from the cluster". What did you mean if not adjusting the allocation filters?

Also you're setting index-level filters in those API calls, that's a lot less efficient than cluster-level shard allocation filtering.

The way we exclude the temp nodes from the cluster is via:
cluster.routing.allocation.exclude._ip: "x.x.x.x, x.x.x.x, x.x.x.x, x.x.x.x".. for some reason this makes the monitoring indices go UNASSIGNED and will not move to other nodes.

Thanks for this! I'll use the cluster-level filtering instead.

Ok, setting cluster.routing.allocation.exclude._ip itself cannot cause any shards to become UNASSIGNED. Unless you've found a weird bug ofc. We'd really like to see the full allocation explain from that situation to understand it.

I'll provide the full allocation explain when I encounter the issue again. Thank you! :slight_smile:

1 Like