ES Not Allocating Shards According to Hot/Warm Architecture via Curator

ES: v 5.4.0
Curator v 5.1.1

I have a hot/warm architecture with roughly 10 warm (disk) nodes and 3 hot (SSD) nodes. Effectively, I can tag indices to be moved over to the warm nodes from the hot, and can verify they get tagged accordingly, but the nodes never actually move.

The architecture is setup such that indices are created and held on the SSDs until they become three days old. After three days, Curator tags them as: require: box_type="warm" and then the tagged indices are moved from the hot nodes to the warm nodes (this part is broken)

My process is summarized as follows:
####1. Create new daily index shards on hot nodes
####2. Migrate three day old indices to warm nodes

##1. Create new daily index shards on hot nodes
I'll give an example using an Nginx daily index. This index is templated to require shards be allocated to nodes with: box_type: "hot" when created.

{
  "nginx-2017.06.27": {
    "settings": {
      "index": {
        "routing": {
          "allocation": {
            "require": {
              "box_type": "hot"
            }
          }
        },
        "number_of_shards": "1",
        "provided_name": "nginx-2017.06.27",
        "creation_date": "1498521579873",
        "number_of_replicas": "1",
        "uuid": "ZtzfqflfRAi3e4XzANVfKQ",
        "version": {
          "created": "5040099"
        }
      }
    }
  }
}

ES verifies a new Nginx shard is only placed on a "hot" node. E.g.:

      "name": "hot_node",
      "transport_address": "10.191.1.3:9302",
      "host": "10.191.1.3",
      "ip": "10.191.1.3",
      "version": "5.4.0",
      "build_hash": "780f8c4",
      "roles": [
        "data"
      ],
      "attributes": {
        "box_type": "hot",
        "ml.enabled": "true"
      },

This part works perfectly -- new nginx shards are only created on the three hot nodes and won't otherwise be created if there is no available hot node.

##2. Migrate three day old indices to warm nodes
To migrate 3 day old indices, I run a daily curator script:

actions:
  1:
    action: allocation
    description:  Apply shard allocation routing to 'require' 'tag=warm' for hot/cold node setup for logstash- indices older than 3 days, based on index_creation date
    options:
      key: box_type
      value: warm
      allocation_type: require
      wait_for_completion: true
      timeout_override:
      continue_if_exception: false
      disable_action: false
    filters:
    - filtertype: pattern
      kind: prefix
      value: nginx-
    - filtertype: age
      source: name
      direction: older
      timestring: '%Y.%m.%d'
      unit: days
      unit_count: 2

with the following configuration file:

---
# Remember, leave a key empty if there is no value.  None will be a string,
# not a Python "NoneType"
client:
  hosts:
    - node_hot
  port: 9202
  url_prefix:
  use_ssl: False
  certificate:
  client_cert:
  client_key:
  ssl_no_validate: False
  http_auth:
  timeout: 30
  master_only: False

logging:
  loglevel: DEBUG
  logfile:
  logformat: default
  blacklist: ['elasticsearch', 'urllib3']

At which point I get verification that the nodes have been updated with a new box_type:

2017-06-27 14:43:11,778 INFO      curator.actions.allocation              do_action:227  Updating index setting {'index.routing.allocation.require.box_type': 'warm'}
2017-06-27 14:43:14,951 DEBUG              curator.utils              do_action:237  Waiting for shards to complete relocation for indices: nginx-2016.11.15,nginx-2016.12.01,nginx-2016.12.02,nginx-2016.12.04,nginx-2016.12.06,nginx-2016.12.07,nginx-2016.12.08,nginx-2016.12.09,nginx-2016.12.10,nginx-2016.12.11,nginx-2016.12.12,nginx-2016.12.13,nginx-2016.12.14,nginx-2016.12.15,nginx-2016.12.16,nginx-2016.12.18,nginx-2016.12.19,nginx-2016.12.20,nginx-2016.12.21,nginx-2016.12.22,nginx-2016.12.23,nginx-2017.01.06,nginx-2017.01.07,nginx-2017.05.26,nginx-2017.05.28,nginx-2017.05.29,nginx-2017.05.30,nginx-2017.05.31,nginx-2017.06.01,nginx-2017.06.02,nginx-2017.06.05,nginx-2017.06.06,nginx-2017.06.07,nginx-2017.06.08,nginx-2017.06.09,nginx-2017.06.10,nginx-2017.06.11,nginx-2017.06.12,nginx-2017.06.13,nginx-2017.06.14,nginx-2017.06.15,nginx-2017.06.16,nginx-2017.06.17,nginx-2017.06.18,nginx-2017.06.19,nginx-2017.06.20,nginx-2017.06.21,nginx-2017.06.22,nginx-2017.06.23,nginx-2017.06.24,nginx-2017.06.25
2017-06-27 14:43:14,951 DEBUG              curator.utils            wait_for_it:1581 Elapsed time: 0 seconds
2017-06-27 14:43:14,952 DEBUG              curator.utils           health_check:1357 KWARGS= "{'relocating_shards': 0}"
2017-06-27 14:43:14,961 DEBUG              curator.utils           health_check:1377 MATCH: Value for key "0", health check data: 0
2017-06-27 14:43:14,961 INFO               curator.utils           health_check:1380 Health Check for all provided keys passed.
2017-06-27 14:43:14,961 DEBUG              curator.utils            wait_for_it:1584 Response: True
2017-06-27 14:43:14,962 DEBUG              curator.utils            wait_for_it:1589 Action "allocation" finished executing (may or may not have been successful)
2017-06-27 14:43:14,962 DEBUG              curator.utils            wait_for_it:1607 Result: True
2017-06-27 14:43:14,966 INFO                 curator.cli                    cli:203  Action ID: 1, "allocation" completed.
2017-06-27 14:43:14,966 INFO                 curator.cli                    cli:204  Job completed.

Which should allocate the index to a new node with box_type: "warm"

      "name": "warm_node",
      "transport_address": "10.191.1.2:9300",
      "host": "10.191.1.2",
      "ip": "10.191.1.2",
      "version": "5.4.0",
      "build_hash": "780f8c4",
      "roles": [
        "data",
        "ingest"
      ],
      "attributes": {
        "box_type": "warm",
        "ml.enabled": "true"
      },

But if I look in XPack or query the state of the shards (e.g. nginx-2017.06.23) , I can see that box_type: "warm" is a requirement:

{
  "nginx-2017.06.23": {
    "settings": {
      "index": {
        "routing": {
          "allocation": {
            "require": {
              "box_type": "warm",
              "tag": "warm"
            }
          }
        },
        "number_of_shards": "5",
        "provided_name": "nginx-2017.06.23",
        "creation_date": "1498175976481",
        "number_of_replicas": "1",
        "uuid": "SZNhxKGFQuuFA4OHsKMdvA",
        "version": {
          "created": "5040099"
        }
      }
    }
  }
}

yet the index still clearly lives on nodes with box_type: "hot": get nginx-2017.06.23/_search_shards:

{
  "nodes": {
    "m7Cc-yBdR9efPHwBuliJTw": {
      "name": "node_warm",
      "ephemeral_id": "lqka_rveR-GL0d-yWqWjCg",
      "transport_address": "10.191.4.44:9300",
      "attributes": {
        "box_type": "warm",
        "ml.enabled": "true"
      }
    },
    "CRlGMRe6RNKIp3ojo7wpSA": {
      "name": "node_hot",
      "ephemeral_id": "nshd47ujTCuPD_t_Fng-PA",
      "transport_address": "10.191.1.3:9302",
      "attributes": {
        "box_type": "hot",
        "ml.enabled": "true"
      }
    },
    "x-ghbSl5Rf6ObX50KRY4DQ": {
      "name": "node_hot_1",
      "ephemeral_id": "M4mPFUrJT0WQiduBJ70y0Q",
      "transport_address": "10.191.1.4:9302",
      "attributes": {
        "box_type": "hot",
        "ml.enabled": "true"
      }
    }
  },
  "indices": {
    "nginx-2017.06.23": {}
  },
  "shards": [
    [
      {
        "state": "STARTED",
        "primary": true,
        "node": "m7Cc-yBdR9efPHwBuliJTw",
        "relocating_node": null,
        "shard": 0,
        "index": "gpu-2017.06.23",
        "allocation_id": {
          "id": "AnZ-92o8Rg-P1_M3CM7Bqg"
        }
      },

Is there a way to manually force indexes to reevaluate their conditions against the conditions of the nodes they're living on?

The shards/index would normally be moved as soon as the allocation setting is changed but only if there are no other deciders blocking the move to the warm nodes (e.g. if the warm nodes are above the disk watermark). You can ask Elasticsearch to explain the allocation using the AllocationExplain API. Try running that for a shard in the nginx-2017.06.23 index and if you are still having trouble paste the output in a Gist and link to it here.

1 Like

@colings86 I didn't know that tool exists. It is super handy. I got rid of the output already but it explained it had tried to allocate the index five times before giving up. It also told me to execute this command:

POST /_cluster/reroute?retry_failed=true

Which worked!

Great, glad you got it sorted out. The API is pretty new, it was added in 5.0.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.