Running Curator causes missing shards for indices with reason NO_VALID_SHARD_COPY

Hi,

Since last week we've been getting that our ELK cluster status is RED after we've been running curator. The curator job is set to forcemerge indices older than 2 days and shrink indices older than 7 days and then send them to our ZFS powered archive node which is using a SAN as a backend.

Running GET _cluster/allocation/explain produces the following:

{
  "index": "winlogbeat-2018-07-03",
  "shard": 1,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "DANGLING_INDEX_IMPORTED",
    "at": "2018-09-05T23:00:01.238Z",
    "last_allocation_status": "no_valid_shard_copy"
  },
  "can_allocate": "no_valid_shard_copy",
  "allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
  "node_allocation_decisions": [
    {
      "node_id": "045e5WjzQCCjHqj8g_VA2Q",
      "node_name": "sealikreela04-archive",
      "transport_address": "10.229.1.14:9300",
      "node_attributes": {
        "ml.machine_memory": "101352407040",
        "ml.max_open_jobs": "20",
        "datacenter": "KRE",
        "xpack.installed": "true",
        "box_type": "archive",
        "ml.enabled": "true"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "8MvsALY9RpSEPqJVXz3qxQ",
      "node_name": "sealijvbela02-masterdata",
      "transport_address": "10.229.1.12:9300",
      "node_attributes": {
        "ml.machine_memory": "67529682944",
        "ml.max_open_jobs": "20",
        "datacenter": "JVB",
        "xpack.installed": "true",
        "box_type": "hot",
        "ml.enabled": "true"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "cwrufCSjQpakNlct4SXdAA",
      "node_name": "sealikreela03-masterdata",
      "transport_address": "10.229.1.13:9300",
      "node_attributes": {
        "ml.machine_memory": "67529682944",
        "ml.max_open_jobs": "20",
        "datacenter": "KRE",
        "xpack.installed": "true",
        "box_type": "hot",
        "ml.enabled": "true"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    }
  ]
}

And here is our Curator action.yml file:

actions:
  1:
    action: forcemerge
    description: >-
      forceMerge logstash- prefixed indices older than 2 days (based on index
      creation_date) to 2 segments per shard.  Delay 120 seconds between each
      forceMerge operation to allow the cluster to quiesce. Skip indices that
      have already been forcemerged to the minimum number of segments to avoid
      reprocessing.
    options:
      max_num_segments: 2
      delay: 120
      timeout_override:
      continue_if_exception: False
      disable_action: False
      ignore_empty_list: True
    filters:
    - filtertype: pattern
      kind: regex
      value: '^(logstash-|winlogbeat-).*$'
    - filtertype: pattern
      kind: suffix
      value: -shrink
      exclude: True
    - filtertype: age
      source: creation_date
      direction: older
      unit: days
      unit_count: 2
      exclude:
    - filtertype: forcemerged
      max_num_segments: 2
      exclude: True
  2:
    action: replicas
    description: >-
      Set index replica to 0 since we want to move it to the archivenode
    options:
      disable_action: False
      count: 0
      wait_for_completion: True
    filters:
    - filtertype: age
      source: creation_date
      direction: older
      unit: days
      unit_count: 7
    - filtertype: pattern
      kind: regex
      value: '^(logstash-|winlogbeat-).*$'
    - filtertype: pattern
      kind: suffix
      value: -shrink
      exclude: True
  3:
    action: allocation
    description: >-
      Move indices older than 7 days to the archive node sealikreela04
    options:
      key: box_type
      value: archive
      allocation_type: require
      wait_for_completion: True
      disable_action: False
    filters:
    - filtertype: age
      source: creation_date
      direction: older
      unit: days
      unit_count: 7
      exclude:
    - filtertype: pattern
      kind: regex
      value: '^(logstash-|winlogbeat-).*$'
    - filtertype: pattern
      kind: suffix
      value: -shrink
      exclude: True
  4:
    action: shrink
    description: >-
      Shrink logstash indices older than 7 days and rotate them to
      the archive node, sealikreela04. It deletes the original indices and
      creates a new one with the suffix "-shrink".
    options:
      disable_action: False
      ignore_empty_list: True
      shrink_node: sealikreela04-archive
      node_filters:
        permit_masters: True
      number_of_shards: 1
      number_of_replicas: 0
      shrink_prefix:
      shrink_suffix: '-shrink'
      delete_after: True
      post_allocation:
        allocation_type: require
        key: box_type
        value: archive
      wait_for_active_shards: 1
      extra_settings:
        settings:
          index.codec: best_compression
      wait_for_completion: True
      wait_for_rebalance: True
      wait_interval: 9
      max_wait: -1
    filters:
    - filtertype: age
      source: creation_date
      direction: older
      unit: days
      unit_count: 7
    - filtertype: pattern
      kind: regex
      value: '^(logstash-|winlogbeat-).*$'
    - filtertype: pattern
      kind: suffix
      value: -shrink
      exclude: True

What could be causing this? And what can be done to prevent this in the future?

Curator is just an index selection wrapper around regular Elasticsearch API calls. In other words, if you made the exact same API calls to Elasticsearch without Curator, your results would be the same, dangling indices and all.

Without going over a full debug log, plus the associated Elasticsearch logs, I can’t guess why you’re getting this message. However, I am curious what you mean by “ZFS powered node” and backed by a SAN. I can tell you that ZFS is not recommended for use as a data path for Elasticsearch because it is slower for writes, and uses a considerable amount of system memory for the ARC cache, which is highly undesirable for Elasticsearch use.

Whether that has anything to do with your results or not I cannot tell from just this message. I also can’t tell if you mean you’re using ZFS for a data path on the node, or if the ZFS data path is on a SAN—neither of which is recommended—or if you are implying that it’s a snapshot repository target destination—which would be okay.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.