Hi,
Since last week we've been getting that our ELK cluster status is RED after we've been running curator. The curator job is set to forcemerge indices older than 2 days and shrink indices older than 7 days and then send them to our ZFS powered archive node which is using a SAN as a backend.
Running GET _cluster/allocation/explain produces the following:
{
"index": "winlogbeat-2018-07-03",
"shard": 1,
"primary": true,
"current_state": "unassigned",
"unassigned_info": {
"reason": "DANGLING_INDEX_IMPORTED",
"at": "2018-09-05T23:00:01.238Z",
"last_allocation_status": "no_valid_shard_copy"
},
"can_allocate": "no_valid_shard_copy",
"allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
"node_allocation_decisions": [
{
"node_id": "045e5WjzQCCjHqj8g_VA2Q",
"node_name": "sealikreela04-archive",
"transport_address": "10.229.1.14:9300",
"node_attributes": {
"ml.machine_memory": "101352407040",
"ml.max_open_jobs": "20",
"datacenter": "KRE",
"xpack.installed": "true",
"box_type": "archive",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "8MvsALY9RpSEPqJVXz3qxQ",
"node_name": "sealijvbela02-masterdata",
"transport_address": "10.229.1.12:9300",
"node_attributes": {
"ml.machine_memory": "67529682944",
"ml.max_open_jobs": "20",
"datacenter": "JVB",
"xpack.installed": "true",
"box_type": "hot",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "cwrufCSjQpakNlct4SXdAA",
"node_name": "sealikreela03-masterdata",
"transport_address": "10.229.1.13:9300",
"node_attributes": {
"ml.machine_memory": "67529682944",
"ml.max_open_jobs": "20",
"datacenter": "KRE",
"xpack.installed": "true",
"box_type": "hot",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
}
]
}
And here is our Curator action.yml file:
actions:
1:
action: forcemerge
description: >-
forceMerge logstash- prefixed indices older than 2 days (based on index
creation_date) to 2 segments per shard. Delay 120 seconds between each
forceMerge operation to allow the cluster to quiesce. Skip indices that
have already been forcemerged to the minimum number of segments to avoid
reprocessing.
options:
max_num_segments: 2
delay: 120
timeout_override:
continue_if_exception: False
disable_action: False
ignore_empty_list: True
filters:
- filtertype: pattern
kind: regex
value: '^(logstash-|winlogbeat-).*$'
- filtertype: pattern
kind: suffix
value: -shrink
exclude: True
- filtertype: age
source: creation_date
direction: older
unit: days
unit_count: 2
exclude:
- filtertype: forcemerged
max_num_segments: 2
exclude: True
2:
action: replicas
description: >-
Set index replica to 0 since we want to move it to the archivenode
options:
disable_action: False
count: 0
wait_for_completion: True
filters:
- filtertype: age
source: creation_date
direction: older
unit: days
unit_count: 7
- filtertype: pattern
kind: regex
value: '^(logstash-|winlogbeat-).*$'
- filtertype: pattern
kind: suffix
value: -shrink
exclude: True
3:
action: allocation
description: >-
Move indices older than 7 days to the archive node sealikreela04
options:
key: box_type
value: archive
allocation_type: require
wait_for_completion: True
disable_action: False
filters:
- filtertype: age
source: creation_date
direction: older
unit: days
unit_count: 7
exclude:
- filtertype: pattern
kind: regex
value: '^(logstash-|winlogbeat-).*$'
- filtertype: pattern
kind: suffix
value: -shrink
exclude: True
4:
action: shrink
description: >-
Shrink logstash indices older than 7 days and rotate them to
the archive node, sealikreela04. It deletes the original indices and
creates a new one with the suffix "-shrink".
options:
disable_action: False
ignore_empty_list: True
shrink_node: sealikreela04-archive
node_filters:
permit_masters: True
number_of_shards: 1
number_of_replicas: 0
shrink_prefix:
shrink_suffix: '-shrink'
delete_after: True
post_allocation:
allocation_type: require
key: box_type
value: archive
wait_for_active_shards: 1
extra_settings:
settings:
index.codec: best_compression
wait_for_completion: True
wait_for_rebalance: True
wait_interval: 9
max_wait: -1
filters:
- filtertype: age
source: creation_date
direction: older
unit: days
unit_count: 7
- filtertype: pattern
kind: regex
value: '^(logstash-|winlogbeat-).*$'
- filtertype: pattern
kind: suffix
value: -shrink
exclude: True
What could be causing this? And what can be done to prevent this in the future?