Unassigned shards and Elasticsearch health turns RED

Hi, I'm having a single node cluster architecture where my ELK is in a server (say servA) and Filebeat is in remote server.

there are 2 remote servers say

filebeat-A(fA) in serverB(sB) and filebeat-B(fB) in serverC(sC)

these two filebeats pushes respective logs to my serverA..

it was working fine untill filebeatA pushing logs to ELK of servA..(running fine for 10+ days)

as soon as I pushed logs from filebeatB with include_lines:['regex'] (this settings is not in filebeatA), every index turns RED..

then I stopped both fA and fB.. and checked index & cluster health.. everything become red.. I dont know whether that settings had the impact or any other problem

I tried GET /_cluster/health/mdcp_contact
result is
> {

      "index": "mdcp_contact",
      "shard": 3,
      "primary": true,
      "current_state": "unassigned",
      "unassigned_info": {
        "reason": "ALLOCATION_FAILED",
        "at": "2019-11-07T12:30:02.711Z",
        "failed_allocation_attempts": 5,
        "details": "failed shard on node [Bewr4jriQziexcfUXZSfdg]: failed recovery, failure RecoveryFailedException[[mdcp_contact][3]: Recovery failed on {Bewr4jr}{Bewr4jriQziexcfUXZSfdg}{Rqi93PhcTnaC4gegILxBww}{ip}{ip:9300}{ml.machine_memory=67368890368, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: FileSystemException[/opt/jboss/elk/elasticsearch-6.4.2/data/nodes/0/indices/60sjgq9vSZKv7CbTcUi6_Q/3/index/_5wu.nvd: Too many open files]; ",
        "last_allocation_status": "no"
      },
      "can_allocate": "no",
      "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
      "node_allocation_decisions": [
        {
          "node_id": "Bewr4jriQziexcfUXZSfdg",
          "node_name": "Bewr4jr",
          "transport_address": "ip:9300",
          "node_attributes": {
            "ml.machine_memory": "67368890368",
            "xpack.installed": "true",
            "ml.max_open_jobs": "20",
            "ml.enabled": "true"
          },
          "node_decision": "no",
          "store": {
            "in_sync": true,
            "allocation_id": "U4BmrwmxTNKGPPMqCMKEpA"
          },
          "deciders": [
            {
              "decider": "max_retry",
              "decision": "NO",
              "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-11-07T12:30:02.711Z], failed_attempts[5], delayed=false, details[failed shard on node [Bewr4jriQziexcfUXZSfdg]: failed recovery, failure RecoveryFailedException[[mdcp_contact][3]: Recovery failed on {Bewr4jr}{Bewr4jriQziexcfUXZSfdg}{Rqi93PhcTnaC4gegILxBww}{ip}{ip:9300}{ml.machine_memory=67368890368, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: FileSystemException[/opt/jboss/elk/elasticsearch-6.4.2/data/nodes/0/indices/60sjgq9vSZKv7CbTcUi6_Q/3/index/_5wu.nvd: Too many open files]; ], allocation_status[deciders_no]]]"
            }
          ]
        }
      ]
    }

So unassigned shards and tried reallocating and rerouting apis.. no help

Any suggestions or workaround?

Your issue is that, you disabled reallocation on those nodes. So Elastic does not know where to allocate these Shards.

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": null
  }
}

This should solve this problem.

I tried this already..

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": null
  }
}

also

PUT _cluster/settings
{ 
   "transient": { 
"cluster.routing.allocation.enable" : "all" 
  }
}

No improvement :frowning:

Do you run any reindex jobs right now? If so, stop this task and fireup the cluster reroute api.

Do you work with `

node.attr.rack:

in your elasticsearch.yml?
If so, is it implemented correctly?

Also i don't know if this is a proper way to fix such things, but for me it worked to do as :

PUT yourredindex-*/_settings
{
"index" : {
"number_of_replicas" : 0
}
}

wait 5 Seconds and

PUT filebeat-*/_settings
{
"index" : {
"number_of_replicas" : 1
}
}

After this you should take a look at you monitoring in Kibana, it should show you the allocation process of the index.

@Moritz_Kiesewetter

I tried these previously and also now but no breakthrough.. and Im not sure about reindexing job.. can you say how to stop it?

also my exclude_lines value is exclude_lines: ['.*healthcheck.*','.*\/healthcheck\/healthcheck.do.*'] in filebeat.yml

Is the regex correct?

You can check it via one of these :

GET _tasks
GET _tasks?nodes=nodeId1,nodeId2
GET _tasks?nodes=nodeId1,nodeId2&actions=cluster:*

Since i'm still experimenting with the filebeat.yml, i can't really tell you. But either way, this should not lead to your shards not being allocated...

{
  "nodes": {
    "Bewr4jriQziexcfUXZSfdg": {
      "name": "Bewr4jr",
      "transport_address": "ip:9300",
      "host": "ip",
      "ip": "ip:9300",
      "roles": [
        "master",
        "data",
        "ingest"
      ],
      "attributes": {
        "ml.machine_memory": "67368890368",
        "xpack.installed": "true",
        "ml.max_open_jobs": "20",
        "ml.enabled": "true"
      },
      "tasks": {
        "Bewr4jriQziexcfUXZSfdg:33704751": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704751,
          "type": "transport",
          "action": "indices:data/write/bulk",
          "start_time_in_millis": 1573205987221,
          "running_time_in_nanos": 9589541098,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704754": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704754,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205987222,
          "running_time_in_nanos": 9589412869,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704751",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704752": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704752,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205987222,
          "running_time_in_nanos": 9589524962,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704751",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704753": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704753,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205987222,
          "running_time_in_nanos": 9589436078,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704751",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704790": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704790,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205993570,
          "running_time_in_nanos": 3240869689,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704788",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704791": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704791,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205993570,
          "running_time_in_nanos": 3240855342,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704788",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704788": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704788,
          "type": "transport",
          "action": "indices:data/write/bulk",
          "start_time_in_millis": 1573205993570,
          "running_time_in_nanos": 3241014664,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704789": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704789,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205993570,
          "running_time_in_nanos": 3240956715,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704788",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704570": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704570,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205959857,
          "running_time_in_nanos": 36954140118,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704568",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704763": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704763,
          "type": "transport",
          "action": "indices:data/write/bulk",
          "start_time_in_millis": 1573205989569,
          "running_time_in_nanos": 7242091324,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704568": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704568,
          "type": "transport",
          "action": "indices:data/write/bulk",
          "start_time_in_millis": 1573205959857,
          "running_time_in_nanos": 36954288974,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704569": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704569,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205959857,
          "running_time_in_nanos": 36954235064,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704568",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704766": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704766,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205989569,
          "running_time_in_nanos": 7241919716,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704763",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704767": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704767,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205989569,
          "running_time_in_nanos": 7241909491,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704763",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704764": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704764,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205989569,
          "running_time_in_nanos": 7242021501,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704763",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704828": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704828,
          "type": "transport",
          "action": "cluster:monitor/tasks/lists",
          "start_time_in_millis": 1573205996809,
          "running_time_in_nanos": 2126434,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704765": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704765,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205989569,
          "running_time_in_nanos": 7241943603,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704763",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704829": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704829,
          "type": "direct",
          "action": "cluster:monitor/tasks/lists[n]",
          "start_time_in_millis": 1573205996809,
          "running_time_in_nanos": 2017851,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704828",
          "headers": {}
        }
      }
    }
  }
}

GET _tasks?nodes=Bewr4jr

{
  "nodes": {
    "Bewr4jriQziexcfUXZSfdg": {
      "name": "Bewr4jr",
      "transport_address": "ip:9300",
      "host": "ip",
      "ip": "ip:9300",
      "roles": [
        "master",
        "data",
        "ingest"
      ],
      "attributes": {
        "ml.machine_memory": "67368890368",
        "xpack.installed": "true",
        "ml.max_open_jobs": "20",
        "ml.enabled": "true"
      },
      "tasks": {
        "Bewr4jriQziexcfUXZSfdg:33705962": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33705962,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206173242,
          "running_time_in_nanos": 56816559069,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33705961",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33705963": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33705963,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206173243,
          "running_time_in_nanos": 56816491013,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33705961",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706344": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706344,
          "type": "transport",
          "action": "cluster:monitor/tasks/lists",
          "start_time_in_millis": 1573206230059,
          "running_time_in_nanos": 208557,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706345": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706345,
          "type": "direct",
          "action": "cluster:monitor/tasks/lists[n]",
          "start_time_in_millis": 1573206230059,
          "running_time_in_nanos": 146254,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706344",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33705961": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33705961,
          "type": "transport",
          "action": "indices:data/write/bulk",
          "start_time_in_millis": 1573206173242,
          "running_time_in_nanos": 56816635985,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706030": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706030,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206181014,
          "running_time_in_nanos": 49045029571,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706028",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706031": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706031,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206181014,
          "running_time_in_nanos": 49045015524,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706028",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706028": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706028,
          "type": "transport",
          "action": "indices:data/write/bulk",
          "start_time_in_millis": 1573206181014,
          "running_time_in_nanos": 49045174558,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33705964": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33705964,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206173243,
          "running_time_in_nanos": 56816486930,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33705961",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706029": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706029,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206181014,
          "running_time_in_nanos": 49045107553,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706028",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706032": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706032,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206181014,
          "running_time_in_nanos": 49045010709,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706028",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706331": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706331,
          "type": "transport",
          "action": "indices:data/write/bulk",
          "start_time_in_millis": 1573206227042,
          "running_time_in_nanos": 3017143158,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706206": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706206,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206207875,
          "running_time_in_nanos": 22184191326,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706205",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706334": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706334,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206227042,
          "running_time_in_nanos": 3016950197,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706331",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706207": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706207,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206207875,
          "running_time_in_nanos": 22184135515,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706205",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706335": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706335,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206227042,
          "running_time_in_nanos": 3016937345,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706331",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706332": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706332,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206227042,
          "running_time_in_nanos": 3017061437,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706331",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706205": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706205,
          "type": "transport",
          "action": "indices:data/write/bulk",
          "start_time_in_millis": 1573206207875,
          "running_time_in_nanos": 22184243692,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706333": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706333,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206227042,
          "running_time_in_nanos": 3016978170,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706331",
          "headers": {}
        }
      }
    }
  }
}

Well look good to me.. let me try some stuff in my test-cluster. I'll get back to you in an hour.

I have deleted other indices other than mdcp_contact.. and again started indexing, they become yellow.. still mdcp_contact index is at RED