Unassigned shards and Elasticsearch health turns RED

Sundaramoorthy_Anand · November 8, 2019, 7:33am

Hi, I'm having a single node cluster architecture where my ELK is in a server (say servA) and Filebeat is in remote server.

there are 2 remote servers say

filebeat-A(fA) in serverB(sB) and filebeat-B(fB) in serverC(sC)

these two filebeats pushes respective logs to my serverA..

it was working fine untill filebeatA pushing logs to ELK of servA..(running fine for 10+ days)

as soon as I pushed logs from filebeatB with include_lines:['regex'] (this settings is not in filebeatA), every index turns RED..

then I stopped both fA and fB.. and checked index & cluster health.. everything become red.. I dont know whether that settings had the impact or any other problem

I tried GET /_cluster/health/mdcp_contact
result is
> {

      "index": "mdcp_contact",
      "shard": 3,
      "primary": true,
      "current_state": "unassigned",
      "unassigned_info": {
        "reason": "ALLOCATION_FAILED",
        "at": "2019-11-07T12:30:02.711Z",
        "failed_allocation_attempts": 5,
        "details": "failed shard on node [Bewr4jriQziexcfUXZSfdg]: failed recovery, failure RecoveryFailedException[[mdcp_contact][3]: Recovery failed on {Bewr4jr}{Bewr4jriQziexcfUXZSfdg}{Rqi93PhcTnaC4gegILxBww}{ip}{ip:9300}{ml.machine_memory=67368890368, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: FileSystemException[/opt/jboss/elk/elasticsearch-6.4.2/data/nodes/0/indices/60sjgq9vSZKv7CbTcUi6_Q/3/index/_5wu.nvd: Too many open files]; ",
        "last_allocation_status": "no"
      },
      "can_allocate": "no",
      "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
      "node_allocation_decisions": [
        {
          "node_id": "Bewr4jriQziexcfUXZSfdg",
          "node_name": "Bewr4jr",
          "transport_address": "ip:9300",
          "node_attributes": {
            "ml.machine_memory": "67368890368",
            "xpack.installed": "true",
            "ml.max_open_jobs": "20",
            "ml.enabled": "true"
          },
          "node_decision": "no",
          "store": {
            "in_sync": true,
            "allocation_id": "U4BmrwmxTNKGPPMqCMKEpA"
          },
          "deciders": [
            {
              "decider": "max_retry",
              "decision": "NO",
              "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-11-07T12:30:02.711Z], failed_attempts[5], delayed=false, details[failed shard on node [Bewr4jriQziexcfUXZSfdg]: failed recovery, failure RecoveryFailedException[[mdcp_contact][3]: Recovery failed on {Bewr4jr}{Bewr4jriQziexcfUXZSfdg}{Rqi93PhcTnaC4gegILxBww}{ip}{ip:9300}{ml.machine_memory=67368890368, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: FileSystemException[/opt/jboss/elk/elasticsearch-6.4.2/data/nodes/0/indices/60sjgq9vSZKv7CbTcUi6_Q/3/index/_5wu.nvd: Too many open files]; ], allocation_status[deciders_no]]]"
            }
          ]
        }
      ]
    }

So unassigned shards and tried reallocating and rerouting apis.. no help

Any suggestions or workaround?

Moritz_Kiesewetter · November 8, 2019, 8:20am

Your issue is that, you disabled reallocation on those nodes. So Elastic does not know where to allocate these Shards.

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": null
  }
}

This should solve this problem.

Sundaramoorthy_Anand · November 8, 2019, 8:59am

I tried this already..

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": null
  }
}

also

PUT _cluster/settings
{ 
   "transient": { 
"cluster.routing.allocation.enable" : "all" 
  }
}

No improvement

Moritz_Kiesewetter · November 8, 2019, 9:08am

Do you run any reindex jobs right now? If so, stop this task and fireup the cluster reroute api.

Do you work with `

node.attr.rack:

in your elasticsearch.yml?
If so, is it implemented correctly?

Also i don't know if this is a proper way to fix such things, but for me it worked to do as :

PUT yourredindex-*/_settings
{
"index" : {
"number_of_replicas" : 0
}
}

wait 5 Seconds and

PUT filebeat-*/_settings
{
"index" : {
"number_of_replicas" : 1
}
}

After this you should take a look at you monitoring in Kibana, it should show you the allocation process of the index.

Sundaramoorthy_Anand · November 8, 2019, 9:23am

@Moritz_Kiesewetter

I tried these previously and also now but no breakthrough.. and Im not sure about reindexing job.. can you say how to stop it?

Sundaramoorthy_Anand · November 8, 2019, 9:25am

also my exclude_lines value is exclude_lines: ['.*healthcheck.*','.*\/healthcheck\/healthcheck.do.*'] in filebeat.yml

Is the regex correct?

Moritz_Kiesewetter · November 8, 2019, 9:38am

You can check it via one of these :

GET _tasks
GET _tasks?nodes=nodeId1,nodeId2
GET _tasks?nodes=nodeId1,nodeId2&actions=cluster:*

Moritz_Kiesewetter · November 8, 2019, 9:52am

Since i'm still experimenting with the filebeat.yml, i can't really tell you. But either way, this should not lead to your shards not being allocated...

Sundaramoorthy_Anand · November 8, 2019, 9:53am

{
  "nodes": {
    "Bewr4jriQziexcfUXZSfdg": {
      "name": "Bewr4jr",
      "transport_address": "ip:9300",
      "host": "ip",
      "ip": "ip:9300",
      "roles": [
        "master",
        "data",
        "ingest"
      ],
      "attributes": {
        "ml.machine_memory": "67368890368",
        "xpack.installed": "true",
        "ml.max_open_jobs": "20",
        "ml.enabled": "true"
      },
      "tasks": {
        "Bewr4jriQziexcfUXZSfdg:33704751": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704751,
          "type": "transport",
          "action": "indices:data/write/bulk",
          "start_time_in_millis": 1573205987221,
          "running_time_in_nanos": 9589541098,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704754": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704754,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205987222,
          "running_time_in_nanos": 9589412869,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704751",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704752": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704752,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205987222,
          "running_time_in_nanos": 9589524962,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704751",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704753": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704753,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205987222,
          "running_time_in_nanos": 9589436078,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704751",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704790": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704790,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205993570,
          "running_time_in_nanos": 3240869689,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704788",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704791": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704791,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205993570,
          "running_time_in_nanos": 3240855342,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704788",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704788": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704788,
          "type": "transport",
          "action": "indices:data/write/bulk",
          "start_time_in_millis": 1573205993570,
          "running_time_in_nanos": 3241014664,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704789": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704789,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205993570,
          "running_time_in_nanos": 3240956715,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704788",
          "headers": {}
        },

Sundaramoorthy_Anand · November 8, 2019, 9:54am

        "Bewr4jriQziexcfUXZSfdg:33704570": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704570,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205959857,
          "running_time_in_nanos": 36954140118,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704568",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704763": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704763,
          "type": "transport",
          "action": "indices:data/write/bulk",
          "start_time_in_millis": 1573205989569,
          "running_time_in_nanos": 7242091324,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704568": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704568,
          "type": "transport",
          "action": "indices:data/write/bulk",
          "start_time_in_millis": 1573205959857,
          "running_time_in_nanos": 36954288974,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704569": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704569,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205959857,
          "running_time_in_nanos": 36954235064,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704568",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704766": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704766,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205989569,
          "running_time_in_nanos": 7241919716,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704763",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704767": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704767,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205989569,
          "running_time_in_nanos": 7241909491,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704763",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704764": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704764,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205989569,
          "running_time_in_nanos": 7242021501,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704763",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704828": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704828,
          "type": "transport",
          "action": "cluster:monitor/tasks/lists",
          "start_time_in_millis": 1573205996809,
          "running_time_in_nanos": 2126434,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704765": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704765,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573205989569,
          "running_time_in_nanos": 7241943603,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704763",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33704829": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33704829,
          "type": "direct",
          "action": "cluster:monitor/tasks/lists[n]",
          "start_time_in_millis": 1573205996809,
          "running_time_in_nanos": 2017851,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33704828",
          "headers": {}
        }
      }
    }
  }
}

GET _tasks?nodes=Bewr4jr

{
  "nodes": {
    "Bewr4jriQziexcfUXZSfdg": {
      "name": "Bewr4jr",
      "transport_address": "ip:9300",
      "host": "ip",
      "ip": "ip:9300",
      "roles": [
        "master",
        "data",
        "ingest"
      ],
      "attributes": {
        "ml.machine_memory": "67368890368",
        "xpack.installed": "true",
        "ml.max_open_jobs": "20",
        "ml.enabled": "true"
      },
      "tasks": {
        "Bewr4jriQziexcfUXZSfdg:33705962": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33705962,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206173242,
          "running_time_in_nanos": 56816559069,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33705961",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33705963": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33705963,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206173243,
          "running_time_in_nanos": 56816491013,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33705961",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706344": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706344,
          "type": "transport",
          "action": "cluster:monitor/tasks/lists",
          "start_time_in_millis": 1573206230059,
          "running_time_in_nanos": 208557,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706345": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706345,
          "type": "direct",
          "action": "cluster:monitor/tasks/lists[n]",
          "start_time_in_millis": 1573206230059,
          "running_time_in_nanos": 146254,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706344",
          "headers": {}
        },

Sundaramoorthy_Anand · November 8, 2019, 9:54am

        "Bewr4jriQziexcfUXZSfdg:33705961": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33705961,
          "type": "transport",
          "action": "indices:data/write/bulk",
          "start_time_in_millis": 1573206173242,
          "running_time_in_nanos": 56816635985,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706030": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706030,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206181014,
          "running_time_in_nanos": 49045029571,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706028",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706031": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706031,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206181014,
          "running_time_in_nanos": 49045015524,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706028",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706028": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706028,
          "type": "transport",
          "action": "indices:data/write/bulk",
          "start_time_in_millis": 1573206181014,
          "running_time_in_nanos": 49045174558,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33705964": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33705964,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206173243,
          "running_time_in_nanos": 56816486930,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33705961",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706029": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706029,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206181014,
          "running_time_in_nanos": 49045107553,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706028",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706032": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706032,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206181014,
          "running_time_in_nanos": 49045010709,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706028",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706331": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706331,
          "type": "transport",
          "action": "indices:data/write/bulk",
          "start_time_in_millis": 1573206227042,
          "running_time_in_nanos": 3017143158,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706206": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706206,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206207875,
          "running_time_in_nanos": 22184191326,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706205",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706334": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706334,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206227042,
          "running_time_in_nanos": 3016950197,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706331",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706207": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706207,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206207875,
          "running_time_in_nanos": 22184135515,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706205",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706335": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706335,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206227042,
          "running_time_in_nanos": 3016937345,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706331",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706332": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706332,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206227042,
          "running_time_in_nanos": 3017061437,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706331",
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706205": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706205,
          "type": "transport",
          "action": "indices:data/write/bulk",
          "start_time_in_millis": 1573206207875,
          "running_time_in_nanos": 22184243692,
          "cancellable": false,
          "headers": {}
        },
        "Bewr4jriQziexcfUXZSfdg:33706333": {
          "node": "Bewr4jriQziexcfUXZSfdg",
          "id": 33706333,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "start_time_in_millis": 1573206227042,
          "running_time_in_nanos": 3016978170,
          "cancellable": false,
          "parent_task_id": "Bewr4jriQziexcfUXZSfdg:33706331",
          "headers": {}
        }
      }
    }
  }
}

Moritz_Kiesewetter · November 8, 2019, 10:11am

Well look good to me.. let me try some stuff in my test-cluster. I'll get back to you in an hour.

Sundaramoorthy_Anand · November 8, 2019, 7:03pm

I have deleted other indices other than mdcp_contact.. and again started indexing, they become yellow.. still mdcp_contact index is at RED

system · December 6, 2019, 7:06pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error: Index Open Elasticsearch	1	461	July 5, 2017
ElasticSearch Red Health - Red Indices Elasticsearch	2	700	January 21, 2020
Metricbeat Index turns red every hours Elasticsearch	6	1017	September 29, 2017
Shards Failed \| Most of the recent indexes are unassigned Elasticsearch	14	5633	November 26, 2019
Yellow health for my indicies from Filebeat and Winlogbeat Elasticsearch	5	2945	July 25, 2018

Unassigned shards and Elasticsearch health turns RED

Related topics