Fail to assign shards after restarting ES

diegoorellanaga · June 21, 2017, 8:19pm

Hi,

I'm trying to start my ES but as soon as it reach 80% (assigned shards) it goes back to around 50% and keep doing that indefinitely. It also take hours to reach 80% of assigned shards.

As the logs told us that we had to many open files we incremented the max file descriptor variable but this didn't solve the issue. What action do you recommend? we would like to save as much information as possible.
Thank you.
The cluster has no replicas and has 6 shards per index. With a total of 16672 shards.

The logs show the following Errors:

[2017-06-21 14:40:52,778][ERROR][cluster.action.shard     ] [elastic-1] unexpected failure during [shard-failed ([akainix-smg-2017.03.04][0], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[4], s[STARTED], a[id=f_aQYmUEQtOQTBfPCY8ncg]), message [master {elastic-1}{jXLTdv8ISEmEEHMUJWSnWQ}{172.16.87.59}{172.16.87.59:9300}{master=true} marked shard as started, but shard has previous failed. resending shard failure.]]
[2017-06-21 14:40:52,778][ERROR][cluster.action.shard     ] [elastic-1] unexpected failure during [shard-failed ([itau-bluecoat-2017.06.14][0], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[16], s[STARTED], a[id=haNxy7spSdyQrzagSY6Hqw]), message [master {elastic-1}{jXLTdv8ISEmEEHMUJWSnWQ}{172.16.87.59}{172.16.87.59:9300}{master=true} marked shard as started, but shard has previous failed. resending shard failure.]]
[2017-06-21 14:40:52,778][ERROR][cluster.action.shard     ] [elastic-1] unexpected failure during [shard-failed ([euroamerica-sep-2017.06.17][0], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[16], s[STARTED], a[id=D7iq4f8dRpeshb5vtqax7Q]), message [master [{elastic-1}{jXLTdv8ISEmEEHMUJWSnWQ}{172.16.87.59}{172.16.87.59:9300}{master=true}] marked shard as started, but shard has not been created, mark shard as failed]]
[2017-06-21 14:40:52,778][ERROR][cluster.action.shard     ] [elastic-1] unexpected failure during [shard-failed ([anonimo-monitoreo-2017.03.04][3], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[3], s[INITIALIZING], a[id=vM4Mwu_0RYGfnVzsz4ZzPQ], unassigned_info[[reason=CLUSTER_RECOVERED], at[2017-06-21T14:30:54.236Z]]), message [failed recovery]]
[2017-06-21 14:40:52,778][ERROR][cluster.action.shard     ] [elastic-1] unexpected failure during [shard-failed ([anonimo-monitoreo-2017.03.04][0], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[3], s[INITIALIZING], a[id=fnnETNxfRA2lB3vvY0_F2w], unassigned_info[[reason=CLUSTER_RECOVERED], at[2017-06-21T14:30:54.236Z]]), message [failed recovery]]
[2017-06-21 14:40:52,778][ERROR][cluster.action.shard     ] [elastic-1] unexpected failure during [shard-failed ([anonimo-monitoreo-2017.06.18][0], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[19], s[INITIALIZING], a[id=if-xP-xSSLq3ZytLcYAVeQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-06-21T16:36:18.410Z], details[failed to create shard, failure ElasticsearchException[failed to create shard]; nested: FileSystemException[/datos/elasticsearch/reportes/nodes/0/indices/anonimo-monitoreo-2017.06.18/0/_state: Too many open files]; ]]), message [failed recovery]]
[2017-06-21 14:40:52,779][ERROR][cluster.action.shard     ] [elastic-1] unexpected failure during [shard-failed ([itau-monitoreo-2017.03.09][3], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[15], s[INITIALIZING], a[id=Jtzidiw2TmONLM3z8-wneg], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-06-21T16:36:18.410Z], details[failed to create shard, failure ElasticsearchException[failed to create shard]; nested: FileSystemException[/datos/elasticsearch/reportes/nodes/0/indices/itau-monitoreo-2017.03.09/3/_state: Too many open files]; ]]), message [failed recovery]]

My ES configuration is:

{
  "cluster_name": "reportes",
  "nodes": {
    "WE4NoGaPRXW8qMiXTo8iAg": {
      "timestamp": 1498074992865,
      "name": "elastic-3",
      "transport_address": "172.16.87.60:9300",
      "host": "172.16.87.60",
      "ip": [
        "172.16.87.60:9300",
        "NONE"
      ],
      "attributes": {
        "master": "true"
      },
      "process": {
        "timestamp": 1498074992865,
        "open_file_descriptors": 62974,
        "max_file_descriptors": 65535,
        "cpu": {
          "percent": 22,
          "total_in_millis": 30145070
        },
        "mem": {
          "total_virtual_in_bytes": 131429810176
        }
      }
    },
    "uu1xA_1qTsaChCfyCm11Ig": {
      "timestamp": 1498074994107,
      "name": "elastic-gui",
      "transport_address": "172.16.87.64:9301",
      "host": "172.16.87.64",
      "ip": [
        "172.16.87.64:9301",
        "NONE"
      ],
      "attributes": {
        "data": "false",
        "master": "false"
      },
      "process": {
        "timestamp": 1498074994107,
        "open_file_descriptors": 380,
        "max_file_descriptors": 65535,
        "cpu": {
          "percent": 4,
          "total_in_millis": 6583920
        },
        "mem": {
          "total_virtual_in_bytes": 10332991488
        }
      }
    },
    "pSWuGFrOQPCvX4ykrzcV0Q": {
      "timestamp": 1498074992869,
      "name": "elastic-2",
      "transport_address": "172.16.87.58:9300",
      "host": "172.16.87.58",
      "ip": [
        "172.16.87.58:9300",
        "NONE"
      ],
      "attributes": {
        "master": "true"
      },
      "process": {
        "timestamp": 1498074992869,
        "open_file_descriptors": 65425,
        "max_file_descriptors": 65535,
        "cpu": {
          "percent": 27,
          "total_in_millis": 41594260
        },
        "mem": {
          "total_virtual_in_bytes": 135810736128
        }
      }
    },
    "jXLTdv8ISEmEEHMUJWSnWQ": {
      "timestamp": 1498074992869,
      "name": "elastic-1",
      "transport_address": "172.16.87.59:9300",
      "host": "172.16.87.59",
      "ip": [
        "172.16.87.59:9300",
        "NONE"
      ],
      "attributes": {
        "master": "true"
      },
      "process": {
        "timestamp": 1498074992869,
        "open_file_descriptors": 47658,
        "max_file_descriptors": 65535,
        "cpu": {
          "percent": 22,
          "total_in_millis": 40894600
        },
        "mem": {
          "total_virtual_in_bytes": 112791838720
        }
      }
    }
  }
}

My ES cluster health:

{
  "cluster_name": "reportes",
  "status": "red",
  "timed_out": false,
  "number_of_nodes": 4,
  "number_of_data_nodes": 3,
  "active_primary_shards": 8334,
  "active_shards": 8334,
  "relocating_shards": 0,
  "initializing_shards": 12,
  "unassigned_shards": 8326,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 291067,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 341159,
  "active_shards_percent_as_number": 49.9880038387716
}

Christian_Dahlqvist · June 21, 2017, 8:29pm

That is a very large number of shards for a cluster that size. I would recommend reconsidering your sharing strategy in order to reduce the number of shards/ indices in the cluster significantly.

diegoorellanaga · June 21, 2017, 8:41pm

Thank you we will do that, but how we can start deleting indexes directly in order to reduce the number of shards without the API? We were trying to delete using the API with the following command:

curl -XDELETE 'http://localhost:9200/xxx-xxxx-2016.12.13'
{"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (delete-index [xxx-xxxx-2016.12.13]) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (delete-index [xxx-xxxx-2016.12.13]) within 30s"},"status":503}

Do you recommend to start deleting the indexes directly from the file system? is there any other way you think we can stabilize the ES?

Thank you.

system · July 19, 2017, 8:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Full cluster restart consistently fails to assign all shards Elasticsearch	4	380	July 6, 2017
Elasticseach failed shard allocation Elasticsearch	8	1353	May 28, 2021
Proper way to restart elasticsearch in a cluster Elasticsearch	5	183	April 17, 2024
Shard failing after a cluster restart Elasticsearch	1	957	July 5, 2017
ES bugs in 0.20.4 and 0.20.5 cause shards allocation failure and stuck in initializing state Elasticsearch	21	874	July 6, 2017

Fail to assign shards after restarting ES

Related topics