Fail to assign shards after restarting ES

Hi,

I'm trying to start my ES but as soon as it reach 80% (assigned shards) it goes back to around 50% and keep doing that indefinitely. It also take hours to reach 80% of assigned shards.

As the logs told us that we had to many open files we incremented the max file descriptor variable but this didn't solve the issue. What action do you recommend? we would like to save as much information as possible.
Thank you.
The cluster has no replicas and has 6 shards per index. With a total of 16672 shards.

The logs show the following Errors:

[2017-06-21 14:40:52,778][ERROR][cluster.action.shard     ] [elastic-1] unexpected failure during [shard-failed ([akainix-smg-2017.03.04][0], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[4], s[STARTED], a[id=f_aQYmUEQtOQTBfPCY8ncg]), message [master {elastic-1}{jXLTdv8ISEmEEHMUJWSnWQ}{172.16.87.59}{172.16.87.59:9300}{master=true} marked shard as started, but shard has previous failed. resending shard failure.]]
[2017-06-21 14:40:52,778][ERROR][cluster.action.shard     ] [elastic-1] unexpected failure during [shard-failed ([itau-bluecoat-2017.06.14][0], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[16], s[STARTED], a[id=haNxy7spSdyQrzagSY6Hqw]), message [master {elastic-1}{jXLTdv8ISEmEEHMUJWSnWQ}{172.16.87.59}{172.16.87.59:9300}{master=true} marked shard as started, but shard has previous failed. resending shard failure.]]
[2017-06-21 14:40:52,778][ERROR][cluster.action.shard     ] [elastic-1] unexpected failure during [shard-failed ([euroamerica-sep-2017.06.17][0], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[16], s[STARTED], a[id=D7iq4f8dRpeshb5vtqax7Q]), message [master [{elastic-1}{jXLTdv8ISEmEEHMUJWSnWQ}{172.16.87.59}{172.16.87.59:9300}{master=true}] marked shard as started, but shard has not been created, mark shard as failed]]
[2017-06-21 14:40:52,778][ERROR][cluster.action.shard     ] [elastic-1] unexpected failure during [shard-failed ([anonimo-monitoreo-2017.03.04][3], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[3], s[INITIALIZING], a[id=vM4Mwu_0RYGfnVzsz4ZzPQ], unassigned_info[[reason=CLUSTER_RECOVERED], at[2017-06-21T14:30:54.236Z]]), message [failed recovery]]
[2017-06-21 14:40:52,778][ERROR][cluster.action.shard     ] [elastic-1] unexpected failure during [shard-failed ([anonimo-monitoreo-2017.03.04][0], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[3], s[INITIALIZING], a[id=fnnETNxfRA2lB3vvY0_F2w], unassigned_info[[reason=CLUSTER_RECOVERED], at[2017-06-21T14:30:54.236Z]]), message [failed recovery]]
[2017-06-21 14:40:52,778][ERROR][cluster.action.shard     ] [elastic-1] unexpected failure during [shard-failed ([anonimo-monitoreo-2017.06.18][0], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[19], s[INITIALIZING], a[id=if-xP-xSSLq3ZytLcYAVeQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-06-21T16:36:18.410Z], details[failed to create shard, failure ElasticsearchException[failed to create shard]; nested: FileSystemException[/datos/elasticsearch/reportes/nodes/0/indices/anonimo-monitoreo-2017.06.18/0/_state: Too many open files]; ]]), message [failed recovery]]
[2017-06-21 14:40:52,779][ERROR][cluster.action.shard     ] [elastic-1] unexpected failure during [shard-failed ([itau-monitoreo-2017.03.09][3], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[15], s[INITIALIZING], a[id=Jtzidiw2TmONLM3z8-wneg], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-06-21T16:36:18.410Z], details[failed to create shard, failure ElasticsearchException[failed to create shard]; nested: FileSystemException[/datos/elasticsearch/reportes/nodes/0/indices/itau-monitoreo-2017.03.09/3/_state: Too many open files]; ]]), message [failed recovery]]

My ES configuration is:

{
  "cluster_name": "reportes",
  "nodes": {
    "WE4NoGaPRXW8qMiXTo8iAg": {
      "timestamp": 1498074992865,
      "name": "elastic-3",
      "transport_address": "172.16.87.60:9300",
      "host": "172.16.87.60",
      "ip": [
        "172.16.87.60:9300",
        "NONE"
      ],
      "attributes": {
        "master": "true"
      },
      "process": {
        "timestamp": 1498074992865,
        "open_file_descriptors": 62974,
        "max_file_descriptors": 65535,
        "cpu": {
          "percent": 22,
          "total_in_millis": 30145070
        },
        "mem": {
          "total_virtual_in_bytes": 131429810176
        }
      }
    },
    "uu1xA_1qTsaChCfyCm11Ig": {
      "timestamp": 1498074994107,
      "name": "elastic-gui",
      "transport_address": "172.16.87.64:9301",
      "host": "172.16.87.64",
      "ip": [
        "172.16.87.64:9301",
        "NONE"
      ],
      "attributes": {
        "data": "false",
        "master": "false"
      },
      "process": {
        "timestamp": 1498074994107,
        "open_file_descriptors": 380,
        "max_file_descriptors": 65535,
        "cpu": {
          "percent": 4,
          "total_in_millis": 6583920
        },
        "mem": {
          "total_virtual_in_bytes": 10332991488
        }
      }
    },
    "pSWuGFrOQPCvX4ykrzcV0Q": {
      "timestamp": 1498074992869,
      "name": "elastic-2",
      "transport_address": "172.16.87.58:9300",
      "host": "172.16.87.58",
      "ip": [
        "172.16.87.58:9300",
        "NONE"
      ],
      "attributes": {
        "master": "true"
      },
      "process": {
        "timestamp": 1498074992869,
        "open_file_descriptors": 65425,
        "max_file_descriptors": 65535,
        "cpu": {
          "percent": 27,
          "total_in_millis": 41594260
        },
        "mem": {
          "total_virtual_in_bytes": 135810736128
        }
      }
    },
    "jXLTdv8ISEmEEHMUJWSnWQ": {
      "timestamp": 1498074992869,
      "name": "elastic-1",
      "transport_address": "172.16.87.59:9300",
      "host": "172.16.87.59",
      "ip": [
        "172.16.87.59:9300",
        "NONE"
      ],
      "attributes": {
        "master": "true"
      },
      "process": {
        "timestamp": 1498074992869,
        "open_file_descriptors": 47658,
        "max_file_descriptors": 65535,
        "cpu": {
          "percent": 22,
          "total_in_millis": 40894600
        },
        "mem": {
          "total_virtual_in_bytes": 112791838720
        }
      }
    }
  }
}

My ES cluster health:

{
  "cluster_name": "reportes",
  "status": "red",
  "timed_out": false,
  "number_of_nodes": 4,
  "number_of_data_nodes": 3,
  "active_primary_shards": 8334,
  "active_shards": 8334,
  "relocating_shards": 0,
  "initializing_shards": 12,
  "unassigned_shards": 8326,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 291067,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 341159,
  "active_shards_percent_as_number": 49.9880038387716
}

That is a very large number of shards for a cluster that size. I would recommend reconsidering your sharing strategy in order to reduce the number of shards/ indices in the cluster significantly.

Thank you we will do that, but how we can start deleting indexes directly in order to reduce the number of shards without the API? We were trying to delete using the API with the following command:

curl -XDELETE 'http://localhost:9200/xxx-xxxx-2016.12.13'
{"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (delete-index [xxx-xxxx-2016.12.13]) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (delete-index [xxx-xxxx-2016.12.13]) within 30s"},"status":503}

Do you recommend to start deleting the indexes directly from the file system? is there any other way you think we can stabilize the ES?

Thank you.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.