Elasticsearch cluster status RED and not responding

Hi friends,

My Elastic search cluster is not responding and status is RED from last few days. It is not able to allocate the shards which are unassigned. This issue started when I first restart the primary instance and I am also not able to delete the shards which are unassigned. Can you please help

No of nodes in the cluster 62 69 0.27 d m elastic-search-60586867-3-183395336.major-qa.graylog.dfwqa2.qa.com 46 87 0.00 c - graylog-c4884382-571a-4490-90e5-01a125578189 64 80 1.09 d m elastic-search-60586867-1-183395330.major-qa.graylog.dfwqa2.qa.com 65 89 0.00 c - graylog-ae9e0ce7-07d7-44d5-ad65-2e01d842f314 82 59 1.15 d m elastic-search-60586867-2-183395333.major-qa.graylog.dfwqa2.qa.com 32 86 0.22 c - graylog-82a8d493-012a-4894-b749-3b2ac276b708 40 68 0.62 d m elastic-search-60586867-4-190299637.major-qa.graylog.dfwqa2.qa..com 49 67 1.13 d * elastic-search-60586867-5-190299640.major-qa.graylog.dfwqa2.qa.com 25 86 0.17 c - graylog-e400e5f3-9f72-41c8-8ef3-b57ba5270693

and here is the status I can see
"cluster_name" : "elasticsearch-major-qa",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 9,
"number_of_data_nodes" : 5,
"active_primary_shards" : 14718,
"active_shards" : 29436,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 8220,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 78.17080943275973

So? 37 656 Shards on 5 nodes?
Around 7500 shards per node!

That's like running 7500 MySQL instances on 1 machine. Would you really do that?

You must reduce that number or increase the number of nodes.

BTW you didn't tell the version and hardware details.

I am using 2.4.1 version Elasticsearch on CentOS 6.7 with 130GB disk and 16GB Memory on each node

Yes I have 37K shards on 5nodes, we are using OneOps to provision the env and config.

Can you please point me how to reduce them, I am ok if I have to loose any data or all the data. I haven't configured anywhere to have so many shards, I want to keep this minimum. I am using file beat to ship the logs to Elasticsearch regularly.

Change the number of shards to 1 by index, have a lesser period of retention, use rollover API instead... Many choices. May be a combination of all that?

So you have only 8gb of HEAP, right? Might be not enough for so many shards.

BTW about your specific problem, do you see any errors in logs?

I don't see any errors in the logs, when I restart complete cluster - it starts allocating all the 32K shards and stops allocating a some point without any errors.

When I tried to delete few by using below CURl, its not helping.
$ curl -XDELETE 'http://localhost:9200/graylog_57529/'
{"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (delete-index [graylog_57529]) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (delete-index [graylog_57529]) within 30s"},"status":503}

$ curl -XDELETE 'localhost:9200/graylog_57286?pretty'
"acknowledged" : false
[app@elastic-search-60586867-5-190299640 ~]$ curl -XDELETE 'localhost:9200/graylog_57287?pretty'
"error" : {
"root_cause" : [ {
"type" : "process_cluster_event_timeout_exception",
"reason" : "failed to process cluster event (delete-index [graylog_57287]) within 30s"
} ],
"type" : "process_cluster_event_timeout_exception",
"reason" : "failed to process cluster event (delete-index [graylog_57287]) within 30s"
"status" : 503

And if I delete indexes manually from the directory and restart the cluster, it is restoring the deleted ones I guess from replica. Can you please suggest how to deleted the indexes using CURL or manually so that I can bring up the cluster and then work on ES management.

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:


Can you try this: https://www.elastic.co/guide/en/elasticsearch/reference/5.2/cluster-allocation-explain.html

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.