Elasticsearch unassigned shards

brandonmcgrath1 · July 20, 2016, 1:50pm

I have set up ES and Kibana to monitor some servers. We have roughly 150 servers that need to be monitored using Winlogbeat. I set it up and tested it with 2 fairly active servers, it was working fine so we decided to throw an extra 10 servers into ES using winlogbeat. At this point, ES and Kibana died and is displaying Status Red as seen here: http://pastebin.com/0DDqi0fQ

The output of curl -XGET "http://192.168.60.90:9200/_cluster/health/?level=indices" is: http://www.pastebin.com/y0jLHDkP

I tried using curl -XGET 192.168.60.90:9200/_cat/recovery?v to see what was going on. There are thousands of entries and they are all yellow except for a few reds. Here is the output: http://pastebin.com/TC3b0A9X

One final command I found online and seems useful but I'm unsure is curl -XGET "http://192.168.60.90:9200/_cluster/state/routing_table,routing_node".
I got the following: http://www.pastebin.com/LQXh747y
This is just a snippet of that output but as you can see there are several unassigned.

So, from what I can understand is that some instances of winlogbeat have not been assigned shards. Whats the fix here? Thanks.

PhaedrusTheGreek · July 20, 2016, 3:49pm

Hi @brandonmcgrath1,

Please show the output of:

curl -XGET "http://127.0.0.1:9200/_cat/recovery?v&active_only=true"
curl -XGET "http://127.0.0.1:9200/_cat/pending_tasks?v"

Also, having a look at the elasticsearch log file will be extremely helfpul - if you can paste them, great.

I'm assuming you only have 1 elasticsearch node, correct?

Also, check disk space on this Elasticsearch node.

Ravi_Shanker_Reddy · July 21, 2016, 5:22am

I think that because of your unassigned shards. To make your ES status to green

Maintain another node which will take care of your replica
or
Make no.of replicas as zero

curl -XPUT 'localhost:9200/winlogbeat*/_settings' -d ' { "index" : { "number_of_replicas" : 0 } }'

brandonmcgrath1 · July 21, 2016, 7:46am

curl -XGET "http://192.168.60.90:9200/_cat/recovery?v&active_only=true"

index shard time type stage source_host target_host repository snapshot 
files files_percent bytes bytes_percent total_files total_bytes translog
 translog_percent total_translog

curl -XGET "http://192.168.60.90:9200/_cat/pending_tasks?v"

insertOrder timeInQueue priority source

Yeah I am using only 1 node and I think this is the section for the disk space (2GB = 50% of the RAM):

"jvm" : {
        "timestamp" : 1469087118033,
        "uptime_in_millis" : 58953671,
        "mem" : {
          "heap_used_in_bytes" : 1966507496,
          "heap_used_percent" : 94,
          "heap_committed_in_bytes" : 2075918336,
          "heap_max_in_bytes" : 2075918336,
          "non_heap_used_in_bytes" : 100099704,
          "non_heap_committed_in_bytes" : 102301696,
          "pools" : {
            "young" : {
              "used_in_bytes" : 509946200,
              "max_in_bytes" : 572653568,
              "peak_used_in_bytes" : 572653568,
              "peak_max_in_bytes" : 572653568
            },
            "survivor" : {
              "used_in_bytes" : 35278656,
              "max_in_bytes" : 71565312,
              "peak_used_in_bytes" : 71565312,
              "peak_max_in_bytes" : 71565312
            },
            "old" : {
              "used_in_bytes" : 1421282640,
              "max_in_bytes" : 1431699456,
              "peak_used_in_bytes" : 1431699456,
              "peak_max_in_bytes" : 1431699456
            }
          }

Christian_Dahlqvist · July 21, 2016, 8:05am

It seems like you might be running out of resources. How many shards do you have in the cluster? What is the average shard size?

PhaedrusTheGreek · July 21, 2016, 2:02pm

@brandonmcgrath1,

I agree with @Christian_Dahlqvist - the Heap percent looks far too high. It's possible that the node went out of memory , causing the problems.

Since there are no pending tasks or active recoveries, and shards are showing unassigned, then the logs should indicate why the shards cannot be started.

Having only 1 node will cause a yellow status, but should not cause red. Setting the replicas to 1 as @Ravi_Shanker_Reddy recommends will reduce the amount of unassigned shards (replicas only) shown in your health check, and in _cat/indices, making it easier for you to find the real shards with the issue.

Once you know which shards are problematic (and you have increased Java Heap memory to prevent this problem in the future), then you can either delete those indices or perform an empty reroute:

curl -XPOST 'localhost:9200/_cluster/reroute?pretty&explain'

Please paste the output of the above command and tell me if it fixes any further unassigned shards.

As a last resort, if you don't want to delete the red indices, you can try to partially recover the index by forcing a primary shard reroute of a particular red shard , that can be determined by _cat/shards.

DANGER :

specifying allow_primary will result in data loss

curl -XPOST 'localhost:9200/_cluster/reroute?allow_primary' -d '{
    "commands" : [ {
        {
          "allocate" : {
              "index" : "my_index", "shard" : 1, "node" : "my_node_name"
          }
        }
    ]
}'

HTH

brandonmcgrath1 · July 22, 2016, 8:26am

contents of reroute?pretty&explain:
{ "acknowledged" : true, "state" : { "version" : 5455, "state_uuid" : "1k6uNcfHS9S1q1jDVAK6Vg", "master_node" : "QIiz4oEdTWCpLt6U8Yu05A", "blocks" : { }, "nodes" : { "QIiz4oEdTWCpLt6U8Yu05A" : { "name" : "node-1", "transport_address" : "192.168.60.90:9300", "attributes" : { } } }, "routing_table" : { "indices" : { "winlogbeat-2014.10.29" : { "shards" : { "1" : [ { "state" : "STARTED", "primary" : true, "node" : "QIiz4oEdTWCpLt6U8Yu05A", "relocating_node" : null, "shard" : 1, "index" : "winlogbeat-2014.10.29", "version" : 2, "allocation_id" : { "id" : "iugSdyWrTEC8yWqPqXD2Ng" } } ], "2" : [ { "state" : "STARTED", "primary" : true, "node" : "QIiz4oEdTWCpLt6U8Yu05A", "relocating_node" : null, "shard" : 2, "index" : "winlogbeat-2014.10.29", "version" : 2, "allocation_id" : { "id" : "ztTDXQLPQgG6gO71kza7zA" }

The contents was enorumous but its basically of a repitition of State: Started down to ID with different indexes.
JVM Heap size is at 2GB

from the _cat/indices everything is green except for several yellows which are:
yellow open winlogbeat-2016.07.22 5 1 130217 0 82.6mb 82.6mb yellow open winlogbeat-2016.07.22 5 1 130217 0 82.6mb 82.6mb yellow open topbeat-2016.07.21 5 1 345894 0 95.7mb 95.7mb yellow open topbeat-2016.07.20 5 1 124225 0 34.4mb 34.4mb yellow open topbeat-2016.07.22 5 1 121041 0 32.2mb 32.2mb yellow open .kibana 1 1 3 0 16.7kb 16.7kb
these are amongst hundreds of greens.

cluster health has changed from red to yellow:
{ "cluster_name" : "eTech_cluster", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 5091, "active_shards" : 5091, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 21, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 99.58920187793427 }

Topic		Replies	Views
New ES Instance Unassigned Shards Elasticsearch	5	453	July 18, 2018
ElasticSearch on Yellow status && "unassigned_shards" : 985 Elasticsearch	16	459	May 19, 2019
Unassigned Shards Elasticsearch	5	456	July 6, 2017
Yellow health for my indicies from Filebeat and Winlogbeat Elasticsearch	5	2977	July 25, 2018
Unassigned shards status flip-flopping Elasticsearch	1	493	July 5, 2017

Elasticsearch unassigned shards

DANGER :

Related topics