Elasticsearch unassigned shards

I have set up ES and Kibana to monitor some servers. We have roughly 150 servers that need to be monitored using Winlogbeat. I set it up and tested it with 2 fairly active servers, it was working fine so we decided to throw an extra 10 servers into ES using winlogbeat. At this point, ES and Kibana died and is displaying Status Red as seen here: http://pastebin.com/0DDqi0fQ

The output of curl -XGET "http://192.168.60.90:9200/_cluster/health/?level=indices" is: http://www.pastebin.com/y0jLHDkP

I tried using curl -XGET 192.168.60.90:9200/_cat/recovery?v to see what was going on. There are thousands of entries and they are all yellow except for a few reds. Here is the output: http://pastebin.com/TC3b0A9X

One final command I found online and seems useful but I'm unsure is curl -XGET "http://192.168.60.90:9200/_cluster/state/routing_table,routing_node".
I got the following: http://www.pastebin.com/LQXh747y
This is just a snippet of that output but as you can see there are several unassigned.

So, from what I can understand is that some instances of winlogbeat have not been assigned shards. Whats the fix here? Thanks.

1 Like

Hi @brandonmcgrath1,

Please show the output of:

curl -XGET "http://127.0.0.1:9200/_cat/recovery?v&active_only=true"
curl -XGET "http://127.0.0.1:9200/_cat/pending_tasks?v"

Also, having a look at the elasticsearch log file will be extremely helfpul - if you can paste them, great.

I'm assuming you only have 1 elasticsearch node, correct?

Also, check disk space on this Elasticsearch node.

I think that because of your unassigned shards. To make your ES status to green

  1. Maintain another node which will take care of your replica
    or
  2. Make no.of replicas as zero

curl -XPUT 'localhost:9200/winlogbeat*/_settings' -d ' { "index" : { "number_of_replicas" : 0 } }'

curl -XGET "http://192.168.60.90:9200/_cat/recovery?v&active_only=true"

index shard time type stage source_host target_host repository snapshot 
files files_percent bytes bytes_percent total_files total_bytes translog
 translog_percent total_translog 

curl -XGET "http://192.168.60.90:9200/_cat/pending_tasks?v"

insertOrder timeInQueue priority source

Yeah I am using only 1 node and I think this is the section for the disk space (2GB = 50% of the RAM):

"jvm" : {
        "timestamp" : 1469087118033,
        "uptime_in_millis" : 58953671,
        "mem" : {
          "heap_used_in_bytes" : 1966507496,
          "heap_used_percent" : 94,
          "heap_committed_in_bytes" : 2075918336,
          "heap_max_in_bytes" : 2075918336,
          "non_heap_used_in_bytes" : 100099704,
          "non_heap_committed_in_bytes" : 102301696,
          "pools" : {
            "young" : {
              "used_in_bytes" : 509946200,
              "max_in_bytes" : 572653568,
              "peak_used_in_bytes" : 572653568,
              "peak_max_in_bytes" : 572653568
            },
            "survivor" : {
              "used_in_bytes" : 35278656,
              "max_in_bytes" : 71565312,
              "peak_used_in_bytes" : 71565312,
              "peak_max_in_bytes" : 71565312
            },
            "old" : {
              "used_in_bytes" : 1421282640,
              "max_in_bytes" : 1431699456,
              "peak_used_in_bytes" : 1431699456,
              "peak_max_in_bytes" : 1431699456
            }
          }

It seems like you might be running out of resources. How many shards do you have in the cluster? What is the average shard size?

@brandonmcgrath1,

I agree with @Christian_Dahlqvist - the Heap percent looks far too high. It's possible that the node went out of memory , causing the problems.

Since there are no pending tasks or active recoveries, and shards are showing unassigned, then the logs should indicate why the shards cannot be started.

Having only 1 node will cause a yellow status, but should not cause red. Setting the replicas to 1 as @Ravi_Shanker_Reddy recommends will reduce the amount of unassigned shards (replicas only) shown in your health check, and in _cat/indices, making it easier for you to find the real shards with the issue.

Once you know which shards are problematic (and you have increased Java Heap memory to prevent this problem in the future), then you can either delete those indices or perform an empty reroute:

curl -XPOST 'localhost:9200/_cluster/reroute?pretty&explain'

Please paste the output of the above command and tell me if it fixes any further unassigned shards.

As a last resort, if you don't want to delete the red indices, you can try to partially recover the index by forcing a primary shard reroute of a particular red shard , that can be determined by _cat/shards.

DANGER :

specifying allow_primary will result in data loss

curl -XPOST 'localhost:9200/_cluster/reroute?allow_primary' -d '{
    "commands" : [ {
        {
          "allocate" : {
              "index" : "my_index", "shard" : 1, "node" : "my_node_name"
          }
        }
    ]
}'

HTH

contents of reroute?pretty&explain:
{ "acknowledged" : true, "state" : { "version" : 5455, "state_uuid" : "1k6uNcfHS9S1q1jDVAK6Vg", "master_node" : "QIiz4oEdTWCpLt6U8Yu05A", "blocks" : { }, "nodes" : { "QIiz4oEdTWCpLt6U8Yu05A" : { "name" : "node-1", "transport_address" : "192.168.60.90:9300", "attributes" : { } } }, "routing_table" : { "indices" : { "winlogbeat-2014.10.29" : { "shards" : { "1" : [ { "state" : "STARTED", "primary" : true, "node" : "QIiz4oEdTWCpLt6U8Yu05A", "relocating_node" : null, "shard" : 1, "index" : "winlogbeat-2014.10.29", "version" : 2, "allocation_id" : { "id" : "iugSdyWrTEC8yWqPqXD2Ng" } } ], "2" : [ { "state" : "STARTED", "primary" : true, "node" : "QIiz4oEdTWCpLt6U8Yu05A", "relocating_node" : null, "shard" : 2, "index" : "winlogbeat-2014.10.29", "version" : 2, "allocation_id" : { "id" : "ztTDXQLPQgG6gO71kza7zA" }

The contents was enorumous but its basically of a repitition of State: Started down to ID with different indexes.
JVM Heap size is at 2GB

from the _cat/indices everything is green except for several yellows which are:
yellow open winlogbeat-2016.07.22 5 1 130217 0 82.6mb 82.6mb yellow open winlogbeat-2016.07.22 5 1 130217 0 82.6mb 82.6mb yellow open topbeat-2016.07.21 5 1 345894 0 95.7mb 95.7mb yellow open topbeat-2016.07.20 5 1 124225 0 34.4mb 34.4mb yellow open topbeat-2016.07.22 5 1 121041 0 32.2mb 32.2mb yellow open .kibana 1 1 3 0 16.7kb 16.7kb
these are amongst hundreds of greens.

cluster health has changed from red to yellow:
{ "cluster_name" : "eTech_cluster", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 5091, "active_shards" : 5091, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 21, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 99.58920187793427 }