Shard stuck in INITIALIZING


#1

Hi guys,

I'm having an issue that I tried multiple approachs and I'm unable to find a proper solution.

I thought it was something with unassigned shards:

curl -XGET 'http://183.*.*.200:9200/_cluster/health?pretty&level=indices'

  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 325,
  "active_shards" : 325,
  "relocating_shards" : 0,
  "initializing_shards" : 1,
  "unassigned_shards" : 326,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 49.84662576687116,

...

So, I've setted the index.number_of_replicas: 0 and it solved.
Well, sort of it. Then I've found another issue:

  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 325,
  "active_shards" : 325,
  "relocating_shards" : 0,
  "initializing_shards" : 1,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.69325153374233,

When I look at the shards theres seems to be one stuck in INITIALIZING:

curl -XGET http://183.*.*.200:9200/_cat/shards 

tracking-2016.09.28 3 p INITIALIZING                 185.31.158.200 tracking 
tracking-2016.09.28 4 p STARTED      1575741 360.4mb 185.31.158.200 tracking 
tracking-2016.09.28 1 p STARTED      1577167 240.3mb 185.31.158.200 tracking 
tracking-2016.09.28 2 p STARTED      1575764   239mb 185.31.158.200 tracking 
tracking-2016.09.28 0 p STARTED                      185.31.158.200 tracking 


curl -XGET 'http://183.*.*.200:9200/_cat/recovery?v'

index               shard time     type  stage    source_host    target_host    repository snapshot files files_percent bytes bytes_percent total_files total_bytes translog translog_percent total_translog 
tracking-2016.09.28 0     19       store done     185.31.158.200 185.31.158.200 n/a        n/a      0     0.0%          0     0.0%          0           0           0        100.0%           0              
tracking-2016.09.28 1     20       store done     185.31.158.200 185.31.158.200 n/a        n/a      0     0.0%          0     0.0%          0           0           0        100.0%           0              
tracking-2016.09.28 2     359751   store done     185.31.158.200 185.31.158.200 n/a        n/a      0     100.0%        0     100.0%        121         247805835   2978     100.0%           2978           
tracking-2016.09.28 3     89734989 store translog 185.31.158.200 185.31.158.200 n/a        n/a      0     100.0%        0     100.0%        109         259401405   0        -1.0%            -1             

I've taken a look at /storage/tracking/data/elasticsearch/nodes/0/indices/tracking-2016.09.28/3/translog and found a few translog files there:

-rw-r--r-- 1 nobody 4294967294      43 Sep 28 16:21 translog-10.tlog
-rw-r--r-- 1 nobody 4294967294      20 Sep 28 16:19 translog-8.ckp
-rw-r--r-- 1 nobody 4294967294 4514316 Sep 28 16:17 translog-8.tlog
-rw-r--r-- 1 nobody 4294967294      20 Sep 28 16:21 translog-9.ckp
-rw-r--r-- 1 nobody 4294967294    4124 Sep 28 16:20 translog-9.tlog
-rw-r--r-- 1 nobody 4294967294      20 Sep 29 18:43 translog.ckp

I've read here that If I rename translog-9.ckp to translog.ckp it may solve the stuck state but nothing changed after restarting elasticsearch service.

Elasticsearch version: 2.3.4

Is there anyone able to guide me in the right direction?

TIA


(Mark Walkom) #2

What do your logs show?


#3

Well, in logstash.log, not logstash.err, I've found a warning related to the problematic index and shard:

"create" => {
    "_index" => "tracking-2016.09.28", "_type" => "snapshots", "_id" => "AVdxYTvWa6yUWy4kpHrw", "status" => 404, "error" => {
        "type" => "engine_closed_exception", "reason" => "CurrentState[CLOSED] Closed", "shard" => "3", "index" => "tracking-2016.09.28", "caused_by" => {
            "type" => "out_of_memory_error", "reason" => "Java heap space"
        }
    }
}

out_of memory_error ... Java heap space.

What do you suggest to recover the shard from current INITIALIZING state and also, what can be done to prevent it from happen in the future?

TIA


(Mark Walkom) #4

You need to look in your ES logs, that is where the shard is after all.


#5

Admirably, /var/log/elasticsearch/ is empty...

[edit]
Apparently, Elasticsearch was able to heal itself:

[2016-10-06 17:53:55,503][WARN ][index.translog           ] [tracking] [tracking-2016.09.28][3] deleted previously created, but not yet committed, next generation [translog-10.tlog]. This can happen due to a tragic exception when creating a new generation

[2016-10-06 17:54:00,760][INFO ][cluster.routing.allocation] [tracking] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[tracking-2016.09.28][3]] ...]).

Everything seems to be working fine. The shard state changed to STARTED, the index status changed to GREEN.


(system) #6