Our production cluster went yellow last night after our logstash index
rolled over to the next version. I've seen this happen before but this time
I decided to properly diagnose and seek some feedback on what might be
going on.
So, I'd love some feedback on what is going on. I'm happy to keep this
cluster in a yellow state for a limited time to get some help from people
in this group trying to diagnose this properly and maybe help some others
who face the same issues. However, I will need to fix this one way or
another before end of business day today. I plan to perform a rolling
restart to see if node reinitialization fixes things. If not, I'll remove
the problematic logstash index and move on. I'd love suggesttions for less
intrusive solutions. I don't like losing data and rolling restarts are kind
of tedious to babysit. Tends to take 45 minutes or so.
Below is some information I've gathered. Let me know if you need me to
extract more data.
First the obvious:
{
"status" : 200,
"name" : "192.168.1.13",
"cluster_name" : "linko_elasticsearch",
"version" : {
"number" : "1.4.0",
"build_hash" : "bc94bd81298f81c656893ab1ddddd30a99356066",
"build_timestamp" : "2014-11-05T14:26:12Z",
"build_snapshot" : false,
"lucene_version" : "4.10.2"
},
"tagline" : "You Know, for Search"
}
[linko@app2 elasticsearch]$ curl localhost:9200/_cluster/health?pretty
{
"cluster_name" : "linko_elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 3,
"active_primary_shards" : 221,
"active_shards" : 619,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 1
}
So we're yellow and the reason is initializing and unassigned shards. We
have five nodes, of which three are data nodes. It seems we are hitting
some kind of resilience issue. The three machines have plenty of diskspace
and memory.
I found this in the log of one of our es nodes:
[2014-11-27 10:15:12,585][WARN ][cluster.action.shard ] [192.168.1.13]
[logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4],
node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID
[-mMLqYjAQuCUDcczYf5SHA], reason [Failed to start shard, message
[RecoveryFailedException[[logstash-2014.11.27][4]: Recovery failed from
[192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io][inet[/192.168.1.14:9300]]
into
[192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}];
nested:
RemoteTransportException[[192.168.1.14][inet[/192.168.1.14:9300]][internal:index/shard/recovery/start_recovery]];
nested: RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution
failed]; nested:
RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]];
nested: NumberFormatException[For input string: "finished"]; ]]
on the mentioned node there's a corresponding messages:
[2014-11-27 10:17:54,187][WARN ][cluster.action.shard ] [192.168.1.14]
[logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4],
node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID
[-mMLqYjAQuCUDcczYf5SHA], reason [Failed to perform
[indices:data/write/bulk[s]] on replica, message
[RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][indices:data/write/bulk[s][r]]];
nested: NumberFormatException[For input string: "finished"]; ]]
All three data nodes have similar messages happening over and over again.
Our cluster has been up for a couple of weeks and seems pretty happy
otherwise. I deleted some older logstash indices a few days ago. The
cluster has logstash data and a few smallish indiceses we use for our
inbot.io service. The issue appears to be related to the logstash index
rollover. Our app servers and kibana talk to the two non data nodes that we
run on both our application servers.
My next stop was kibana which we use on the same cluster with the logstash
index that is probably causing us issues. Looking at that, I noticed a few
interesting things:
- logstash indexing seems to be fine (good) and it appears there has
been no data loss yet - our cpu load jumped around midnight and sort of stayed up on all three
nodes. We measure this using collectd and both mean and max load jumped to
around 1 around the time the index rollover happened.
My next step was using curl -XGET 'localhost:9200/_cat/recovery?v'
All the indices listed there looked fine. I'll spare you the output but
everything appared to be in the 'done' stage.
Finally, I did
[linko@es3 elasticsearch]$ curl -XGET
'localhost:9200/_cluster/health/logstash-2014.11.27/?pretty'
{
"cluster_name" : "linko_elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 3,
"active_primary_shards" : 5,
"active_shards" : 12,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 1
}
So, that confirms last night's new logstash index is the issue and it seems
to have issues initializing two of its shards. With some Googling, I also
found a recently closed issues
(e.g. https://github.com/elasticsearch/elasticsearch/issues/6808) that seem
related to this for 1.3.x. However, I'm on 1.4 where this is supposedly
fixed. The NumberFormatException mentioned in the logs seems suspicious for
sure.
My questions are: what is going on, how do I fix it properly and without
data loss or downtime, and I do I stop this from happening? A rolling
restart might fix things (seen that happen before).
Maybe there are some magic curl incantation than nudges things along that
I've overlooked? I've dug around on google but all I seem to find is advice
along the lines of deleting indices and/or maybe trying this or that by
poking around with various apis in the elasticsearch internals. I'm looking
for something a bit more responsible/structured and am a bit wary of the
'shotgun' approach here. Yes, I know I can fix things by deleting stuff or
restarting stuff and doing a hail mary. However, my main concern is that
the need for this seems to be a regularly recurring thing (done it a dozen
times over the course of the last year for different issues). What I
really need is simple instructions that I can give to our people to fix
things when things go bad: when x happens do this and you'll be fine. The
last thing I want is ops people who get in the habit of deleting indices or
doing random rolling restarts out of magical thinking. That sounds like
something that could cause more issues than that it solves. I know from
experience a botched rolling restart is not nice to deal with.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/392715d0-624e-4d12-a914-52d3bec7a3dd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.