I get back '"status":"green",..."unassigned_shards":0'. If I shut down graylog-server (ps shows it's not running) - so that I know there's no data flowing into elasticsearch - curl still shows "green...0". So far, so good.
If I then restart elasticsearch, that always comes up '"status":"red",..."unassigned_shards":3000' - or a similar number in the thousands. If I keep polling with curl, I see that unassigned shard number come down until it hits zero, and then the status turns to "green".
My question is, is that expected behaviour? I am uncomfortable with seeing the "green" change to "red" with a mere restart - it implies there's some form in inconsistency that can only be resolved/recognized through a restart.
Elasticsearch recovers those shards on start up, one by one. This means it runs some checks before it is making those shards available for search. 3000 shards on a single system is way too much and just needs some time to check, thus the delay - but this is ok.
OK, so does that mean that internally ES has some kind of queue for unprocessed shards, and it just so happened that it's running behind what's coming in, but it's only the formal shutdown/restart that exposes things in this queue to us via the /_cluster/health command? Is there an API call that would expose this backlog? I currently monitor /_cluster/health (which obviously isn't showing this) but would rather have a better option that actually told me a backlog was forming. We are currently on one node and will expand into a cluster - but I'd still like to know when this backlog starts happening, as it tells you something it going to go wrong soon if it only increases over time
I don't think I'm explaining myself very well Let's try this
shutdown graylog (which is the only input into ES)
GET /_cluster/health == green, unassigned_shards:0 (ie "I am a happy ES")
shutdown ES
startup ES
GET /_cluster/health == red, unassigned_shards:3000 (ie "I am unhappy")
In my mind I can't understand how that can be expected behaviour. A formal restart shouldn't change state - and yet it's gone red. It feels like it's equivalent to unmounting a file system and yet it being classified as "dirty" every time you remount it, and doing a fsck. A clean umount should mean no fsck (hope that analogy works
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.