Nearly All Indices UNASSIGNED and All Are Red


(Todd L.) #1

I am trying to repair an Elasticsearch / Logstash / Kibana (ELK) installation implemented by a former coworker. The system seemed to be running along fine until it ran out of space (several times now). I have successfully re-worked various .yml and/or config files to the point where I can now use curl from another system to take action. I have successfully deleted the oldest date of the logstash indices. So, I'm confident I can take action on the installation, if only I knew what I was doing! :slight_smile:

So...
curl 'http://hostname:9200/_cat/shards?pretty' will list 500+ lines of shards (indices?). All but four of them are "UNASSIGNED", and the other four are INITIALIZING, and never change.
curl 'http://hostname:9200/_cat/indices?pretty' lists every entry as "red open logstash-yyyy.mm.dd" and then two numbers after that.

I'm too new at this to know what's going on, or the correct action to take. As I said, I successfully deleted one of the logstash dates... good or bad, I did it.

The Kibana browser page shows status red, and gives some stats, and indicates the indexes are initializing, try again in 2.5 seconds.
What can I do to get this thing back to functioning, again?
Would running out of disk space (and I mean, zero bytes free) cause the symptoms I am seeing?

I asked all of the above on a different forum and got one reply, indicating I should come here, instead, and that a full disk might cause these problems, and that I might need to delete some more indices. I am a complete novice at this, and wouldn't know how to determine which indices might need to be deleted (which ones should I target to maintain as much data as I can?).

Any help at all would be appreciated. Here is a _cluster/health from my installation, so you can see how bad it really is.
{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 526,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 4,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 30,
"active_shards_percent_as_number" : 0.0
}


(Mark Walkom) #2

Check the ES logs, it should state if there are problems assigning them, but you may just need to delete the indices that aren't being assigned to clear it up.

Just use _cat/indices to see which ones are red.


(Christian Dahlqvist) #3

Which Elasticsearch version are you using? How much heap do you have assigned?


(Todd L.) #4

Mark,
I can't view the logs, yet. I don't have a utility installed able to open the 16, 23, 31, and 33 GB log files from the last few days (some serious error logging going on there!)

Christian,
I am using ES 2.0.0 (as far as I can tell, as that is what is listed on the service). I haven't figured out how to determine how much heap I have assigned, yet; what ever is default is what I expect the guy who implemented this used. I'll try to find that.

Status update.
After finally being able to issue curl commands remotely (found an article describing what to put in the config file to get that to happen), I was able to start investigating. I found articles describing what you mention, Mark, and others. A "pending task" query let me see some tasks where certain logstash files were getting errors consistently, and those happened to be the days when we ran out of space. I deleted those files and other primary logstash files started becoming active, allowing my cluster to go from red to yellow. As I only have the one node, and as I found an article that indicated replica shards in a cluster with only one node were useless and the cause of the "yellow" indices, I set all my indices number_of_replicas value to to zero. Green status! I found where to set the default number_of_replicas for new indices, and figured I was ready for the next day.
However, I had already deleted the current day's index (this was 1/13), as it was empty (no data being used) in Windows Explorer, as was the previous two days folders. I deleted them all. I still don't have any logstash folders for anything beyond 1/8. All of those dates were either corrupted by running out of space, or were empty, so I got rid of them.
My current status is... no new logstash folders are being created. (by the way, i did restart all the windows services yesterday). And, another weird thing... the logstash folders created since the first have a mixture of year indicators. On the first of January, I have two logstash folders for that day, one indicating 2016, one indicating 2015, and both have data. After that, through the 8th, I only had "2015" folders created. Staring the 9th, I had both "2015" and "2016" folders created, but only the 2015 folder had data, the other was roughly empty (I think it listed 2.5 K worth of start up files, lock files, etc). Starting on the 11th, only 2016 folders were getting created. I have gotten rid of everything beyond 1/8, as it was either empty, or corrupted.

tl;dr
My current issue is... no new logstash files are being created... and my Kibana page still shows an error, Status Red, and the plugin:elasticsearch shows "Unable to connect to Elasticsearch at http://localhost:9200. Retrying in 2.5 seconds."


(Todd L.) #5

OK, now... I found a log file small enough to read. There are a ton of warnings and info messages indicating high disk watermark, and rerouting shards.

So, if the high watermark is hit (90%), it appears "shards will be relocated away from this node". I feel like saying "well, that's yer problem, right there, son!" but in a one node setup, just to where does that relocation go... the bit bucket? And if that's the case, why/how did it ever get to the point where it would fill up the disk? And, is my only option for getting new logstashes created is to get disk utilization below 90%

I also feel like I might be missing something in the implementation left to me; a clean-up process that takes the info gathered on previous days, and somehow makes them smaller, while still usable. Does something like that exist? How might I tell if it is already installed? If it does exist, then can you tell me what's it called (point me to the installation info)?


(Mark Walkom) #6

You probably need to remove some older indices, use Elasticsearch Curator for that.

You can also increase the watermarks - https://www.elastic.co/guide/en/elasticsearch/reference/2.1/disk-allocator.html


(Todd L.) #7

Thanks, Mark.
I found a folder with old, archived data from a previous log collection utility. I have moved all of that data and got the disk utilization down below 85%, but no new indices are being created. I have restarted all services related to ELK after I have cleared off disk space, yet no new indices or index folders get created.

This may not be relevant anymore now that I have diskspace utilization down below 85%, but it would be helpful info to have. Before I install and use Curator, I'd like to know a little more about selecting and deleting potential problem indices.

  1. Is there a way to identify the indices causing me a problem? Right now, all of them are reporting green, so I am not sure which one(s) is causing a problem.
  2. Is there a proper procedure for deleting indices? For now, I have just been using the -XDELETE commands.

My goal is to keep as much of the data we have collect so far, so I am very curious how to identify those indices needing deleted. Now that data utilization is below the watermark, shouldn't the indices start creating again? I did see in the log that the re-routing had ended (or something to that effect) but nothing new was showing up.
Thanks, much for your help!


(Mark Walkom) #8

Define problem, because if it's disk problems then just delete the old ones.


(Todd L.) #9

My current problem is... no new indices are getting created. I have 20% disk free, now, so that's not the issue anymore. I can't figure out why, or where to look, to get my indices creating, again.


(Mark Walkom) #10

What do the logs show?


(Jelmer Kuperus) #11

what is the output of http://elastichost:9200/_cluster/settings ?

anything in there for key cluster.routing.allocation.enable ?


(Okan) #12

Hey Todd, Did you try to delete translog. If you delete translogs you probably solve problem and your indices will turn yellow again but you need to take backup first ofcourse


(Mark Walkom) #13

No! Don't do this!


(Todd L.) #14

Mark. No translogs have been deleted (at least not by a human, to my knowledge). Thanks for stopping me from doing that. I would have had to figure out how to do it, anyway. :wink:
As for what do the logs say.... nothing, since 1/14, when I last restarted all the services. Here is an excerpt...
[2016-01-14 11:36:41,628][INFO ][cluster.routing.allocation.decider] [Captain Universe] low disk watermark [85%] exceeded on [SYXHSEqLTPeElwJ0thmVjg][Captain Universe][G:\ELK_Stack\elasticsearch-2.0.0\data\elasticsearch\nodes\0] free: 291.5gb[14.9%], replicas will not be assigned to this node
[2016-01-14 11:37:12,182][INFO ][cluster.routing.allocation.decider] [Captain Universe] rerouting shards: [one or more nodes has gone under the high or low watermark]
[2016-01-14 14:40:13,044][INFO ][node ] [Captain Universe] stopping ...

then a bunch of DEBUG entries for Captain Universe "failed to execute" and "no such index" messages for index folders that still do exist for the Captain Universe shutdown... pretty much all of them, but I didn't do a full check/inventory of that.

(I hope that was relevant info)

Then...
[2016-01-14 14:40:14,122][WARN ][transport ] [Captain Universe] Transport response handler not found of id [6250]
[2016-01-14 14:40:14,325][INFO ][node ] [Captain Universe] stopped
[2016-01-14 14:40:14,325][INFO ][node ] [Captain Universe] closing ...
[2016-01-14 14:40:14,325][INFO ][node ] [Captain Universe] closed
[2016-01-14 14:40:16,078][INFO ][node ] [Vibraxas] version[2.0.0], pid[7892], build[de54438/2015-10-22T08:09:48Z]
[2016-01-14 14:40:16,078][INFO ][node ] [Vibraxas] initializing ...
[2016-01-14 14:40:16,188][INFO ][plugins ] [Vibraxas] loaded [], sites []
[2016-01-14 14:40:16,344][INFO ][env ] [Vibraxas] using [1] data paths, mounts [[EVA2_1950GB (G:)]], net usable_space [324.8gb], net total_space [1.9tb], spins? [unknown], types [NTFS]
[2016-01-14 14:40:19,953][INFO ][node ] [Vibraxas] initialized
[2016-01-14 14:40:19,953][INFO ][node ] [Vibraxas] starting ...
[2016-01-14 14:40:20,219][INFO ][transport ] [Vibraxas] publish_address {10.48.32.123:9300}, bound_addresses {10.48.32.123:9300}
[2016-01-14 14:40:20,219][INFO ][discovery ] [Vibraxas] elasticsearch/5MWU_9vSSUqjQ6Fgpis_OQ
[2016-01-14 14:40:24,313][INFO ][cluster.service ] [Vibraxas] new_master {Vibraxas}{5MWU_9vSSUqjQ6Fgpis_OQ}{10.48.32.123}{10.48.32.123:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2016-01-14 14:40:24,688][INFO ][http ] [Vibraxas] publish_address {10.48.32.123:9200}, bound_addresses {10.48.32.123:9200}
[2016-01-14 14:40:24,688][INFO ][node ] [Vibraxas] started
[2016-01-14 14:40:26,016][INFO ][gateway ] [Vibraxas] recovered [44] indices into cluster_state

and nothing after that.


(Mark Walkom) #15

What's _cat/state and _cat/indices show?


(Todd L.) #16

_cat/state results in error, no feature for name [state], status 400. I tried stats, _stats, and _state, as well, but all resulted in some form of error.
_cat/indices lists 44 lines of logstash file names, all green and open. I also happen to have 44 logstash folders on the disk.


(Mark Walkom) #17

Err, dunno what I was thinking with _cat/state, maybe _cat/allocation?

Can we actually see the ouput?


(Todd L.) #18

_cat/allocation returns the following (IP address obfuscated).
220 1.5tb 324.8gb 1.9tb 83 AA.BB.CC.DD AA.BB.CC.DD Vibraxas

Did you want to see the output of the _cat/indices, too? Here's an excerpt. Note: all dates in January are supposed to be 2016.
green open logstash-2015.12.10 5 0 55336255 0 35.7gb 35.7gb
green open logstash-2015.12.17 5 0 55317850 0 35.8gb 35.8gb
green open logstash-2015.01.01 5 0 33542704 0 21gb 21gb
green open logstash-2015.12.06 5 0 43955294 0 27.6gb 27.6gb
green open logstash-2015.12.16 5 0 55482842 0 35.9gb 35.9gb
green open logstash-2015.12.26 5 0 43452017 0 27.3gb 27.3gb
green open logstash-2015.11.29 5 0 43266795 0 27.1gb 27.1gb


(Todd L.) #19

@warkolm, do you have any other advice or direction for me?


(Mark Walkom) #20

I am not seeing anything red in that output?