Shards Initializing Indefinitely?

corona · September 20, 2017, 3:42pm

Hello,

We are currently running version 2.4.0 in our production cluster. After node restarts I've noticed we have around 36 shards which seem to be stuck initializing.

We have been having problems with nodes crashing due to OOM errors. Overtime (and many node restarts later: -- This is separate issue I'm trying to address) we have ended up with a bunch of initializing shards which never seem to finish initializing. I'll update this with more info/logs when they become available.

I was wondering if there is anything I can do to address this issue other than delete the index / restart the node?
At this point the cluster is red. In this state, can this have a slowing impact on overall performance? Could this slow down indexing b/c the cluster is also busying trying to initialize shards?
Is there a way to determine if the shard has become corrupted and just won't ever initialize?

Running the cat/_recovery api tells me the following:
index shard time type stage source_host target_host repository snapshot files files_percent bytes bytes_percent total_files total_bytes translog translog_percent total_translog
xyz-index 0 596067535 store init 10.10.10.10 10.10.10.10 n/a n/a 0 0.0% 0 0.0% 0 0 0 -1.0% -1

Prior to this index/shard not initializing I verified it contained 57 docs and was about 23KB so I'm thinking it should have initialized pretty quickly.

Any thoughts are greatly appreciated!

Christian_Dahlqvist · September 20, 2017, 3:55pm

How many shards do you have? How many nodes?

corona · September 20, 2017, 4:09pm

We're running a 20 node cluster with ~48600 shards.

{
"cluster_name" : "testxyz",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 20,
"active_primary_shards" : 48621,
"active_shards" : 48621,
"relocating_shards" : 0,
"initializing_shards" : 36,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 4,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 7893,
"active_shards_percent_as_number" : 99.92601270115297
}

Christian_Dahlqvist · September 20, 2017, 4:16pm

I think you have too many shards, and possibly also too many indices. The fact that you do not seem to use any replica shards probably does not help either. Would recommend you read this blog post.

warkolm · September 21, 2017, 12:40am

Please don't add replicas on this cluster though. You are already way over sharded and you need to reduce that first before you start adding replicas in.

Christian_Dahlqvist · September 21, 2017, 9:25am

How much data do you have in the cluster?

corona · September 25, 2017, 3:30pm

We have approximately ~17TB of data in our cluster. I was looking at a 6 month sample of data index breakdowns for our cluster. I'm seeing 90% of the indexes are pretty small in size. They can range anywhere from 50KB up to 48MB in size. We also have 3 groups of indices which make up the most of the data. They can range from 11-14GB in size. For the smaller indices, they are time based broken up by hour. The group of larger indexes are also time based but broken up by days.

It sounds like recommendation is to decrease the number of shards in order to collapse the smaller indices into larger indices. Do you have a recommendation for doing that with version 2 of ES? Would we just need to reindex the data? Any thoughts are appreciated!

Christian_Dahlqvist · September 25, 2017, 3:34pm

While you can change settings for new indices getting created, you will indeed need to reindex older indices in order to reduce shard count.

warkolm · September 25, 2017, 10:36pm

Unless you upgrade to 5.X (which you should totally do) and use the _shrink API

corona · September 26, 2017, 2:40pm

Hey thanks for all of the quick responses from everyone. Very much appreciated!. Ok, let me know if you want me to open a new topic since this a different question, but I was curious about something related to this cluster. Recently I updated the bootstrap.mlock setting from false to true for half of the nodes (10). After about 1-2 weeks later, I'm seeing the nodes (with that setting enabled) are starting to use some swap space as it's slowly climbing.

-- If the cluster is under a lot of stress due to too many indices/shards (as we described above), is it possible this setting will be ignored, or do you think there something else going on that would cause this setting to be ignored. If I understand correctly, enabling this setting will prevent the ES processes from swapping correct? I'm a little confused because I'm seeing something different. Again I realize our cluster is not in an ideal state. Thanks again.

system · October 24, 2017, 2:40pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Initializing Shards Elasticsearch	3	4437	July 5, 2017
One node cluster stuck on initializing_shards Elasticsearch	4	7566	June 29, 2017
Shards stuck in initializing status for long time Elasticsearch	4	1019	July 6, 2017
Non-recovering index shard Elasticsearch	4	335	July 6, 2017
Shard stucked in initializing state (elasticsearch crash test) Elasticsearch	3	507	July 6, 2017

Shards Initializing Indefinitely?

Related topics