Initializing_shards got struck

muthug · October 19, 2020, 8:53am

Hi Team,

initializing_shards got struck, if i delete those shards wat will happen? it will become Green and will work smoothly?

Kibana link is working for some time and its showing error timeout 30000ms some time.

{
"cluster_name" : "OConnectElasticSearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 9,
"number_of_data_nodes" : 6,
"active_primary_shards" : 20474,
"active_shards" : 40948,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 4,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 99.98046684246509
}

dadoonet · October 19, 2020, 9:05am

41000 shards on 6 nodes.... Which means around 6900 shards per node.

You probably have too many shards per node.

May I suggest you look at the following resources about sizing:

And Using Rally to Get Your Elasticsearch Cluster Size Right | Elastic Videos

muthug · October 19, 2020, 9:19am

Hi David,

Greetings !!

Thanks for your reply. I got you but for work around i need to fix this, because it is production one.

And also i am working on ELK upgradation. So very soon i will create new PROD environment there i will implement your suggestion. Here i need to fix ASAP as i said its in Production.

If i delete that 4 initializing shards what could happen ?

dadoonet · October 19, 2020, 9:37am

What is the output of:

GET /
GET /_cat/indices?v

If some outputs are too big, please share them on gist.github.com and link them here.

It will stay RED. As shards will be missing.

Best guess: wait for the cluster to recover.
Or Delete the indices which are RED but you will be missing some data.

muthug · October 19, 2020, 10:05am

GET /
{
"name" : "OConnectManagementNode",
"cluster_name" : "OConnectElasticSearch",
"version" : {
"number" : "2.3.4",
"build_hash" : "e455fd0c13dceca8dbbdbb1665d068ae55dabe3f",
"build_timestamp" : "2016-06-30T11:24:31Z",
"build_snapshot" : false,
"lucene_version" : "5.5.0"
},
"tagline" : "You Know, for Search"
}

GET /_cat/indices?v

gist.github.com

https://gist.github.com/muthug91/e734926a008d3e42f6e44394d8d43b8f

gistfile1.txt

health status index                         pri rep docs.count docs.deleted store.size pri.store.size 
green  open   vdc_b_e058-2020.04.30           6   1         36            0    935.2kb        467.6kb 
green  open   vdc_b_e067-2020.09.28           6   1         75            0    925.3kb        462.6kb 
green  open   vdc_b_e067-2020.09.29           6   1         52            0      1.3mb        666.3kb 
green  open   vdc_b_e067-2020.09.10           6   1         11            0    289.1kb        144.5kb 
green  open   vdc_b_e067-2020.09.11           6   1         47            0    996.2kb        498.1kb 
green  open   vdc_b_e063-2019.06.01           6   1         77            0    719.6kb        359.8kb 
green  open   vdc_b_err-2020.08.03            6   1      65234            0     59.9mb         29.9mb 
green  open   vdc_b_err-2020.08.04            6   1      50156            0     47.2mb         23.6mb 
green  open   vdc_b_e067-2020.09.15           6   1         25            0      614kb          307kb

This file has been truncated. show original

muthug · October 19, 2020, 10:12am

cat/shards |grep INITIALIZING

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:08 --:--:-- 0
vdc_b_e029-2020.04.12 5 p INITIALIZING OConnectDataNode04
vdc_b_err-2019.10.03 1 p INITIALIZING OConnectDataNode06
vdc_b_e029-2019.04.10 4 p INITIALIZING OConnectDataNode04
vdc_b_err-2020.09.18 1 p INITIALIZING OConnectDataNode04
100 3959k 100 3959k 0 0 453k 0 0:00:08 0:00:08 --:--:-- 837k

Christian_Dahlqvist · October 19, 2020, 10:23am

muthug:

green  open   vdc_b_e058-2020.04.30           6   1         36            0    935.2kb        467.6kb 
green  open   vdc_b_e067-2020.09.28           6   1         75            0    925.3kb        462.6kb 
green  open   vdc_b_e067-2020.09.29           6   1         52            0      1.3mb        666.3kb 
green  open   vdc_b_e067-2020.09.10           6   1         11            0    289.1kb        144.5kb 
green  open   vdc_b_e067-2020.09.11           6   1         47            0    996.2kb        498.1kb 
green  open   vdc_b_e063-2019.06.01           6   1         77            0    719.6kb        359.8kb

Why would you have indices with less that 100 documents each and 6 primary shards and 6 replica shards. This is incredibly wasteful but explains why your cluster is so exceptionally oversharded. I would recommend you start addressing this ASAP (go to a single primary shard in your index templates, consolidate indices, switch from daily to e.g. monthly indices where the size is small) as it is just otherwise going to complicate the migration.

muthug · October 19, 2020, 10:30am

True!! I am working on that. Soon it will be addressed and moved to new ELK stack 7.9.

Here what is the workaround solution to fix this?

If i don't want those shards data, shall i go for delete ? after that it would be fine or still remains in RED?

Note;- Kibana link is working but not consistently. Most of the time getting 30000ms timeout error.

I want to Fix Initializing_shards and kibana timeout issue at-least for workaround

Christian_Dahlqvist · October 19, 2020, 11:03am

I would not be surprised if the Kibana timeout error originated from querying too many small shards.

muthug · October 19, 2020, 12:02pm

So, You mean this timeout error is not because of these Initializing_shards got struck, It because of too many small shards.

Correct me if i am wrong

Christian_Dahlqvist · October 19, 2020, 12:36pm

It could be either. Querying large amounts of shards can be slow. Did you see timeouts before you has unallocated primary shards?

If you want to allocate the missing primary shards as empty shards you can use the cluster reroute API, but be aware you will lose the data in those shards.

muthug · October 19, 2020, 1:19pm

I have deleted the all 4 shards and the status turn into green. But Reallocating shards are showing 6 and looks it got struck

GET cluster/health?pretty
{
"cluster_name" : "OConnectElasticSearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 9,
"number_of_data_nodes" : 6,
"active_primary_shards" : 20472,
"active_shards" : 40944,
"relocating_shards" : 6,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 526,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 549051,
"active_shards_percent_as_number" : 100.0
}

dadoonet · October 19, 2020, 1:50pm

You don't need to wait for an upgrade to address that problem.
From tomorrow, just start new indices with only one shard and one replica.

Also you have very old indices in your cluster. Like this one vdc_b_e063-2019.06.01 which has more than one year.
Do you still need those indices?

muthug · October 19, 2020, 1:54pm

I haves deleted those indices. but now relocating shards are got struck.

vdc_b_e061-2020.05.05         4 r RELOCATING      0     159b 0.46.xx.xx OConnectDataNode06 -> 10.46.xx.xx jmmTwCkxQ_mQbLLllcd_Cg OConnectDataNode05
vdc_b_e061-2020.05.05         3 r RELOCATING      0     160b 0.46.xx.xx OConnectDataNode04 -> 10.46.xx.xx  jmmTwCkxQ_mQbLLllcd_Cg OConnectDataNode05
vdc_b_e061-2020.05.05         5 p RELOCATING      0     159b 0.46.xx.xx OConnectDataNode04 -> 10.46.xx.xx jmmTwCkxQ_mQbLLllcd_Cg OConnectDataNode05
vdc_b_e061-2020.05.05         1 r RELOCATING      0     160b 0.46.xx.xx OConnectDataNode04 -> 10.46.xx.xx jmmTwCkxQ_mQbLLllcd_Cg OConnectDataNode05
vdc_b_e061-2020.05.05         2 p RELOCATING      1   13.2kb 0.46.xx.xx OConnectDataNode03 -> 10.46.xx.xx jmmTwCkxQ_mQbLLllcd_Cg OConnectDataNode05
vdc_b_e061-2020.05.05         0 r RELOCATING      0     160b 0.46.xx.xx OConnectDataNode04 -> 10.46.xx.xx jmmTwCkxQ_mQbLLllcd_Cg OConnectDataNode05

muthug · October 19, 2020, 1:57pm

If i fix this relocating shard issue then i am good. Bcoz past 5 days i am working this issue today finally its become green but with relocating shads 6

dadoonet · October 19, 2020, 2:07pm

What is the full output of:

GET /_cat/nodes?v
GET /_cat/health?v
GET /_cat/indices?v

muthug · October 19, 2020, 2:30pm

GET /_cat/nodes?v
host ip heap.percent ram.percent load node.role master name
10.46.xx.xx 10.46.xx.xx 14 58 0.27 d m OConnectDataNode05
10.46.xx.xx 10.46.xx.xx 30 59 0.54 d m OConnectDataNode01
10.46.xx.xx 10.46.xx.xx 48 55 0.96 - * OConnectManagementNode
10.46.xx.xx 10.46.xx.xx 13 57 0.43 d m OConnectDataNode02
10.46.xx.xx 10.46.xx.xx 15 57 0.44 d m OConnectDataNode03
10.46.xx.xx 10.46.xx.xx 61 57 0.20 - - OConnectClientNode02
10.46.xx.xx 10.46.xx.xx 23 57 0.63 d - OConnectDataNode04
10.46.xx.xx 10.46.xx.xx 53 0 -1.00 - - OConnectClientNode01
10.46.xx.xx 10.46.xx.xx 59 57 0.54 d m OConnectDataNode06

health?v

{
"cluster_name" : "BPOConnectElasticSearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 9,
"number_of_data_nodes" : 6,
"active_primary_shards" : 20472,
"active_shards" : 40944,
"relocating_shards" : 6,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 2409,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 2561610,
"active_shards_percent_as_number" : 100.0
}

GET /_cat/indices?v

gist.github.com

https://gist.github.com/muthug91/75ad51d34d60090f537be640c6e62935

gistfile1.txt

health status index                         pri rep docs.count docs.deleted store.size pri.store.size 
green  open   vdc_b_e058-2020.04.30           6   1         36            0    935.2kb        467.6kb 
green  open   vdc_b_e067-2020.09.28           6   1         75            0    925.3kb        462.6kb 
green  open   vdc_b_e067-2020.09.29           6   1         52            0      1.3mb        666.3kb 
green  open   vdc_b_e067-2020.09.10           6   1         11            0    289.1kb        144.5kb 
green  open   vdc_b_e067-2020.09.11           6   1         47            0    996.2kb        498.1kb 
green  open   vdc_b_e063-2019.06.01           6   1         77            0    719.6kb        359.8kb 
green  open   vdc_b_err-2020.08.03            6   1      65234            0     59.9mb         29.9mb 
green  open   vdc_b_err-2020.08.04            6   1      50156            0     47.2mb         23.6mb 
green  open   vdc_b_e067-2020.09.15           6   1         25            0      614kb          307kb

This file has been truncated. show original

dadoonet · October 19, 2020, 2:44pm

I can still see a lot of data from 2019.

So I don't think you did.

muthug · October 19, 2020, 2:47pm

As per client agreement we should maintain 2 years log. What are the shards are shown unassigned i have deleted those, not all 2019.

dadoonet · October 19, 2020, 3:09pm

How can I know that you did not follow the advices although you said you did?

Anyway, I deeply agree with @Christian_Dahlqvist's advices:

Topic		Replies	Views
Shards Initializing Indefinitely? Elasticsearch	10	4984	October 24, 2017
Elasticsearch is still initializing the kibana index... Trying again in 2.5 second Elasticsearch	6	751	September 22, 2020
Elastic search is still initializing the kibana index issue Elasticsearch	7	4514	July 18, 2017
Elasticsearch is still initializing the kibana index Elasticsearch	17	11950	July 5, 2017
Shards stuck in initializing status for long time Elasticsearch	4	979	July 6, 2017

Initializing_shards got struck

Related topics