Initializing_shards got struck

Hi Team,

initializing_shards got struck, if i delete those shards wat will happen? it will become Green and will work smoothly?

Kibana link is working for some time and its showing error timeout 30000ms some time.

{
"cluster_name" : "OConnectElasticSearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 9,
"number_of_data_nodes" : 6,
"active_primary_shards" : 20474,
"active_shards" : 40948,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 4,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 99.98046684246509
}

41000 shards on 6 nodes.... Which means around 6900 shards per node.

You probably have too many shards per node.

May I suggest you look at the following resources about sizing:

And Using Rally to Get Your Elasticsearch Cluster Size Right | Elastic Videos

1 Like

Hi David,

Greetings !!

Thanks for your reply. I got you but for work around i need to fix this, because it is production one.

And also i am working on ELK upgradation. So very soon i will create new PROD environment there i will implement your suggestion. Here i need to fix ASAP as i said its in Production.

If i delete that 4 initializing shards what could happen ?

What is the output of:

GET /
GET /_cat/indices?v

If some outputs are too big, please share them on gist.github.com and link them here.

It will stay RED. As shards will be missing.

Best guess: wait for the cluster to recover.
Or Delete the indices which are RED but you will be missing some data.

GET /
{
"name" : "OConnectManagementNode",
"cluster_name" : "OConnectElasticSearch",
"version" : {
"number" : "2.3.4",
"build_hash" : "e455fd0c13dceca8dbbdbb1665d068ae55dabe3f",
"build_timestamp" : "2016-06-30T11:24:31Z",
"build_snapshot" : false,
"lucene_version" : "5.5.0"
},
"tagline" : "You Know, for Search"
}

GET /_cat/indices?v

cat/shards |grep INITIALIZING

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:08 --:--:-- 0
vdc_b_e029-2020.04.12 5 p INITIALIZING OConnectDataNode04
vdc_b_err-2019.10.03 1 p INITIALIZING OConnectDataNode06
vdc_b_e029-2019.04.10 4 p INITIALIZING OConnectDataNode04
vdc_b_err-2020.09.18 1 p INITIALIZING OConnectDataNode04
100 3959k 100 3959k 0 0 453k 0 0:00:08 0:00:08 --:--:-- 837k

Why would you have indices with less that 100 documents each and 6 primary shards and 6 replica shards. This is incredibly wasteful but explains why your cluster is so exceptionally oversharded. I would recommend you start addressing this ASAP (go to a single primary shard in your index templates, consolidate indices, switch from daily to e.g. monthly indices where the size is small) as it is just otherwise going to complicate the migration.

True!! I am working on that. Soon it will be addressed and moved to new ELK stack 7.9.

Here what is the workaround solution to fix this?

If i don't want those shards data, shall i go for delete ? after that it would be fine or still remains in RED?

Note;- Kibana link is working but not consistently. Most of the time getting 30000ms timeout error.

I want to Fix Initializing_shards and kibana timeout issue at-least for workaround :frowning:

I would not be surprised if the Kibana timeout error originated from querying too many small shards.

So, You mean this timeout error is not because of these Initializing_shards got struck, It because of too many small shards.

Correct me if i am wrong

It could be either. Querying large amounts of shards can be slow. Did you see timeouts before you has unallocated primary shards?

If you want to allocate the missing primary shards as empty shards you can use the cluster reroute API, but be aware you will lose the data in those shards.

I have deleted the all 4 shards and the status turn into green. But Reallocating shards are showing 6 and looks it got struck :frowning:

GET cluster/health?pretty
{
"cluster_name" : "OConnectElasticSearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 9,
"number_of_data_nodes" : 6,
"active_primary_shards" : 20472,
"active_shards" : 40944,
"relocating_shards" : 6,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 526,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 549051,
"active_shards_percent_as_number" : 100.0
}

You don't need to wait for an upgrade to address that problem.
From tomorrow, just start new indices with only one shard and one replica.

Also you have very old indices in your cluster. Like this one vdc_b_e063-2019.06.01 which has more than one year.
Do you still need those indices?

I haves deleted those indices. but now relocating shards are got struck.

vdc_b_e061-2020.05.05         4 r RELOCATING      0     159b 0.46.xx.xx OConnectDataNode06 -> 10.46.xx.xx jmmTwCkxQ_mQbLLllcd_Cg OConnectDataNode05
vdc_b_e061-2020.05.05         3 r RELOCATING      0     160b 0.46.xx.xx OConnectDataNode04 -> 10.46.xx.xx  jmmTwCkxQ_mQbLLllcd_Cg OConnectDataNode05
vdc_b_e061-2020.05.05         5 p RELOCATING      0     159b 0.46.xx.xx OConnectDataNode04 -> 10.46.xx.xx jmmTwCkxQ_mQbLLllcd_Cg OConnectDataNode05
vdc_b_e061-2020.05.05         1 r RELOCATING      0     160b 0.46.xx.xx OConnectDataNode04 -> 10.46.xx.xx jmmTwCkxQ_mQbLLllcd_Cg OConnectDataNode05
vdc_b_e061-2020.05.05         2 p RELOCATING      1   13.2kb 0.46.xx.xx OConnectDataNode03 -> 10.46.xx.xx jmmTwCkxQ_mQbLLllcd_Cg OConnectDataNode05
vdc_b_e061-2020.05.05         0 r RELOCATING      0     160b 0.46.xx.xx OConnectDataNode04 -> 10.46.xx.xx jmmTwCkxQ_mQbLLllcd_Cg OConnectDataNode05

If i fix this relocating shard issue then i am good. Bcoz past 5 days i am working this issue today finally its become green but with relocating shads 6

What is the full output of:

GET /_cat/nodes?v
GET /_cat/health?v
GET /_cat/indices?v

GET /_cat/nodes?v
host ip heap.percent ram.percent load node.role master name
10.46.xx.xx 10.46.xx.xx 14 58 0.27 d m OConnectDataNode05
10.46.xx.xx 10.46.xx.xx 30 59 0.54 d m OConnectDataNode01
10.46.xx.xx 10.46.xx.xx 48 55 0.96 - * OConnectManagementNode
10.46.xx.xx 10.46.xx.xx 13 57 0.43 d m OConnectDataNode02
10.46.xx.xx 10.46.xx.xx 15 57 0.44 d m OConnectDataNode03
10.46.xx.xx 10.46.xx.xx 61 57 0.20 - - OConnectClientNode02
10.46.xx.xx 10.46.xx.xx 23 57 0.63 d - OConnectDataNode04
10.46.xx.xx 10.46.xx.xx 53 0 -1.00 - - OConnectClientNode01
10.46.xx.xx 10.46.xx.xx 59 57 0.54 d m OConnectDataNode06

health?v

{
"cluster_name" : "BPOConnectElasticSearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 9,
"number_of_data_nodes" : 6,
"active_primary_shards" : 20472,
"active_shards" : 40944,
"relocating_shards" : 6,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 2409,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 2561610,
"active_shards_percent_as_number" : 100.0
}

GET /_cat/indices?v

I can still see a lot of data from 2019.

So I don't think you did.

As per client agreement we should maintain 2 years log. What are the shards are shown unassigned i have deleted those, not all 2019.

How can I know that you did not follow the advices although you said you did?

Anyway, I deeply agree with @Christian_Dahlqvist's advices:

1 Like