Delete too many indexes at once leads the cluster to red

sebastienf · February 1, 2021, 2:57pm

I work on a ES 5.6 "on premises", and we work to switch to ES 7.x.
Before that, we aim to reduce our over-sharding issue : we generate too much index (every day when every month is ok) and espacially, with 5 shards (default value)...
And then, we re-indexed bad indexes into good ones et deleted them.
Every time, a "delete" on my-index-...-* (400 indexes behind this pattern) leads the cluster to yellow (good case), and even red (very bad case)...
I can imagine what is going on (rebalance too heavy ?), but is it avoidable, this behaviour ?

warkolm · February 1, 2021, 11:07pm

It's hard to say without more info.
What do your Elasticsearch logs show when this happens?

Christian_Dahlqvist · February 2, 2021, 6:38am

This sounds like one of the side effects of having let the shard count get out of hand. I assume you have already fixed the sharding on the input side so you are now only generating monthly indices and are working through converting older indices. Is this correct?

If you are moving to ES 7.x from ES 5.x you will either have to reindex from remote or reindex in place while going via ES 6.8.x. If you are going directly through a reindex from remote and can spin up a second cluster, it might be worthwhile exploring whether you can start writing new data to both clusters in parallel and then at the same time reindex daily indices into monthly from remote. This way you do not need to delete old indices in the ES 5.x cluster which will reduce the amount of reallocations and cluster updates.

sebastienf · February 2, 2021, 8:18am

I assume you have already fixed the sharding on the input side so you are now only generating monthly indices and are working through converting older indices. Is this correct?

This is correct and thanks for your advices.

I ask more logs to the ops team and go back to you.

sebastienf · February 2, 2021, 5:45pm

11h25 : we performed curl -i -XDELETE ...

many :

[2021-02-01T11:25:29,472][INFO ][o.e.c.m.MetaDataDeleteIndexService] [nodexxxx_master-adm_90] [my-index-2020-07-20/QGPFxEJfRjqlS_xkarBCeg] deleting index

and then :

[2021-02-01T11:26:00,916][WARN ][o.e.d.z.PublishClusterStateAction] [nodexxxx_master-adm_90] timed out waiting for all nodes to process published state [36397]

many :

[2021-02-01T11:31:31,860][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [nodexxxx-adm_90] failed to execute on node [u_ll-MJ1SA-aVkXkmscMCA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [nodexxxx_data_03][0.0.0.0:9303][cluster:monitor/nodes/stats[n]] request_id [559821693] timed out after [15023ms]

[2021-02-01T11:31:38,602][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [nodexxxx_master-adm_90] collector [index-recovery] timed out when collecting data

many :

[2021-02-01T11:32:24,167][WARN ][o.e.t.TransportService ] [nodexxxx_master-adm_90] Received response for a request that has timed out, sent [96919ms] ago, timed out [66917ms] ago, action [internal:discovery/zen/fd/ping], node [{nodexxxx_data_03}{u_ll-MJ1SA-aVkXkmscMCA}{O0gG1f-RR_ex9U3HpotRGA}{0.0.0.0}{0.0.0.0:9303}{rack_id=nodexxxx}], id [559820848]

[2021-02-01T11:32:26,476][WARN ][o.e.g.GatewayAllocator$InternalReplicaShardAllocator] [nodexxxx_master-adm_90] [my-index-2020-11-08][2]: failed to list shard for shard_store on node [rCc5h51dRcGKUO9vD7tP0g]
org.elasticsearch.action.FailedNodeException: Failed node [rCc5h51dRcGKUO9vD7tP0g] at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:239) ~[elasticsearch-5.6.14.jar:5.6.14]

Caused by: org.elasticsearch.transport.RemoteTransportException: [nodexxx_data_04][0.0.0.0:9304][internal:cluster/nodes/indices/shard/store[n]]
Caused by: org.elasticsearch.ElasticsearchException: Failed to list store metadata for shard [[my-index-2020-11-08][2]] at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:114) ~[elasticsearch-5.6.14.jar:5.6.14]

Caused by: java.io.FileNotFoundException: no segments* file found in store(mmapfs(/var/opt/data/flat/elastic/files03/04/nodes/0/indices/v5KrU_ZQQi28oyNlxyxn3g/2/index)): files: [recovery.AXddJCueclgzgH391Ga1._1851.dii, recovery.AXddJCueclgzgH391Ga1._1851.fdx, recovery.AXddJCueclgzgH391Ga1._1851.fnm, recovery.AXddJCueclgzgH391Ga1._1851.nvm, recovery.AXddJCueclgzgH391Ga1._1851.si, recovery.AXddJCueclgzgH391Ga1._1851_1.liv, recovery.AXddJCueclgzgH391Ga1._1851_Lucene50_0.tip, recovery.AXddJCueclgzgH391Ga1._1851_Lucene54_0.dvm, recovery.AXddJCueclgzgH391Ga1._1b4p.cfe, recovery.AXddJCueclgzgH391Ga1._1b4p.si, recovery.AXddJCueclgzgH391Ga1._1hmg.dii, recovery.AXddJCueclgzgH391Ga1._1hmg.fdx, recovery.AXddJCueclgzgH391Ga1._1hmg.fnm, recovery.AXddJCueclgzgH391Ga1._1hmg.nvd, recovery.AXddJCueclgzgH391Ga1._1hmg.nvm, recovery.AXddJCueclgzgH391Ga1._1hmg.si, recovery.AXddJCueclgzgH391Ga1._1hmg_Lucene50_0.tip, recovery.AXddJCueclgzgH391Ga1._1hmg_Lucene54_0.dvm, recovery.AXddJCueclgzgH391Ga1._1i5t.cfe, recovery.AXddJCueclgzgH391Ga1._1i5t.si, recovery.AXddJCueclgzgH391Ga1._1jcq.cfe, recovery.AXddJCueclgzgH391Ga1._1jcq.si, recovery.AXddJCueclgzgH391Ga1._1kaw.cfe, recovery.AXddJCueclgzgH391Ga1._1kaw.si, recovery.AXddJCueclgzgH391Ga1._1lrs.cfe, recovery.AXddJCueclgzgH391Ga1._1lrs.si, recovery.AXddJCueclgzgH391Ga1._1lvv.cfe, recovery.AXddJCueclgzgH391Ga1._1lvv.si, recovery.AXddJCueclgzgH391Ga1._1m28.cfe, recovery.AXddJCueclgzgH391Ga1._1m28.cfs, recovery.AXddJCueclgzgH391Ga1._1m28.si, recovery.AXddJCueclgzgH391Ga1._1mb8.cfe, recovery.AXddJCueclgzgH391Ga1._1mb8.cfs, recovery.AXddJCueclgzgH391Ga1._1mb8.si, recovery.AXddJCueclgzgH391Ga1._1mea.cfe, recovery.AXddJCueclgzgH391Ga1._1mea.cfs, recovery.AXddJCueclgzgH391Ga1._1mea.si, recovery.AXddJCueclgzgH391Ga1._1mkv.cfe, recovery.AXddJCueclgzgH391Ga1._1mkv.si, recovery.AXddJCueclgzgH391Ga1._1mp1.cfe, recovery.AXddJCueclgzgH391Ga1._1mp1.cfs, recovery.AXddJCueclgzgH391Ga1._1mp1.si, recovery.AXddJCueclgzgH391Ga1._1msn.cfe, recovery.AXddJCueclgzgH391Ga1._1msn.cfs, recovery.AXddJCueclgzgH391Ga1._1msn.si, recovery.AXddJCueclgzgH391Ga1._1msw.cfe, recovery.AXddJCueclgzgH391Ga1._1msw.cfs, recovery.AXddJCueclgzgH391Ga1._1msw.si, recovery.AXddJCueclgzgH391Ga1._1mtz.cfe, recovery.AXddJCueclgzgH391Ga1._1mtz.cfs, recovery.AXddJCueclgzgH391Ga1._1mtz.si, recovery.AXddJCueclgzgH391Ga1._1muw.cfe, recovery.AXddJCueclgzgH391Ga1._1muw.cfs, recovery.AXddJCueclgzgH391Ga1._1muw.si, recovery.AXddJCueclgzgH391Ga1._1mv9.cfe, recovery.AXddJCueclgzgH391Ga1._1mv9.cfs, recovery.AXddJCueclgzgH391Ga1._1mv9.si, recovery.AXddJCueclgzgH391Ga1._1mvi.cfe, recovery.AXddJCueclgzgH391Ga1._1mvi.cfs, recovery.AXddJCueclgzgH391Ga1._1mvi.si, recovery.AXddJCueclgzgH391Ga1._1mvj.cfe, recovery.AXddJCueclgzgH391Ga1._1mvj.cfs, recovery.AXddJCueclgzgH391Ga1._1mvj.si, recovery.AXddJCueclgzgH391Ga1._1mvy.cfe, recovery.AXddJCueclgzgH391Ga1._1mvy.cfs, recovery.AXddJCueclgzgH391Ga1._1mvy.si, recovery.AXddJCueclgzgH391Ga1._1mvz.cfe, recovery.AXddJCueclgzgH391Ga1._1mvz.cfs, recovery.AXddJCueclgzgH391Ga1._1mvz.si, recovery.AXddJCueclgzgH391Ga1._1mw6.cfe, recovery.AXddJCueclgzgH391Ga1._1mw6.cfs, recovery.AXddJCueclgzgH391Ga1._1mw6.si, recovery.AXddJCueclgzgH391Ga1._1mw8.cfe, recovery.AXddJCueclgzgH391Ga1._1mw8.cfs, recovery.AXddJCueclgzgH391Ga1._1mw8.si, recovery.AXddJCueclgzgH391Ga1._1mw9.cfe, recovery.AXddJCueclgzgH391Ga1._1mw9.cfs, recovery.AXddJCueclgzgH391Ga1._1mw9.si, recovery.AXddJCueclgzgH391Ga1._1mwj.cfe, recovery.AXddJCueclgzgH391Ga1._1mwj.cfs, recovery.AXddJCueclgzgH391Ga1._1mwj.si, recovery.AXddJCueclgzgH391Ga1._1mwk.cfe, recovery.AXddJCueclgzgH391Ga1._1mwk.cfs, recovery.AXddJCueclgzgH391Ga1._1mwk.si, recovery.AXddJCueclgzgH391Ga1._1mwl.cfe, recovery.AXddJCueclgzgH391Ga1._1mwl.cfs, recovery.AXddJCueclgzgH391Ga1._1mwl.si, recovery.AXddJCueclgzgH391Ga1._1mwm.cfe, recovery.AXddJCueclgzgH391Ga1._1mwm.cfs, recovery.AXddJCueclgzgH391Ga1._1mwm.si, recovery.AXddJCueclgzgH391Ga1._1mwx.cfe, recovery.AXddJCueclgzgH391Ga1._1mwx.cfs, recovery.AXddJCueclgzgH391Ga1._1mwx.si, recovery.AXddJCueclgzgH391Ga1._1mwy.cfe, recovery.AXddJCueclgzgH391Ga1._1mwy.cfs, recovery.AXddJCueclgzgH391Ga1._1mwy.si, recovery.AXddJCueclgzgH391Ga1._eej.dii, recovery.AXddJCueclgzgH391Ga1._eej.fdx, recovery.AXddJCueclgzgH391Ga1._eej.fnm, recovery.AXddJCueclgzgH391Ga1._eej.nvm, recovery.AXddJCueclgzgH391Ga1._eej.si, recovery.AXddJCueclgzgH391Ga1._eej_Lucene50_0.tip, recovery.AXddJCueclgzgH391Ga1._eej_Lucene54_0.dvm, recovery.AXddJCueclgzgH391Ga1._qyr.dii, recovery.AXddJCueclgzgH391Ga1._qyr.fdx, recovery.AXddJCueclgzgH391Ga1._qyr.fnm, recovery.AXddJCueclgzgH391Ga1._qyr.nvm, recovery.AXddJCueclgzgH391Ga1._qyr.si, recovery.AXddJCueclgzgH391Ga1._qyr_Lucene50_0.tip, recovery.AXddJCueclgzgH391Ga1._qyr_Lucene54_0.dvm, recovery.AXddJCueclgzgH391Ga1._z18.dii, recovery.AXddJCueclgzgH391Ga1._z18.fdx, recovery.AXddJCueclgzgH391Ga1._z18.fnm, recovery.AXddJCueclgzgH391Ga1._z18.nvd, recovery.AXddJCueclgzgH391Ga1._z18.nvm, recovery.AXddJCueclgzgH391Ga1._z18.si, recovery.AXddJCueclgzgH391Ga1._z18_Lucene50_0.tip, recovery.AXddJCueclgzgH391Ga1._z18_Lucene54_0.dvm, recovery.AXddJCueclgzgH391Ga1.segments_4g, write.lock]
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:687) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]

At last :

[2021-02-01T11:33:01,377][INFO ][o.e.c.r.a.AllocationService] [nodexxxx_master-adm_90] Cluster health status changed from [YELLOW] to [RED] (reason: [{nodexxxx_data_03}{lfkgmXXCS2an-9vGMwJgMw}{yYYj3CmvTjWtlCIuh5xDqA}{0.0.0.0}{0.0.0.0.140:9303}{rack_id=nodexxxx} failed to ping, tried [3] times, each with maximum [30s] timeout

warkolm · February 2, 2021, 9:21pm

What is the output from the _cluster/stats?pretty&human API?

sebastienf · February 3, 2021, 7:51am

Here is the output from _cluster/stats :

HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 4807
{
"_nodes" : {
"total" : 42,
"successful" : 42,
"failed" : 0
},
"cluster_name" : "##########",
"timestamp" : 1612338624367,
"status" : "green",
"indices" : {
"count" : 3073,
"shards" : {
"total" : 28923,
"primaries" : 14445,
"replication" : 1.0022845275181724,
"index" : {
"shards" : {
"min" : 2,
"max" : 35,
"avg" : 9.411975268467296
},
"primaries" : {
"min" : 1,
"max" : 6,
"avg" : 4.700618288317605
},
"replication" : {
"min" : 1.0,
"max" : 34.0,
"avg" : 1.010738691832086
}
}
},
"docs" : {
"count" : 68461661389,
"deleted" : 54085877
},
"store" : {
"size" : "97.6tb",
"size_in_bytes" : 107351103384485,
"throttle_time" : "0s",
"throttle_time_in_millis" : 0
},
"fielddata" : {
"memory_size" : "56.1mb",
"memory_size_in_bytes" : 58883064,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "35.3gb",
"memory_size_in_bytes" : 38002057402,
"total_count" : 902465343,
"hit_count" : 136761051,
"miss_count" : 765704292,
"cache_size" : 596074,
"cache_count" : 10200931,
"evictions" : 9604857
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 436453,
"memory" : "188.9gb",
"memory_in_bytes" : 202936988938,
"terms_memory" : "148gb",
"terms_memory_in_bytes" : 158915678934,
"stored_fields_memory" : "30.4gb",
"stored_fields_memory_in_bytes" : 32734430952,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "586.6mb",
"norms_memory_in_bytes" : 615113984,
"points_memory" : "5.6gb",
"points_memory_in_bytes" : 6102942496,
"doc_values_memory" : "4.2gb",
"doc_values_memory_in_bytes" : 4568822572,
"index_writer_memory" : "62.9mb",
"index_writer_memory_in_bytes" : 65983588,
"version_map_memory" : "1.6mb",
"version_map_memory_in_bytes" : 1701845,
"fixed_bit_set" : "0b",
"fixed_bit_set_memory_in_bytes" : 0,
"max_unsafe_auto_id_timestamp" : 1612310410289,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 42,
"data" : 35,
"coordinating_only" : 4,
"master" : 3,
"ingest" : 0
},
"versions" : [
"5.6.14"
],
"os" : {
"available_processors" : 1680,
"allocated_processors" : 1344,
"names" : [
{
"name" : "Linux",
"count" : 42
}
],
"mem" : {
"total" : "11.5tb",
"total_in_bytes" : 12703406112768,
"free" : "233.8gb",
"free_in_bytes" : 251059400704,
"used" : "11.3tb",
"used_in_bytes" : 12452346712064,
"free_percent" : 2,
"used_percent" : 98
}
},
"process" : {
"cpu" : {
"percent" : 15
},
"open_file_descriptors" : {
"min" : 1626,
"max" : 3524,
"avg" : 3126
}
},
"jvm" : {
"max_uptime" : "140.8d",
"max_uptime_in_millis" : 12169334519,
"versions" : [
{
"version" : "1.8.0_141",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.141-b16",
"vm_vendor" : "Oracle Corporation",
"count" : 42
}
],
"mem" : {
"heap_used" : "844.4gb",
"heap_used_in_bytes" : 906730397368,
"heap_max" : "1.2tb",
"heap_max_in_bytes" : 1343251021824
},
"threads" : 9945
},
"fs" : {
"total" : "579.7tb",
"total_in_bytes" : 637486114570240,
"free" : "482.1tb",
"free_in_bytes" : 530113496121344,
"available" : "453tb",
"available_in_bytes" : 498111363497984,
"spins" : "true"
},
"plugins" : [
{
"name" : "search-guard-5",
"version" : "5.6.14-19.2",
"description" : "Provide access control related features for Elasticsearch 5",
"classname" : "com.floragunn.searchguard.SearchGuardPlugin",
"has_native_controller" : false
},
{
"name" : "x-pack",
"version" : "5.6.14",
"description" : "Elasticsearch Expanded Pack Plugin",
"classname" : "org.elasticsearch.xpack.XPackPlugin",
"has_native_controller" : true
}
],
"network_types" : {
"transport_types" : {
"com.floragunn.searchguard.ssl.http.netty.SearchGuardSSLNettyTransport" : 42
},
"http_types" : {
"com.floragunn.searchguard.http.SearchGuardHttpServerTransport" : 42
}
}
}
}

warkolm · February 3, 2021, 9:09pm

That's the cause. That makes nearly 830 shards per node, way too many.
And the average size of your shards is way too small as well.

The short term solution is to do small batches of deletes, or add more nodes to the cluster to bring the per node shard count down.

You're heading in the right direction though!

sebastienf · February 4, 2021, 7:55am

We keep going then, thanks a lot.

system · March 4, 2021, 7:56am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cant start ES because too many shards Elasticsearch	7	778	September 25, 2019
Elasticsearch Cluster status turns red while restoring an index Elasticsearch	5	523	April 11, 2017
Elasticsearch is gone Red after delete index Elasticsearch	3	1355	February 21, 2019
Slow Index Deletion Elasticsearch	7	3649	July 5, 2017
Lost shards and cluster state stays red Elasticsearch	3	5233	July 6, 2017

Delete too many indexes at once leads the cluster to red

Related topics