ES 6.x / 7.x freezes on index deletion/creation for a long time ONLY with big amount of deprecation warnings

Hello everyone :wave: !

I'm getting a mysterious case on the CI.
ES freezes on an index deletion or creation for a long time, when I run application tests using ES 6.8.13 (same for 7.10.1) and the application is configured in such a way, that ES produces a big amount of deprecation warnings. It uses

  • {_all: {enabled: false}} index level setting, which causes warnings like [_all] is deprecated in 6.0+ and will be removed in 7.0. As a replacement, you can use [copy_to] on mapping fields to create your own catch all field.
  • and it uses big difference between min_gram & max_gram for the nGram tokenizer, without setting max_ngram_diff: 100 index level setting, which leads to Deprecated big difference between max_gram and min_gram in NGram Tokenizer,expected difference must be less than or equal to: [1]

If I remove the _all setting and add the max_ngram_diff setting then everything works fine.
If I suppress logging by setting logger.level=error then everything works fine.

But if there are deprecation warnings, then after some time, around 30+ minutes, ES cluster (this is CI, so single node cluster) starts giving timeouts on index create or index delete requests.
There are several lines like

2021-02-25T10:30:52,191][WARN ][r.suppressed             ] [R-M_ESE] path: /cucumber_company_representatives, params: {include_type_name=true, index=cucumber_company_representatives}
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (create-index [cucumber_company_representatives], cause [api]) within 30s


[2021-02-25T10:26:19,606][WARN ][r.suppressed             ] [qcsQvvf] path: /cucumber_%2A, params: {index=cucumber_*}
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (delete-index [[cucumber_chewy_journal/OYkk5h4oT0uVmChTTyCGZg], [cucumber_chewy_specifications/J05Z0mNLQrikCJ6CXmjJmg], [cucumber_tasks/mpOSS_YAQbq6hs0Tn991vw]]) within 30s

in the log and then finally

[2021-02-25T10:31:34,083][WARN ][o.e.c.s.ClusterApplierService] [R-M_ESE] cluster state applier task [apply cluster state (from master [master {R-M_ESE}{R-M_ESEeTUy6GeUWzvaOxg}{0xVXi9TvRU6mW0I7yXDKqg}{}{}{ml.machine_memory=31616786432, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [3085] source [create-index [cucumber_company_representatives], cause [api]]])] took [8.7m] above the warn threshold of 30s


2021-02-25T10:39:27,959][WARN ][o.e.c.s.MasterService    ] [qcsQvvf] cluster state update task [shard-started StartedShardEntry{shardId [[cucumber_tasks][0]], allocationId [aXTu63OnRqexz5RnxMwaWA], primary term [1], message [after new shard recovery]}[StartedShardEntry{shardId [[cucumber_tasks][0]], allocationId [aXTu63OnRqexz5RnxMwaWA], primary term [1], message [after new shard recovery]}]] took [15.6m] above the warn threshold of 30s

8 or 15 minutes are far from maximum, I've seen 50+minutes.

When this is happening, cluster health is reported to be "green". Pending tasks consist of the only task to create/delete index, e.g. {"tasks"=>[{"insert_order"=>3095, "priority"=>"URGENT", "source"=>"create-index [cucumber_company_representatives], cause [api]", "executing"=>true, "time_in_queue_millis"=>511179, "time_in_queue"=>"8.5m"}]}. Nodes stats reports that memory usage, heap usage, disk usage are ok, there are no GC pauses and no any clues what's going on.

Please help to debug this mysterious case and find the root cause of the freezes.

For ES 7 set of features to generate deprecation warnings is a little bit different, but the overall tendency remains.
"Big" amount if deprecation warnings doesn't mean something really huge, full log with those warnings is usually around 1.5 Mb.

What do these tests do exactly?
How many indices and shards are in the cluster at that point, as well as how much data is it storing?
Is there anything in your hot threads?

Tests are just cucumber scenarios, i.e. feature tests for the application. Some of them utilize some ES functionality, some of them not, but each of them is surrounded by ES cleanup: a set of indexes got created before the scenario run and deleted right after. The set is 66 indexes, 66 shards, most of them empty, total docs count ~1250. The total size is about 200kb.
Thread pool shows nothing interesting, just with each minute of a task blocking the generic threads count increases by 1, i.e. if a task is blocking for 56m then it's "generic"=>{"threads"=>56, "queue"=>0, "active"=>55, "rejected"=>0, "largest"=>56, "completed"=>22275}.

What mappings are you using?

Sorry, didn't get the question, what do you mean?

The mappings for the indices, are they super large or complex?
What hardware does this node run on?

I'd say they're quite simple. Most of them just a text field, a keyword field and a date field, text field analyzer uses icu_folding plugin and edge_ngram tokenizer.

Node is a n1-standard-16 GCE instance.

That was an issue with the docker-compose: Logs stop consuming for container, blocking writes to stderr 路 Issue #6018 路 docker/compose 路 GitHub