Docker container becomes unresponsive with Elastic Logging Plugin when Elastic server is not available

Sometimes a container or a series of containers becomes unresponsive when Elastic server is not available so we can't: restart/stop/kill these containers. Docker logs command is not available for these containers. Docker stats is not available for all containers.

Last time we were able to reproduce the issue when there was no route no server via network while containers were running.

So here's an error message:

    dockerd[64602]: time="2021-04-26T15:48:51+03:00" level=error msg="2021-04-26T12:48:51.645Z\tERROR\t[publisher_pipeline_output]\tpipeline/output.go:155\tFailed to connect to backoff(elasticsearch(http://...)): Get http://...: dial tcp ...: connect: no route to host

I can see that it tries to reconnect:

    dockerd[64602]: time="2021-04-26T15:48:16+03:00" level=error msg="2021-04-26T12:48:16.915Z\tINFO\t[publisher_pipeline_output]\tpipeline/output.go:146\tAttempting to reconnect to backoff(elasticsearch
...)) with 89 reconnect attempt(s)" plugin=6c0cba97835793e6fa2056e601bee296d996ab03c23cd0076b5baaa7278dcd53
    Apr 26 15:48:16 ... dockerd[64602]: time="2021-04-26T15:48:16+03:00" level=error msg="2021-04-26T12:48:16.915Z\tINFO\t[PipelineManager]\tpipeline/retry.go:221\tretryer: send unwait signal to consumer" plugin=6c0cba97835
    Apr 26 15:48:16 .... dockerd[64602]: time="2021-04-26T15:48:16+03:00" level=error msg="2021-04-26T12:48:16.915Z\tINFO\t[PipelineManager]\tpipeline/retry.go:225\t done" plugin=6c0cba97835793e6fa2056e601bee296d996ab03c23c

If i search logs for a container that's what i see:

    Apr 26 15:02:20 ... dockerd[64602]: time="2021-04-26T15:02:20.808393367+03:00" level=info msg="Configured log driver does not support reads, enabling local file cache for container logs" container=095f572f6fd0233154210ba9f39b4073f44ab101431aa1fa22f711a981d242ef driver="elastic/elastic-logging-plugin:7.8.1"
    Apr 26 15:02:20 ... containerd[977]: time="2021-04-26T15:02:20.836300244+03:00" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/095f572f6fd0233154210ba9f39b4073f44ab101431aa1fa22f711a981d242ef pid=41008

When i tried to stop a container:

    Apr 26 15:26:19 ... dockerd[64602]: time="2021-04-26T15:26:19.797138343+03:00" level=info msg="Container 095f572f6fd0233154210ba9f39b4073f44ab101431aa1fa22f711a981d242ef failed to exit within 10 seconds of signal 15 - u
sing the force"
    Apr 26 15:26:21 .... dockerd[64602]: time="2021-04-26T15:26:21.954812808+03:00" level=info msg="ignoring event" container=095f572f6fd0233154210ba9f39b4073f44ab101431aa1fa22f711a981d242ef module=libcontainerd namespace=mo
by topic=/tasks/delete type="*events.TaskDelete"
    Apr 26 15:26:21 ... containerd[977]: time="2021-04-26T15:26:21.955558303+03:00" level=info msg="shim disconnected" id=095f572f6fd0233154210ba9f39b4073f44ab101431aa1fa22f711a981d242ef

Docker version 20.10.1

I suspect that this is an internal issue with docker that's a tad hard to fix: if we don't continuously read from the container's logging interface, stdio calls will eventually block. About how long is ES unavailable for?

Container gets stuck immediatly. If we get ES up it doesn't unstuck.
There was a bug in fluentd plugin (Properly stop logger during (re)connect failure · akerouanton/fluent-logger-golang@8ae97ca · GitHub) that caused people exactly the same situation we have with Elastic logging plugin, maybe it has the same or similar issue?

I assume i can create a github ticket with this issue?