Consequences of violating 'Blocking operation' assertion

In short: What are the risks of ignoring Blocking operation assert?

There are some checks in Elasticsearch code to verify operation is not executed in transport thread. For instance, in BaseFuture class. It requires asserts are enabled (which can happen when using ESIntegTestCase for example). Typically, the symptom looks like this: https://github.com/elastic/elasticsearch/issues/17865

I have seen some 3rd party plugins that can run into this when doing blocking calls in rest handler. For example, plugin uses client() to query cluster state or index documents right in the REST action class (i.e. in the context inheritting from BaseRestHandler class). Since this is "just" an assert there is nothing that forces plugin authors to solve this issue unless they want to implement integration tests (and they do not want to -da).

My understanding is that this assert tells you you are consuming resources from generic thread pool, which is unbound (at lesat for ES 2.x), which means that if you are running blocking operation in this context there is a risk of creating way too many threads and nothing can stop you except shortage of HW resources, which is what you really do not want to happen.

Is my understanding correct? Are there any other risks? And finally, why is this an assert and not an Exception?

Anyone?

If you run a blocking operation on a networking thread, that networking thread is tied up until the blocking operation returns. It's bad to block these networking threads since they are needed to handle responses and requests. Even worse: if all the networking threads are tied up waiting on blocking calls to complete, and those blocking calls are waiting on responses other nodes, then there are no networking threads left to handle the responses, so the server is deadlocked. :scream:

This is an assert and not an exception because we want to catch this during development. Exceptions are too soft (for example, if we threw an exception and it only led to a shard being failed and being relocated elsewhere, the cluster could recover from this and tests might not fail). Assertions are hard since they go uncaught (or kill the node) and uncaught errors automatically fail tests.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.