Null Responses and deadlocks when using the Elasticsearch.Net APIs

Hi folks,

We’re using the C# Elasticsearch.Net API to write to an index. There are many bulk requests, most succeed, some fail because of queue pressure and we retry, etc. Every response object of a client call is of type ElasticsearchResponse. This object has a member called Response which is the json DynamicDictionary. Sometimes (rarely), we see this object set to null, even as the response status code is 200, which we don't expect. It turns out that these calls time out after a minute - it can be seen if instead of calling client.Bulk() you call client.BulkAsync.Wait(), and a timeout exception will be thrown (sync Bulk does not throw even as the client is configured to throw exceptions).

The bigger problem is that there seems to be some sort of client-side connection starvation that seems related to those errors. Eventually, after a random amount of time, all Bulk calls get stuck and never return which blocks all our executor threads, effectively killing the client. Connection starvation happens when the caller of a web API does not clean up (dispose/close) network connections, usually in case of errors (this may be the case here but at this point I'm speculating).

Has anyone encountered this? Is it a known bug? Fixed in a newer version? We are using Elasticsearch.Net 1.7.2. The server runs 1.7.1.

Thanks
Fritz

Hey Fritz, Could you provide some more details

  1. What settings are you configuring the client with?
  2. How are you composing and sending bulk requests with the client? An example that replicates the issue would be really helpful
  3. How are you performing retries? Are you using client configuration for retrying, retrying manually, etc? On what condition are you retrying e.g. are you seeing EsRejectedExecutionExceptions being returned from Elasticsearch?
  4. Do you see any exceptions being thrown and if so, can you post the Stack Trace?
  5. What do TCP connections coming from the client machines look like (using something like TCPView)? Are they saturated? Are there connections stuck in CLOSE_WAIT

I'v used NEST 1.7.2 against Elasticsearch 1.7.1 previously without seeing these kinds of issues; I was indexing 10,000 complex documents (including geo_shape fields) roughly every 15 seconds for 5 hours (~ 12 million documents).

If you're seeing EsRejectedExecutionExceptions coming back, I would take a look at the threadpools for bulk

GET /_cat/thread_pool?v&h=host,bulk.queueSize,bulk.min,bulk.max

to see if you need to look at indexing fewer documents in each bulk request.

Hi Russ,

thanks for following up.
Yes, we see "EsRejectedExecutionException[rejected execution (queue capacity" but those are not the big concern. I believe they are normal when the server is experiencing pressure. For example I send a bulk request with 100 documents, 80 succeed and 20 are rejected. When I retry I resend just the 20 that failed. I wrote an adaptive algorithm that regulates client throughput and adjusts concurrent thread count based on server queue push-back, so over time the number of queue errors and retries is minimized - basically the goal is to not index documents faster than the server can process them (why the server seems slow sometimes is a different question).

What is causing our ultimate failures is that after a random amount of time all Bulk calls are hanging, they never complete. This brings the whole client to a standstill. Also, some requests are simply coming back with null Response inner member which breaks code assumptions. I traced that back to a failure of the ElasticSearchDefaultSerializer to parse the response from the server; they look like timeouts, however the status code is 200 which seems like a bug. I had to enlist in the GitHub code for the client library to get more insights, as just looking at the call stacks in the Visual Studio debugger didn't help.

I'll follow up here when I have more details from my debugging. If you or anyone else knows more about this, I could use any available information to unblock me.

Hi Russ, folks,

Following up on this one

The deadlocks I mentioned happen when using the Async version of the APIs (i.e. BulkAsync() instead of Bulk()), I have not seen them since I switched to the sync calls. When the deadlocks would happen, I could see all those .Net tasks that never complete. My client would call WaitAny on those and be blocked forever. This wasn't some sort of connection starvation as hung connection errors are pretty obvious on the call stack and would eventually timeout trying to retrieve a response stream. More likely, a callback for a task would not be called in a rare error situation which would leave the task in limbo. I did however not investigate further, since I had to make progress and the non-async API did not suffer of the same problem. Russ, did you fix anything like this in newer versions?

The second issue is that occasionally, even with non-async requests, not only the bulk call, but also other calls, such as those trying to retrieve a document would come up with a successful ElasticsearchResoponse<DynamicDictionary> response having an http status of 200 but a null Response member. I debugged those a bit and found that simply reading from the response ConnectStream returns 0 bytes. I didn't dig higher up the callstack to figure out what the actual response was, I would have had to change the code a bit for that. Those errors are pretty disturbing, as it's unclear what actually happens to that bulk of batch index operations. Did they succeed? Did they fail? Did some succeed and some fail? I am doing further testing to figure out whether the number of documents indexed actually match the number of documents I thought I indexed. I am using automatic IDs, so if the documents in the null-response batch were actually indexed I would get duplicated documents when I retry.

Russ, or the other owners of the API, can you help us understand this issue?

Generally speaking, you should avoid waiting (.WaitAny(), .Wait()) on a method that returns Task(and in addition, waiting on .Result on a method that returns Task<T>) anywhere in an asynchronous pipeline other than at the top level entry point e.g. Main() in a console application.

Waiting on a task blocks and there is a possibility of introducing a deadlock, which is what I think is happening in your case; the calling code has blocked the context waiting for the TasK<T> to complete and the TasK<T> is waiting for the context to become free so it can complete, the result being a deadlock. It's difficult to know if this is definitely what is happening but the indicators are there, based on your description. What kind of application is it?

Within NEST 2.x, all async methods are configured to not capture the context i.e. all async methods have .ConfigureAwait(false), so can resume on a thread pool thread. In NEST 1.x however, this isn't the case as it uses TPL instead of async/await (for backwards compatibility with .NET 4). As is recommended, you shouldn't block on async code so if you need to perform work against Elasticsearch in a synchronous, using the synchronous methods makes sense. But out of curiosity, if you're on .NET 4.5 or later, what happens if you change your code to

var bulkTask = client.BulkAsync(new BulkRequest()).ConfigureAwait(false);
var response = bulkTask.GetAwaiter().GetResult();

I would not recommend having this in your code, but I'm interested to know if the deadlock is due to context capturing.

Do you have a small reproducible example that can demonstrate the null Response issue?

Hi Russ, thanks for the follow up,
I'm familiar with the sort of deadlock that you refer to but I do not think it has something to do with my situation (I send a link to Stephen Cleary's blog to my junior devs when they run into those issues too ;-)). This is a common deadlock for Asp.Net web requests when the captured context allows a is single thread. This isn't the case in my application (not a web service), the callers run on threadpool threads. WaitAny is a fine option here. Note that depending on the size of the data I have to index I am creating tens of thousands of such tasks and the vast majority of them work just fine; the deadlock sometimes isn't happening at all, but with large data sets it eventually does.

As for the null response, which is more concerning, I'm not sure how I could demonstrate the repro but yes, it happens quite often - I will reply with a statistic of what I see in a common bulk indexing scenario. I enlisted in the source code and was successful debugging, however I didn't get all the way to the root of the issue; the only thing I discovered is that reading from the response stream yields 0 bytes, which are not deserialized to anything, so the response stays null.

I'm not sure I understand what you're saying here.

I agree that if you're not seeing the deadlock all the time, it doesn't sound like an async deadlock problem.

I would look at the Elasticsearch logs to correlate with when you see this happening; it might also be worth enabling tracing on the client to get more information.

Hi Russ,

To clarify, the paragraph you quoted was just my paraphrasing the article in Stephen Cleary's blog you sent in your earlier message. The deadlock that happens when the async Task tries to re-enter the captured context is a side-effect of capturing and re-entering a context that is single-threaded - this is typical for the Asp.Net request context and the WinForms UI context. The application I am writing is neither of those; all threads are thread-pool threads and the thread pool context is a multi-threaded context, so even if the context is captured by default by the BulkAsync call, it won't lead to a deadlock.

I will look into the client tracing for better diagnostics. The deadlocks and the null responses seem different issues at this point; I will follow up with more details once I do some more testing.