We have a .NET application using NEST version 2.3.1 against Elasticsearch 2.2.0. The application is a backend server application running on multiple hosts under IIS, all performing the same queries against an Elasticsearch cluster.
Recently we experienced very strange behavior at one of our customer sites running the application on 4 IIS hosts. Suddenly all queries against Elasticsearch from one of the nodes started to fail, the other 3 nodes continued to work fine.
The error reported from Elasticsearch for queries from the failing node was the following:
StatusCode: 400, Method: POST, Url: https://--the-url--/_search?timeout=5s, Elasticsearch reason: [term] query does not support [has_child], Exception: System.Net.WebException: The remote server returned an error: (400) Bad Request. at System.Net.HttpWebRequest.GetResponse() at Elasticsearch.Net.HttpConnection.Request[TReturn](RequestData requestData)
After enabling verbose logging of the request on the application side we found that we actually did send the following query in the request:
This is obviously an incorrect query, so the Elasticsearch response is correct. The query should look like this (with "value" instead of "has_child"):
From the C# code of the application we just perform the following for the term condition using the NEST client:
We experienced the exact same thing with not only term queries, but also regexp queries. It appears that the NEST client suddenly started writing incorrect requests using "has_child" instead of "value" as the property name for the condition value. After restarting the process (by performing an IIS recycle of the application pool) the problem disappeared.
I have browsed the source code of NEST/Elasticsearch.Net, looking at how the serialization of queries is performed. Including trying to find e.g. thread unsafety anywhere in FieldNameQueryJsonConverter, ReserializeJsonConverter and the extension method GetCachedObjectProperties. But I have not found anything that looks incorrect.
Has anyone experienced anything similar? Any idea on what could cause this other than some sort of memory corruption?