Hi everyone,
I am hitting this "strange" behavior related to search_phase_execution_exceptions that I don't really understand and would like to know the underlying reasons for.
A bit of context; I'm running ES 7.8.1, the cluster has two ES nodes. I'm running it on my dev workstation (windows 10) with AdoptOpenJDK 11
The goal: To provide end users with the best possible "reason" for why a given search request failed (basically handling "all shards failed" scenarios)
I have defined a query string which I know will cause a lexical error:
Test """
Going native json, this is the query, sent to test_index/_search endpoint:
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "Test \"\"\""
}
}
]
}
}
}
And this is the response, 10/10 times:
{
"error": {
"root_cause": [
{
"type": "query_shard_exception",
"reason": "Failed to parse query [Test \"\"\"]",
"index_uuid": "8XXlJMXlRHyZkWFF3ilzfw",
"index": "test_index"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "test_index",
"node": "S1yQggTBSVW7hsZrg02rWA",
"reason": {
"type": "query_shard_exception",
"reason": "Failed to parse query [Test \"\"\"]",
"index_uuid": "8XXlJMXlRHyZkWFF3ilzfw",
"index": "test_index",
"caused_by": {
"type": "parse_exception",
"reason": "Cannot parse 'Test \"\"\"': Lexical error at line 1, column 9. Encountered: <EOF> after : \"\"",
"caused_by": {
"type": "token_mgr_error",
"reason": "Lexical error at line 1, column 9. Encountered: <EOF> after : \"\""
}
}
}
}
]
},
"status": 400
}
However, if I make use of the RestHighLevelClient and the searchAsync() method, I experience a different behavior:
Repeating the same request - the exception "object" I get back varies, sometimes it holds a ResponseException part of the Suppressed array - sometimes the Suppressed array is empty.
A code snippet from the onFailure handler, note the logging of the esEx here:
public void onFailure(Exception e) {
if(e instanceof ElasticsearchStatusException) {
ElasticsearchStatusException esEx = (ElasticsearchStatusException) e;
Throwable rootCause = esEx.getRootCause();
Throwable cause = esEx.getCause();
Throwable[] suppressed = rootCause.getSuppressed();
LOG.info("Cause: " + cause);
LOG.info("rootCause: " + rootCause);
LOG.info("Suppressed size: " + suppressed.length);
LOG.info("Suppressed: " + (suppressed.length > 0 ? suppressed[suppressed.length-1] : null));
}
LOG.error("Exception caught from " + Stream.of(stackTrace).map(StackTraceElement::toString).collect(Collectors.joining("\n")));
if(sr != null) {
LOG.error("ES request = {}", sr.source());
}
cf.completeExceptionally(exceptionTransformer.apply(e));
}
Most of the times (approx 60-70%) I end up with this printed from my log statements:
c.k.rest.async.CompletableFutureUtils : Cause: ElasticsearchException[Elasticsearch exception [type=token_mgr_error, reason=token_mgr_error: Lexical error at line 1, column 6. Encountered: <EOF> after : ""]]
c.k.rest.async.CompletableFutureUtils : rootCause: ElasticsearchException[Elasticsearch exception [type=token_mgr_error, reason=token_mgr_error: Lexical error at line 1, column 6. Encountered: <EOF> after : ""]]
c.k.rest.async.CompletableFutureUtils : Suppressed size: 0
c.k.rest.async.CompletableFutureUtils : Suppressed: null
However, the other 30-40% of the times I issue this request, this is printed (see the ResponseException part of Suppressed array):
c.k.rest.async.CompletableFutureUtils : Cause: null
c.k.rest.async.CompletableFutureUtils : rootCause: ElasticsearchStatusException[Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]]
c.k.rest.async.CompletableFutureUtils : Suppressed size: 1
c.k.rest.async.CompletableFutureUtils : Suppressed: org.elasticsearch.client.ResponseException: method [POST], host [http://localhost:9202], URI [/test_index/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&preference=kadmin&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"query_shard_exception","reason":"Failed to parse query [1 \"\"\"]","index_uuid":"dAeiuu_tQ_SBKP1UAsnfVw","index":"test_index"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"test_index","node":"VBx7kpXwTaCtoq0cP_m_oA","reason":{"type":"query_shard_exception","reason":"Failed to parse query [1 \"\"\"]","index_uuid":"dAeiuu_tQ_SBKP1UAsnfVw","index":"test_index","caused_by":{"type":"parse_exception","reason":"Cannot parse '1 \"\"\"': Lexical error at line 1, column 6. Encountered: <EOF> after : \"\"","caused_by":{"type":"token_mgr_error","reason":"Lexical error at line 1, column 6. Encountered: <EOF> after : \"\""}}}}]},"status":400}
So to sum up;
Whenever the ResponseException is part of the exception returned, the cause is null but suppressed.length >= 1
If the ResponseException is not part of the exception returned, there is a cause object and suppressed.length == 0
Any clues to why I get this ambiguous behavior when the request is the same all the time? And can I do anything to get the ResponseException always?
I can add that regardless of which "flow" I end up in, the stacktrace is always coming from the same point in the ES code;
at org.elasticsearch.client.RestHighLevelClient$1.onFailure(RestHighLevelClient.java:1783) ~[elasticsearch-rest-high-level-client-7.8.1.jar:7.8.1]
at org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onDefinitiveFailure(RestClient.java:598) ~[elasticsearch-rest-client-7.8.1.jar:7.8.1]
at org.elasticsearch.client.RestClient$1.completed(RestClient.java:343) ~[elasticsearch-rest-client-7.8.1.jar:7.8.1]
at org.elasticsearch.client.RestClient$1.completed(RestClient.java:327) ~[elasticsearch-rest-client-7.8.1.jar:7.8.1]
Kind Regards
Eivind Rønnevik