ElasticSearch stops responding after reaching a certain number of "fails to parse query" errors

When the search string has special characters the 'query_string' search fails with the error "Failed to parse query [sam]
at org.elasticsearch.index.query.QueryStringQueryBuilder.doToQuery"

Our issue here is not this error. We can escape the special character or use the 'simple_query'.

The issue we are having is that every time this error happens it looks like that search thread is not returned to the pool. So eventually after encountering a certain number of this error, the elastic search nodes stops responding. We have a 7 node elastic setup. We executed this query with special character continuously. After roughly around 60 times, the nodes stopped responding and just hangs. We have to restart the node to get it back to a working state. Sometimes we have to restart more than one nodes or the entire cluster. We were using elasticsearch 5.2 and recently upgraded to 6.5.4. The issues seems to be still there. Just wanted to make sure if this is an issue with the error handling of query_string search, it is fixed.

Thanks for the report. This does sound like a problem, but unfortunately there's not enough information in your post to help us to reproduce it, and without that we cannot solve it. This kind of problem can be very tricky.

Please can you share the query you're running and the error that you're seeing, including as much information as possible (full stack traces, all pertinent log messages etc.) to help out? Could you also share the full output of GET _tasks once the cluster has got into this state? There's probably quite a lot of information to share, so you might like to use https://gist.github.com/. If you would rather not make all this information public, I can provide my email address for you to share it privately.

Also a thread dump from GET /_nodes/hot_threads?threads=9999 would be handy.

We have the logs from multiple instances where this error happened. I don't see an option to share the files. If you can provide your email address i can send them over.

This has been happening for quite a while now and every time the service goes in to this state, the logs always had this query error as the last item. We never thought a parse error in the query could cause this and we ran several unit tests to simulate the same but the issue never happened during those tests. Only when we have run this a certain number of times, this issue happens.

Here is a sample of the query,

"must" : [
{
"query_string" : {
"query" : "sam\",
"fields" : [
"data.contexts.all_values^1.0"
],
"use_dis_max" : true,
"tie_breaker" : 0.0,
"default_operator" : "and",
"auto_generate_phrase_queries" : false,
"max_determined_states" : 10000,
"enable_position_increment" : true,
"fuzziness" : "AUTO",
"fuzzy_prefix_length" : 0,
"fuzzy_max_expansions" : 50,
"phrase_slop" : 0,
"escape" : false,
"split_on_whitespace" : true,
"boost" : 1.0
}
}
],

The last logs,

org.elasticsearch.transport.RemoteTransportException: [VD6DLR4][10.0.0.63:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.index.query.QueryShardException: Failed to parse query [sam]
at org.elasticsearch.index.query.QueryStringQueryBuilder.doToQuery(QueryStringQueryBuilder.java:1034) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:97) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.index.query.BoolQueryBuilder.addBooleanClauses(BoolQueryBuilder.java:442) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.index.query.BoolQueryBuilder.doToQuery(BoolQueryBuilder.java:416) ~[elasticsearch-5.2.0.jar:5.2.0]

Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: parse_exception: Cannot parse 'sam': Lexical error at line 1, column 5. Encountered: after : ""
at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:116) ~[lucene-queryparser-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim - 2017-01-17 15:58:02]
at org.apache.lucene.queryparser.classic.MapperQueryParser.parse(MapperQueryParser.java:786) ~[elasticsearch-5.2.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim - 2017-01-17 15:58:02]
at org.elasticsearch.index.query.QueryStringQueryBuilder.doToQuery(QueryStringQueryBuilder.java:1032) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:97) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.index.query.BoolQueryBuilder.addBooleanClauses(BoolQueryBuilder.java:442) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.index.query.BoolQueryBuilder.doToQuery(BoolQueryBuilder.java:416) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:97) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.index.query.NestedQueryBuilder.doToQuery(NestedQueryBuilder.java:252) ~[elasticsearch-5.2.0.jar:5.2.0]

Sure. I'm david.turner@elastic.co.

We shared some information by email, including the hot threads output and a diagnostics dump and it looks like the stuck node has many threads, including all of its network threads, busy writing to logs:

    0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[2fFs0rv][transport_client_boss][T#2]'
     10/10 snapshots sharing following 72 elements
       java.io.FileOutputStream.writeBytes(Native Method)
       java.io.FileOutputStream.write(FileOutputStream.java:326)
       java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
       java.io.PrintStream.write(PrintStream.java:480)
       org.apache.logging.log4j.core.util.CloseShieldOutputStream.write(CloseShieldOutputStream.java:53)
       org.apache.logging.log4j.core.appender.OutputStreamManager.writeToDestination(OutputStreamManager.java:252)
       org.apache.logging.log4j.core.appender.OutputStreamManager.flushBuffer(OutputStreamManager.java:284)
       org.apache.logging.log4j.core.appender.OutputStreamManager.drain(OutputStreamManager.java:341)
       org.apache.logging.log4j.core.layout.TextEncoderHelper.drainIfByteBufferFull(TextEncoderHelper.java:147)
       org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeAsMuchAsPossible(TextEncoderHelper.java:125)
       org.apache.logging.log4j.core.layout.TextEncoderHelper.encode(TextEncoderHelper.java:109)
       org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeText(TextEncoderHelper.java:89)
       org.apache.logging.log4j.core.layout.StringBuilderEncoder.encodeSynchronized(StringBuilderEncoder.java:86)
       org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:61)
       org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:32)

I've suggested toning down the logging to see if that helps.

diff -U3 -r config-default/log4j2.properties config-0/log4j2.properties
--- config-default/log4j2.properties	2017-01-24 19:50:32.000000000 +0000
+++ config-0/log4j2.properties	2019-03-07 11:13:08.000000000 +0000
@@ -2,7 +2,7 @@

 # log action execution errors for easier debugging
 logger.action.name = org.elasticsearch.action
-logger.action.level = debug
+logger.action.level = info

 appender.console.type = Console
 appender.console.name = console

Thank you. Appreciate your timely help. We will try changing the logging level and give it a shot. But what is not clear is that why did this large message cause the thread to block? Now that you have given us some pointers we will explore a little bit on our side as well. Thanks again.

It's a good question, and the short answer is that I don't know. Can you tell us a bit more about how you are collecting logs from all these nodes? Is each node's log file being written to a disk local to that node or is it more complicated than that?

Also it's not clear whether it's actually stuck or whether it's just proceeding slowly. The stack trace I showed above shows that this thread is all the way down in java.io.FileOutputStream.writeBytes().

I also think that things have changed in this area of Log4J since the version that you're using. The code I looked at doesn't correspond with what we're seeing here. It's possible that upgrading to a newer version will help.

Agree. We are in the process of upgrading to 6.5.4.

To your question regarding the log collection, each elastic search node logs itself are written to a local folder only. We have a file beat service that sends them over to centralized ELK instance.

Thanks David. The change suggested by you worked. The issue does not happen anymore.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.