Surprising behaviour when escaping reserved char in query string [1.3.4]

mrec · October 13, 2015, 1:18pm

Trivial document, default (i.e. dynamic) mapping/analyzer.

PUT example/doc/1
{"s":">hello"}

Tokenization discards the >, but it's a reserved character, so you'd expect a naive unescaped querystring containing it to have problems:

GET example/_search
{"query":{"query_string":{"query":"(s:(>hello))"}}}

And it does, sort of; no errors, but no hits either. "query":"(s:(%hello))" does match the doc, and % isn't reserved, so the reserved-ness of > definitely seems to be the reason.

Where it gets weird is that "(s:(\\>hello))", i.e. one JSON-escaped backslash followed by >, doesn't match the doc either. "(s:(\\\\>hello))", which looks like it ought to be an Lucene-escaped backslash followed by an unescaped >, does match. So do "(s:(\\\\\\>hello))", "(s:(\\\\\\\\>hello))" and so ad infinitum.

Can anyone make any sense of this? As a newbie I've been banging my head against it without result, and colleagues with much more ES experience are similarly stumped.

warkolm · October 14, 2015, 5:10am

Very weird, but I don't have an answer! It does it on 1.7.2 as well.
Let me ask one of the core team for some of their thoughts.

mrec · October 14, 2015, 11:24am

Thanks! Very relieved to hear it's not just me.

One additional piece of info: if I double-quote the query term instead of parenthesising it, it matches regardless of escaping. That is, "s:\">hello\"", "s:\"\\>hello\"", "s:\"\\\\>hello\"" etc all match. (Obviously this changes the meaning of the query if it contains multiple tokens, so it's not really a workaround.)

cbuescher · October 14, 2015, 2:33pm

Hi,

notice that

GET example/_validate/query?explain
{"query":{"query_string":{"query":"(s:(>hello))"}}}

gives the lucene query explanation

"s:{hello TO *]"

and shows that the > is a shorthand notation for an unbounded range query that excludes the first element. This maybe explains why that query doesn't return the document. If you use % like in "(s:(%hello))", this gets simply deleted and the result is a normal term query that matches the doc.

As for the escaping, notice that

GET example/_validate/query?explain
{"query":{"query_string":{"query":"(s:(\\>hello))"}}}

=> "s:{hello TO *]"

but

GET example/_validate/query?explain
{"query":{"query_string":{"query":"(s:(\\\\>hello))"}}}

=> "s:hello"

so the first version does not escape the > character and thus creates a range query, where the later results in a term query (the > gets dropped by the standard analyzer).

Hopes this helps.

mrec · October 14, 2015, 2:54pm

Hi Christoph,

Thanks for investigating. To be clear, are you just expanding on what Lucene is doing under the covers, or are you saying that this is in fact correct behaviour? (In which case either the relevant docs are wrong or my reading comprehension is failing spectacularly.)

cbuescher · October 14, 2015, 3:36pm

Why \\> doesn't escape the reserved character suprises me too, given the docs. My comment was mostly about why s:(>hello) doesn't match the original doc.

jprante · October 14, 2015, 4:11pm

These are equivalent

GET /test/_search?explain&q=s:\\>hello

and

POST /test/_search
{
    "explain" : true,
    "query" : {
        "query_string" : {
            "query" : "s:\\\\>hello"
        }
    }
}

You need to double-backslash everything in the second variant because of the JSON parser which processes the request body. So \\\\ becomes \\ which is then passed to the Lucene query string parser.

Anyway, the escaping does not perform anything useful regarding the search result, the token to be searched is in both cases hello.

mrec · October 14, 2015, 4:19pm

@jprante no, I don't think so. I'm not trying to pass a literal backslash to the query string parser, I'm trying to pass a literal >. So according to the escaping docs I should be passing \> to Lucene, meaning \\> in JSON.

I understand that the escaped char is useless in this minimized example, but in the real system I'm looking at I don't control the raw strings and can't assume that the field being searched will always be tokenized in a way that ignores problematic characters (i.e. I can't just replace them with spaces).

jprante · October 14, 2015, 6:19pm

Ok, I was wrong. There are several effects. This will pass \>hello using the JSON escaper

POST /test/_search
{
    "query" : {
        "query_string" : {
            "escape" : false,
            "query" : "\\>hello",
            "default_field" : "s"
        }
    }
}

This will also pass \>hello but uses Lucene's query string escaper.

POST /test/_search
{
    "query" : {
        "query_string" : {
            "escape" : true,
            "query" : ">hello",
            "default_field" : "s"
        }
    }
}

When escape: true is given, you can enable Lucene's query string escaping after JSON has been received, but before the value is submitted to search.

But even if you can pass >hello in multiple ways, it does not execute a search for this term.

First observation: if you index >hello with the ES default analyzer, you will index hello and searching for >hello will not return a hit.

Second observation: as @cbuescher wrote, ES does some smart preprocessing in query_string queries and evaluates a > symbol at position 0 for translating the query text to a range query in org.apache.lucene.queryparser.classic.MapperQueryParser

Now let's try to make this possible although of that. Let's use the keyword analyzer, this will take the word >hello to the index.

PUT /test/
{
    "mappings" : {
        "docs" : {
            "properties" : {
                "s" : {
                    "type" : "string",
                    "analyzer" : "keyword"
                }
            }
        }
    }
}

PUT /test/docs/1
{"s":">hello"}

POST /test/_search
{
    "query" : {
        "query_string" : {
            "query" : ">hello",
            "default_field" : "s"
        }
    }
}

gives no hits.

If you want hits, you can use simple_query_string instead, which does not try to interpret > at position 0.

POST /test/_search
{
    "query" : {
        "simple_query_string" : {
            "query" : ">hello",
            "fields" : ["s"]
        }
    }
}

gives a hit.

Maybe this can help to find a solution.

mrec · October 14, 2015, 6:45pm

Thanks for the detailed reply. There's a lot to experiment with there, but from initial testing (again, against 1.3.4) neither

"escape":false,
"query":"\\>hello",

nor

"escape":true,
"query":">hello",

returns hits, but

"escape":true,
"query":"\\>hello",

does.

One other thing in your reply which confuses me:

if you index >hello with the ES default analyzer, you will index hello and searching for >hello will not return a hit

If you're talking about something more fundamental than Lucene's apparent divergence from its documented escaping rules... why not? Why does searching for %hello or $hello or ?hello return a hit but not \\>hello?

Topic		Replies	Views
Escaping reserved characters in a query Elasticsearch	1	1165	July 6, 2017
Reserved character escaping understanding Elasticsearch	2	8399	June 9, 2017
Elasticsearch query issue with reserved characters Elasticsearch	1	426	October 18, 2018
Query_string search containing a dash has unexpected results Elasticsearch	12	23031	July 6, 2017
ElasticSearch escape special character Elasticsearch	4	2662	July 12, 2021

Surprising behaviour when escaping reserved char in query string [1.3.4]

Related topics