Surprising behaviour when escaping reserved char in query string [1.3.4]


#1

Trivial document, default (i.e. dynamic) mapping/analyzer.

PUT example/doc/1
{"s":">hello"}

Tokenization discards the >, but it's a reserved character, so you'd expect a naive unescaped querystring containing it to have problems:

GET example/_search
{"query":{"query_string":{"query":"(s:(>hello))"}}}

And it does, sort of; no errors, but no hits either. "query":"(s:(%hello))" does match the doc, and % isn't reserved, so the reserved-ness of > definitely seems to be the reason.

Where it gets weird is that "(s:(\\>hello))", i.e. one JSON-escaped backslash followed by >, doesn't match the doc either. "(s:(\\\\>hello))", which looks like it ought to be an Lucene-escaped backslash followed by an unescaped >, does match. So do "(s:(\\\\\\>hello))", "(s:(\\\\\\\\>hello))" and so ad infinitum.

Can anyone make any sense of this? As a newbie I've been banging my head against it without result, and colleagues with much more ES experience are similarly stumped.


(Mark Walkom) #2

Very weird, but I don't have an answer! It does it on 1.7.2 as well.
Let me ask one of the core team for some of their thoughts.


#3

Thanks! Very relieved to hear it's not just me.

One additional piece of info: if I double-quote the query term instead of parenthesising it, it matches regardless of escaping. That is, "s:\">hello\"", "s:\"\\>hello\"", "s:\"\\\\>hello\"" etc all match. (Obviously this changes the meaning of the query if it contains multiple tokens, so it's not really a workaround.)


(Christoph) #4

Hi,

notice that

GET example/_validate/query?explain
{"query":{"query_string":{"query":"(s:(>hello))"}}}

gives the lucene query explanation

"s:{hello TO *]"

and shows that the > is a shorthand notation for an unbounded range query that excludes the first element. This maybe explains why that query doesn't return the document. If you use % like in "(s:(%hello))", this gets simply deleted and the result is a normal term query that matches the doc.

As for the escaping, notice that

GET example/_validate/query?explain
{"query":{"query_string":{"query":"(s:(\\>hello))"}}}

=> "s:{hello TO *]"

but

GET example/_validate/query?explain
{"query":{"query_string":{"query":"(s:(\\\\>hello))"}}}

=> "s:hello"

so the first version does not escape the > character and thus creates a range query, where the later results in a term query (the > gets dropped by the standard analyzer).

Hopes this helps.


#5

Hi Christoph,

Thanks for investigating. To be clear, are you just expanding on what Lucene is doing under the covers, or are you saying that this is in fact correct behaviour? (In which case either the relevant docs are wrong or my reading comprehension is failing spectacularly.)


(Christoph) #6

Why \\> doesn't escape the reserved character suprises me too, given the docs. My comment was mostly about why s:(>hello) doesn't match the original doc.


(Jörg Prante) #7

These are equivalent

GET /test/_search?explain&q=s:\\>hello

and

POST /test/_search
{
    "explain" : true,
    "query" : {
        "query_string" : {
            "query" : "s:\\\\>hello"
        }
    }
}

You need to double-backslash everything in the second variant because of the JSON parser which processes the request body. So \\\\ becomes \\ which is then passed to the Lucene query string parser.

Anyway, the escaping does not perform anything useful regarding the search result, the token to be searched is in both cases hello.


#8

@jprante no, I don't think so. I'm not trying to pass a literal backslash to the query string parser, I'm trying to pass a literal >. So according to the escaping docs I should be passing \> to Lucene, meaning \\> in JSON.

I understand that the escaped char is useless in this minimized example, but in the real system I'm looking at I don't control the raw strings and can't assume that the field being searched will always be tokenized in a way that ignores problematic characters (i.e. I can't just replace them with spaces).


(Jörg Prante) #9

Ok, I was wrong. There are several effects. This will pass \>hello using the JSON escaper

POST /test/_search
{
    "query" : {
        "query_string" : {
            "escape" : false,
            "query" : "\\>hello",
            "default_field" : "s"
        }
    }
}

This will also pass \>hello but uses Lucene's query string escaper.

POST /test/_search
{
    "query" : {
        "query_string" : {
            "escape" : true,
            "query" : ">hello",
            "default_field" : "s"
        }
    }
}

When escape: true is given, you can enable Lucene's query string escaping after JSON has been received, but before the value is submitted to search.

But even if you can pass >hello in multiple ways, it does not execute a search for this term.

First observation: if you index >hello with the ES default analyzer, you will index hello and searching for >hello will not return a hit.

Second observation: as @cbuescher wrote, ES does some smart preprocessing in query_string queries and evaluates a > symbol at position 0 for translating the query text to a range query in org.apache.lucene.queryparser.classic.MapperQueryParser

Now let's try to make this possible although of that. Let's use the keyword analyzer, this will take the word >hello to the index.

PUT /test/
{
    "mappings" : {
        "docs" : {
            "properties" : {
                "s" : {
                    "type" : "string",
                    "analyzer" : "keyword"
                }
            }
        }
    }
}

PUT /test/docs/1
{"s":">hello"}

POST /test/_search
{
    "query" : {
        "query_string" : {
            "query" : ">hello",
            "default_field" : "s"
        }
    }
}

gives no hits.

If you want hits, you can use simple_query_string instead, which does not try to interpret > at position 0.

POST /test/_search
{
    "query" : {
        "simple_query_string" : {
            "query" : ">hello",
            "fields" : ["s"]
        }
    }
}

gives a hit.

Maybe this can help to find a solution.


#10

Thanks for the detailed reply. There's a lot to experiment with there, but from initial testing (again, against 1.3.4) neither

"escape":false,
"query":"\\>hello",

nor

"escape":true,
"query":">hello",

returns hits, but

"escape":true,
"query":"\\>hello",

does.

One other thing in your reply which confuses me:

if you index >hello with the ES default analyzer, you will index hello and searching for >hello will not return a hit

If you're talking about something more fundamental than Lucene's apparent divergence from its documented escaping rules... why not? Why does searching for %hello or $hello or ?hello return a hit but not \\>hello?


(system) #11