Cannot search double quotes


(Hayk Hovhanisyan) #1

Hi folk,

Elastic Search version 2.4
Java version 1.8

Following error when search via "
java.lang.IllegalArgumentException: expected '"' at position 7.

Search via double quotes query body
{
"from":0,
"size":10,
"query":{
  "bool":{
     "must":{
        "match_all":{

        }
     },
     "filter":[
        {
           "bool":{
              "filter":{
                 "regexp":{
                    "ppsn":{
                       "value":".*\"\"\".*",
                       "flags_value":65535
                    }
                 }
              }
           }
        },
        {
           "bool":{
              "must":{
                 "term":{
                    "deleted":"false"
                 }
              }
           }
        }
     ]
    }
  }
}

I have problem with ingest pipeline
(Ry Biesemeyer) #2

The character " carries special semantic meaning in the Lucene regexp engine that means something like "treat everything until the next " as a literal character, not as a pattern expression", or if already in a literal expression, means "this is the end of the literal expression" (docs):

Let's start with the expression in your JSON query:

".*\"\"\".*"

After being parsed into the string it represents, we get:

.*""".*

This is given to the Lucene regexp engine, which parses it to mean:

  • .* any sequence, followed by
  • " a literal sequence (everything until the next ")
    • " literal sequence ends
  • " a literal sequence (everything until the next ")
    • . a literal dot
    • * a literal asterisk
    • UNEXPECTED END: no matching close ".

You may be able to escape the literal double-quote inside the literal-sequence by prefixing it with a backslash (e.g., .*"\"".*, which itself would get escaped again when being converted to JSON to be ".*\"\\\"\".*"), but the escaping of double-quotes inside double-quote sequence isn't clearly documented, so that may or may not work:

" <Unicode string without double-quotes> " (a string)
-- Lucene Atomaton Regexp

My guess is that you're taking arbitrary input and simply concatenating (.*" + input + ".*). You may be able to avoid double-quote entirely by concatenating (.* + quote(input) + .*), where quote is some function that escapes all characters with special meaning by prefixing them with a backslash (\).


(Hayk Hovhanisyan) #3

Hi @yaauie,
thanks for nice and deeply explanation.

regards


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.