Query String Query escaping spaces in V2.3

Hi there,

I am trying to match on search values with spaces, such as "a b c". I found in v5 ES you can unset the splitOnWhitespace flag but I am using 2.3.5.

According to this document: Query String Query | Elasticsearch Guide [2.3] | Elastic

Watch this space
A space may also be a reserved character. For instance, if you have a synonym list which converts "wi fi" to "wifi", a > query_string search for "wi fi" would fail. The query string parser would interpret your query as a search for "wi OR fi", while the token stored in your index is actually "wifi". Escaping the space will protect it from being touched by the query string parser: "wi\ fi".

I could escape those spaces (preventing the query string parser from interpreting spaces into OR) by using "\ ".

So I tried to use this search:
{
"query" : {
"query_string" : {
"query" : "*A\\ b\\ c*",
"fields" : [ "Reporting_Name", "Source_Document_Ref", "client.Client_Number" ]
}
}
}

Did not give me any results. Note all fields are not_analysed.

Help please?

Thanks

Hi @jasw,

I'm following up here on ticket #24082 that you have created on Github. Let's start with a minimal self-contained example (tested on Elasticsearch 2.4.4):

PUT /my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "city": {
          "type": "string",
          "fields": {
            "raw": { 
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

PUT /my_index/my_type/1
{
    "city": "A B C Cleaning"
}

PUT /my_index/my_type/2
{
    "city": "Cleaning"
}

Now you run this query:

GET /my_index/_search
{
    "query": {
        "query_string": {
           "fields": ["city", "city.raw"],
           "query": "*A\\ B\\ C*"
        }
    }
}

and you would expect 1 result: "A B C Cleaning" but Elasticsearch returns no result at all. Let me quote the docs:

Wildcarded terms are not analyzed by default — they are lower-cased (lowercase_expanded_terms defaults to true) but no further analysis is done, mainly because it is impossible to accurately analyze a word that is missing some of its letters. However, by setting analyze_wildcard to true, an attempt will be made to analyze wildcarded words before searching the term list for matching terms.

So let's try to set analyze_wildcard to true:

GET /my_index/_search
{
    "query": {
        "query_string": {
           "fields": ["city", "city.raw"],
           "query": "*A\\ B\\ C*",
           "analyze_wildcard": true
        }
    }
}

Now, both documents are returned, but why? Let's use the explain parameter:

GET /my_index/_search?explain
{
    "query": {
        "query_string": {
           "fields": ["city", "city.raw"],
           "query": "*A\\ B\\ C*",
           "analyze_wildcard": true
        }
    }
}

Which will reveal "description": "city:*a*, product of:". So both "A B C Cleaning" and "Cleaning" contain an "a" and that's why they both match. But what we actually want is something different: We want to match all documents that contain "A B C". This is called a phrase query. The syntax of a phrase query is just the search terms enclosed in double quotes. We need to escape the double-quotes because they are contained in a JSON string. Let's try this:

GET /my_index/_search
{
    "query": {
        "query_string": {
           "fields": ["city"],
           "query": "\"A B C\""
        }
    }
}

This returns only "A B C Cleaning" as we've expected.

A couple of thoughts:

  • Wild cards, and especially leading wild cards, are really bad from a performance perspective and you should try to avoid them whenever you can. They are also not very intuitive as we've seen and there are often better alternatives.
  • You don't need to include the not-analyzed sub-field "city.raw" because you'd only benefit from it for exact matches (i.e. when you want to search exactly for the term "A B C Cleaning").

I hope that clarifies the confusion and helps you to proceed.

Daniel

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.