Query String Query escaping spaces in V2.3

danielmitterdorfer · April 13, 2017, 8:06am

I'm following up here on ticket #24082 that you have created on Github. Let's start with a minimal self-contained example (tested on Elasticsearch 2.4.4):

PUT /my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "city": {
          "type": "string",
          "fields": {
            "raw": { 
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

PUT /my_index/my_type/1
{
    "city": "A B C Cleaning"
}

PUT /my_index/my_type/2
{
    "city": "Cleaning"
}

Now you run this query:

GET /my_index/_search
{
    "query": {
        "query_string": {
           "fields": ["city", "city.raw"],
           "query": "*A\\ B\\ C*"
        }
    }
}

and you would expect 1 result: "A B C Cleaning" but Elasticsearch returns no result at all. Let me quote the docs:

Wildcarded terms are not analyzed by default — they are lower-cased (lowercase_expanded_terms defaults to true) but no further analysis is done, mainly because it is impossible to accurately analyze a word that is missing some of its letters. However, by setting analyze_wildcard to true, an attempt will be made to analyze wildcarded words before searching the term list for matching terms.

So let's try to set analyze_wildcard to true:

GET /my_index/_search
{
    "query": {
        "query_string": {
           "fields": ["city", "city.raw"],
           "query": "*A\\ B\\ C*",
           "analyze_wildcard": true
        }
    }
}

Now, both documents are returned, but why? Let's use the explain parameter:

GET /my_index/_search?explain
{
    "query": {
        "query_string": {
           "fields": ["city", "city.raw"],
           "query": "*A\\ B\\ C*",
           "analyze_wildcard": true
        }
    }
}

Which will reveal "description": "city:*a*, product of:". So both "A B C Cleaning" and "Cleaning" contain an "a" and that's why they both match. But what we actually want is something different: We want to match all documents that contain "A B C". This is called a phrase query. The syntax of a phrase query is just the search terms enclosed in double quotes. We need to escape the double-quotes because they are contained in a JSON string. Let's try this:

GET /my_index/_search
{
    "query": {
        "query_string": {
           "fields": ["city"],
           "query": "\"A B C\""
        }
    }
}

This returns only "A B C Cleaning" as we've expected.

A couple of thoughts:

Wild cards, and especially leading wild cards, are really bad from a performance perspective and you should try to avoid them whenever you can. They are also not very intuitive as we've seen and there are often better alternatives.
You don't need to include the not-analyzed sub-field "city.raw" because you'd only benefit from it for exact matches (i.e. when you want to search exactly for the term "A B C Cleaning").

I hope that clarifies the confusion and helps you to proceed.

Daniel

Topic		Replies	Views
Can't find unit tests for reserved characters Elasticsearch	6	1678	July 6, 2017
Elasticsearch query issue with reserved characters Elasticsearch	1	420	October 18, 2018
Search query / analyzer issue dealing with spaces Elasticsearch	9	470	July 6, 2017
Surprising behaviour when escaping reserved char in query string [1.3.4] Elasticsearch	10	2890	July 5, 2017
Query String with WhiteSpace Elasticsearch	2	1050	July 6, 2017

Query String Query escaping spaces in V2.3

Related topics