Query String Query escaping spaces in V2.3

Hi @jasw,

I'm following up here on ticket #24082 that you have created on Github. Let's start with a minimal self-contained example (tested on Elasticsearch 2.4.4):

PUT /my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "city": {
          "type": "string",
          "fields": {
            "raw": { 
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

PUT /my_index/my_type/1
{
    "city": "A B C Cleaning"
}

PUT /my_index/my_type/2
{
    "city": "Cleaning"
}

Now you run this query:

GET /my_index/_search
{
    "query": {
        "query_string": {
           "fields": ["city", "city.raw"],
           "query": "*A\\ B\\ C*"
        }
    }
}

and you would expect 1 result: "A B C Cleaning" but Elasticsearch returns no result at all. Let me quote the docs:

Wildcarded terms are not analyzed by default — they are lower-cased (lowercase_expanded_terms defaults to true) but no further analysis is done, mainly because it is impossible to accurately analyze a word that is missing some of its letters. However, by setting analyze_wildcard to true, an attempt will be made to analyze wildcarded words before searching the term list for matching terms.

So let's try to set analyze_wildcard to true:

GET /my_index/_search
{
    "query": {
        "query_string": {
           "fields": ["city", "city.raw"],
           "query": "*A\\ B\\ C*",
           "analyze_wildcard": true
        }
    }
}

Now, both documents are returned, but why? Let's use the explain parameter:

GET /my_index/_search?explain
{
    "query": {
        "query_string": {
           "fields": ["city", "city.raw"],
           "query": "*A\\ B\\ C*",
           "analyze_wildcard": true
        }
    }
}

Which will reveal "description": "city:*a*, product of:". So both "A B C Cleaning" and "Cleaning" contain an "a" and that's why they both match. But what we actually want is something different: We want to match all documents that contain "A B C". This is called a phrase query. The syntax of a phrase query is just the search terms enclosed in double quotes. We need to escape the double-quotes because they are contained in a JSON string. Let's try this:

GET /my_index/_search
{
    "query": {
        "query_string": {
           "fields": ["city"],
           "query": "\"A B C\""
        }
    }
}

This returns only "A B C Cleaning" as we've expected.

A couple of thoughts:

  • Wild cards, and especially leading wild cards, are really bad from a performance perspective and you should try to avoid them whenever you can. They are also not very intuitive as we've seen and there are often better alternatives.
  • You don't need to include the not-analyzed sub-field "city.raw" because you'd only benefit from it for exact matches (i.e. when you want to search exactly for the term "A B C Cleaning").

I hope that clarifies the confusion and helps you to proceed.

Daniel