Query_string breaks search term on space even when keyword tokenizer is used


(Tomekit) #1

I am using keyword tokenizer for one of my name fields.

 "analyzer": {            
            "lowercase_keyword_analyzer": {
              "filter": [
                "lowercase"
              ],
              "tokenizer": "keyword"
            }
          }       


... 
    "name": {
            "type": "text",
            "analyzer":"lowercase_keyword_analyzer"        
          }
...

Unfortunately when I do the query_string search , my search phrase gets broken into multiple terms even though I've explicitly specified my lowercase_keyword_analyzer analyzer.

Query:

{"query": {"query_string" : {"default_field": "name", "query": "some-long (specific),phrase with spaces and wild*", "analyzer": "lowercase_keyword_analyzer", "quote_analyzer": "lowercase_keyword_analyzer"}}}

Explain:

"explanation":"+(name:some-long name:specific name:,phrase name:with name:spaces name:and name:wild*) #(#_type:substance)"}

Debug analyze:

{
  "analyzer": "lowercase_keyword_analyzer",
  "text": "some-long (specific),phrase with spaces and wild*"
}

{"token":"some-long (specific),phrase with spaces and wild*","start_offset":0,"end_offset":49,"type":"word","position":0}

Why query_string ignores the analyzer?

I've found this old topic from 2011 where someone asks about similar problem, with no real answer: Query string not working with keyword tokenizer


(Abdon Pijpelink) #2

Regardless of whether you use a keyword tokenizer, the query_string query breaks up a query string whenever it encounters an operator.

In your case, this happens for the parentheses ( and ) that are reserved characters. You need to escape these.

Furthermore, the same is true for the * wildcard. * is considered as an operator so the query parser will break on the first space before it.

As a result, your query becomes a bool query for:

  • name:some-long
  • name:specific
  • name:,phrase with spaces and
  • name:wild*

You can see this by using the Validate API. It can show you the Lucene query that your query gets written into:

GET test/_validate/query?rewrite=true
{
  "query": {
    "query_string": {
      "default_field": "name",
      "query": "some-long (specific),phrase with spaces and wild*",
      "analyzer": "lowercase_keyword_analyzer",
      "quote_analyzer": "lowercase_keyword_analyzer"
    }
  }
}

Now, how to resolve this? You can escape the operators and whitespaces, to prevent the query parser from breaking up the query string on those:

GET test/_search
{
  "query": {
    "query_string": {
      "default_field": "name",
      "query": "some-long\\ \\(specific\\),phrase\\ with\\ spaces\\ and\\ wild*",
      "analyzer": "lowercase_keyword_analyzer",
      "quote_analyzer": "lowercase_keyword_analyzer"
    }
  }
}

But a better approach may be to not use the query_string query in combination with an analyzed field and wildcards. Things can get pretty complicated really fast. Maybe it is easier to use a wildcard query on a keyword field on which you can apply a normalizer to lowercase the token?

PUT test
{
  "settings": {
    "analysis": {
      "normalizer": {
        "lowercase_keyword_normalizer": {
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "name": {
          "type": "keyword",
          "normalizer": "lowercase_keyword_normalizer"
        }
      }
    }
  }
}

PUT test/_doc/1
{
  "name": "some-long (specific),phrase with spaces and wildcard"
}

GET test/_search
{
  "query": {
    "wildcard": {
      "name": "some-long (specific),phrase with spaces and wild*"
    }
  }
}

(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.