Reserved character escaping understanding


(Scott) #1

I'm hoping someone can help me understand when it's necessary to escape query terms and query_strings as defined here.

Let's say I have an index with one field that's of type keyword:

curl -s -XPUT 'localhost:9200/my_test_index' -d '
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0
    }
  },
  "mappings" : {
	  "my_doc" : {
	    "dynamic" : "false",
	    "include_in_all" : false,
	    "date_detection" : false,
	    "numeric_detection" : false,
	    "properties" : {
	      "reference_id" : {
	        "type" : "keyword",
	        "include_in_all" : true
	      }
	    }
	  }
	}
}'

Add some documents, one with some "-" and another with some "<>"

curl -XPOST 'localhost:9200/_bulk?pretty' -d '
{ "index" : { "_index" : "my_test_index", "_type" : "my_doc", "_id" : "1" } }
{ "reference_id" : "1"}
{ "index" : { "_index" : "my_test_index", "_type" : "my_doc", "_id" : "2" } }
{ "reference_id" : "abc-def-1234567"}
{ "index" : { "_index" : "my_test_index", "_type" : "my_doc", "_id" : "3" } }
{ "reference_id" : "abc<def>1234567"}

'

This must term query produces 0 results when escaping the "-":

curl -s -XGET localhost:9200/my_test_index/_search?pretty -d ' {"query": {"bool": { "must": [{ "term": { "reference_id": "abc\\-def\\-1234567" } }] }}}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

However, this works without the escaping in the must term:

curl -s -XGET localhost:9200/my_test_index/_search?pretty -d ' {"query": {"bool": { "must": [{ "term": { "reference_id": "abc-def-1234567" } }] }}}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.9808292,
    "hits" : [
      {
        "_index" : "my_test_index",
        "_type" : "my_doc",
        "_id" : "2",
        "_score" : 0.9808292,
        "_source" : {
          "reference_id" : "abc-def-1234567"
        }
      }
    ]
  }
}

I would have expected the '-' needing to be escaped. To clarify my understanding, is it safe to assume that term queries DO NOT need to be escaped?

Query string queries do not act exactly as I would expect either. If I escape the query string for an exact result, I get multiple results.

curl -s -XGET localhost:9200/my_test_index/_search?pretty -d '
> {
>   "query": {
>       "query_string" : {
>         "query" : "abc\\<def\\>1234567"
>       }
>   }
> }'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0911642,
    "hits" : [
      {
        "_index" : "my_test_index",
        "_type" : "my_doc",
        "_id" : "2",
        "_score" : 1.0911642,
        "_source" : {
          "reference_id" : "abc-def-1234567"
        }
      },
      {
        "_index" : "my_test_index",
        "_type" : "my_doc",
        "_id" : "3",
        "_score" : 1.0911642,
        "_source" : {
          "reference_id" : "abc<def>1234567"
        }
      }
    ]
  }
}

I'm assuming that my query is getting tokenized and hitting the _all field as it's treated as a standard text field?

And finally, if I specify the field in the query_string, escaping doesn't seem to matter:

curl -s -XGET localhost:9200/my_test_index/_search?pretty -d '
> {
>   "query": {
>       "query_string" : {
>         "fields" : ["reference_id"],
>         "query" : "abc\\<def\\>1234567"
>       }
>   }
> }'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.9808292,
    "hits" : [
      {
        "_index" : "my_test_index",
        "_type" : "my_doc",
        "_id" : "3",
        "_score" : 0.9808292,
        "_source" : {
          "reference_id" : "abc<def>1234567"
        }
      }
    ]
  }
}

And this:

curl -s -XGET localhost:9200/my_test_index/_search?pretty -d '
{
  "query": {
      "query_string" : {
        "fields" : ["reference_id"],
        "query" : "abc<def>1234567"
      }
  }
}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.9808292,
    "hits" : [
      {
        "_index" : "my_test_index",
        "_type" : "my_doc",
        "_id" : "3",
        "_score" : 0.9808292,
        "_source" : {
          "reference_id" : "abc<def>1234567"
        }
      }
    ]
  }
}

Any help understanding would be greatly appreciated. Thanks!


(Clinton Gormley) #2

Hi @scotttam

Term queries (and match queries) do not need escaping. Query string queries (and simple query string queries) have a query syntax, and so reserved characters need escaping. However, < and > embedded in other strings are not special characters.

Elasticsearch adds some syntax to the standard Lucene query string syntax: eg the ability to do field:>5 (where field is greater than 5), and there is a bug with this parsing which prevents a leading < or > from being escaped properly (see https://github.com/elastic/elasticsearch/issues/21703).

Also, your assumption about the query string query querying the _all field by default (and so using the analyzer associated with the _all field) is correct.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.