I'm hoping someone can help me understand when it's necessary to escape query terms and query_strings as defined here.
Let's say I have an index with one field that's of type keyword:
curl -s -XPUT 'localhost:9200/my_test_index' -d '
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 0
}
},
"mappings" : {
"my_doc" : {
"dynamic" : "false",
"include_in_all" : false,
"date_detection" : false,
"numeric_detection" : false,
"properties" : {
"reference_id" : {
"type" : "keyword",
"include_in_all" : true
}
}
}
}
}'
Add some documents, one with some "-" and another with some "<>"
curl -XPOST 'localhost:9200/_bulk?pretty' -d '
{ "index" : { "_index" : "my_test_index", "_type" : "my_doc", "_id" : "1" } }
{ "reference_id" : "1"}
{ "index" : { "_index" : "my_test_index", "_type" : "my_doc", "_id" : "2" } }
{ "reference_id" : "abc-def-1234567"}
{ "index" : { "_index" : "my_test_index", "_type" : "my_doc", "_id" : "3" } }
{ "reference_id" : "abc<def>1234567"}
'
This must term query produces 0 results when escaping the "-":
curl -s -XGET localhost:9200/my_test_index/_search?pretty -d ' {"query": {"bool": { "must": [{ "term": { "reference_id": "abc\\-def\\-1234567" } }] }}}'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
However, this works without the escaping in the must term:
curl -s -XGET localhost:9200/my_test_index/_search?pretty -d ' {"query": {"bool": { "must": [{ "term": { "reference_id": "abc-def-1234567" } }] }}}'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.9808292,
"hits" : [
{
"_index" : "my_test_index",
"_type" : "my_doc",
"_id" : "2",
"_score" : 0.9808292,
"_source" : {
"reference_id" : "abc-def-1234567"
}
}
]
}
}
I would have expected the '-' needing to be escaped. To clarify my understanding, is it safe to assume that term queries DO NOT need to be escaped?
Query string queries do not act exactly as I would expect either. If I escape the query string for an exact result, I get multiple results.
curl -s -XGET localhost:9200/my_test_index/_search?pretty -d '
> {
> "query": {
> "query_string" : {
> "query" : "abc\\<def\\>1234567"
> }
> }
> }'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0911642,
"hits" : [
{
"_index" : "my_test_index",
"_type" : "my_doc",
"_id" : "2",
"_score" : 1.0911642,
"_source" : {
"reference_id" : "abc-def-1234567"
}
},
{
"_index" : "my_test_index",
"_type" : "my_doc",
"_id" : "3",
"_score" : 1.0911642,
"_source" : {
"reference_id" : "abc<def>1234567"
}
}
]
}
}
I'm assuming that my query is getting tokenized and hitting the _all field as it's treated as a standard text field?
And finally, if I specify the field in the query_string, escaping doesn't seem to matter:
curl -s -XGET localhost:9200/my_test_index/_search?pretty -d '
> {
> "query": {
> "query_string" : {
> "fields" : ["reference_id"],
> "query" : "abc\\<def\\>1234567"
> }
> }
> }'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.9808292,
"hits" : [
{
"_index" : "my_test_index",
"_type" : "my_doc",
"_id" : "3",
"_score" : 0.9808292,
"_source" : {
"reference_id" : "abc<def>1234567"
}
}
]
}
}
And this:
curl -s -XGET localhost:9200/my_test_index/_search?pretty -d '
{
"query": {
"query_string" : {
"fields" : ["reference_id"],
"query" : "abc<def>1234567"
}
}
}'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.9808292,
"hits" : [
{
"_index" : "my_test_index",
"_type" : "my_doc",
"_id" : "3",
"_score" : 0.9808292,
"_source" : {
"reference_id" : "abc<def>1234567"
}
}
]
}
}
Any help understanding would be greatly appreciated. Thanks!