Hi @jasw,
I'm following up here on ticket #24082 that you have created on Github. Let's start with a minimal self-contained example (tested on Elasticsearch 2.4.4):
PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"city": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
PUT /my_index/my_type/1
{
"city": "A B C Cleaning"
}
PUT /my_index/my_type/2
{
"city": "Cleaning"
}
Now you run this query:
GET /my_index/_search
{
"query": {
"query_string": {
"fields": ["city", "city.raw"],
"query": "*A\\ B\\ C*"
}
}
}
and you would expect 1 result: "A B C Cleaning" but Elasticsearch returns no result at all. Let me quote the docs:
Wildcarded terms are not analyzed by default — they are lower-cased (lowercase_expanded_terms defaults to true) but no further analysis is done, mainly because it is impossible to accurately analyze a word that is missing some of its letters. However, by setting analyze_wildcard to true, an attempt will be made to analyze wildcarded words before searching the term list for matching terms.
So let's try to set analyze_wildcard
to true
:
GET /my_index/_search
{
"query": {
"query_string": {
"fields": ["city", "city.raw"],
"query": "*A\\ B\\ C*",
"analyze_wildcard": true
}
}
}
Now, both documents are returned, but why? Let's use the explain
parameter:
GET /my_index/_search?explain
{
"query": {
"query_string": {
"fields": ["city", "city.raw"],
"query": "*A\\ B\\ C*",
"analyze_wildcard": true
}
}
}
Which will reveal "description": "city:*a*, product of:"
. So both "A B C Cleaning" and "Cleaning" contain an "a" and that's why they both match. But what we actually want is something different: We want to match all documents that contain "A B C". This is called a phrase query. The syntax of a phrase query is just the search terms enclosed in double quotes. We need to escape the double-quotes because they are contained in a JSON string. Let's try this:
GET /my_index/_search
{
"query": {
"query_string": {
"fields": ["city"],
"query": "\"A B C\""
}
}
}
This returns only "A B C Cleaning" as we've expected.
A couple of thoughts:
- Wild cards, and especially leading wild cards, are really bad from a performance perspective and you should try to avoid them whenever you can. They are also not very intuitive as we've seen and there are often better alternatives.
- You don't need to include the not-analyzed sub-field "city.raw" because you'd only benefit from it for exact matches (i.e. when you want to search exactly for the term "A B C Cleaning").
I hope that clarifies the confusion and helps you to proceed.
Daniel