Add a flag called `analyze_wildcard` to both `query_string` and `field` queries,… once set, a best effort will be made to analyze wildcard and prefix queries as well.
More details:
When we use an analyzer that stems terms into tokens, and then later we want to search against those analyzed terms using a wildcard, by default the search terms are not analyzed, as that analysis could lead into several tokens and the search engine would not be sure which one to use:
http://www.jguru.com/faq/view.jsp?EID=538312
However, in certain circumstances, when the liability of the search can be somehow constrained in favor of better expected results, it would be nice to tell the search engine to analyze the wildcard terms before executing the search, therefore allowing for a more precise search (at least, expected).
Let's put here an example with the Spanish analyzer (which uses the snowball stemmer):
- We index the phrase "I have an iPhone"
- We index the phrase "I love the triad iPad/iPhone/iPod"
- We index the phrase "I found the perfect combination: iPhone/MP3"
If we use the standard current 'query_string', when searching for "*phone*", we will only get the last phrase, due to the way in which the terms have been analyzed:
"I have an iPhone":
{"tokens":[{"token":"i","start_offset":0,"end_offset":1,"type":"<ALPHANUM>","position":1},{"token":"hav","start_offset":2,"end_offset":6,"type":"<ALPHANUM>","position":2},{"token":"an","start_offset":7,"end_offset":9,"type":"<ALPHANUM>","position":3},{"token":"iphon","start_offset":10,"end_offset":16,"type":"<ALPHANUM>","position":4}]}
"I love the triad iPad/iPhone/iPod":
{"tokens":[{"token":"i","start_offset":0,"end_offset":1,"type":"<ALPHANUM>","position":1},{"token":"lov","start_offset":2,"end_offset":6,"type":"<ALPHANUM>","position":2},{"token":"the","start_offset":7,"end_offset":10,"type":"<ALPHANUM>","position":3},{"token":"tri","start_offset":11,"end_offset":16,"type":"<ALPHANUM>","position":4},{"token":"ipad","start_offset":17,"end_offset":21,"type":"<ALPHANUM>","position":5},{"token":"iphon","start_offset":22,"end_offset":28,"type":"<ALPHANUM>","position":6},{"token":"ipod","start_offset":29,"end_offset":33,"type":"<ALPHANUM>","position":7}]}
"I found the perfect combination: iPhone/MP3":
{"tokens":[{"token":"i","start_offset":0,"end_offset":1,"type":"<ALPHANUM>","position":1},{"token":"found","start_offset":2,"end_offset":7,"type":"<ALPHANUM>","position":2},{"token":"the","start_offset":8,"end_offset":11,"type":"<ALPHANUM>","position":3},{"token":"perfect","start_offset":12,"end_offset":19,"type":"<ALPHANUM>","position":4},{"token":"combination","start_offset":20,"end_offset":31,"type":"<ALPHANUM>","position":5},{"token":"iphone/mp3","start_offset":33,"end_offset":43,"type":"<NUM>","position":6}]}
See how the latter stems "iPhone/MP3" as "iphone/mp3"? Hence this is the only one matching a 'query_string' equal to "*phone*" (and similar 'unexpected' results occur when using just one leading or trailing wildcard as well).
This result would be dissapointing for the user, as she'd expect at least something like "iPhone" or even "telephone" to be returned as a result, but due to fact that the Spanish analyzer will always remove the trailing 'e' from most of the words, it won't find them.
So enhancement would be to provide a mechanism, in the form of a parameter, for instance, in the 'query_string', that would tell the ES query parser to analyze those search terms surrounded by wildcards (i.e. either enclosed completely, or just with a leading or trailing wildcard).
Following our previous example, a 'query_string' for "*phone*" would be actually analyzed in the Spanish analyzer as "*phon*" therefore returning absolutely all the phrases previously created, which would be the expected and reasonable behaviour from a user's perspective. Of course, it could have some side effects on other searches, but as a parameter, it would be up to the search designer to either use it or not.