Hi I am wondering why the following query does not hit. Here is the reproducer:
// put index
PUT /test
{
"mappings" : {
"properties" : {
"title": {
"type": "text",
"analyzer": "german"
}
}
}
}
// put test doc
POST /test/_doc
{
"title": "Foober Baren"
}
GET /_analyze
{
"analyzer": "german",
"text": "Foober Baren"
}
// Tokens are "foob" and "bar" as expected
GET /test/_search
{
"query": {
"query_string": {
"default_field": "title",
"analyze_wildcard": true,
"query": "*oober"
}
}
}
If I change the inside query to *oob it does hit. I would have expected the text on the wildcard also to be analyzed now. If I check how it would be analyzed:
GET /_analyze
{
"analyzer": "german",
"text": "oober"
}
// yields "oob" as token as expexted
so *oober analyzed should be *oob and also hit,... did I understand analyze_wildcard wrong?
does hit which would sugest that foobe* is analyzed to foob* and thus hits the foob token of the document.
It seems that left wildcards only lowercase and then match and right wildcards lowercase and anaylze,... but that would be weird inconsistent behaviour between those? Can anyone confirm or deny/explain this observation?
It would be awesome if anyone could explain why left wildcards are analyzed differently to right wildcards and if this has a reason or should rather be reported as a unexpected behavior / bug?
The Elasticsearch query_string query is passing the text to the analyzer for processing. However, terms with wildcards are not passed to the analyzer (leading or not). This will explain the difference and issues you are seeing - and the analyzer plays part in transforming the token to its stem form in the OP's example.
This is not quite correct. This is what the analyze_wildcard parameter is for as found in the elastic documentation (to which you also should link instead of promoting your own company with links)
During indexing the analyzer outputs foob to the index. There is no foober
During search, *oober is received, but the analyzer has nothing to do with it. No stemming algorithm can be executed here , as the whole logic depends on the full word structure, which is missing due to the wildcard.
Analyzer is skipped, query is rewritten as boolean with a leading wildcard query for this term, and no match is found.
I think it might be a bug.
A prefix query (asterisk at the end) uses the analyzer when analyze_wildcard is set to true.
A wildcard query (asterisk not at the end) only uses the analyzer to normalize (regardless of analyze_wildcard).
While it is questionable that analyzers can be expected to do the right thing with partial words you could argue that they would generally work better on wildcard queries rather than prefix queries - stemming being an example of typical analyzer activity that has rules for word endings. Prefix queries don't contain the ends of words.That's why the current logic is confusing - prefix queries are analyzed but wildcards aren't.
Good point Mark! Interestingly enough its been like that for 15 years so I guess I got used to it.
I'd expect most useful analyzers (stemmers mostly) to act stupid here. You can't stem a partial word. And I won't be surprised if in some languages normalization would also be incorrect - as an example I can think of the words "it" (stop word) and IT (acronym) and I'm sure there's plenty more.
Thank you for digging up all the parts relevant here, very appreciated.
I try to rephrase/summarize in my own works to ensure it has been understood
if the wildcard is at the end, it is called a prefix query (that sound kind of the wrong way around?)
if the wildcard is at the start, it is called wildcard query
Besides that, prefix or wildcard queries, if analyze_wildcard is active, the analyzing step applied is different.
prefix will use 'the analyzer' (does this mean, the same analyzer that is applied tot the field?) on the query before comparing with the tokens. This is why foobe* will be reduced to foob* (which is surprising, since how does the stemmer know?) and thus the query matches the token
a wildcard is only normalized (smaller case?) but no stemming is applied. Thus '*oober' will stay '*oober' and thus will not match any existing token (foob)
While I understand while doing stemming on the 'right side query' is complicated .. it is still implemented. If so, why is that hard then using stemming on a left-side wildcard? Both words could just be cut-off anywhere and thus hard to map against a stemmable word.
This said, even though it is questionable if 'stemming' is possible at all or should be done (that's why analyze_wildcard is opt-in I guess), it should be applied the same way to prefix or wildcard queries?
Analysing bits of words is always going to be a questionable practice which is why users have to opt in with the analyze_wildcard set to true.
With this caveat in mind it appears that even when set there is different behaviour.
Prefix queries have an asterisk at the end and are called that because the user has supplied only the start, or prefix, of a word. They will use an analyzer associated with the field (or a custom one with the query) if the analyze_wildcards field is set.
Wildcard queries can have asterisks or question marks anywhere in the text, not just the start eg “ac?om*dation”. Despite the name, wildcard queries do not use a tokenizer when analyse_wildcard is set to true. That is the bug which could require a fix to code, dsl parameter naming or just docs
Thank you for clarifying, also with the terminology of prefix-query vs wildcard-query.
I understand while analyzing partial words is a job that cannot succeed properly, but doing it differently in those cases still seems not right to me IMHO. So I would opt in to say that either prefix-query is now downgraded to what wildcard query does or the other way around, but going different routes seems at least debatable.
Should I open a GH issue for discussion on how to proceed and link this thread as an information source?
In any case, thank you for sharing all the insight!
Yes, please. A link to this thread would certainly help for background. Please close the loop and post a link to the issue in this thread so readers can follow. Thanks!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.