i noticed a weird search behaviour with the french analyzer (could probably also affect other languages).
i have boiled down our example to a minimal PoC, the original index & query were much bigger. our query is used in a type-ahead search. in our real index the field is of type "search_as_you_type", but it can also be reproduced with the more simple "text" type => using that for the PoC to reduce the complexity.
PUT /test1
{
"mappings": {
"properties": {
"name_fr": {
"type": "text",
"analyzer": "french"
}
}
}
}
POST /test1/_doc/CH
{
"name_fr": "Suisse"
}
# test also with "Sui" and "Suiss" - both will work
GET /test1/_search
{
"query": {
"bool": {
"should": [
{
"match_phrase_prefix": {
"name_fr": {
"query": "Suis"
}
}
}
]
}
}
}
when searching with "Sui" it finds the entry, as well as when searching with "Suiss" - but with "Suis" it doesn't find it! this doesn't seem to make any sense to me since it can find it with one letter less and one letter more.
funny enough the french analyzer even tokenises "Suisse" as "suis":
POST /_analyze
{
"analyzer": "french",
"text": "Suisse"
}
note: when using the "standard" analyzer this works fine and it also finds the entry using "Suis". to test, just re-create the index above with that analyzer and run the example again.
am i misunderstanding something or is this a bug (=> in that case i'll report it on github as a bug)?
thanks a lot! that indeed is the problem! sorry, i didn't think about analyzing the query string, just the stored data
my analysis went in the wrong direction - i was focused on the length and wondered if it had anything to do with shingle sizes (because i wasn't using a normal text field).
would it make sense to set "search_analyzer": "standard" on the field rather than building up an analyser without stop words? that seems to do the trick but i'm unsure if it'd have any unexpected side-effects.
the use-case is a search for countries where we have the name stored in various languages (e.g. EN, DE, FR, IT, ES) and the user can search for it in any of the language (though we do boost the match in his UI language).
i want to use the language-specific analyzers since in the real query we also do fuzzy search for the name in each language (we do both, the non-fuzzy match gets a boost compared to the fuzzy match since some countries have names which are similar).
is this what you had in mind? it seems to work for us now.
i'm wondering if we should also get rid of stemming since for country names it might not be the most important thing? however i haven't noticed any weird behaviour with it so far. and in turn i just noticed that ASCII folding isn't enabled (oddly enough even if i explicitly add it it still doesn't do it... 'ê' isn't turned into 'e').
from your experience: is this setup good for such a search or would you go about it another way? sorry, we're rather new to elasticsearch and still trying to find our way (planning on doing the course, but the budget hasn't been approved yet ).
from an implementation point of view it doesn't really scale well with adding new languages (since we have to touch the index and the query every time we add a new language since we have them in dedicated fields), but that seems to be necessary due to the language-specific analyzers. we'll also run into problems with other languages where there are no analysers out-of-the-box available in elasticsearch (though there seem to be 3rd party plugins like rosette).
PS: we can split this away into a new thread if you want since the original question is hopefully solved by now.
may i ask why? building the full analyzer from the ground up means that with every upgrade we'll have to check what the current defaults for french are and then align our custom-built version again with that. with my approach we say "i'm happy with the french analyzer, but i want to change this specific setting (stopwords)" and thus we automatically get any update to the french analyzer applied as well.
i'm generally in favour of doing less custom things rather than doing more custom things since custom things are more effort (= more expensive) to maintain.
i wasn't aware that overwriting an existing analyzer was considered bad practice if one wanted to change the default behaviour of that analyzer in an index.
i did this partially due to laziness (didn't have to change the field definitions :)), but partially also to ensure that people don't have to remember to use a special one (since in this index we'd then never want to use the language analyzer with the stopwords). though of course the latter part can also be covered with code reviews.
if/when we get the budget we have to see whether we do an on-site training (there's one scheduled here - but let's see if we'll get a full lockdown due to the coronavirus which would cancel the course as well :() or if we do an online training. english is fine, local language as well (though i find it weird to discuss IT topics in languages other than english :)).
i'll create a new thread later on for the more general discussion (i need to organise the information a bit so that it's concise and clear).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.