Hi guys,
I have a generalized search field, with this analyzer:
"analysis": {
"analyzer": {
"custom_brazilian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"brazilian_stop",
"light_portuguese_stemmer"
]
}
},
"filter": {
"brazilian_stop": {
"type": "stop",
"stopwords": [
"_brazilian_"
]
},
"light_portuguese_stemmer": {
"type": "stemmer",
"language": "light_portuguese"
}
}
}
So with the standard tokenizer and those filters, any search that I made is tokenized like this:
search: "bota de trabalho"
tokens: "bota", "trabalho".
That is OK. But in Brazilian Portuguese there are some words made up of other words, for example:
"meia calça"
I don't want this word to generate the tokens "meia" and "calça". I want this to be unique, something like "meia calça", or "meia-calça"
The problem is, I don't want to change the tokenizer, this one works great, there are just some words that I want to keep the same.
The best solution that I have found is to use a Mapping Char Filter that can replace the search "meia calça" for "meia_calça", this way the tokenizer will not break into two tokens.
But this does not sound good enough, since I have to make a dictionary with all variations of the word to match, like "Meia calça", "mEia calça"...
I want to avoid regex, because of performance issues.
My doubt is if there is a better solution to this problem?
Thanks a lot!