Analyzer: Problem when generating tokens

Hi guys,

I have a generalized search field, with this analyzer:

"analysis": {
  "analyzer": {
    "custom_brazilian": {
      "tokenizer": "standard",
      "filter": [
        "lowercase",
        "asciifolding",
        "brazilian_stop",
        "light_portuguese_stemmer"
      ]
    }
  },
  "filter": {
    "brazilian_stop": {
      "type": "stop",
      "stopwords": [
        "_brazilian_"
      ]
    },
    "light_portuguese_stemmer": {
      "type": "stemmer",
      "language": "light_portuguese"
    }
  }
}

So with the standard tokenizer and those filters, any search that I made is tokenized like this:

search: "bota de trabalho"
tokens: "bota", "trabalho".

That is OK. But in Brazilian Portuguese there are some words made up of other words, for example:
"meia calça"

I don't want this word to generate the tokens "meia" and "calça". I want this to be unique, something like "meia calça", or "meia-calça"

The problem is, I don't want to change the tokenizer, this one works great, there are just some words that I want to keep the same.

The best solution that I have found is to use a Mapping Char Filter that can replace the search "meia calça" for "meia_calça", this way the tokenizer will not break into two tokens.

But this does not sound good enough, since I have to make a dictionary with all variations of the word to match, like "Meia calça", "mEia calça"...

I want to avoid regex, because of performance issues.

My doubt is if there is a better solution to this problem?

Thanks a lot!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.