Analyzer: Problem when generating tokens

Vitor_Derobe · September 9, 2019, 6:23pm

Hi guys,

I have a generalized search field, with this analyzer:

"analysis": {
  "analyzer": {
    "custom_brazilian": {
      "tokenizer": "standard",
      "filter": [
        "lowercase",
        "asciifolding",
        "brazilian_stop",
        "light_portuguese_stemmer"
      ]
    }
  },
  "filter": {
    "brazilian_stop": {
      "type": "stop",
      "stopwords": [
        "_brazilian_"
      ]
    },
    "light_portuguese_stemmer": {
      "type": "stemmer",
      "language": "light_portuguese"
    }
  }
}

So with the standard tokenizer and those filters, any search that I made is tokenized like this:

search: "bota de trabalho"
tokens: "bota", "trabalho".

That is OK. But in Brazilian Portuguese there are some words made up of other words, for example:
"meia calça"

I don't want this word to generate the tokens "meia" and "calça". I want this to be unique, something like "meia calça", or "meia-calça"

The problem is, I don't want to change the tokenizer, this one works great, there are just some words that I want to keep the same.

The best solution that I have found is to use a Mapping Char Filter that can replace the search "meia calça" for "meia_calça", this way the tokenizer will not break into two tokens.

But this does not sound good enough, since I have to make a dictionary with all variations of the word to match, like "Meia calça", "mEia calça"...

I want to avoid regex, because of performance issues.

My doubt is if there is a better solution to this problem?

Thanks a lot!

system · October 7, 2019, 6:23pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Generating same token for related words Elasticsearch	1	86	February 23, 2024
Keywords not behaving as I need Elasticsearch	3	432	March 4, 2020
New language - Custom analyzer plugin or token filter Elasticsearch	1	541	March 21, 2017
Stop words not used by the analyzer Elasticsearch	5	614	July 6, 2017
Wrestling with analyzer Elasticsearch	5	414	July 6, 2017

Analyzer: Problem when generating tokens

Related topics