Combining ngram tokenizer with stopwords

Dimitri_Gamkrelidze · April 12, 2024, 3:33pm

Helllo everyone

I'm trieing to create custom tokenizer with ngram.

GET /_analyze
{
 "tokenizer": {
    "type":"ngram",
    "min_gram":3,
    "max_gram":3,
    "token_chars": [
    "letter",
    "digit",
    "punctuation",
    "symbol",
    "whitespace"
  ]
  },
  "filter": [ {
    "type":"stop",
    "ignore_case":true,
    "stopwords":[
      "HELP"
      ]
  }
  ,{
    "type":"ngram",
    "min_gram":3,
    "max_gram":3
  }
  ],
  "text": "HELP TEST VALUE"
}

Of course it's not filtering the "HELP" out, because ngram tokenizer splitted before filter,
If I use "type":"standard in tokenizer, then ngram filter works, but tokenizer produces result that is not what we want, ngram works only filter on the result.

Can someone help in this case?

Thanks

RabBit_BR · April 12, 2024, 7:16pm

Hi @Dimitri_Gamkrelidze

According to the documentation, the behavior is correct.
What do you expect as a result? I believe you can create a tokenizer other than Ngram to generate the tokens. Use ngram only in the filter step.

Topic		Replies	Views
Custom stopwords not working with custom tokenizer Elasticsearch	1	579	May 2, 2017
Stopwords are not working in custom tokenizer Elasticsearch	3	391	April 29, 2021
How to use ngram tokenizer for words that output from synonym filter? Elasticsearch	2	443	June 26, 2018
Stop-Words analyzers does not work as expected Elasticsearch	1	397	June 5, 2018
Multiple tokenizers inside one Custom Analyser in Elasticsearch Elasticsearch	1	1915	October 26, 2018

Combining ngram tokenizer with stopwords

Related topics