Stopwords are not working in custom tokenizer

tusharl · April 1, 2021, 1:23pm

Consider below example

    GET _analyze
    {
      "tokenizer": {
        "pattern" : ",",
        "type" : "pattern"
      },
      "filter": [
        { "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10
        },
        {
          "type" : "stop",
          "stopwords" : [ "ab" ]
        }
      ],
      "text": "ab bcd,cde"
    }

If I am using above configuration then I should expect following tokens:
["bc", "bcd", "cd", "cde"]

but I am getting below list of tokens:
["ab", "ab b", "ab bc", "ab bcd", "cd", "cde"]

What is wrong with configuration of edge_ngram?
How do I achieve first sets of token in case of edge_ngram and custom tokenizer?

Christian_Dahlqvist · April 1, 2021, 2:13pm

Your tokenizer splits based on , only, giving the tokens ab bcd and cde. This is then broken down into edge ngrams, which gives the result you see. To get the results you want you probably need to use a different tokenizer.

tusharl · April 1, 2021, 2:23pm

Yes. You are right. I will get two tokens ab bcd and cde but at same time I have another stopword filter which should remove ab from ab bcd and then create more tokens as per nedge_gram. If I use standard token analyzer instead of pattern tokenizer then things are working perfectly fine(meaning it is removing stopword) but again standard tokenizer is not desired.

system · April 29, 2021, 2:24pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Custom stopwords not working with custom tokenizer Elasticsearch	1	585	May 2, 2017
How to whitelist terms in a custom analyzer Elasticsearch	4	1128	August 10, 2017
Stop words not indexed (custom edgeNGram analyzer with not STOP filter) Elasticsearch	2	520	July 6, 2017
Issue with Edge NGram Tokenizer in elastic search Elasticsearch	2	657	January 13, 2017
Combining ngram tokenizer with stopwords Elasticsearch	1	118	April 12, 2024

Stopwords are not working in custom tokenizer

Related topics