Stopwords are not working in custom tokenizer

Consider below example

    GET _analyze
    {
      "tokenizer": {
        "pattern" : ",",
        "type" : "pattern"
      },
      "filter": [
        { "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10
        },
        {
          "type" : "stop",
          "stopwords" : [ "ab" ]
        }
      ],
      "text": "ab bcd,cde"
    }

If I am using above configuration then I should expect following tokens:
["bc", "bcd", "cd", "cde"]

but I am getting below list of tokens:
["ab", "ab b", "ab bc", "ab bcd", "cd", "cde"]

What is wrong with configuration of edge_ngram?
How do I achieve first sets of token in case of edge_ngram and custom tokenizer?

Your tokenizer splits based on , only, giving the tokens ab bcd and cde. This is then broken down into edge ngrams, which gives the result you see. To get the results you want you probably need to use a different tokenizer.

Yes. You are right. I will get two tokens ab bcd and cde but at same time I have another stopword filter which should remove ab from ab bcd and then create more tokens as per nedge_gram. If I use standard token analyzer instead of pattern tokenizer then things are working perfectly fine(meaning it is removing stopword) but again standard tokenizer is not desired.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.