Using word_delimiter with edgeNGram ignores Word_Delimiter Token

batmaci · March 7, 2016, 2:24pm

I have my custom analyzer as below. But I dont understand how to achieve my goal.

My goal is that I want to have whitespace separated inverted index but also I want to have autocomplete feature after user enters min 3 chars. For that I though to combine word_delimiter and edgeNGram tokens as below

{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "whitespace",
            "filter": [
              "standard",
              "lowercase",
              "my_word_delimiter",
              "my_edge_ngram_analyzer"
            ],
            "type": "custom"
          }
        },
        "filter": {
          "my_word_delimiter": {
            "catenate_all": true,
            "type": "word_delimiter"
          },
          "my_edge_ngram_analyzer": {
            "min_gram": 3,
            "max_gram": 10,
            "type": "edgeNGram"
          }
        }
      }
    }
  }
}

This will give result for "Brother TN-200" as below. But I was expecting "tn" to be also in the reverted index as I have word_delimiter token. why is it not in the inverted index? How can I achieve this?

url -XGET "localhost:9200/myIndex/_analyze?analyzer=my_analyzer&pr
    etty=true" -d "Brother TN-200"
    {
      {
        "token" : "bro",
        "start_offset" : 14,
        "end_offset" : 21,
        "type" : "word",
        "position" : 2
      }, {
        "token" : "brot",
        "start_offset" : 14,
        "end_offset" : 21,
        "type" : "word",
        "position" : 2
      }, {
        "token" : "broth",
        "start_offset" : 14,
        "end_offset" : 21,
        "type" : "word",
        "position" : 2
      }, {
        "token" : "brothe",
        "start_offset" : 14,
        "end_offset" : 21,
        "type" : "word",
        "position" : 2
      }, {
        "token" : "brother",
        "start_offset" : 14,
        "end_offset" : 21,
        "type" : "word",
        "position" : 2
      }, {
        "token" : "tn2",
        "start_offset" : 22,
        "end_offset" : 28,
        "type" : "word",
        "position" : 3
      }, {
        "token" : "tn20",
        "start_offset" : 22,
        "end_offset" : 28,
        "type" : "word",
        "position" : 3
      }, {
        "token" : "tn200",
        "start_offset" : 22,
        "end_offset" : 28,
        "type" : "word",
        "position" : 3
      }, {
        "token" : "200",
        "start_offset" : 25,
        "end_offset" : 28,
        "type" : "word",
        "position" : 4
      }]
    }

jpountz · March 8, 2016, 1:25pm

I suspect that the word_delimiter filter returns "tn" but then it gets removed by the edge ngram filter since it has less than 3 chars.

batmaci · March 8, 2016, 3:28pm

So you mean if I use word_delimiter with edgengram, word_delimiter has no use. It will be overriden by edgengram. This is really boring because I dont want to set min:2 as I have "tn" can be found inside the words and they will appear in the list as well.

Topic		Replies	Views
edgeNGram filter not keeping the whole words Elasticsearch	2	1341	July 6, 2017
Customized analyzers behave Elasticsearch	2	413	July 6, 2017
Tokenizer: whitespace not working with edge_ngram Elasticsearch	9	2448	March 5, 2018
Word delimiter filter with preserve_original Elasticsearch	4	653	December 18, 2019
Why word_delimiter doesn's work on my index? Elasticsearch	5	345	March 17, 2021

Using word_delimiter with edgeNGram ignores Word_Delimiter Token

Related topics