How to whitelist terms in a custom analyzer

rlvoyer · July 11, 2017, 11:35pm

Hi there,

I'm using the edge ngram tokenizer to support search autocomplete. Unlike other tokenizers, the edge_ngram tokenizer is relatively limited in terms of its configurability. Take for a example the string "AT&T". Even with "symbols" included in an edge ngram tokenizer's token_chars field, that string would be tokenized as follows:

 {
  "tokens": [
    {
      "token": "a",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "at",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "t",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 2
    }
  ]
}

Tokenizing the ampersand away results in 3 very high frequency n-grams (2 unigrams, 1 bigram). The problem with this then is that any search for this string is likely to return a lot of false positives. It seems to me that there should be a token whitelist filter that allows you to preserve select terms/phrases (ie. exceptions to tokenization). I've noticed that there is a filter that will prevent stemming -- the keyword marker token filter -- but I've found no equivalent for allowing exceptions to tokenization rules.

Any tips would be greatly appreciated, thanks!

polyfractal · July 12, 2017, 4:54pm

Yeah, the ngram tokenizer can be somewhat limiting, but mainly because it's designed as a quick/dirty ngrammer when you just want simple behavior.

For your case, you'd probably be better served by using a less aggressive tokenizer (something like whitespace) and then an ngram token filter. The whitespace tokenizer will preserve the ampersand, and then the ngram filter will do as you expect.

rlvoyer · July 12, 2017, 5:03pm

I considered this, but it doesn't feel like the right approach because these sorts of strings are very much edge cases. In general, tokenizing on ampersands feels like the right thing to do.

rlvoyer · July 13, 2017, 8:24pm

That said, I wasn't aware of the existence of the edge ngram token filter (as an alternative to the edge ngram tokenizer), so thanks for the heads up.

system · August 10, 2017, 8:24pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tokenizer: whitespace not working with edge_ngram Elasticsearch	9	2492	March 5, 2018
Elastic Edge_Ngram with Whitespace Word Breaker Elasticsearch	4	1025	April 28, 2020
Tokenization help mixed n-grams Elasticsearch	2	361	July 6, 2017
Edge nGram token filter doesn't seem to work Elasticsearch	2	2427	July 5, 2017
Using word_delimiter with edgeNGram ignores Word_Delimiter Token Elasticsearch	3	488	July 5, 2017

How to whitelist terms in a custom analyzer

Related topics