How to whitelist terms in a custom analyzer

Hi there,

I'm using the edge ngram tokenizer to support search autocomplete. Unlike other tokenizers, the edge_ngram tokenizer is relatively limited in terms of its configurability. Take for a example the string "AT&T". Even with "symbols" included in an edge ngram tokenizer's token_chars field, that string would be tokenized as follows:

 {
  "tokens": [
    {
      "token": "a",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "at",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "t",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 2
    }
  ]
}

Tokenizing the ampersand away results in 3 very high frequency n-grams (2 unigrams, 1 bigram). The problem with this then is that any search for this string is likely to return a lot of false positives. It seems to me that there should be a token whitelist filter that allows you to preserve select terms/phrases (ie. exceptions to tokenization). I've noticed that there is a filter that will prevent stemming -- the keyword marker token filter -- but I've found no equivalent for allowing exceptions to tokenization rules.

Any tips would be greatly appreciated, thanks!

Yeah, the ngram tokenizer can be somewhat limiting, but mainly because it's designed as a quick/dirty ngrammer when you just want simple behavior.

For your case, you'd probably be better served by using a less aggressive tokenizer (something like whitespace) and then an ngram token filter. The whitespace tokenizer will preserve the ampersand, and then the ngram filter will do as you expect.

I considered this, but it doesn't feel like the right approach because these sorts of strings are very much edge cases. In general, tokenizing on ampersands feels like the right thing to do.

That said, I wasn't aware of the existence of the edge ngram token filter (as an alternative to the edge ngram tokenizer), so thanks for the heads up.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.