Hi there,
I'm using the edge ngram tokenizer to support search autocomplete. Unlike other tokenizers, the edge_ngram tokenizer is relatively limited in terms of its configurability. Take for a example the string "AT&T". Even with "symbols" included in an edge ngram tokenizer's token_chars field, that string would be tokenized as follows:
 {
  "tokens": [
    {
      "token": "a",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "at",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "t",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 2
    }
  ]
}
Tokenizing the ampersand away results in 3 very high frequency n-grams (2 unigrams, 1 bigram). The problem with this then is that any search for this string is likely to return a lot of false positives. It seems to me that there should be a token whitelist filter that allows you to preserve select terms/phrases (ie. exceptions to tokenization). I've noticed that there is a filter that will prevent stemming -- the keyword marker token filter -- but I've found no equivalent for allowing exceptions to tokenization rules.
Any tips would be greatly appreciated, thanks!