Hi there,
I'm using the edge ngram tokenizer to support search autocomplete. Unlike other tokenizers, the edge_ngram
tokenizer is relatively limited in terms of its configurability. Take for a example the string "AT&T". Even with "symbols" included in an edge ngram tokenizer's token_chars
field, that string would be tokenized as follows:
{
"tokens": [
{
"token": "a",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "at",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "t",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 2
}
]
}
Tokenizing the ampersand away results in 3 very high frequency n-grams (2 unigrams, 1 bigram). The problem with this then is that any search for this string is likely to return a lot of false positives. It seems to me that there should be a token whitelist filter that allows you to preserve select terms/phrases (ie. exceptions to tokenization). I've noticed that there is a filter that will prevent stemming -- the keyword marker token filter -- but I've found no equivalent for allowing exceptions to tokenization rules.
Any tips would be greatly appreciated, thanks!