Overwrite Tokenizer of english analyzer


(Xudong You) #1

The ES built-in english analyzer treat "." as valid token character. that is, www.google.com will be ONE token "www.google.com" in inverted index.

Now I want to treat "." as delimiter in english analyzer, that is, www.google.com will be tokenized to three tokens:
www
google,
com

But I still wan to keep all other default english analyzer behaviors as-is.

What is the best approach to do?


(Christoph) #2

Hi,

this example shows how the english analyzer could be reimplemented as a custom analyzer. From there you can change any part of the analysis chain, e.g. use a different tokenizer that fits your needs.


(Xudong You) #3

Thanks reply.
Do you mean I can use tokenizer like Pattern tokenizer to split the sentences by my pre-defined the word delimiters?

What I want is to extend the standard tokenizer which used by most of language analyzers to add the support of splitting tokens by "." in addition to all existing word boundaries.

Using pattern tokenizer seems cannot cover all word boundaries which supported by standard tokenizer.


(system) #4