Overwrite Tokenizer of english analyzer

Youxu · December 14, 2015, 4:18am

The ES built-in english analyzer treat "." as valid token character. that is, www.google.com will be ONE token "www.google.com" in inverted index.

Now I want to treat "." as delimiter in english analyzer, that is, www.google.com will be tokenized to three tokens:
www
google,
com

But I still wan to keep all other default english analyzer behaviors as-is.

What is the best approach to do?

cbuescher · December 14, 2015, 8:34am

Hi,

this example shows how the english analyzer could be reimplemented as a custom analyzer. From there you can change any part of the analysis chain, e.g. use a different tokenizer that fits your needs.

Youxu · December 17, 2015, 10:42am

Thanks reply.
Do you mean I can use tokenizer like Pattern tokenizer to split the sentences by my pre-defined the word delimiters?

What I want is to extend the standard tokenizer which used by most of language analyzers to add the support of splitting tokens by "." in addition to all existing word boundaries.

Using pattern tokenizer seems cannot cover all word boundaries which supported by standard tokenizer.

Topic		Replies	Views
Override built-in analyzer Elasticsearch	2	386	July 6, 2017
Override built-in analyzer Elasticsearch	6	459	July 6, 2017
Customized analyzers behave Elasticsearch	2	413	July 6, 2017
Analyzer settings for breaking up words on hyphens Elasticsearch	4	2218	July 6, 2017
Help with custom analyzer/tokenizer Elasticsearch	2	997	July 5, 2017

Overwrite Tokenizer of english analyzer

Related topics