Custom tokenizer for letter and digits


Searched the docs and I was not able to find a solution to create a custom tokenizer to break text at any char different from digit or unicode letter (like those returned by java Character.isLetterOrDigit()). In the past I coded one using pure Lucene...

In Elastic, tried the simple pattern tokenizer, creating a regex with all chars returned by java Character.isLetterOrDigit() (154,137 chars) but that caused a stack overflow and put Elastic down.

I would not like to use standard pattern tokenizer because it is slow. Had bad experience with java regex in the past...

Found char group tokenizer, but, if I understood correctly, I need exactly the opposite: be able to define valid token chars, not delimiter chars.

Luis Nassif


I do not think that there is an out of the box tokenizer doing that for you. The LetterTokenizer checks for letters only, and you could probably use that one, and write a plugin based on that. You would need to implement AnalysisPlugin and write your own plugin. See and for some help on how to do that.


Thank you for replying. I didn't know it is possible to write analysis plugins, will take a look.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.