Custom tokenizer for letter and digits

Hi,

Searched the docs and I was not able to find a solution to create a custom tokenizer to break text at any char different from digit or unicode letter (like those returned by java Character.isLetterOrDigit()). In the past I coded one using pure Lucene...

In Elastic, tried the simple pattern tokenizer, creating a regex with all chars returned by java Character.isLetterOrDigit() (154,137 chars) but that caused a stack overflow and put Elastic down.

I would not like to use standard pattern tokenizer because it is slow. Had bad experience with java regex in the past...

Found char group tokenizer, but, if I understood correctly, I need exactly the opposite: be able to define valid token chars, not delimiter chars.

Thanks,
Luis Nassif

Hey,

I do not think that there is an out of the box tokenizer doing that for you. The LetterTokenizer checks for letters only, and you could probably use that one, and write a plugin based on that. You would need to implement AnalysisPlugin and write your own plugin. See https://github.com/elastic/elasticsearch/blob/master/plugins/analysis-stempel/src/main/java/org/elasticsearch/plugin/analysis/stempel/AnalysisStempelPlugin.java and https://github.com/elastic/elasticsearch/tree/master/plugins/examples for some help on how to do that.

--Alex

Thank you for replying. I didn't know it is possible to write analysis plugins, will take a look.

Luis

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.