Custom tokenizer for letter and digits

lfcnassif · March 27, 2020, 12:44am

Hi,

Searched the docs and I was not able to find a solution to create a custom tokenizer to break text at any char different from digit or unicode letter (like those returned by java Character.isLetterOrDigit()). In the past I coded one using pure Lucene...

In Elastic, tried the simple pattern tokenizer, creating a regex with all chars returned by java Character.isLetterOrDigit() (154,137 chars) but that caused a stack overflow and put Elastic down.

I would not like to use standard pattern tokenizer because it is slow. Had bad experience with java regex in the past...

Found char group tokenizer, but, if I understood correctly, I need exactly the opposite: be able to define valid token chars, not delimiter chars.

Thanks,
Luis Nassif

spinscale · March 30, 2020, 3:30pm

Hey,

I do not think that there is an out of the box tokenizer doing that for you. The LetterTokenizer checks for letters only, and you could probably use that one, and write a plugin based on that. You would need to implement AnalysisPlugin and write your own plugin. See https://github.com/elastic/elasticsearch/blob/master/plugins/analysis-stempel/src/main/java/org/elasticsearch/plugin/analysis/stempel/AnalysisStempelPlugin.java and https://github.com/elastic/elasticsearch/tree/master/plugins/examples for some help on how to do that.

--Alex

lfcnassif · March 30, 2020, 3:49pm

Thank you for replying. I didn't know it is possible to write analysis plugins, will take a look.

Luis

system · April 27, 2020, 3:49pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Create an analyzer to tokenize non-alphanumeric characters Elasticsearch	7	2302	July 5, 2017
Choose Correct Text Analyzer/ Tokenizer Elasticsearch	4	592	July 17, 2019
Pattern analyzer regex help Elasticsearch	3	253	August 24, 2022
Custom analyzer and char_group tokenizer - can't search for terms with dot Elasticsearch	1	882	February 1, 2019
Help with custom analyzer/tokenizer Elasticsearch	2	997	July 5, 2017

Custom tokenizer for letter and digits

Related topics