Searched the docs and I was not able to find a solution to create a custom tokenizer to break text at any char different from digit or unicode letter (like those returned by java Character.isLetterOrDigit()). In the past I coded one using pure Lucene...
In Elastic, tried the simple pattern tokenizer, creating a regex with all chars returned by java Character.isLetterOrDigit() (154,137 chars) but that caused a stack overflow and put Elastic down.
I would not like to use standard pattern tokenizer because it is slow. Had bad experience with java regex in the past...
Found char group tokenizer, but, if I understood correctly, I need exactly the opposite: be able to define valid token chars, not delimiter chars.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.