Searched the docs and I was not able to find a solution to create a custom tokenizer to break text at any char different from digit or unicode letter (like those returned by java Character.isLetterOrDigit()). In the past I coded one using pure Lucene...
In Elastic, tried the simple pattern tokenizer, creating a regex with all chars returned by java Character.isLetterOrDigit() (154,137 chars) but that caused a stack overflow and put Elastic down.
I would not like to use standard pattern tokenizer because it is slow. Had bad experience with java regex in the past...
Found char group tokenizer, but, if I understood correctly, I need exactly the opposite: be able to define valid token chars, not delimiter chars.