I don't understood what task are you solving.
But you can use Pattern tokenizer, or ngram tokenizer
here example, you can use java pattern to define mask of your token. This pattern divided your text to token beginning with upper litera:"pattern": "(?=\p{Upper})"
Thank you. Somehow, my message was updated because I used "<" & ">" in my examples. Just fixed it
max_token_length seems to be closest to what we want, however, it seems to split the token after it reaches maximum token length. We want to truncate it since we don't care about it.
Use Case:
Sometimes, customers send in some random gibberish text that dose not make any sense:
e.g. "asdfasdf....." of may be 1 MB or some text in a unsupported language. We simply want to truncate such values instead of analyzing them.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.