I want a custom analyzer that can take a string like "((hello world!))" and give me a token list of:
["(", "(", "hello", "world", "!", ")", ")"]
That is to say, I basically want the "letter" tokenizer, but I want to keep the non letter characters and tokenize them as a single character length token.
Ah, ok - this isn't well documented but to be able to write a pattern for tokens (instead of delimiters, by default), you have to use the "group" value!
I've put together the following:
I'd ideally like to use the UNICODE_CHARACTER_CLASS Java regex flag, but I get an error when I include it. This means that values in Chinese, Japanese, etc, are being treated as non-letters and each letter is therefore a single token. Is there any way to do this without using CJK?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.