Im trying to search a word from this phrase/compound words, this data are from an document/email.
SMC_otherword_
something1234
user12345
so for this instance I would like to search "SMC", "something" or "1234". Should I use a different analyzer or should I use a tokenizer? or this can be handle by the search API? Im currently using match_phrase_prefix but this only put wildcards on the last word so it isn't sufficient since I need to search some words within a compound words with some weird characters.
Try using a regexp tokenizer with custom rules coupled with an ngram token filter. As Mark has rightly pointed out, beware of the number of tokens that are formed at the end of the analysis process. It can increase the index size drastically.
Mark, you mean a custom tokenizer or a custom analyzer? I guess there is no provision to write a custom tokenizer, we need to alter to source code or write a plugin for the same.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.