Searching a word within a word

Hi,

Im trying to search a word from this phrase/compound words, this data are from an document/email.

SMC_otherword_
something1234
user12345

so for this instance I would like to search "SMC", "something" or "1234". Should I use a different analyzer or should I use a tokenizer? or this can be handle by the search API? Im currently using match_phrase_prefix but this only put wildcards on the last word so it isn't sufficient since I need to search some words within a compound words with some weird characters.

You could try ngrams, but depending on how short/long you want them, it could end up costly.

It depends on how you want to query. If it would mean breaking the alphas up from the numerics, then a custom tokeniser might be best.

Try using a regexp tokenizer with custom rules coupled with an ngram token filter. As Mark has rightly pointed out, beware of the number of tokens that are formed at the end of the analysis process. It can increase the index size drastically.

Mark, you mean a custom tokenizer or a custom analyzer? I guess there is no provision to write a custom tokenizer, we need to alter to source code or write a plugin for the same.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.