Searching a word within a word


Im trying to search a word from this phrase/compound words, this data are from an document/email.


so for this instance I would like to search "SMC", "something" or "1234". Should I use a different analyzer or should I use a tokenizer? or this can be handle by the search API? Im currently using match_phrase_prefix but this only put wildcards on the last word so it isn't sufficient since I need to search some words within a compound words with some weird characters.

You could try ngrams, but depending on how short/long you want them, it could end up costly.

It depends on how you want to query. If it would mean breaking the alphas up from the numerics, then a custom tokeniser might be best.

Try using a regexp tokenizer with custom rules coupled with an ngram token filter. As Mark has rightly pointed out, beware of the number of tokens that are formed at the end of the analysis process. It can increase the index size drastically.

Mark, you mean a custom tokenizer or a custom analyzer? I guess there is no provision to write a custom tokenizer, we need to alter to source code or write a plugin for the same.

