Searching a word within a word

Mark_Dendrix_Garcia · July 25, 2018, 2:37am

Hi,

Im trying to search a word from this phrase/compound words, this data are from an document/email.

SMC_otherword_
something1234
user12345

so for this instance I would like to search "SMC", "something" or "1234". Should I use a different analyzer or should I use a tokenizer? or this can be handle by the search API? Im currently using match_phrase_prefix but this only put wildcards on the last word so it isn't sufficient since I need to search some words within a compound words with some weird characters.

warkolm · July 25, 2018, 2:48am

You could try ngrams, but depending on how short/long you want them, it could end up costly.

It depends on how you want to query. If it would mean breaking the alphas up from the numerics, then a custom tokeniser might be best.

Abhilash_Bolla · July 25, 2018, 6:40am

Try using a regexp tokenizer with custom rules coupled with an ngram token filter. As Mark has rightly pointed out, beware of the number of tokens that are formed at the end of the analysis process. It can increase the index size drastically.

Abhilash_Bolla · July 25, 2018, 6:41am

Mark, you mean a custom tokenizer or a custom analyzer? I guess there is no provision to write a custom tokenizer, we need to alter to source code or write a plugin for the same.

system · August 22, 2018, 6:41am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Best practices to match subwords in foreign languages Elasticsearch	3	465	July 6, 2017
How to use es analyzer for compound words? Elasticsearch	2	756	July 6, 2017
Compound Words not found but Filter is configured Elasticsearch	5	669	July 5, 2017
Matching combined words in elasticsearch? Elasticsearch	1	765	August 5, 2020
Issues trying to search with ngram tokenizer Elasticsearch	2	504	May 12, 2021

Searching a word within a word

Related topics