How to Index Words Actual form and Modified form into Elastic Search

Hi Team,

I want to index the words into elasticsearch with actual form and modified form.

Example:

The Term "F-35" want to index is "F35" and "F-35", when i search the text F35 or F-35 both should return the document.

Note: Here i am using White space analyzer, so it will not be split into two tokens.

Please someone provide me option to achieve this.

Hi Karthik,

if you want to remove the punctuation aswell, you should consider using the standard tokenizer.
If that removed more than you want, you can consider using a Pattern-Analyzer. But be aware that this might be very slow.

Make sure to test your custom analyzers using the _analyze API

Hi SaskiaVola,

Thanks for your time to reply this conversation.

When we use standard tokenizer it will split from "F-35" into 2 different tokens as "F" and "35".
But i am expecting to be a single token like "F35" and "F-35".

Even Pattern-Analyzer will split the token by punctuation i guess?

Hi Karthik,

that's correct. So depending on your data, if you can define a proper pattern for the cases you're referring to, you could use a character filter first, that removes the hyphen inside of words that contain numbers.

Then a query for "F35" and "F-35" would match docs containing both variants.

Hope that works for you.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.