Analyzing Hebrew string

Hey guys. I need help with writing an analyzer for a hebrew string.

Basically, hebrew words can contain prefix/postfix characters that I want to tokenize. But the thing is that I want my list of terms to include both words with the prefix characters and without. For example:
"מלחמה ושלום" this string contains the prefix letter "ו". I need the analyzer to produce the following terms:
['מלחמה', 'ושלום', 'שלום']. Naturally, I need the Analyzer to also break the string per the whitespaces.

So far, I tried writing a basic regex pattern to locate these specific characters I want to capture which works, however how I'm not sure how to advance from here. Any input on the matter would be appreciated.

Hello ProxyMasterMerger!

I apologize beforehand as I'm not speaking or reading Hebrew. Elasticsearch ship with a bunch of different analyzers. There is as well some plugin that you could use. On our documentation, we refer to this Hebrew analyzer: https://github.com/synhershko/elasticsearch-analysis-hebrew

An option could be as well to analyze your string using your regex as you mention and use it into a custom analyzer, using the pattern tokenizer: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html . I don't understand Hebrew but if 'ושלום, 'שלום' are the same then you could use synonyms. But in your scenario, I think it will be better to have 2 analyzers, one using the white space analyzer and another one using the custom one that you defined using your regex and then do the search on both fields in the same time.

1 Like

Thanks for the response, I'll look into the analyzer, perhaps it could clue me in on some additional details.

Hey, sorry for bringing this thread back up, but I have another question. I tried to find an answer on the docs, but to no avail...

I already know of a way to strip prefix/suffix characters thanks to the pattern replace filter, however I'm kinda stuck in finding something that does the opposite. Namely, is there some sort of a token filter, or a tokenizer which can take a token, prepend specific characters to the beginning of the token, and then emit it with the original?

For example, given the sentence "I have a cat" and the specified string "L", the expected output (assuming there's also a whitespace filter) should be: ['I', 'LI, 'have', 'Lhave', 'a', 'La', 'cat', 'Lcat']. Having this option would really help me in handling common prefix characters in my index.

EDIT: Might as well throw another question in here while I can. I started working on a Synonym token filter and it's working great. So far I'm placing all of my synonyms directly inside the filter but I want to move them into a separate file inside the elasticsearch config directory. Now, I intend for this synonyms file to be editable via a separate web application I'll be developing later, for ease of use. Which means I require access to this file remotely, but I'm not certain on how to do that exactly... any suggestions?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.