Analyzing Hebrew string

ProxyMasterMerger · June 2, 2019, 6:38am

Hey guys. I need help with writing an analyzer for a hebrew string.

Basically, hebrew words can contain prefix/postfix characters that I want to tokenize. But the thing is that I want my list of terms to include both words with the prefix characters and without. For example:
"מלחמה ושלום" this string contains the prefix letter "ו". I need the analyzer to produce the following terms:
['מלחמה', 'ושלום', 'שלום']. Naturally, I need the Analyzer to also break the string per the whitespaces.

So far, I tried writing a basic regex pattern to locate these specific characters I want to capture which works, however how I'm not sure how to advance from here. Any input on the matter would be appreciated.

Melvyn · June 3, 2019, 6:38am

Hello ProxyMasterMerger!

I apologize beforehand as I'm not speaking or reading Hebrew. Elasticsearch ship with a bunch of different analyzers. There is as well some plugin that you could use. On our documentation, we refer to this Hebrew analyzer: https://github.com/synhershko/elasticsearch-analysis-hebrew

An option could be as well to analyze your string using your regex as you mention and use it into a custom analyzer, using the pattern tokenizer: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html . I don't understand Hebrew but if 'ושלום, 'שלום' are the same then you could use synonyms. But in your scenario, I think it will be better to have 2 analyzers, one using the white space analyzer and another one using the custom one that you defined using your regex and then do the search on both fields in the same time.

ProxyMasterMerger · June 4, 2019, 8:35am

Thanks for the response, I'll look into the analyzer, perhaps it could clue me in on some additional details.

ProxyMasterMerger · June 6, 2019, 7:04am

Hey, sorry for bringing this thread back up, but I have another question. I tried to find an answer on the docs, but to no avail...

I already know of a way to strip prefix/suffix characters thanks to the pattern replace filter, however I'm kinda stuck in finding something that does the opposite. Namely, is there some sort of a token filter, or a tokenizer which can take a token, prepend specific characters to the beginning of the token, and then emit it with the original?

For example, given the sentence "I have a cat" and the specified string "L", the expected output (assuming there's also a whitespace filter) should be: ['I', 'LI, 'have', 'Lhave', 'a', 'La', 'cat', 'Lcat']. Having this option would really help me in handling common prefix characters in my index.

EDIT: Might as well throw another question in here while I can. I started working on a Synonym token filter and it's working great. So far I'm placing all of my synonyms directly inside the filter but I want to move them into a separate file inside the elasticsearch config directory. Now, I intend for this synonyms file to be editable via a separate web application I'll be developing later, for ease of use. Which means I require access to this file remotely, but I'm not certain on how to do that exactly... any suggestions?

system · July 4, 2019, 7:04am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Analyzed prefix filter Elasticsearch	2	361	July 6, 2017
Disable custom analyzer for prefix filter Elasticsearch	3	465	July 6, 2017
Combining language-specific analyzer and synonym token filter Elasticsearch	2	607	July 6, 2017
Custom analyzer to include all the given text and tokenize it Elasticsearch	1	419	April 4, 2017
Changing Analyzer behavior for hyphens - suggestions? Elasticsearch	7	11975	July 5, 2017

Analyzing Hebrew string

Related topics