Dealing with languages

James_Cook · October 21, 2010, 1:08pm

I am indexing documents where some are English and others are in Arabic.
There is a locale property on each document to indicate its language. What
strategies exist to choose the correct analyzer to use on a field in this
document? Can I somehow apply a language specific analyzer to say the
description field based on the value of the locale field? Or, is there some
other approach?

rmuir · October 21, 2010, 1:15pm

On Thu, Oct 21, 2010 at 9:08 AM, James Cook jcook@tracermedia.com wrote:

I am indexing documents where some are English and others are in Arabic.
There is a locale property on each document to indicate its language. What
strategies exist to choose the correct analyzer to use on a field in this
document? Can I somehow apply a language specific analyzer to say the
description field based on the value of the locale field? Or, is there some
other approach?

If you have mixed english/arabic, why not just make a combined
analyzer (e.g. one that uses both PorterStemFilter and
ArabicStemFilter/ArabicNormalizationFilter).

Because they don't share any characters in unicode, the
ArabicStemFilter will only work on arabic text, and won't mess with
the english, and the PorterStemFilter will only work on latin text,
and won't mess with your arabic.

Same with the stopwords, etc.

So i don't think you need to use your locale field at all in this case...

Clinton_Gormley · October 21, 2010, 1:21pm

On Thu, 2010-10-21 at 09:08 -0400, James Cook wrote:

I am indexing documents where some are English and others are in
Arabic. There is a locale property on each document to indicate its
language. What strategies exist to choose the correct analyzer to use
on a field in this document? Can I somehow apply a language specific
analyzer to say the description field based on the value of the locale
field? Or, is there some other approach?

I don't think you can. Don't forget that when you search for those
documents, your query string is analysed with the same analyzer - so
which should it choose? English or Arabic?

Don't know if this is the best solution, but you could either have two
different types eg 'english_doc' or 'arabic_doc', or two different
fields eg 'english_content', 'arabic_content'

Then when you search, you can search across both types|fields and the
correct analyser would be applied.

(Not sure how the _all field works in this case, but if it is clever
enough, you should just be able to search against that, and your query
string would be analysed differently for each component field)

clint

Topic		Replies	Views
Multilingual index options: _analyzer or multiple mappings or? Elasticsearch	2	623	July 6, 2017
MultiLingual Index Elasticsearch	3	1008	July 5, 2017
Index with documents in multiple languages Elasticsearch	6	1097	July 6, 2017
How do I use "lang" analyzers? Actually, should I use them? Elasticsearch	4	350	July 6, 2017
_analyse field: which analyzer will be used on search? Elasticsearch	3	340	July 6, 2017

Dealing with languages

Related topics