Dealing with languages


(James Cook) #1

I am indexing documents where some are English and others are in Arabic.
There is a locale property on each document to indicate its language. What
strategies exist to choose the correct analyzer to use on a field in this
document? Can I somehow apply a language specific analyzer to say the
description field based on the value of the locale field? Or, is there some
other approach?


#2

On Thu, Oct 21, 2010 at 9:08 AM, James Cook jcook@tracermedia.com wrote:

I am indexing documents where some are English and others are in Arabic.
There is a locale property on each document to indicate its language. What
strategies exist to choose the correct analyzer to use on a field in this
document? Can I somehow apply a language specific analyzer to say the
description field based on the value of the locale field? Or, is there some
other approach?

If you have mixed english/arabic, why not just make a combined
analyzer (e.g. one that uses both PorterStemFilter and
ArabicStemFilter/ArabicNormalizationFilter).

Because they don't share any characters in unicode, the
ArabicStemFilter will only work on arabic text, and won't mess with
the english, and the PorterStemFilter will only work on latin text,
and won't mess with your arabic.

Same with the stopwords, etc.

So i don't think you need to use your locale field at all in this case...


(Clinton Gormley) #3

On Thu, 2010-10-21 at 09:08 -0400, James Cook wrote:

I am indexing documents where some are English and others are in
Arabic. There is a locale property on each document to indicate its
language. What strategies exist to choose the correct analyzer to use
on a field in this document? Can I somehow apply a language specific
analyzer to say the description field based on the value of the locale
field? Or, is there some other approach?

I don't think you can. Don't forget that when you search for those
documents, your query string is analysed with the same analyzer - so
which should it choose? English or Arabic?

Don't know if this is the best solution, but you could either have two
different types eg 'english_doc' or 'arabic_doc', or two different
fields eg 'english_content', 'arabic_content'

Then when you search, you can search across both types|fields and the
correct analyser would be applied.

(Not sure how the _all field works in this case, but if it is clever
enough, you should just be able to search against that, and your query
string would be analysed differently for each component field)

clint


(system) #4