Indexing without knowing the language


(James Cook-3) #1

I have a use case where several different languages can be used in a forum.
I don't have any indication which language is used by which users, and in
some cases, multiple languages might be used in the same posts.

My document includes the typical properties one might expect for a forum
post along with the body of the message.

What strategies might I employ so users can search for the german posts
using german keywords, japanese posts using japanese keywords, etc?


(Otis Gospodnetić) #2

Hi James,

Sounds like you should simply identify the language(s) of documents
before indexing them and then analyze them appropriately.
There is a language identifier that's included in Solr 3.5.0 you could
rip out.

Otis

On Dec 17, 11:42 am, James Cook jc...@pykl.com wrote:

I have a use case where several different languages can be used in a forum.
I don't have any indication which language is used by which users, and in
some cases, multiple languages might be used in the same posts.

My document includes the typical properties one might expect for a forum
post along with the body of the message.

What strategies might I employ so users can search for the german posts
using german keywords, japanese posts using japanese keywords, etc?


(Karel Minarik) #3

Hi James,

there's also a Google "Compact Language Detector",
http://code.google.com/p/chromium-compact-language-detector/, which
seemed very nice to me, with easy wrapping from other languages (eg.
http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html,
https://github.com/jtoy/cld).

Detecting the language won't be the hard part here, I think. The hard
part is defining proper mapping for those languages (multifields?,
different properties?, etc). I think multifields should work really
well here, but I don't know of any strategy which would allow
magically analyzing queries with german analyzers and searching german
fields, and vice versa for english/japanese, etc., without the user
explicitely setting the language, or "trying out" the search against
all the multi fields and depending on the _score to sort them right...

This does not sound so "simple" as Otis' response suggests :), so I
may be missing something.

Karel

On Dec 17, 5:42 pm, James Cook jc...@pykl.com wrote:

I have a use case where several different languages can be used in a forum.
I don't have any indication which language is used by which users, and in
some cases, multiple languages might be used in the same posts.

My document includes the typical properties one might expect for a forum
post along with the body of the message.

What strategies might I employ so users can search for the german posts
using german keywords, japanese posts using japanese keywords, etc?


(James Cook-3) #4

I wonder if a "smart" analyzer could be employed at both index and search
time?

At index time, the smart analyzer would introduce a new multitype to the
mapping for the detected language, and at search time, the smart analyzer
would do something similar?


(Otis Gospodnetić) #5

Hi,

On Dec 18, 3:47 am, karmi karel.mina...@gmail.com wrote:

Hi James,

there's also a Google "Compact Language Detector",http://code.google.com/p/chromium-compact-language-detector/, which
seemed very nice to me, with easy wrapping from other languages (eg.http://blog.mikemccandless.com/2011/10/language-detection-with-google...,https://github.com/jtoy/cld).

Detecting the language won't be the hard part here, I think. The hard
part is defining proper mapping for those languages (multifields?,
different properties?, etc). I think multifields should work really
well here, but I don't know of any strategy which would allow
magically analyzing queries with german analyzers and searching german
fields, and vice versa for english/japanese, etc., without the user

Are you referring to?
http://www.elasticsearch.org/guide/reference/mapping/multi-field-type.html

If (don't know) this can indeed be used for multilingual situations
like this one, it would be great for ES documentation to include that
example, because dealing with this sort of problem will become a FAQ
if it's not already.

Otis

Sematext is Hiring World-Wide -- http://sematext.com/about/jobs.html

explicitely setting the language, or "trying out" the search against
all the multi fields and depending on the _score to sort them right...

This does not sound so "simple" as Otis' response suggests :), so I
may be missing something.

Karel

On Dec 17, 5:42 pm, James Cook jc...@pykl.com wrote:

I have a use case where several different languages can be used in a forum.
I don't have any indication which language is used by which users, and in
some cases, multiple languages might be used in the same posts.

My document includes the typical properties one might expect for a forum
post along with the body of the message.

What strategies might I employ so users can search for the german posts
using german keywords, japanese posts using japanese keywords, etc?


(Otis Gospodnetić) #6

Hi,

On Dec 18, 8:04 pm, James Cook jc...@pykl.com wrote:

I wonder if a "smart" analyzer could be employed at both index and search
time?

At index time, the smart analyzer would introduce a new multitype to the
mapping for the detected language, and at search time, the smart analyzer
would do something similar?

I'm all eyeballs in this thread! :slight_smile:

Re search-time language identification, the common problem is the
shortness of query strings :frowning:
So in my experience one tries to figure out the user's language some
other way (explicit selection of from preferences or...)

Otis

Sematext is Hiring World-Wide -- http://sematext.com/about/jobs.html


(system) #7