Indexing without knowing the language

I have a use case where several different languages can be used in a forum.
I don't have any indication which language is used by which users, and in
some cases, multiple languages might be used in the same posts.

My document includes the typical properties one might expect for a forum
post along with the body of the message.

What strategies might I employ so users can search for the german posts
using german keywords, japanese posts using japanese keywords, etc?

Hi James,

Sounds like you should simply identify the language(s) of documents
before indexing them and then analyze them appropriately.
There is a language identifier that's included in Solr 3.5.0 you could
rip out.

Otis

On Dec 17, 11:42 am, James Cook jc...@pykl.com wrote:

I have a use case where several different languages can be used in a forum.
I don't have any indication which language is used by which users, and in
some cases, multiple languages might be used in the same posts.

My document includes the typical properties one might expect for a forum
post along with the body of the message.

What strategies might I employ so users can search for the german posts
using german keywords, japanese posts using japanese keywords, etc?

Hi James,

there's also a Google "Compact Language Detector",
GitHub - mikemccand/chromium-compact-language-detector: Automatically exported from code.google.com/p/chromium-compact-language-detector, which
seemed very nice to me, with easy wrapping from other languages (eg.
Changing Bits: Language detection with Google's Compact Language Detector,
GitHub - jtoy/cld: compact language detection in ruby).

Detecting the language won't be the hard part here, I think. The hard
part is defining proper mapping for those languages (multifields?,
different properties?, etc). I think multifields should work really
well here, but I don't know of any strategy which would allow
magically analyzing queries with german analyzers and searching german
fields, and vice versa for english/japanese, etc., without the user
explicitely setting the language, or "trying out" the search against
all the multi fields and depending on the _score to sort them right...

This does not sound so "simple" as Otis' response suggests :), so I
may be missing something.

Karel

On Dec 17, 5:42 pm, James Cook jc...@pykl.com wrote:

I have a use case where several different languages can be used in a forum.
I don't have any indication which language is used by which users, and in
some cases, multiple languages might be used in the same posts.

My document includes the typical properties one might expect for a forum
post along with the body of the message.

What strategies might I employ so users can search for the german posts
using german keywords, japanese posts using japanese keywords, etc?

I wonder if a "smart" analyzer could be employed at both index and search
time?

At index time, the smart analyzer would introduce a new multitype to the
mapping for the detected language, and at search time, the smart analyzer
would do something similar?

Hi,

On Dec 18, 3:47 am, karmi karel.mina...@gmail.com wrote:

Hi James,

there's also a Google "Compact Language Detector",GitHub - mikemccand/chromium-compact-language-detector: Automatically exported from code.google.com/p/chromium-compact-language-detector, which
seemed very nice to me, with easy wrapping from other languages (eg.http://blog.mikemccandless.com/2011/10/language-detection-with-google...,GitHub - jtoy/cld: compact language detection in ruby).

Detecting the language won't be the hard part here, I think. The hard
part is defining proper mapping for those languages (multifields?,
different properties?, etc). I think multifields should work really
well here, but I don't know of any strategy which would allow
magically analyzing queries with german analyzers and searching german
fields, and vice versa for english/japanese, etc., without the user

Are you referring to?

If (don't know) this can indeed be used for multilingual situations
like this one, it would be great for ES documentation to include that
example, because dealing with this sort of problem will become a FAQ
if it's not already.

Otis

Sematext is Hiring World-Wide -- Jobs - Sematext

explicitely setting the language, or "trying out" the search against
all the multi fields and depending on the _score to sort them right...

This does not sound so "simple" as Otis' response suggests :), so I
may be missing something.

Karel

On Dec 17, 5:42 pm, James Cook jc...@pykl.com wrote:

I have a use case where several different languages can be used in a forum.
I don't have any indication which language is used by which users, and in
some cases, multiple languages might be used in the same posts.

My document includes the typical properties one might expect for a forum
post along with the body of the message.

What strategies might I employ so users can search for the german posts
using german keywords, japanese posts using japanese keywords, etc?

Hi,

On Dec 18, 8:04 pm, James Cook jc...@pykl.com wrote:

I wonder if a "smart" analyzer could be employed at both index and search
time?

At index time, the smart analyzer would introduce a new multitype to the
mapping for the detected language, and at search time, the smart analyzer
would do something similar?

I'm all eyeballs in this thread! :slight_smile:

Re search-time language identification, the common problem is the
shortness of query strings :frowning:
So in my experience one tries to figure out the user's language some
other way (explicit selection of from preferences or...)

Otis

Sematext is Hiring World-Wide -- Jobs - Sematext