Indexing without knowing the language

James_Cook_3 · December 17, 2011, 4:42pm

I have a use case where several different languages can be used in a forum.
I don't have any indication which language is used by which users, and in
some cases, multiple languages might be used in the same posts.

My document includes the typical properties one might expect for a forum
post along with the body of the message.

What strategies might I employ so users can search for the german posts
using german keywords, japanese posts using japanese keywords, etc?

otisg · December 18, 2011, 5:45am

Hi James,

Sounds like you should simply identify the language(s) of documents
before indexing them and then analyze them appropriately.
There is a language identifier that's included in Solr 3.5.0 you could
rip out.

Otis

On Dec 17, 11:42 am, James Cook jc...@pykl.com wrote:

I have a use case where several different languages can be used in a forum.
I don't have any indication which language is used by which users, and in
some cases, multiple languages might be used in the same posts.

My document includes the typical properties one might expect for a forum
post along with the body of the message.

What strategies might I employ so users can search for the german posts
using german keywords, japanese posts using japanese keywords, etc?

karmi · December 18, 2011, 8:47am

Hi James,

there's also a Google "Compact Language Detector",
GitHub - mikemccand/chromium-compact-language-detector: Automatically exported from code.google.com/p/chromium-compact-language-detector, which
seemed very nice to me, with easy wrapping from other languages (eg.
Changing Bits: Language detection with Google's Compact Language Detector,
GitHub - jtoy/cld: compact language detection in ruby).

Detecting the language won't be the hard part here, I think. The hard
part is defining proper mapping for those languages (multifields?,
different properties?, etc). I think multifields should work really
well here, but I don't know of any strategy which would allow
magically analyzing queries with german analyzers and searching german
fields, and vice versa for english/japanese, etc., without the user
explicitely setting the language, or "trying out" the search against
all the multi fields and depending on the _score to sort them right...

This does not sound so "simple" as Otis' response suggests :), so I
may be missing something.

Karel

On Dec 17, 5:42 pm, James Cook jc...@pykl.com wrote:

I have a use case where several different languages can be used in a forum.
I don't have any indication which language is used by which users, and in
some cases, multiple languages might be used in the same posts.

My document includes the typical properties one might expect for a forum
post along with the body of the message.

What strategies might I employ so users can search for the german posts
using german keywords, japanese posts using japanese keywords, etc?

James_Cook_3 · December 19, 2011, 1:04am

I wonder if a "smart" analyzer could be employed at both index and search
time?

At index time, the smart analyzer would introduce a new multitype to the
mapping for the detected language, and at search time, the smart analyzer
would do something similar?

otisg · December 19, 2011, 5:34am

Hi,

On Dec 18, 3:47 am, karmi karel.mina...@gmail.com wrote:

Hi James,

there's also a Google "Compact Language Detector",GitHub - mikemccand/chromium-compact-language-detector: Automatically exported from code.google.com/p/chromium-compact-language-detector, which
seemed very nice to me, with easy wrapping from other languages (eg.http://blog.mikemccandless.com/2011/10/language-detection-with-google...,GitHub - jtoy/cld: compact language detection in ruby).

Detecting the language won't be the hard part here, I think. The hard
part is defining proper mapping for those languages (multifields?,
different properties?, etc). I think multifields should work really
well here, but I don't know of any strategy which would allow
magically analyzing queries with german analyzers and searching german
fields, and vice versa for english/japanese, etc., without the user

Are you referring to?

If (don't know) this can indeed be used for multilingual situations
like this one, it would be great for ES documentation to include that
example, because dealing with this sort of problem will become a FAQ
if it's not already.

Otis

Sematext is Hiring World-Wide -- Jobs - Sematext

explicitely setting the language, or "trying out" the search against
all the multi fields and depending on the _score to sort them right...

This does not sound so "simple" as Otis' response suggests :), so I
may be missing something.

Karel

On Dec 17, 5:42 pm, James Cook jc...@pykl.com wrote:

I have a use case where several different languages can be used in a forum.
I don't have any indication which language is used by which users, and in
some cases, multiple languages might be used in the same posts.

My document includes the typical properties one might expect for a forum
post along with the body of the message.

What strategies might I employ so users can search for the german posts
using german keywords, japanese posts using japanese keywords, etc?

otisg · December 19, 2011, 5:37am

Hi,

On Dec 18, 8:04 pm, James Cook jc...@pykl.com wrote:

I wonder if a "smart" analyzer could be employed at both index and search
time?

At index time, the smart analyzer would introduce a new multitype to the
mapping for the detected language, and at search time, the smart analyzer
would do something similar?

I'm all eyeballs in this thread!

Re search-time language identification, the common problem is the
shortness of query strings
So in my experience one tries to figure out the user's language some
other way (explicit selection of from preferences or...)

Otis

Sematext is Hiring World-Wide -- Jobs - Sematext

Topic		Replies	Views
Implementation of multi lingual search Elasticsearch	3	372	July 6, 2017
Best way to index multiple languages Elasticsearch	9	10308	July 6, 2017
Indexing multi language documents with langdetect Elasticsearch	1	856	September 6, 2018
Handling multiple languages Elasticsearch	1	322	July 6, 2017
Mult-language searchable in one field Elasticsearch	9	452	July 6, 2017

Indexing without knowing the language

Otis

Otis

Related topics