I have a use case where several different languages can be used in a forum.
I don't have any indication which language is used by which users, and in
some cases, multiple languages might be used in the same posts.
My document includes the typical properties one might expect for a forum
post along with the body of the message.
What strategies might I employ so users can search for the german posts
using german keywords, japanese posts using japanese keywords, etc?
Sounds like you should simply identify the language(s) of documents
before indexing them and then analyze them appropriately.
There is a language identifier that's included in Solr 3.5.0 you could
rip out.
I have a use case where several different languages can be used in a forum.
I don't have any indication which language is used by which users, and in
some cases, multiple languages might be used in the same posts.
My document includes the typical properties one might expect for a forum
post along with the body of the message.
What strategies might I employ so users can search for the german posts
using german keywords, japanese posts using japanese keywords, etc?
Detecting the language won't be the hard part here, I think. The hard
part is defining proper mapping for those languages (multifields?,
different properties?, etc). I think multifields should work really
well here, but I don't know of any strategy which would allow
magically analyzing queries with german analyzers and searching german
fields, and vice versa for english/japanese, etc., without the user
explicitely setting the language, or "trying out" the search against
all the multi fields and depending on the _score to sort them right...
This does not sound so "simple" as Otis' response suggests :), so I
may be missing something.
I have a use case where several different languages can be used in a forum.
I don't have any indication which language is used by which users, and in
some cases, multiple languages might be used in the same posts.
My document includes the typical properties one might expect for a forum
post along with the body of the message.
What strategies might I employ so users can search for the german posts
using german keywords, japanese posts using japanese keywords, etc?
I wonder if a "smart" analyzer could be employed at both index and search
time?
At index time, the smart analyzer would introduce a new multitype to the
mapping for the detected language, and at search time, the smart analyzer
would do something similar?
Detecting the language won't be the hard part here, I think. The hard
part is defining proper mapping for those languages (multifields?,
different properties?, etc). I think multifields should work really
well here, but I don't know of any strategy which would allow
magically analyzing queries with german analyzers and searching german
fields, and vice versa for english/japanese, etc., without the user
If (don't know) this can indeed be used for multilingual situations
like this one, it would be great for ES documentation to include that
example, because dealing with this sort of problem will become a FAQ
if it's not already.
I have a use case where several different languages can be used in a forum.
I don't have any indication which language is used by which users, and in
some cases, multiple languages might be used in the same posts.
My document includes the typical properties one might expect for a forum
post along with the body of the message.
What strategies might I employ so users can search for the german posts
using german keywords, japanese posts using japanese keywords, etc?
I wonder if a "smart" analyzer could be employed at both index and search
time?
At index time, the smart analyzer would introduce a new multitype to the
mapping for the detected language, and at search time, the smart analyzer
would do something similar?
I'm all eyeballs in this thread!
Re search-time language identification, the common problem is the
shortness of query strings
So in my experience one tries to figure out the user's language some
other way (explicit selection of from preferences or...)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.