I am trying to figure out gow to index the following in ES.
I have documents representing location names (geonames.org). Each
document has a category such as Airport, restaurant, river, beach
etc ..
I have translated the categories in 25 languages. For each document I
want to add the translated categories.
Should I :
create 1 field per category translation and use a different
analyzer with different language for each categroy field ?
create 1 general category object field with an array of
translations ... in that case how should I set analysis ?
No real production experience yet but I am using the approach of having one
field per language (with language specific analyzer configurations attached
to them).
For the bonus question: Often you will have some context in your app that
would define the language (e.g. user selecting language for their browsing
session as the remaining page content most likely will have to show the
correct language too). In this case it would be trivial to select the
correct field in the query. Without session context you could use language
detection on the user input and select the correct field based on that. If
you do not have a language detection library, you could try to run the
search across all language fields (this may generate some noise though).
Check the ML archive, I just asked a similar question the other day
and the approach we'll be taking is the one where we have a single set
of fields and at index-time we explicitly specify the analyzer for
each field depending on the field or document language. We'll be
using our own Language Identifier library for that (http:// Cloud Monitoring Tools & Services | Sematext). You could use
a Language Identifier to detect query language, too. Precision may
suffer if queries are very short or ambiguous (is "die" an English
verb? Or English noun? Or a German article?), though this can be
addressed through UI, giving people options to select from one or a
few guessed languages, allowing people to permanently store/remember
their language selection and such.
Otis
Sematext is hiring Elasticsearch / Solr developers --
No real production experience yet but I am using the approach of having one
field per language (with language specific analyzer configurations attached
to them).
For the bonus question: Often you will have some context in your app that
would define the language (e.g. user selecting language for their browsing
session as the remaining page content most likely will have to show the
correct language too). In this case it would be trivial to select the
correct field in the query. Without session context you could use language
detection on the user input and select the correct field based on that. If
you do not have a language detection library, you could try to run the
search across all language fields (this may generate some noise though).
It really depends, there are several ways to do it. You can create an index
per language, with its own mapping that has lang analyzer specific on the
relevant field. Another option is to use multiple field names for each
language, each with its own analyzer associated with it. Those are usually
the best two options.
No real production experience yet but I am using the approach of having
one field per language (with language specific analyzer configurations
attached to them).
For the bonus question: Often you will have some context in your app that
would define the language (e.g. user selecting language for their browsing
session as the remaining page content most likely will have to show the
correct language too). In this case it would be trivial to select the
correct field in the query. Without session context you could use language
detection on the user input and select the correct field based on that. If
you do not have a language detection library, you could try to run the
search across all language fields (this may generate some noise though).
I started working through the field per language approach. I guess
I'll be using variants of snowball analyzer for each language when
available and a regular LanguageAnalyzer when not.
For the query part I can play both with a user setting and query
language identification (whenever possible).
I started working through the field per language approach. I guess
I'll be using variants of snowball analyzer for each language when
available and a regular LanguageAnalyzer when not.
For the query part I can play both with a user setting and query
language identification (whenever possible).
Hi.
Perhaps this is only relevant for a language like finnish, but we have
indexed two versions of every field. One type with no language analyzing
for exact matches and the other with language analyzer for expanded matches.
Documents are build like this:
title: "some text" -> to default analyzer
title_en: "some text" -> to english analyzer
language: english
When searching the language independent field is given a slight boost over
the linguistically analyzed field. Our documents only containt text in one
language and are indexed using a language field. Searches are then filtered
by this field on the user's chosen language.
There is also the option of specifying an Analyzer for each field at
index-time, right?
Are there some drawback to this approach that makes you say that index
per language and field set per language are usually the best the
options?
Thanks,
Otis
Sematext is hiring Elasticsearch / Solr developers - Jobs - Sematext
It really depends, there are several ways to do it. You can create an index
per language, with its own mapping that has lang analyzer specific on the
relevant field. Another option is to use multiple field names for each
language, each with its own analyzer associated with it. Those are usually
the best two options.
No real production experience yet but I am using the approach of having
one field per language (with language specific analyzer configurations
attached to them).
For the bonus question: Often you will have some context in your app that
would define the language (e.g. user selecting language for their browsing
session as the remaining page content most likely will have to show the
correct language too). In this case it would be trivial to select the
correct field in the query. Without session context you could use language
detection on the user input and select the correct field based on that. If
you do not have a language detection library, you could try to run the
search across all language fields (this may generate some noise though).
There is also the option of specifying an Analyzer for each field at
index-time, right?
Are there some drawback to this approach that makes you say that index
per language and field set per language are usually the best the
options?
Thanks,
Otis
Sematext is hiring Elasticsearch / Solr developers - Jobs - Sematext
It really depends, there are several ways to do it. You can create an
index
per language, with its own mapping that has lang analyzer specific on the
relevant field. Another option is to use multiple field names for each
language, each with its own analyzer associated with it. Those are
usually
the best two options.
No real production experience yet but I am using the approach of having
one field per language (with language specific analyzer configurations
attached to them).
For the bonus question: Often you will have some context in your app
that
would define the language (e.g. user selecting language for their
browsing
session as the remaining page content most likely will have to show the
correct language too). In this case it would be trivial to select the
correct field in the query. Without session context you could use
language
detection on the user input and select the correct field based on
that. If
you do not have a language detection library, you could try to run the
search across all language fields (this may generate some noise
though).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.