I am facing the same problem but not able to decide which option to use.
I have one Document having id,name,description,datetime,userid fields.
From these all fields only 2 fields name,description can be in any
languages english, german etc.
Can you please explain following sentence with example? or suggest me
what approach I will follow for better performance?
How about using a single field called x using the standard analyzer, and
x_[langId] for each language? You can use dynamic mapping to automatically
map analysis parameters for *_en, or *_de etc.
Please give an example for automatically map analysis.
I have to use Java API for this. So is it possible with Java API?
On Wednesday, March 7, 2012 4:58:17 PM UTC+5:30, kimchy wrote:
It makes sense, the problem with using different analyzers on the same
field is that all those tokens, from the different languages, end up under
the same field, so its "kindda dirty". How about using a single field
called x using the standard analyzer, and x_[langId] for each language? You
can use dynamic mapping to automatically map analysis parameters for *_en,
or *_de (and so on, see more here under dynamic templates:
On Tuesday, March 6, 2012 at 11:17 PM, barnybug wrote:
Thanks for the response.
Currently we're indexing a set of documents in different languages and
using _analyzer mapping to determine the per doc stemming analyzer.
What we'd like to do is index some fields of the documents both stemmed
and unstemmed (eg. english analyzer to produce stemmed English and
'standard' analyzer to produce unstemmed). So using a multi_field seems
applicable, but then the two analyzers are fixed. Kind of need to specify
two _analyzer fields.
Essentially the customer wants to be able to do both stemmed (language
specific) searches and unstemmed (general) searches. This comes down to a
requirement to be able to match names, proper nouns, etc in cases where
stemming may interfere but there's no definitive list of these terms that
should not be stemmed.
We considered an index per language but it's quite a high number of
languages we're dealing so would likely be too many indexes.
Using a field per language also presents issues - to do the general
unstemmed searches would require querying across many fields.
Alternatively we were considering if it'd be easy to develop a tokenizer
that wrapped existing stemming tokenizers but also produced the original
term in addition to the stemmed term.
Sorry if that makes less than perfect sense!
On Tuesday, 6 March 2012 20:29:57 UTC, kimchy wrote:
No, you can't specify it per field, though why do you want it? Usually,
having a different analyzer for each document does't make a lot of sense.
Usually, it makes more sense to have different fields.
On Tuesday, March 6, 2012 at 6:01 PM, barnybug wrote:
I understand you can specify the analyzer per document at index time
using the _analyzer field in mapping, but is it possible to specify it
in the same way but per field at index time?
Or if not currently possible, how easy to add (happy to have a crack
at it myself)?