Specifying analyzer on a per field basis at index time

barnybug · March 6, 2012, 4:01pm

I understand you can specify the analyzer per document at index time
using the _analyzer field in mapping, but is it possible to specify it
in the same way but per field at index time?

Or if not currently possible, how easy to add (happy to have a crack
at it myself)?

thanks

Barnaby

kimchy · March 6, 2012, 8:29pm

No, you can't specify it per field, though why do you want it? Usually, having a different analyzer for each document does't make a lot of sense. Usually, it makes more sense to have different fields.

On Tuesday, March 6, 2012 at 6:01 PM, barnybug wrote:

I understand you can specify the analyzer per document at index time
using the _analyzer field in mapping, but is it possible to specify it
in the same way but per field at index time?

Or if not currently possible, how easy to add (happy to have a crack
at it myself)?

thanks

Barnaby

barnybug · March 6, 2012, 9:17pm

Hi,

Thanks for the response.

Currently we're indexing a set of documents in different languages and
using _analyzer mapping to determine the per doc stemming analyzer.

What we'd like to do is index some fields of the documents both stemmed and
unstemmed (eg. english analyzer to produce stemmed English and 'standard'
analyzer to produce unstemmed). So using a multi_field seems applicable,
but then the two analyzers are fixed. Kind of need to specify two _analyzer
fields.

Essentially the customer wants to be able to do both stemmed (language
specific) searches and unstemmed (general) searches. This comes down to a
requirement to be able to match names, proper nouns, etc in cases where
stemming may interfere but there's no definitive list of these terms that
should not be stemmed.

We considered an index per language but it's quite a high number of
languages we're dealing so would likely be too many indexes.
Using a field per language also presents issues - to do the general
unstemmed searches would require querying across many fields.

Alternatively we were considering if it'd be easy to develop a tokenizer
that wrapped existing stemming tokenizers but also produced the original
term in addition to the stemmed term.

Sorry if that makes less than perfect sense!

thanks,

Barnaby

On Tuesday, 6 March 2012 20:29:57 UTC, kimchy wrote:

No, you can't specify it per field, though why do you want it? Usually,
having a different analyzer for each document does't make a lot of sense.
Usually, it makes more sense to have different fields.

On Tuesday, March 6, 2012 at 6:01 PM, barnybug wrote:

I understand you can specify the analyzer per document at index time
using the _analyzer field in mapping, but is it possible to specify it
in the same way but per field at index time?

Or if not currently possible, how easy to add (happy to have a crack
at it myself)?

thanks

Barnaby

kimchy · March 7, 2012, 11:28am

It makes sense, the problem with using different analyzers on the same field is that all those tokens, from the different languages, end up under the same field, so its "kindda dirty". How about using a single field called x using the standard analyzer, and x_[langId] for each language? You can use dynamic mapping to automatically map analysis parameters for *_en, or *_de (and so on, see more here under dynamic templates: Elasticsearch Platform — Find real-time answers at scale | Elastic).

On Tuesday, March 6, 2012 at 11:17 PM, barnybug wrote:

Hi,

Thanks for the response.

Currently we're indexing a set of documents in different languages and using _analyzer mapping to determine the per doc stemming analyzer.

What we'd like to do is index some fields of the documents both stemmed and unstemmed (eg. english analyzer to produce stemmed English and 'standard' analyzer to produce unstemmed). So using a multi_field seems applicable, but then the two analyzers are fixed. Kind of need to specify two _analyzer fields.

Essentially the customer wants to be able to do both stemmed (language specific) searches and unstemmed (general) searches. This comes down to a requirement to be able to match names, proper nouns, etc in cases where stemming may interfere but there's no definitive list of these terms that should not be stemmed.

We considered an index per language but it's quite a high number of languages we're dealing so would likely be too many indexes.
Using a field per language also presents issues - to do the general unstemmed searches would require querying across many fields.

Alternatively we were considering if it'd be easy to develop a tokenizer that wrapped existing stemming tokenizers but also produced the original term in addition to the stemmed term.

Sorry if that makes less than perfect sense!

thanks,

Barnaby

On Tuesday, 6 March 2012 20:29:57 UTC, kimchy wrote:

No, you can't specify it per field, though why do you want it? Usually, having a different analyzer for each document does't make a lot of sense. Usually, it makes more sense to have different fields.

On Tuesday, March 6, 2012 at 6:01 PM, barnybug wrote:

I understand you can specify the analyzer per document at index time
using the _analyzer field in mapping, but is it possible to specify it
in the same way but per field at index time?

Or if not currently possible, how easy to add (happy to have a crack
at it myself)?

thanks

Barnaby

barnybug · March 13, 2012, 9:30am

Good plan, thanks for suggestion.

Barnaby

On Wednesday, 7 March 2012 11:28:17 UTC, kimchy wrote:

It makes sense, the problem with using different analyzers on the same
field is that all those tokens, from the different languages, end up under
the same field, so its "kindda dirty". How about using a single field
called x using the standard analyzer, and x_[langId] for each language? You
can use dynamic mapping to automatically map analysis parameters for *_en,
or *_de (and so on, see more here under dynamic templates:
Elasticsearch Platform — Find real-time answers at scale | Elastic
).

On Tuesday, March 6, 2012 at 11:17 PM, barnybug wrote:

Hi,

Thanks for the response.

Currently we're indexing a set of documents in different languages and
using _analyzer mapping to determine the per doc stemming analyzer.

What we'd like to do is index some fields of the documents both stemmed
and unstemmed (eg. english analyzer to produce stemmed English and
'standard' analyzer to produce unstemmed). So using a multi_field seems
applicable, but then the two analyzers are fixed. Kind of need to specify
two _analyzer fields.

Essentially the customer wants to be able to do both stemmed (language
specific) searches and unstemmed (general) searches. This comes down to a
requirement to be able to match names, proper nouns, etc in cases where
stemming may interfere but there's no definitive list of these terms that
should not be stemmed.

We considered an index per language but it's quite a high number of
languages we're dealing so would likely be too many indexes.
Using a field per language also presents issues - to do the general
unstemmed searches would require querying across many fields.

Alternatively we were considering if it'd be easy to develop a tokenizer
that wrapped existing stemming tokenizers but also produced the original
term in addition to the stemmed term.

Sorry if that makes less than perfect sense!

thanks,

Barnaby

On Tuesday, 6 March 2012 20:29:57 UTC, kimchy wrote:

No, you can't specify it per field, though why do you want it? Usually,
having a different analyzer for each document does't make a lot of sense.
Usually, it makes more sense to have different fields.

On Tuesday, March 6, 2012 at 6:01 PM, barnybug wrote:

I understand you can specify the analyzer per document at index time
using the _analyzer field in mapping, but is it possible to specify it
in the same way but per field at index time?

Or if not currently possible, how easy to add (happy to have a crack
at it myself)?

thanks

Barnaby

Sapana_Patel · January 24, 2013, 12:19pm

Hi,

I am facing the same problem but not able to decide which option to use.
I have one Document having id,name,description,datetime,userid fields.
From these all fields only 2 fields name,description can be in any
languages english, german etc.

Can you please explain following sentence with example? or suggest me
what approach I will follow for better performance?

How about using a single field called x using the standard analyzer, and
x_[langId] for each language? You can use dynamic mapping to automatically
map analysis parameters for *_en, or *_de etc.

Please give an example for automatically map analysis.

I have to use Java API for this. So is it possible with Java API?

--
Thanks
Sapana

On Wednesday, March 7, 2012 4:58:17 PM UTC+5:30, kimchy wrote:

It makes sense, the problem with using different analyzers on the same
field is that all those tokens, from the different languages, end up under
the same field, so its "kindda dirty". How about using a single field
called x using the standard analyzer, and x_[langId] for each language? You
can use dynamic mapping to automatically map analysis parameters for *_en,
or *_de (and so on, see more here under dynamic templates:
Elasticsearch Platform — Find real-time answers at scale | Elastic
).

On Tuesday, March 6, 2012 at 11:17 PM, barnybug wrote:

Hi,

Thanks for the response.

Currently we're indexing a set of documents in different languages and
using _analyzer mapping to determine the per doc stemming analyzer.

What we'd like to do is index some fields of the documents both stemmed
and unstemmed (eg. english analyzer to produce stemmed English and
'standard' analyzer to produce unstemmed). So using a multi_field seems
applicable, but then the two analyzers are fixed. Kind of need to specify
two _analyzer fields.

Essentially the customer wants to be able to do both stemmed (language
specific) searches and unstemmed (general) searches. This comes down to a
requirement to be able to match names, proper nouns, etc in cases where
stemming may interfere but there's no definitive list of these terms that
should not be stemmed.

We considered an index per language but it's quite a high number of
languages we're dealing so would likely be too many indexes.
Using a field per language also presents issues - to do the general
unstemmed searches would require querying across many fields.

Alternatively we were considering if it'd be easy to develop a tokenizer
that wrapped existing stemming tokenizers but also produced the original
term in addition to the stemmed term.

Sorry if that makes less than perfect sense!

thanks,

Barnaby

On Tuesday, 6 March 2012 20:29:57 UTC, kimchy wrote:

No, you can't specify it per field, though why do you want it? Usually,
having a different analyzer for each document does't make a lot of sense.
Usually, it makes more sense to have different fields.

On Tuesday, March 6, 2012 at 6:01 PM, barnybug wrote:

I understand you can specify the analyzer per document at index time
using the _analyzer field in mapping, but is it possible to specify it
in the same way but per field at index time?

Or if not currently possible, how easy to add (happy to have a crack
at it myself)?

thanks

Barnaby

--

Topic		Replies	Views
One language per document and multiple languages per index Elasticsearch	1	650	January 13, 2017
Apply language-dependent search analyzer at search time Elasticsearch	2	924	June 23, 2017
_analyse field: which analyzer will be used on search? Elasticsearch	3	347	July 6, 2017
Specific analyzer per document Elasticsearch	4	422	July 6, 2017
Multilingual index options: _analyzer or multiple mappings or? Elasticsearch	2	625	July 6, 2017

Specifying analyzer on a per field basis at index time

Related topics