Search across types with same field but different analyzers

Jorn_Kottmann · July 9, 2012, 1:41pm

Hi all,

I found a couple of posts about multilingual indexing.
In them it was recommended to either use one index per language
or use a field per language.

I am currently looking into a system where they defined a type
per language and all types use the same field for the text. The
analyzers
vary per type then. So its using multiple analyzers for the same
field.

Is that ok to do?

The behavior I try to debug are random responses. The index is
not changed, but repeating the same query gives a varying
number of hits, or only one hit if it should be twenty.

Thanks for any help!

Jörn

jprante · July 9, 2012, 4:17pm

Hi Jörn,

in Lucene/Elasticsearch, if you search a field, the words are analyzed once
at search time without notice of the analyzer used at index time, so it's
obvious you get varying number of hits searching over many types where
different analyzers have been used.

A better solution for multilingual search is yakaz' combo analyzer. The
combo analyzer is a single analyzer but combines multiple subanalyzers by
concatenating the generated tokens (so, many analyzers can be applied to a
single field which is usually not possible in Lucene).

See GitHub - yakaz/elasticsearch-analysis-combo: Elasticsearch Combo Analyzer

Best regards,

Jörg

On Monday, July 9, 2012 3:41:35 PM UTC+2, Jörn wrote:

Hi all,

I found a couple of posts about multilingual indexing.
In them it was recommended to either use one index per language
or use a field per language.

I am currently looking into a system where they defined a type
per language and all types use the same field for the text. The
analyzers
vary per type then. So its using multiple analyzers for the same
field.

Is that ok to do?

The behavior I try to debug are random responses. The index is
not changed, but repeating the same query gives a varying
number of hits, or only one hit if it should be twenty.

Thanks for any help!

Jörn

Jorn_Kottmann · July 9, 2012, 4:33pm

On 07/09/2012 06:17 PM, JÃ¶rg Prante wrote:

in Lucene/Elasticsearch, if you search a field, the words are analyzed
once at search time without notice of the analyzer used at index time,
so it's obvious you get varying number of hits searching over many
types where different analyzers have been used.

Thanks for your answer.

Can you explain that to me a bit further? If the analyzers do not match
I would still expect to get the same response
from ES again if I re-send the same query. Tough the result might not be
the desired one but it should be reproduce-able.

Does not the mapping of field to an analyzer take care of choosing the
right analyzer during index and search time?

JÃ¶rn

jprante · July 9, 2012, 5:45pm

Probably I misunderstood, it's hard to follow without a real example. Can
you provide a gist with a little demo where the varying results can
be reproduced?

Best regards,

Jörg

On Monday, July 9, 2012 6:33:26 PM UTC+2, Jörn wrote:

On 07/09/2012 06:17 PM, Jörg Prante wrote:

in Lucene/Elasticsearch, if you search a field, the words are analyzed
once at search time without notice of the analyzer used at index time,
so it's obvious you get varying number of hits searching over many
types where different analyzers have been used.

Thanks for your answer.

Can you explain that to me a bit further? If the analyzers do not match
I would still expect to get the same response
from ES again if I re-send the same query. Tough the result might not be
the desired one but it should be reproduce-able.

Does not the mapping of field to an analyzer take care of choosing the
right analyzer during index and search time?

Jörn

Jorn_Kottmann · July 11, 2012, 1:24pm

Hello,

I was unable to reproduce the exact issue I am getting in the system I
am investigating.
But a similar problem seems to exist in in my sample also.

The index with the name "type-test" is created like this:

{
"settings": {
"number_of_shards": 1,
"index": {
"analysis": {
"analyzer": {
"snowball_eng": {
"type": "snowball",
"language": "English"
},
"snowball_fra": {
"type": "snowball",
"language": "French"
}
}
}
}
},
"mappings": {
"type_eng": {
"_source": {
"enabled": false
},
"properties": {
"field1": {
"type": "string",
"analyzer": "snowball_eng"
}
}
},
"type_fra": {
"_source": {
"enabled": false
},
"properties": {
"field1": {
"type": "string",
"analyzer": "snowball_fra"
}
}
}
}
}

Please note that there is one field called "field1" which is used in the
two types type_eng and type_fra.
In each type is a different snowball analyzer configured.

I put some content there:
Put content:
... /type-test/type_eng/1
{
"field1": "Fina"
}

.../type-test/type_fra/2
{
"field1": "Fina"
}

The word Fina is stemmed like this [1]:
English: Fina -> Fina
French: Fina-> Fin

Searching:
../type-test/_search
{
"query": {
"field": {
"field1": "Fina"
}
}
}

That just returns the document with id 2.

To me it looks like using types to have different analyzers for
different languages is not
supported. Is that right, or was something done wrong?

Thanks for your help!

JÃ¶rn

[1] Python NLTK Stemming and Lemmatization Demo

On 07/09/2012 07:45 PM, JÃ¶rg Prante wrote:

Probably I misunderstood, it's hard to follow without a real example.
Can you provide a gist with a little demo where the varying results
can be reproduced?

Best regards,

JÃ¶rg

On Monday, July 9, 2012 6:33:26 PM UTC+2, JÃ¶rn wrote:

On 07/09/2012 06:17 PM, JÃ¶rg Prante wrote:
> in Lucene/Elasticsearch, if you search a field, the words are
analyzed
> once at search time without notice of the analyzer used at index
time,
> so it's obvious you get varying number of hits searching over many
> types where different analyzers have been used.

Thanks for your answer.

Can you explain that to me a bit further? If the analyzers do not
match
I would still expect to get the same response
from ES again if I re-send the same query. Tough the result might
not be
the desired one but it should be reproduce-able.

Does not the mapping of field to an analyzer take care of choosing
the
right analyzer during index and search time?

JÃ¶rn

Clinton_Gormley · July 12, 2012, 1:59pm

HIya

Please note that there is one field called "field1" which is used in
the two types type_eng and type_fra.

This is the issue.

{

"query": {
"field": {
"field1": "Fina"
}
}
}

ES finds the first field called 'field1' and uses the mapping for that
field. So you may randomly get English or French analysis happening.

Either specify the actual field 'type_eng.field1' or specify the
search_analyzer to use in your search request

clint

Jorn_Kottmann · July 12, 2012, 2:14pm

Thanks for clarifying that, when searching
for the 'actual' field or fields it works.

Does ES use internally one or multiple fields for field1?

I am asking because I would like to understand if the scoring
is done differently compared to a one field per language approach.

JÃ¶rn

On 07/12/2012 03:59 PM, Clinton Gormley wrote:

HIya

Please note that there is one field called "field1" which is used in
the two types type_eng and type_fra.
This is the issue.

{

"query": {
"field": {
"field1": "Fina"
}
}
}
ES finds the first field called 'field1' and uses the mapping for that
field. So you may randomly get English or French analysis happening.

Either specify the actual field 'type_eng.field1' or specify the
search_analyzer to use in your search request

clint

Jorn_Kottmann · July 12, 2012, 2:25pm

On 07/12/2012 03:59 PM, Clinton Gormley wrote:

ES finds the first field called 'field1' and uses the mapping for that
field. So you may randomly get English or French analysis happening.

Should we open an issue for this? Is it considered a bug?

JÃ¶rn

Clinton_Gormley · July 13, 2012, 8:50am

On Thu, 2012-07-12 at 16:25 +0200, JÃ¶rn Kottmann wrote:

On 07/12/2012 03:59 PM, Clinton Gormley wrote:

ES finds the first field called 'field1' and uses the mapping for that
field. So you may randomly get English or French analysis happening.

Should we open an issue for this? Is it considered a bug?

I don't think there is much that can be done about it.

Although, possibly, for certain types of query, it'd be possible to
detect multiple fields with the same name but different mappings, and,
do a dismax query on those. Doing the Right Thing automatically
wouldn't always be possible.

Open an issue to track it, and to alert others who run into the same
problem. But for now, I think we just have to live with it

clint

Topic		Replies	Views
_analyse field: which analyzer will be used on search? Elasticsearch	3	341	July 6, 2017
Multilingual index options: _analyzer or multiple mappings or? Elasticsearch	2	624	July 6, 2017
Multiple Languages against single attribute Elasticsearch	5	1873	July 5, 2017
Multi-language analyzers in Elastic Search Elasticsearch	3	1123	August 17, 2017
Apply language-dependent search analyzer at search time Elasticsearch	2	923	June 23, 2017

Search across types with same field but different analyzers

Related topics