Search across types with same field but different analyzers


(Jörn Kottmann) #1

Hi all,

I found a couple of posts about multilingual indexing.
In them it was recommended to either use one index per language
or use a field per language.

I am currently looking into a system where they defined a type
per language and all types use the same field for the text. The
analyzers
vary per type then. So its using multiple analyzers for the same
field.

Is that ok to do?

The behavior I try to debug are random responses. The index is
not changed, but repeating the same query gives a varying
number of hits, or only one hit if it should be twenty.

Thanks for any help!

Jörn


(Jörg Prante) #2

Hi Jörn,

in Lucene/Elasticsearch, if you search a field, the words are analyzed once
at search time without notice of the analyzer used at index time, so it's
obvious you get varying number of hits searching over many types where
different analyzers have been used.

A better solution for multilingual search is yakaz' combo analyzer. The
combo analyzer is a single analyzer but combines multiple subanalyzers by
concatenating the generated tokens (so, many analyzers can be applied to a
single field which is usually not possible in Lucene).

See https://github.com/yakaz/elasticsearch-analysis-combo

Best regards,

Jörg

On Monday, July 9, 2012 3:41:35 PM UTC+2, Jörn wrote:

Hi all,

I found a couple of posts about multilingual indexing.
In them it was recommended to either use one index per language
or use a field per language.

I am currently looking into a system where they defined a type
per language and all types use the same field for the text. The
analyzers
vary per type then. So its using multiple analyzers for the same
field.

Is that ok to do?

The behavior I try to debug are random responses. The index is
not changed, but repeating the same query gives a varying
number of hits, or only one hit if it should be twenty.

Thanks for any help!

Jörn


(Jörn Kottmann) #3

On 07/09/2012 06:17 PM, Jörg Prante wrote:

in Lucene/Elasticsearch, if you search a field, the words are analyzed
once at search time without notice of the analyzer used at index time,
so it's obvious you get varying number of hits searching over many
types where different analyzers have been used.

Thanks for your answer.

Can you explain that to me a bit further? If the analyzers do not match
I would still expect to get the same response
from ES again if I re-send the same query. Tough the result might not be
the desired one but it should be reproduce-able.

Does not the mapping of field to an analyzer take care of choosing the
right analyzer during index and search time?

Jörn


(Jörg Prante) #4

Probably I misunderstood, it's hard to follow without a real example. Can
you provide a gist with a little demo where the varying results can
be reproduced?

Best regards,

Jörg

On Monday, July 9, 2012 6:33:26 PM UTC+2, Jörn wrote:

On 07/09/2012 06:17 PM, Jörg Prante wrote:

in Lucene/Elasticsearch, if you search a field, the words are analyzed
once at search time without notice of the analyzer used at index time,
so it's obvious you get varying number of hits searching over many
types where different analyzers have been used.

Thanks for your answer.

Can you explain that to me a bit further? If the analyzers do not match
I would still expect to get the same response
from ES again if I re-send the same query. Tough the result might not be
the desired one but it should be reproduce-able.

Does not the mapping of field to an analyzer take care of choosing the
right analyzer during index and search time?

Jörn


(Jörn Kottmann) #5

Hello,

I was unable to reproduce the exact issue I am getting in the system I
am investigating.
But a similar problem seems to exist in in my sample also.

The index with the name "type-test" is created like this:

{
"settings": {
"number_of_shards": 1,
"index": {
"analysis": {
"analyzer": {
"snowball_eng": {
"type": "snowball",
"language": "English"
},
"snowball_fra": {
"type": "snowball",
"language": "French"
}
}
}
}
},
"mappings": {
"type_eng": {
"_source": {
"enabled": false
},
"properties": {
"field1": {
"type": "string",
"analyzer": "snowball_eng"
}
}
},
"type_fra": {
"_source": {
"enabled": false
},
"properties": {
"field1": {
"type": "string",
"analyzer": "snowball_fra"
}
}
}
}
}

Please note that there is one field called "field1" which is used in the
two types type_eng and type_fra.
In each type is a different snowball analyzer configured.

I put some content there:
Put content:
... /type-test/type_eng/1
{
"field1": "Fina"
}

.../type-test/type_fra/2
{
"field1": "Fina"
}

The word Fina is stemmed like this [1]:
English: Fina -> Fina
French: Fina-> Fin

Searching:
../type-test/_search
{
"query": {
"field": {
"field1": "Fina"
}
}
}

That just returns the document with id 2.

To me it looks like using types to have different analyzers for
different languages is not
supported. Is that right, or was something done wrong?

Thanks for your help!

Jörn

[1] http://text-processing.com/demo/stem/

On 07/09/2012 07:45 PM, Jörg Prante wrote:

Probably I misunderstood, it's hard to follow without a real example.
Can you provide a gist with a little demo where the varying results
can be reproduced?

Best regards,

Jörg

On Monday, July 9, 2012 6:33:26 PM UTC+2, Jörn wrote:

On 07/09/2012 06:17 PM, Jörg Prante wrote:
> in Lucene/Elasticsearch, if you search a field, the words are
analyzed
> once at search time without notice of the analyzer used at index
time,
> so it's obvious you get varying number of hits searching over many
> types where different analyzers have been used.

Thanks for your answer.

Can you explain that to me a bit further? If the analyzers do not
match
I would still expect to get the same response
from ES again if I re-send the same query. Tough the result might
not be
the desired one but it should be reproduce-able.

Does not the mapping of field to an analyzer take care of choosing
the
right analyzer during index and search time?

Jörn

(Clinton Gormley) #6

HIya

Please note that there is one field called "field1" which is used in
the two types type_eng and type_fra.

This is the issue.

{

"query": {
"field": {
"field1": "Fina"
}
}
}

ES finds the first field called 'field1' and uses the mapping for that
field. So you may randomly get English or French analysis happening.

Either specify the actual field 'type_eng.field1' or specify the
search_analyzer to use in your search request

clint


(Jörn Kottmann) #7

Thanks for clarifying that, when searching
for the 'actual' field or fields it works.

Does ES use internally one or multiple fields for field1?

I am asking because I would like to understand if the scoring
is done differently compared to a one field per language approach.

Jörn

On 07/12/2012 03:59 PM, Clinton Gormley wrote:

HIya

Please note that there is one field called "field1" which is used in
the two types type_eng and type_fra.
This is the issue.

{

"query": {
"field": {
"field1": "Fina"
}
}
}
ES finds the first field called 'field1' and uses the mapping for that
field. So you may randomly get English or French analysis happening.

Either specify the actual field 'type_eng.field1' or specify the
search_analyzer to use in your search request

clint


(Jörn Kottmann) #8

On 07/12/2012 03:59 PM, Clinton Gormley wrote:

ES finds the first field called 'field1' and uses the mapping for that
field. So you may randomly get English or French analysis happening.

Should we open an issue for this? Is it considered a bug?

Jörn


(Clinton Gormley) #9

On Thu, 2012-07-12 at 16:25 +0200, Jörn Kottmann wrote:

On 07/12/2012 03:59 PM, Clinton Gormley wrote:

ES finds the first field called 'field1' and uses the mapping for that
field. So you may randomly get English or French analysis happening.

Should we open an issue for this? Is it considered a bug?

I don't think there is much that can be done about it.

Although, possibly, for certain types of query, it'd be possible to
detect multiple fields with the same name but different mappings, and,
do a dismax query on those. Doing the Right Thing automatically
wouldn't always be possible.

Open an issue to track it, and to alert others who run into the same
problem. But for now, I think we just have to live with it

clint


(system) #10