Failing to use dis_max to get accurate scores on multiple fields

AEvar_Arnfjord_Bjarm · March 17, 2012, 2:33pm

I'm having problems getting best name matches when making a query
across multiple fields with dis_max, here's a complete test case for
what I'm doing:

https://gist.github.com/2059195

In summary I'm trying to search through data like:

doc 1:
"object_name" : "Albino Elephant"
doc 2:
"object_name" : "Cute Albino Elephant"
"object_name_other_language" : "Cute Albino Elephant"
doc 3:
"object_name" : "The Cutest Albino Elephant"
"object_name_other_language" : "The Cutest Albino Elephant"

I.e. a bunch of objects that have multiple names that have been
translated into different languages (because I want more accurate
matches than using an array, and I want special analyzers per
language).

When I do a search for "Albino Elephant" with this query:

    "dis_max" : {
        "queries" : [
            {
                "text" : {
                    "object_name.unmunged" : "Albino Elephant"
                }
            },
            {
                "text" : {
                    "object_name_other_language.unmunged" :

"Albino Elephant"
}
}
]
}

doc #3 is going to be the highest scored hit, but if I were to change
doc #1 to:

doc 1:
"object_name" : "Albino Elephant"
"object_name_other_language" : "Albino Elephant"

It would be the number #1 hit, so seemingly dis_max is behaving like a
bool/should query and aggregating the scores. I thought the whole
point of dis_max was to do "execute all these queries, compute their
scores, and pick the highest one" so you could search through
e.g. multiple translations and not give objects extra scores by virtue
of having more translations.

Or maybe something else is up here, Clinton Gormley suggested on IRC
that this might be because I had >1 shards, but I changed the gist to
only create one shard and still have the same results.

AEvar_Arnfjord_Bjarm · March 22, 2012, 12:14pm

On Sat, Mar 17, 2012 at 15:33, Ævar Arnfjörð Bjarmason avarab@gmail.com wrote:

I'm having problems getting best name matches when making a query
across multiple fields with dis_max, here's a complete test case for
what I'm doing:

ElasticSearch Exact Name matching · GitHub

In summary I'm trying to search through data like:

doc 1:
"object_name" : "Albino Elephant"
doc 2:
"object_name" : "Cute Albino Elephant"
"object_name_other_language" : "Cute Albino Elephant"
doc 3:
"object_name" : "The Cutest Albino Elephant"
"object_name_other_language" : "The Cutest Albino Elephant"

I.e. a bunch of objects that have multiple names that have been
translated into different languages (because I want more accurate
matches than using an array, and I want special analyzers per
language).

When I do a search for "Albino Elephant" with this query:
   "dis_max" : {
       "queries" : [
           {
               "text" : {
                   "object_name.unmunged" : "Albino Elephant"
               }
           },
           {
               "text" : {
                   "object_name_other_language.unmunged" :
"Albino Elephant"
}
}
]
}

doc #3 is going to be the highest scored hit, but if I were to change
doc #1 to:

doc 1:
"object_name" : "Albino Elephant"
"object_name_other_language" : "Albino Elephant"

It would be the number #1 hit, so seemingly dis_max is behaving like a
bool/should query and aggregating the scores. I thought the whole
point of dis_max was to do "execute all these queries, compute their
scores, and pick the highest one" so you could search through
e.g. multiple translations and not give objects extra scores by virtue
of having more translations.

Or maybe something else is up here, Clinton Gormley suggested on IRC
that this might be because I had >1 shards, but I changed the gist to
only create one shard and still have the same results.

I've played with this some more but I haven't gotten it working.
Anyone have an idea what might be going on here?

kimchy · March 25, 2012, 11:21am

Heya,

If you set explain to true, you will see part of how the score is
calculated. You will see that dismax is doing its job, choosing the higher
ranking query that matched as the final score of the document. But, the
higher ranking doc is the one on object_name_other_language, and thats
because of the inverse document frequency mainly, because those terms are
more unique on the mentioned field than the object_name.

-shay.banon

On Thu, Mar 22, 2012 at 2:14 PM, Ævar Arnfjörð Bjarmason
avarab@gmail.comwrote:

On Sat, Mar 17, 2012 at 15:33, Ævar Arnfjörð Bjarmason avarab@gmail.com
wrote:
I'm having problems getting best name matches when making a query
across multiple fields with dis_max, here's a complete test case for
what I'm doing:

ElasticSearch Exact Name matching · GitHub

In summary I'm trying to search through data like:

doc 1:
"object_name" : "Albino Elephant"
doc 2:
"object_name" : "Cute Albino Elephant"
"object_name_other_language" : "Cute Albino Elephant"
doc 3:
"object_name" : "The Cutest Albino Elephant"
"object_name_other_language" : "The Cutest Albino Elephant"

I.e. a bunch of objects that have multiple names that have been
translated into different languages (because I want more accurate
matches than using an array, and I want special analyzers per
language).

When I do a search for "Albino Elephant" with this query:
   "dis_max" : {
       "queries" : [
           {
               "text" : {
                   "object_name.unmunged" : "Albino Elephant"
               }
           },
           {
               "text" : {
                   "object_name_other_language.unmunged" :
"Albino Elephant"
}
}
]
}

doc #3 is going to be the highest scored hit, but if I were to change
doc #1 to:

doc 1:
"object_name" : "Albino Elephant"
"object_name_other_language" : "Albino Elephant"

It would be the number #1 hit, so seemingly dis_max is behaving like a
bool/should query and aggregating the scores. I thought the whole
point of dis_max was to do "execute all these queries, compute their
scores, and pick the highest one" so you could search through
e.g. multiple translations and not give objects extra scores by virtue
of having more translations.

Or maybe something else is up here, Clinton Gormley suggested on IRC
that this might be because I had >1 shards, but I changed the gist to
only create one shard and still have the same results.
I've played with this some more but I haven't gotten it working.
Anyone have an idea what might be going on here?

AEvar_Arnfjord_Bjarm · March 26, 2012, 4:47pm

On Sun, Mar 25, 2012 at 13:21, Shay Banon kimchy@gmail.com wrote:

If you set explain to true, you will see part of how the score is
calculated. You will see that dismax is doing its job, choosing the higher
ranking query that matched as the final score of the document. But, the
higher ranking doc is the one on object_name_other_language, and thats
because of the inverse document frequency mainly, because those terms are
more unique on the mentioned field than the object_name.

Ah, is there some way to work around that? I see why that's happening
but that means in a document with a field per language the ones that
happen to have translations in relatively rare languages will be
bumped up more than other documents.

kimchy · March 27, 2012, 1:18pm

Maybe boost the regular name one?

On Mon, Mar 26, 2012 at 6:47 PM, Ævar Arnfjörð Bjarmason
avarab@gmail.comwrote:

On Sun, Mar 25, 2012 at 13:21, Shay Banon kimchy@gmail.com wrote:

If you set explain to true, you will see part of how the score is
calculated. You will see that dismax is doing its job, choosing the
higher
ranking query that matched as the final score of the document. But, the
higher ranking doc is the one on object_name_other_language, and thats
because of the inverse document frequency mainly, because those terms are
more unique on the mentioned field than the object_name.

Ah, is there some way to work around that? I see why that's happening
but that means in a document with a field per language the ones that
happen to have translations in relatively rare languages will be
bumped up more than other documents.

AEvar_Arnfjord_Bjarm · March 28, 2012, 9:47am

On Tue, Mar 27, 2012 at 15:18, Shay Banon kimchy@gmail.com wrote:

Maybe boost the regular name one?

The issue is that I don't know in advance what the regular name is,
since the actual use case is that I'm searching through a bunch of
languages, so proper boost values would depend on how they're
distributed throghout the various documents.

What I'm really trying to do is this:

Search through documents with fields like these:

name_en: London
name_is: Lundúnir
name_it: Londres

Where any one of those fields may be missing / existent. I know
that given how Lucene works this is probably impossible but I'd
like the IDF to be computed based not on any one of those fields
but the aggregate of those fields.
The reason I don't structure it like this instead:
```
names:
 -
   name: London
   lc: en
 -
   name: Lundúnir
   lc: is
 -
   name: Londres
   lc: it
```
Is twofold, firstly I found that Lucene would effectively normalize
this to "London Lundúnir Londres" internally, so a location with
more translations would get penalized since it wouldn't be an exact
match anymore. I filed
Provide a way to set the getPositionIncrementGap() in the mapping for multivalued fields · Issue #1806 · elastic/elasticsearch · GitHub which
might allow me to get around this issue.

The second reason is that by representing the data like that I
can't have per-language analyzers. If #1806 got fixed I could work
around this to a large extent by normalizing the data like this:
```
names_standard_analyzer:
 -
   name: London
   lc: en
 -
   name: Lundúnir
   lc: is
 -
   name: Londres
   lc: it

names_cjk_analyzer:
 -
   name: 伦敦
   lc: zh
```
Which would still give me the same IDF problems, just to a lesser
extent.

One way to work around this I guess would be to just declare that
English is the default language, and then if I don't have a
translation for a given language slot just insert the English
one. This would result in a lot of redundant data being added to the
index, but might alleviate the IDF problem.

Topic		Replies	Views
Problem with Dis Max Query Elasticsearch	6	1460	July 5, 2017
Understanding dis_max query Elasticsearch	1	2425	October 13, 2018
According the docs: dis_max without tie_breaker, should return same score, but in practice it doesn't happen Elasticsearch	3	431	July 5, 2017
Running a Match across Array and treating each row exclusively Elasticsearch	5	323	April 14, 2022
Difficulty understanding dis_max query using _validate Elasticsearch	2	452	October 19, 2018

Failing to use dis_max to get accurate scores on multiple fields

Related topics