Failing to use dis_max to get accurate scores on multiple fields


(Ævar Arnfjörð Bjarmason) #1

I'm having problems getting best name matches when making a query
across multiple fields with dis_max, here's a complete test case for
what I'm doing:

https://gist.github.com/2059195

In summary I'm trying to search through data like:

doc 1:
"object_name" : "Albino Elephant"
doc 2:
"object_name" : "Cute Albino Elephant"
"object_name_other_language" : "Cute Albino Elephant"
doc 3:
"object_name" : "The Cutest Albino Elephant"
"object_name_other_language" : "The Cutest Albino Elephant"

I.e. a bunch of objects that have multiple names that have been
translated into different languages (because I want more accurate
matches than using an array, and I want special analyzers per
language).

When I do a search for "Albino Elephant" with this query:

    "dis_max" : {
        "queries" : [
            {
                "text" : {
                    "object_name.unmunged" : "Albino Elephant"
                }
            },
            {
                "text" : {
                    "object_name_other_language.unmunged" :

"Albino Elephant"
}
}
]
}

doc #3 is going to be the highest scored hit, but if I were to change
doc #1 to:

doc 1:
"object_name" : "Albino Elephant"
"object_name_other_language" : "Albino Elephant"

It would be the number #1 hit, so seemingly dis_max is behaving like a
bool/should query and aggregating the scores. I thought the whole
point of dis_max was to do "execute all these queries, compute their
scores, and pick the highest one" so you could search through
e.g. multiple translations and not give objects extra scores by virtue
of having more translations.

Or maybe something else is up here, Clinton Gormley suggested on IRC
that this might be because I had >1 shards, but I changed the gist to
only create one shard and still have the same results.


(Ævar Arnfjörð Bjarmason) #2

On Sat, Mar 17, 2012 at 15:33, Ævar Arnfjörð Bjarmason avarab@gmail.com wrote:

I'm having problems getting best name matches when making a query
across multiple fields with dis_max, here's a complete test case for
what I'm doing:

https://gist.github.com/2059195

In summary I'm trying to search through data like:

doc 1:
"object_name" : "Albino Elephant"
doc 2:
"object_name" : "Cute Albino Elephant"
"object_name_other_language" : "Cute Albino Elephant"
doc 3:
"object_name" : "The Cutest Albino Elephant"
"object_name_other_language" : "The Cutest Albino Elephant"

I.e. a bunch of objects that have multiple names that have been
translated into different languages (because I want more accurate
matches than using an array, and I want special analyzers per
language).

When I do a search for "Albino Elephant" with this query:

   "dis_max" : {
       "queries" : [
           {
               "text" : {
                   "object_name.unmunged" : "Albino Elephant"
               }
           },
           {
               "text" : {
                   "object_name_other_language.unmunged" :

"Albino Elephant"
}
}
]
}

doc #3 is going to be the highest scored hit, but if I were to change
doc #1 to:

doc 1:
"object_name" : "Albino Elephant"
"object_name_other_language" : "Albino Elephant"

It would be the number #1 hit, so seemingly dis_max is behaving like a
bool/should query and aggregating the scores. I thought the whole
point of dis_max was to do "execute all these queries, compute their
scores, and pick the highest one" so you could search through
e.g. multiple translations and not give objects extra scores by virtue
of having more translations.

Or maybe something else is up here, Clinton Gormley suggested on IRC
that this might be because I had >1 shards, but I changed the gist to
only create one shard and still have the same results.

I've played with this some more but I haven't gotten it working.
Anyone have an idea what might be going on here?


(Shay Banon) #3

Heya,

If you set explain to true, you will see part of how the score is
calculated. You will see that dismax is doing its job, choosing the higher
ranking query that matched as the final score of the document. But, the
higher ranking doc is the one on object_name_other_language, and thats
because of the inverse document frequency mainly, because those terms are
more unique on the mentioned field than the object_name.

-shay.banon

On Thu, Mar 22, 2012 at 2:14 PM, Ævar Arnfjörð Bjarmason
avarab@gmail.comwrote:

On Sat, Mar 17, 2012 at 15:33, Ævar Arnfjörð Bjarmason avarab@gmail.com
wrote:

I'm having problems getting best name matches when making a query
across multiple fields with dis_max, here's a complete test case for
what I'm doing:

https://gist.github.com/2059195

In summary I'm trying to search through data like:

doc 1:
"object_name" : "Albino Elephant"
doc 2:
"object_name" : "Cute Albino Elephant"
"object_name_other_language" : "Cute Albino Elephant"
doc 3:
"object_name" : "The Cutest Albino Elephant"
"object_name_other_language" : "The Cutest Albino Elephant"

I.e. a bunch of objects that have multiple names that have been
translated into different languages (because I want more accurate
matches than using an array, and I want special analyzers per
language).

When I do a search for "Albino Elephant" with this query:

   "dis_max" : {
       "queries" : [
           {
               "text" : {
                   "object_name.unmunged" : "Albino Elephant"
               }
           },
           {
               "text" : {
                   "object_name_other_language.unmunged" :

"Albino Elephant"
}
}
]
}

doc #3 is going to be the highest scored hit, but if I were to change
doc #1 to:

doc 1:
"object_name" : "Albino Elephant"
"object_name_other_language" : "Albino Elephant"

It would be the number #1 hit, so seemingly dis_max is behaving like a
bool/should query and aggregating the scores. I thought the whole
point of dis_max was to do "execute all these queries, compute their
scores, and pick the highest one" so you could search through
e.g. multiple translations and not give objects extra scores by virtue
of having more translations.

Or maybe something else is up here, Clinton Gormley suggested on IRC
that this might be because I had >1 shards, but I changed the gist to
only create one shard and still have the same results.

I've played with this some more but I haven't gotten it working.
Anyone have an idea what might be going on here?


(Ævar Arnfjörð Bjarmason) #4

On Sun, Mar 25, 2012 at 13:21, Shay Banon kimchy@gmail.com wrote:

If you set explain to true, you will see part of how the score is
calculated. You will see that dismax is doing its job, choosing the higher
ranking query that matched as the final score of the document. But, the
higher ranking doc is the one on object_name_other_language, and thats
because of the inverse document frequency mainly, because those terms are
more unique on the mentioned field than the object_name.

Ah, is there some way to work around that? I see why that's happening
but that means in a document with a field per language the ones that
happen to have translations in relatively rare languages will be
bumped up more than other documents.


(Shay Banon) #5

Maybe boost the regular name one?

On Mon, Mar 26, 2012 at 6:47 PM, Ævar Arnfjörð Bjarmason
avarab@gmail.comwrote:

On Sun, Mar 25, 2012 at 13:21, Shay Banon kimchy@gmail.com wrote:

If you set explain to true, you will see part of how the score is
calculated. You will see that dismax is doing its job, choosing the
higher
ranking query that matched as the final score of the document. But, the
higher ranking doc is the one on object_name_other_language, and thats
because of the inverse document frequency mainly, because those terms are
more unique on the mentioned field than the object_name.

Ah, is there some way to work around that? I see why that's happening
but that means in a document with a field per language the ones that
happen to have translations in relatively rare languages will be
bumped up more than other documents.


(Ævar Arnfjörð Bjarmason) #6

On Tue, Mar 27, 2012 at 15:18, Shay Banon kimchy@gmail.com wrote:

Maybe boost the regular name one?

The issue is that I don't know in advance what the regular name is,
since the actual use case is that I'm searching through a bunch of
languages, so proper boost values would depend on how they're
distributed throghout the various documents.

What I'm really trying to do is this:

  • Search through documents with fields like these:

    name_en: London
    name_is: Lundúnir
    name_it: Londres

    Where any one of those fields may be missing / existent. I know
    that given how Lucene works this is probably impossible but I'd
    like the IDF to be computed based not on any one of those fields
    but the aggregate of those fields.

  • The reason I don't structure it like this instead:

    names:
     -
       name: London
       lc: en
     -
       name: Lundúnir
       lc: is
     -
       name: Londres
       lc: it
    

    Is twofold, firstly I found that Lucene would effectively normalize
    this to "London Lundúnir Londres" internally, so a location with
    more translations would get penalized since it wouldn't be an exact
    match anymore. I filed
    https://github.com/elasticsearch/elasticsearch/issues/1806 which
    might allow me to get around this issue.

    The second reason is that by representing the data like that I
    can't have per-language analyzers. If #1806 got fixed I could work
    around this to a large extent by normalizing the data like this:

    names_standard_analyzer:
     -
       name: London
       lc: en
     -
       name: Lundúnir
       lc: is
     -
       name: Londres
       lc: it
    
    names_cjk_analyzer:
     -
       name: 伦敦
       lc: zh
    

    Which would still give me the same IDF problems, just to a lesser
    extent.

One way to work around this I guess would be to just declare that
English is the default language, and then if I don't have a
translation for a given language slot just insert the English
one. This would result in a lot of redundant data being added to the
index, but might alleviate the IDF problem.


(system) #7