Score is lower if text is longer


(Robert) #1

Hi!

I indexed a lot of documents. Two are very similar as they have the same
name and the same city. The only difference is that on of these two has
some fields with a lot of text in it. When I do a search over _all fields I
would expect that both results have a very similar score. But the one with
the fields full of text has a significant lower score than the one with the
short text. In my case the scores are ~0.2 for the one with long texts and
~0.6 for the one with the short text.

So, how can I make sure, these two documents get a similar score?

Robert.

My analyzers:
{'analysis':{
'analyzer':{
'indexAnalyzer':{
'type':'custom',
'tokenizer':'standard',
'filter':['lowercase','mynGram']
},
'searchAnalyzer':{
'type':'custom',
'tokenizer':'standard',
'filter':['standard','lowercase','mynGram']
}
},
'filter':{
'mynGram':{
'type':'nGram',
'min_gram'2,
'max_gram':50
}
}
}}

My mapping:
{
'name':{
'type':'string',
'include_in_all':true,
'boost':5,
},
'city':{
'type':'string',
'include_in_all':true,
},
'someTextField':{
'type':'string',
'include_in_all':true,
},
'someOtherTextField':{
'type':'string',
'include_in_all':true,
}
}

My documents:
{
'name':'Wirtschaftsinformatik',
'city':'Hamburg',
'someTextField':'',
'someOtherTextField':''
}
{
'name':'Wirtschaftsinformatik',
'city':'Hamburg',
'someTextField':'A long text. Bla bla bla.',
'someOtherTextField':'Another long text. Bla bla bla.'
}


(Robert) #2

I forgot: The Query performed on _all fields is: 'Wirtschaftsinformatik AND
Hamburg'.


(Lukáš Vlček) #3

Hi,

this is how Lucene scoring works. If you want the documents with the same
name/city to score similarly you can use index time boosting for fields
name/city.

Regards,
Lukas

On Tue, Nov 29, 2011 at 12:30 PM, Robert robert.katzki@googlemail.comwrote:

I forgot: The Query performed on _all fields is: 'Wirtschaftsinformatik
AND Hamburg'.


(Lukáš Vlček) #4

Actually, you can also check
http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html
(see
the "Multi Field" chapter).

On Tue, Nov 29, 2011 at 3:08 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

this is how Lucene scoring works. If you want the documents with the same
name/city to score similarly you can use index time boosting for fields
name/city.

Regards,
Lukas

On Tue, Nov 29, 2011 at 12:30 PM, Robert robert.katzki@googlemail.comwrote:

I forgot: The Query performed on _all fields is: 'Wirtschaftsinformatik
AND Hamburg'.


(Robert) #5

Hi!

I already boost the fields as necessary. Name for example is way more
important than the other fields.

One solution that I have in mind would be to calculate a boost for each
document depending on the length of the texts. But that seems not good to
me...

Will try out the dis_max and tell if it helped.

Any other ideas?


(Lukáš Vlček) #6

Hey,

I am not sure if boosting based on the length of the text is a good
approach, I think this approach could quickly get complicated when you try
to use more advanced queries. But do not take me for granted, I am not
expert on Lucene Similarity
http://lucene.apache.org/java/3_4_0/api/all/org/apache/lucene/search/Similarity.html
which
btw should take account on length of the text, but as far as I know, by
default Lucene scores shorter texts higher.

I think what you need is to answer yourself a simple question: "What is the
relevancy in my domain model?" Once you know the answer you can try to
tackle it trying various options (boosted fields, dis max, custom scoring...

Just out of curiosity, when searching on _all field did you see any effect
for boosting the Name field? I remember that I noticed some issues with it
but did not had a chance to nail it down (I switched to dis max instead).

Regards,
Lukas

On Tue, Nov 29, 2011 at 3:39 PM, Robert robert.katzki@googlemail.comwrote:

Hi!

I already boost the fields as necessary. Name for example is way more
important than the other fields.

One solution that I have in mind would be to calculate a boost for each
document depending on the length of the texts. But that seems not good to
me...

Will try out the dis_max and tell if it helped.

Any other ideas?


(Robert) #7

Finally could test it. But using dis_max gives me no difference in the
results. Not even the score changed.

Still not sure how to analyze the data. I have to search over a lot of
fields, can't limit it to one. Documents look like my example in my first
post. Which is the best analyzer to be able to search for parts of a word,
too? So for example I like to find the two docs from my first post when
searching for 'Wirtschaft'. Documents that match exactly the search query
should have a higher score. Stemming didn't give me the wanted results,
that's why I use nGram atm.

Regards, Robert.


(egaumer) #8

The omit_norms mapping attribute allows you to disable length normalization
and index-time boosting for the field. This means the underlying Lucene
scoring algorithm will not be effected by the length of the text associated
with the given field.

http://www.elasticsearch.org/guide/reference/mapping/core-types.html

-Eric


(Robert) #9

It doesn't give me any difference in the results. Maybe because I'm using
nGram Analyzer?

Any suggestions on which analyzer and which filters to use?

Thanks, Robert.


(system) #10