Scoring problem between 2 machine


(Sicker) #1

I create a search query that order by score but the result is not same
order.

This is some item from first result.

{
_shard: 2
_node: U4zV_tkeRpyRi59NNmNWQQ
_index: directory
_type: profile
_id: RM631130
_score: 0.55288595
},

{
_shard: 4
_node: 3KNS8qPTQEWxuCRzwgMYLw
_index: directory
_type: profile
_id: RM631126
_score: 1.7044709
}

*This is some item from second result. *

{
_shard: 4
_node: U4zV_tkeRpyRi59NNmNWQQ
_index: directory
_type: profile
_id: RM631126
_score: 0.55287325
},
{
_shard: 2
_node: 3KNS8qPTQEWxuCRzwgMYLw
_index: directory
_type: profile
_id: RM631130
_score: 1.6957934
}


(Ivan Brusic) #2

Scoring might be different due to the distributed nature of
ElasticSearch. Try adjusting the search type:
http://www.elasticsearch.org/guide/reference/api/search/search-type.html

There is a tradeoff between performance and accuracy of scoring.

--
Ivan

On Wed, Jul 18, 2012 at 2:26 AM, Sicker sicker27@gmail.com wrote:

I create a search query that order by score but the result is not same
order.

This is some item from first result.

{
_shard: 2
_node: U4zV_tkeRpyRi59NNmNWQQ
_index: directory
_type: profile
_id: RM631130
_score: 0.55288595
},

{
_shard: 4
_node: 3KNS8qPTQEWxuCRzwgMYLw
_index: directory
_type: profile
_id: RM631126
_score: 1.7044709
}

This is some item from second result.

{
_shard: 4
_node: U4zV_tkeRpyRi59NNmNWQQ
_index: directory
_type: profile
_id: RM631126
_score: 0.55287325
},
{
_shard: 2
_node: 3KNS8qPTQEWxuCRzwgMYLw
_index: directory
_type: profile
_id: RM631130
_score: 1.6957934
}


(Clinton Gormley) #3

On Thu, 2012-07-19 at 10:32 -0700, Ivan Brusic wrote:

Scoring might be different due to the distributed nature of
ElasticSearch. Try adjusting the search type:
http://www.elasticsearch.org/guide/reference/api/search/search-type.html

There is a tradeoff between performance and accuracy of scoring.

Also, as the quantity of data you have grows, these differences tend to
even out.


(Radim) #4

To be a little less hand-wavy (please correct me if I'm wrong): some
stats used in the scoring, like IDF, are computed per shard, by
default. These stats are effectively computed only from the document
set present in that one shard. This means that the same document can
be scored differently, depending on which shard it ends up in.

By changing the search-type, you can change this behaviour so that the
stats are computed on index-level (not shard-level), i.e. from the
document set present in the entire index. This helps to score
consistently within one index.

AFAIK there is no way to run cross-index queries accurately. You can
rely on the "evening out" that Clinton mentions. In that case you need
to be careful your routing doesn't skew the stats distribution too
much -- if each shard receives very different data, then the stats
will never even out. The default routing is fine, as it sends out
documents to random shards evenly (using hash of the id field).

HTH,
Radim

On Jul 20, 10:20 am, Clinton Gormley cl...@traveljury.com wrote:

On Thu, 2012-07-19 at 10:32 -0700, Ivan Brusic wrote:

Scoring might be different due to the distributed nature of
ElasticSearch. Try adjusting the search type:
http://www.elasticsearch.org/guide/reference/api/search/search-type.html

There is a tradeoff between performance and accuracy of scoring.

Also, as the quantity of data you have grows, these differences tend to
even out.


(Clinton Gormley) #5

Hiya Radim

By changing the search-type, you can change this behaviour so that the
stats are computed on index-level (not shard-level), i.e. from the
document set present in the entire index. This helps to score
consistently within one index.

Not just at the index-level, but for all the shards involved in your
query. So if you're doing a multi-index search and you use

search_type=dfs_query_then_fetch

then it will fetch the term frequencies from all shards (from all
indices in your query) before executing it.

AFAIK there is no way to run cross-index queries accurately. You can
rely on the "evening out" that Clinton mentions. In that case you need
to be careful your routing doesn't skew the stats distribution too
much -- if each shard receives very different data, then the stats
will never even out. The default routing is fine, as it sends out
documents to random shards evenly (using hash of the id field).

Sure, but for typical use cases, you'll be routing on (eg) a client, and
searching within just that client, so terms will be evenly distributed
for that client.

clint


(system) #6