Scoring inconsistency across shards


(Arun Rajagopalan) #1

Hi , I am trying to run a like_text query with tf set to false. I am seeing the queryNorm being inconsistent across shards. This is causing issues with sorting.
Here is my query :

            {
                            "flt_field": {
                                    "identity.first_name": {
                                            "like_text": "mike",
                                            "ignore_tf": true
                                    }
                            }
        }, {
                            "flt_field": {
                                    "identity.last_name": {
                                            "like_text": "roger",
                                            "ignore_tf": true
                                    }
                            }
        } 

Here are my results :

                    "_shard": 3,
                    "_node": "NeAtgK8zT9GGDiRpql921Q",
                    "_index": "myindex",
                    "_type": "v1",
                    "_id": "226912189",
                    "_score": 0.76921964,
                    "fields": {
                            "identity.last_name": [ "ROGER" ],
                            "identity.first_name": [ "MIKE" ]
                    },
                    "_explanation": {
                            "value": 0.7692197,
                            "description": "sum of:",
                            "details": [ {
                                    "value": 0.04058554,
                                    "description": "ConstantScore(cache(internal.status.value:A)), product of:",
                                    "details": [ {
                                            "value": 1.0,
                                            "description": "boost"
      }, {
                                            "value": 0.04058554,
                                            "description": "queryNorm"
      } ]
    }
}, {
                    "_shard": 3,
                   ....
                  .... 
                    }
}, {
                    "_shard": 2,
                    "_node": "NeAtgK8zT9GGDiRpql921Q",
                    "_index": "myindex",
                    "_type": "v1",
                    "_id": "380786027",
                    "_score": 0.7689439,
                    "fields": {
                            "identity.last_name": [ "ROGER" ],
                            "identity.first_name": [ "MIKE" ]
                    },
                    "_explanation": {
                            "value": 0.76894397,
                            "description": "sum of:",
                            "details": [ {
                                    "value": 0.040466454,
                                    "description": "ConstantScore(cache(internal.status.value:A)), product of:",
                                    "details": [ {
                                            "value": 1.0,
                                            "description": "boost"
      }, {
                                            "value": 0.040466454,
                                            "description": "queryNorm"
      } ]
    }, {
                                    "value": 0.2910072,
                                    "description": "sum of:",
                                    "details": [ {
                                            "value": 0.2910072,
                                            "description": "ConstantScore(identity.first_name:mike)^7.1913195, product of:",
                                            "details": [ {
                                                    "value": 7.1913195,
                                                    "description": "boost"
        }, {
                                                    "value": 0.040466454,
                                                    "description": "queryNorm"
        } ]
      } ]
    }, ]
} ]
    }

}

Here you can see 3 records all having the same first name and last name . First two are from one shard and the third result is from a different shard. The score seems to vary even though the values are the same.


(Christoph) #2

Hi,
scoring is tricky, so just a few questions to better understand what you are trying to do. Do you have a special reason for you using the "Fuzzy Like This" query for what you are trying to archieve? And are you running this query on real data or using just a few documents as toy data?

In that later case it would help to know more about what kind and how many documents you are using. As for the query_norm, I think this scoring factor is calculated using IDF values which are not necessarily consistent across shards due to the distributed nature of Elasticsearch as explained here. Have you tried using search_type=dfs_query_then_fetch and does it make any difference?


(Arun Rajagopalan) #3

Hi,

I am querying against real data. To give you a idea, I am querying against a 16 node cluster and there are more than 10M records. Also like I mentioned earlier I am using search_type=dfs_query_then_fetch. I need to use the fuzzy_like_this query to get both exact matches and names that could be misspelled.


(system) #4