Efficient similarity scoring question


(Chris-6) #1

Hi,

Efficiency question for you guys.

I have a set of documents with a repeated integer:
{
"document": {
"properties": {
"id": {
"index": "no",
"type": "string"
},
"fprint": {
"postings_format": "bloom_pulsing",
"type": "long"
},
"fprint_size": {
"include_in_all": false,
"store": true,
"type": "integer"
},
}
}
}

My query is a set of fingerprints, and I would like the final score to be #
of matching fingerprints from the document normalized by the number of
fingerprints in the document. This query retrieves the right set, but does
not normalize:

{ "query": {

"bool": {

"should" : [

  {

     "term" : { "morgan_fprint" : 632180975 }

  },

  {

     "term" : { "morgan_fprint" : 1039876598 }

  },

  {

     "term" : { "morgan_fprint" : 2246728737 }

  },

  {

     "term" : { "morgan_fprint" : 2264700157 }

  }

],

"minimum_should_match" : 1,

"boost" : 1.0

}

}
}'

The advantage of this query is that it takes only ~50ms.

This query does the normalization, but is significantly slower (e.g. 250ms):

{ "query": {

"custom_score": {

"script" : "(_score / doc.morgan_fprint_size.value)",

"query": {

 "bool": {

   "should" : [

      {

         "term" : { "morgan_fprint" : 632180975 }

      },

      {

         "term" : { "morgan_fprint" : 1039876598 }

      },

      {

         "term" : { "morgan_fprint" : 2246728737 }

      },

      {

         "term" : { "morgan_fprint" : 2264700157 }

      }

    ],

    "minimum_should_match" : 1,

    "boost" : 1.0

   }

 }

}

} }

====

I assume this is because of how morgan_fprint_size.value is retrieved from
disk during query execution. Is there a good way of structuring the index
such that I can get both a fast, normalized query?

I attempted to do an index side, per-document boost (e.g. something like
"fprint": { "_value": 12345, "_value": 67890, "_boost": 0.5 }). However I
got this error:
"You cannot set an index-time boost on an unindexed field, or one that
omits norms"

... So that didn't work.

Thanks,
Chris

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Chris-6) #2

And sorry for the confusion with "fprint" vs "morgan_fprint". I had a copy
pasting error, please ignore the "morgan_" prefix.

  • Chris

On Tuesday, October 1, 2013 3:46:03 PM UTC-7, Chris wrote:

Hi,

Efficiency question for you guys.

I have a set of documents with a repeated integer:
{
"document": {
"properties": {
"id": {
"index": "no",
"type": "string"
},
"fprint": {
"postings_format": "bloom_pulsing",
"type": "long"
},
"fprint_size": {
"include_in_all": false,
"store": true,
"type": "integer"
},
}
}
}

My query is a set of fingerprints, and I would like the final score to be

of matching fingerprints from the document normalized by the number of

fingerprints in the document. This query retrieves the right set, but does
not normalize:

{ "query": {

"bool": {

"should" : [

  {

     "term" : { "morgan_fprint" : 632180975 }

  },

  {

     "term" : { "morgan_fprint" : 1039876598 }

  },

  {

     "term" : { "morgan_fprint" : 2246728737 }

  },

  {

     "term" : { "morgan_fprint" : 2264700157 }

  }

],

"minimum_should_match" : 1,

"boost" : 1.0

}

}
}'

The advantage of this query is that it takes only ~50ms.

This query does the normalization, but is significantly slower (e.g.
250ms):

{ "query": {

"custom_score": {

"script" : "(_score / doc.morgan_fprint_size.value)",

"query": {

 "bool": {

   "should" : [

      {

         "term" : { "morgan_fprint" : 632180975 }

      },

      {

         "term" : { "morgan_fprint" : 1039876598 }

      },

      {

         "term" : { "morgan_fprint" : 2246728737 }

      },

      {

         "term" : { "morgan_fprint" : 2264700157 }

      }

    ],

    "minimum_should_match" : 1,

    "boost" : 1.0

   }

 }

}

} }

====

I assume this is because of how morgan_fprint_size.value is retrieved from
disk during query execution. Is there a good way of structuring the index
such that I can get both a fast, normalized query?

I attempted to do an index side, per-document boost (e.g. something like
"fprint": { "_value": 12345, "_value": 67890, "_boost": 0.5 }). However I
got this error:
"You cannot set an index-time boost on an unindexed field, or one that
omits norms"

... So that didn't work.

Thanks,
Chris

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(benjamin leviant) #3

Hi Chris,

The numerical fields are not normalized.

If you want to used the default normalization provided by the lucene index
(that should be much faster), you should index your fprint values as
string. If in your case the numerical format is mandatory you can use
multi_field to index your field both as string and as long types.

Be warned that the default normalization is not very precise. The small
field length differences may not be taken in account in the final scoring.

To avoid this problem, I can see 2 solutions :

  • You can try to amplify the norm variation by using field boost like you
    did previously. But before you will need to understand how lucene process
    field boost. For more info on this subject :

https://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/document/Fieldable.html#setBoost(float)

The correct syntax will be something like this :
{
"fprint.stringify": [
{
"_value": "12345",
"_boost": 0.5
},
{
"_value": "67890",
"_boost": 0.5
}
]
}

  • An other solution should be to use a custom similarity class that use a
    more precise normalization, but this not trivial and will required to
    understand how lucene indexes norm fields.

I hope this can help you.

Regards

Benjamin

On Wed, Oct 2, 2013 at 12:47 AM, Chris chris.vana@gmail.com wrote:

And sorry for the confusion with "fprint" vs "morgan_fprint". I had a copy
pasting error, please ignore the "morgan_" prefix.

  • Chris

On Tuesday, October 1, 2013 3:46:03 PM UTC-7, Chris wrote:

Hi,

Efficiency question for you guys.

I have a set of documents with a repeated integer:
{
"document": {
"properties": {
"id": {
"index": "no",
"type": "string"
},
"fprint": {
"postings_format": "bloom_pulsing",
"type": "long"
},
"fprint_size": {
"include_in_all": false,
"store": true,
"type": "integer"
},
}
}
}

My query is a set of fingerprints, and I would like the final score to be

of matching fingerprints from the document normalized by the number of

fingerprints in the document. This query retrieves the right set, but does
not normalize:

{ "query": {

"bool": {

"should" : [

  {

      "term" : { "morgan_fprint" : 632180975 }

   },

  {

     "term" : { "morgan_fprint" : 1039876598 }

  },

   {

     "term" : { "morgan_fprint" : 2246728737 }

  },

   {

     "term" : { "morgan_fprint" : 2264700157 }

  }

 ],

"minimum_should_match" : 1,

"boost" : 1.0

}

}
}'

The advantage of this query is that it takes only ~50ms.

This query does the normalization, but is significantly slower (e.g.
250ms):

{ "query": {

"custom_score": {

"script" : "(_score / doc.morgan_fprint_size.value)"**,

"query": {

 "bool": {

   "should" : [

       {

         "term" : { "morgan_fprint" : 632180975 }

      },

       {

         "term" : { "morgan_fprint" : 1039876598 }

      },

       {

         "term" : { "morgan_fprint" : 2246728737 }

      },

       {

         "term" : { "morgan_fprint" : 2264700157 }

      }

     ],

    "minimum_should_match" : 1,

    "boost" : 1.0

    }

 }

}

} }

====

I assume this is because of how morgan_fprint_size.value is retrieved
from disk during query execution. Is there a good way of structuring the
index such that I can get both a fast, normalized query?

I attempted to do an index side, per-document boost (e.g. something like
"fprint": { "_value": 12345, "_value": 67890, "_boost": 0.5 }). However I
got this error:
"You cannot set an index-time boost on an unindexed field, or one that
omits norms"

... So that didn't work.

Thanks,
Chris

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4