We've noticed a strange issue with similarity scores on ES. The outline of
the bug is that we get different tfidf scores back for exactly the same
documents (e.g. duplicate documents) after a number of document inserts.
Steps to reproduce:
Start a clean ES setup
No index/type mapping should be created
Add some content:
curl -XPUT 'http://localhost:9200/items/item/1' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'
So 1 document in a new index called items with a type called item:
Add some new docs with slightly different text:
curl -XPUT 'http://localhost:9200/items/item/6' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data crm"}'
curl -XPUT 'http://localhost:9200/items/item/7' -d
'{"language":"en","description":"some crm description crm","title":"crm
crm"}'
Notice that not all the docs get the same score. Obviously i would have
expected different scores for document 6/7 - but not all of documents 1-5
(that are all the same) get the same score:
Note the scores are completely different for the same
title/description/etc. Is this issue related to sharding (e.g. documents
mapping to a particular shard?) or something else? The same issue is seen
if we add 6 documents (all the same) to an index with 5 shards - using the
explain plan seems to show that maxDocs for tfidf is calculated over #shards. Is this expected?
We are using a completely blank es setup of ES 0.90.0, no complex
analyzer/settings/mapping.
Its not strange, its expected....try DFS_QUERY_THEN_FETCH search type:
to solve this problem. Also omit_norms if you dont want/need tfidf in a
field.
On Tue, May 21, 2013 at 9:33 AM, Derry O' Sullivan derryos@gmail.comwrote:
Hi all,
We've noticed a strange issue with similarity scores on ES. The outline of
the bug is that we get different tfidf scores back for exactly the same
documents (e.g. duplicate documents) after a number of document inserts.
Steps to reproduce:
Start a clean ES setup
No index/type mapping should be created
Add some content:
curl -XPUT 'http://localhost:9200/items/item/1' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'
So 1 document in a new index called items with a type called item:
Add some new docs with slightly different text:
curl -XPUT 'http://localhost:9200/items/item/6' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data crm"}'
curl -XPUT 'http://localhost:9200/items/item/7' -d
'{"language":"en","description":"some crm description crm","title":"crm
crm"}'
Notice that not all the docs get the same score. Obviously i would have
expected different scores for document 6/7 - but not all of documents
1-5 (that are all the same) get the same score:
Note the scores are completely different for the same
title/description/etc. Is this issue related to sharding (e.g. documents
mapping to a particular shard?) or something else? The same issue is seen
if we add 6 documents (all the same) to an index with 5 shards - using the
explain plan seems to show that maxDocs for tfidf is calculated over #shards. Is this expected?
We are using a completely blank es setup of ES 0.90.0, no complex
analyzer/settings/mapping.
Thanks for the response. I guess the real question is now why the ES
default is 5 shards for new indexes (i understand about sharding from an
indexing speed vs search speed perspective). If i redo the below with 1
shard, i won't get this issue so in a 'small' index, it would make sense to
have a smaller shard size.
I guess i would have expected this to be a bit more obvious in the
documentation (e.g. watch out for unusual score values among un-even (if
that is the term) shards? Is the assumption that documents can be (without
routing) randomly spread among shards so the issue is not seen in a large
dataset?
to solve this problem. Also omit_norms if you dont want/need tfidf in a
field.
On Tue, May 21, 2013 at 9:33 AM, Derry O' Sullivan derryos@gmail.comwrote:
Hi all,
We've noticed a strange issue with similarity scores on ES. The outline
of the bug is that we get different tfidf scores back for exactly the same
documents (e.g. duplicate documents) after a number of document inserts.
Steps to reproduce:
Start a clean ES setup
No index/type mapping should be created
Add some content:
curl -XPUT 'http://localhost:9200/items/item/1' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'
So 1 document in a new index called items with a type called item:
Add some new docs with slightly different text:
curl -XPUT 'http://localhost:9200/items/item/6' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data crm"}'
curl -XPUT 'http://localhost:9200/items/item/7' -d
'{"language":"en","description":"some crm description crm","title":"crm
crm"}'
Notice that not all the docs get the same score. Obviously i would have
expected different scores for document 6/7 - but not all of documents
1-5 (that are all the same) get the same score:
Note the scores are completely different for the same
title/description/etc. Is this issue related to sharding (e.g. documents
mapping to a particular shard?) or something else? The same issue is seen
if we add 6 documents (all the same) to an index with 5 shards - using the
explain plan seems to show that maxDocs for tfidf is calculated over #shards. Is this expected?
We are using a completely blank es setup of ES 0.90.0, no complex
analyzer/settings/mapping.
The results that you are seeing are an artefact of having too few docs in a
distributed environment. With a real application, you have many more docs,
so the differences even out.
Thanks for that. I understand the concept that over time/with lots of
documents, the scoring stabilizes. I guess the real q is for a beginner
database (e.g. from the tutorial/etc) where you insert a low number of
documents, this is probably going to be quite confusing as the default
shard size is 5. Then again, i guess people are not going to be adding the
same content multiple times
The results that you are seeing are an artefact of having too few docs in
a distributed environment. With a real application, you have many more
docs, so the differences even out.
Thanks for that. I understand the concept that over time/with lots of
documents, the scoring stabilizes. I guess the real q is for a beginner
database (e.g. from the tutorial/etc) where you insert a low number of
documents, this is probably going to be quite confusing as the default
shard size is 5. Then again, i guess people are not going to be adding the
same content multiple times
The results that you are seeing are an artefact of having too few docs in
a distributed environment. With a real application, you have many more
docs, so the differences even out.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.