Strange issue with similarity scores on ES 0.90.0 - possibly shard related?

Derry_O_Sullivan · May 21, 2013, 4:33pm

Hi all,

We've noticed a strange issue with similarity scores on ES. The outline of
the bug is that we get different tfidf scores back for exactly the same
documents (e.g. duplicate documents) after a number of document inserts.

Steps to reproduce:

Start a clean ES setup
No index/type mapping should be created
Add some content:
curl -XPUT 'http://localhost:9200/items/item/1' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'

So 1 document in a new index called items with a type called item:

Search for that 1 doc and check the score:
http://localhost:9200/items/item/_search?q=crm
0.13561106
Start adding that same doc multiple times (search every few occurences)
curl -XPUT 'http://localhost:9200/items/item/2' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'
curl -XPUT 'http://localhost:9200/items/item/3' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'
curl -XPUT 'http://localhost:9200/items/item/4' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'
curl -XPUT 'http://localhost:9200/items/item/5' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'
Search again:
http://localhost:9200/items/item/_search?q=crm
Should get the same score as above for all the docs:
Search with title field:
http://localhost:9200/items/item/_search?q=title:crm
Same score for all docs (slightly higher but because we check on only 1
field vs all)
0.15342641

Gets interesting here!
*

1. Add some new docs with slightly different text:
  curl -XPUT 'http://localhost:9200/items/item/6' -d
  '{"language":"en","description":"some crm description data","title":"some
  crm title data crm"}'
  curl -XPUT 'http://localhost:9200/items/item/7' -d
  '{"language":"en","description":"some crm description crm","title":"crm
  crm"}'

Now search again:
http://localhost:9200/items/item/_search?q=title:crm

Notice that not all the docs get the same score. Obviously i would have
expected different scores for document 6/7 - but not all of documents 1-5
(that are all the same) get the same score:

{
- _index: "items",
- _type: "item",
- _id: "2",
- _score: 0.2972674,
- _source:
  {
  - language: "en",
  - description: "some crm description data",
  - title: "some crm title data"
    }
    },

{
- _index: "items",
- _type: "item",
- _id: "4",
- _score: 0.15342641,
- _source:
{
- language: "en",
- description: "some crm description data",
- title: "some crm title data"
}
}

Same if we don't search title explicitly:
http://localhost:9200/items/item/_search?q=crm
- {
  - _index: "items",
  - _type: "item",
  - _id: "2",
  - _score: 0.26274976,
  - _source: {
    - language: "en",
    - description: "some crm description data",
    - title: "some crm title data"
      }
      },
- {
  - _index: "items",
  - _type: "item",
  - _id: "4",
  - _score: 0.13561106,
  - _source: {
    - language: "en",
    - description: "some crm description data",
    - title: "some crm title data"
      }
      }

Note the scores are completely different for the same
title/description/etc. Is this issue related to sharding (e.g. documents
mapping to a particular shard?) or something else? The same issue is seen
if we add 6 documents (all the same) to an index with 5 shards - using the
explain plan seems to show that maxDocs for tfidf is calculated over
#shards. Is this expected?

We are using a completely blank es setup of ES 0.90.0, no complex
analyzer/settings/mapping.

http://localhost:9200/items/_settings

{

items:
{
- settings:
  {
  - index.number_of_shards: "5",
  - index.number_of_replicas: "1",
  - index.version.created: "900099"
    }
    }

}

http://localhost:9200/_mapping
{

items:
{
- item:
  {
  - properties:
    {
    - description:
      {
      - type: "string"
        },
    - language:
      {
      - type: "string"
        },
    - title:
      {
      - type: "string"
        }
        }
        }
        }

}

Help as always greatly appreciated

Derry

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Randall_McRee · May 21, 2013, 9:04pm

Its not strange, its expected....try DFS_QUERY_THEN_FETCH search type:

to solve this problem. Also omit_norms if you dont want/need tfidf in a
field.

On Tue, May 21, 2013 at 9:33 AM, Derry O' Sullivan derryos@gmail.comwrote:

Hi all,

We've noticed a strange issue with similarity scores on ES. The outline of
the bug is that we get different tfidf scores back for exactly the same
documents (e.g. duplicate documents) after a number of document inserts.

Steps to reproduce:

Start a clean ES setup

No index/type mapping should be created

Add some content:
curl -XPUT 'http://localhost:9200/items/item/1' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'

So 1 document in a new index called items with a type called item:

Search for that 1 doc and check the score:
http://localhost:9200/items/item/_search?q=crm
0.13561106

Start adding that same doc multiple times (search every few occurences)
curl -XPUT 'http://localhost:9200/items/item/2' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'
curl -XPUT 'http://localhost:9200/items/item/3' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'
curl -XPUT 'http://localhost:9200/items/item/4' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'
curl -XPUT 'http://localhost:9200/items/item/5' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'

Search again:
http://localhost:9200/items/item/_search?q=crm
Should get the same score as above for all the docs:

Search with title field:
http://localhost:9200/items/item/_search?q=title:crm
Same score for all docs (slightly higher but because we check on only 1
field vs all)
0.15342641

Gets interesting here!
*

Add some new docs with slightly different text:
curl -XPUT 'http://localhost:9200/items/item/6' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data crm"}'
curl -XPUT 'http://localhost:9200/items/item/7' -d
'{"language":"en","description":"some crm description crm","title":"crm
crm"}'

Now search again:
http://localhost:9200/items/item/_search?q=title:crm

Notice that not all the docs get the same score. Obviously i would have
expected different scores for document 6/7 - but not all of documents
1-5 (that are all the same) get the same score:

{

_index: "items",

_type: "item",

_id: "2",

_score: 0.2972674,

_source:
{

language: "en",

description: "some crm description data",

title: "some crm title data"
}
},

{
- _index: "items",
- _type: "item",
- _id: "4",
- _score: 0.15342641,
- _source:
{
- language: "en",
- description: "some crm description data",
- title: "some crm title data"
}
}

Same if we don't search title explicitly:
http://localhost:9200/items/item/_search?q=crm

{

_index: "items",

_type: "item",

_id: "2",

_score: 0.26274976,

_source: {

language: "en",

description: "some crm description data",

title: "some crm title data"
}
},

{

_index: "items",

_type: "item",

_id: "4",

_score: 0.13561106,

_source: {

language: "en",

description: "some crm description data",

title: "some crm title data"
}
}

Note the scores are completely different for the same
title/description/etc. Is this issue related to sharding (e.g. documents
mapping to a particular shard?) or something else? The same issue is seen
if we add 6 documents (all the same) to an index with 5 shards - using the
explain plan seems to show that maxDocs for tfidf is calculated over
#shards. Is this expected?

We are using a completely blank es setup of ES 0.90.0, no complex
analyzer/settings/mapping.

http://localhost:9200/items/_settings

{

items:
{

settings:
{

index.number_of_shards: "5",

index.number_of_replicas: "1",

index.version.created: "900099"
}
}

}

http://localhost:9200/_mapping
{

items:
{

item:
{

properties:
{

description:
{

type: "string"
},

language:
{

type: "string"
},

title:
{

type: "string"
}
}
}
}

}

Help as always greatly appreciated

Derry

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Derry_O_Sullivan · May 21, 2013, 11:08pm

Hi Randall,

Thanks for the response. I guess the real question is now why the ES
default is 5 shards for new indexes (i understand about sharding from an
indexing speed vs search speed perspective). If i redo the below with 1
shard, i won't get this issue so in a 'small' index, it would make sense to
have a smaller shard size.

I guess i would have expected this to be a bit more obvious in the
documentation (e.g. watch out for unusual score values among un-even (if
that is the term) shards? Is the assumption that documents can be (without
routing) randomly spread among shards so the issue is not seen in a large
dataset?

Derry
On 21 May 2013 22:04, "Randall McRee" randall.mcree@gmail.com wrote:

Its not strange, its expected....try DFS_QUERY_THEN_FETCH search type:
Elasticsearch Platform — Find real-time answers at scale | Elastic

to solve this problem. Also omit_norms if you dont want/need tfidf in a
field.

On Tue, May 21, 2013 at 9:33 AM, Derry O' Sullivan derryos@gmail.comwrote:

Hi all,

We've noticed a strange issue with similarity scores on ES. The outline
of the bug is that we get different tfidf scores back for exactly the same
documents (e.g. duplicate documents) after a number of document inserts.

Steps to reproduce:

Start a clean ES setup

No index/type mapping should be created

Add some content:
curl -XPUT 'http://localhost:9200/items/item/1' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'

So 1 document in a new index called items with a type called item:

Search for that 1 doc and check the score:
http://localhost:9200/items/item/_search?q=crm
0.13561106

Start adding that same doc multiple times (search every few occurences)
curl -XPUT 'http://localhost:9200/items/item/2' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'
curl -XPUT 'http://localhost:9200/items/item/3' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'
curl -XPUT 'http://localhost:9200/items/item/4' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'
curl -XPUT 'http://localhost:9200/items/item/5' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data"}'

Search again:
http://localhost:9200/items/item/_search?q=crm
Should get the same score as above for all the docs:

Search with title field:
http://localhost:9200/items/item/_search?q=title:crm
Same score for all docs (slightly higher but because we check on only 1
field vs all)
0.15342641

Gets interesting here!
*

Add some new docs with slightly different text:
curl -XPUT 'http://localhost:9200/items/item/6' -d
'{"language":"en","description":"some crm description data","title":"some
crm title data crm"}'
curl -XPUT 'http://localhost:9200/items/item/7' -d
'{"language":"en","description":"some crm description crm","title":"crm
crm"}'

Now search again:
http://localhost:9200/items/item/_search?q=title:crm

Notice that not all the docs get the same score. Obviously i would have
expected different scores for document 6/7 - but not all of documents
1-5 (that are all the same) get the same score:

{

_index: "items",

_type: "item",

_id: "2",

_score: 0.2972674,

_source:
{

language: "en",

description: "some crm description data",

title: "some crm title data"
}
},

{
- _index: "items",
- _type: "item",
- _id: "4",
- _score: 0.15342641,
- _source:
{
- language: "en",
- description: "some crm description data",
- title: "some crm title data"
}
}

Same if we don't search title explicitly:
http://localhost:9200/items/item/_search?q=crm

{

_index: "items",

_type: "item",

_id: "2",

_score: 0.26274976,

_source: {

language: "en",

description: "some crm description data",

title: "some crm title data"
}
},

{

_index: "items",

_type: "item",

_id: "4",

_score: 0.13561106,

_source: {

language: "en",

description: "some crm description data",

title: "some crm title data"
}
}

Note the scores are completely different for the same
title/description/etc. Is this issue related to sharding (e.g. documents
mapping to a particular shard?) or something else? The same issue is seen
if we add 6 documents (all the same) to an index with 5 shards - using the
explain plan seems to show that maxDocs for tfidf is calculated over
#shards. Is this expected?

We are using a completely blank es setup of ES 0.90.0, no complex
analyzer/settings/mapping.

http://localhost:9200/items/_settings

{

items:
{

settings:
{

index.number_of_shards: "5",

index.number_of_replicas: "1",

index.version.created: "900099"
}
}

}

http://localhost:9200/_mapping
{

items:
{

item:
{

properties:
{

description:
{

type: "string"
},

language:
{

type: "string"
},

title:
{

type: "string"
}
}
}
}

}

Help as always greatly appreciated

Derry

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/lFq_V3PiRwA/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · May 23, 2013, 10:06am

The results that you are seeing are an artefact of having too few docs in a
distributed environment. With a real application, you have many more docs,
so the differences even out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Derry_O_Sullivan · May 23, 2013, 10:09am

Hey Clinton,

Thanks for that. I understand the concept that over time/with lots of
documents, the scoring stabilizes. I guess the real q is for a beginner
database (e.g. from the tutorial/etc) where you insert a low number of
documents, this is probably going to be quite confusing as the default
shard size is 5. Then again, i guess people are not going to be adding the
same content multiple times

Derry

On 23 May 2013 11:06, Clinton Gormley clint@traveljury.com wrote:

The results that you are seeing are an artefact of having too few docs in
a distributed environment. With a real application, you have many more
docs, so the differences even out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/lFq_V3PiRwA/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · May 23, 2013, 10:53am

It is a common source of confusion. But being able to scale out
automatically by default has greater benefit, hence the multiple shards

clint

On 23 May 2013 12:09, Derry O' Sullivan derryos@gmail.com wrote:

Hey Clinton,

Thanks for that. I understand the concept that over time/with lots of
documents, the scoring stabilizes. I guess the real q is for a beginner
database (e.g. from the tutorial/etc) where you insert a low number of
documents, this is probably going to be quite confusing as the default
shard size is 5. Then again, i guess people are not going to be adding the
same content multiple times

Derry

On 23 May 2013 11:06, Clinton Gormley clint@traveljury.com wrote:

The results that you are seeing are an artefact of having too few docs in
a distributed environment. With a real application, you have many more
docs, so the differences even out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/lFq_V3PiRwA/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Identical documents have different scores when using fuzziness Elasticsearch	1	360	July 6, 2017
Finding similar documents with Elasticsearch Elasticsearch	4	398	July 6, 2017
Inconsistent results for the same query on an index with 0 replicas Elasticsearch	7	810	February 8, 2021
Why elasticsearch gives different scores to identical documents Elasticsearch	2	586	June 26, 2018
Different scores on replicas with the same documents Elasticsearch	6	2169	July 6, 2017

Strange issue with similarity scores on ES 0.90.0 - possibly shard related?

Related topics