Strange issue with similarity scores on ES 0.90.0 - possibly shard related?

Hi all,

We've noticed a strange issue with similarity scores on ES. The outline of
the bug is that we get different tfidf scores back for exactly the same
documents (e.g. duplicate documents) after a number of document inserts.

Steps to reproduce:

  1. Start a clean ES setup
  2. No index/type mapping should be created
  3. Add some content:
    curl -XPUT 'http://localhost:9200/items/item/1' -d
    '{"language":"en","description":"some crm description data","title":"some
    crm title data"}'

So 1 document in a new index called items with a type called item:

  1. Search for that 1 doc and check the score:
    http://localhost:9200/items/item/_search?q=crm
    0.13561106

  2. Start adding that same doc multiple times (search every few occurences)
    curl -XPUT 'http://localhost:9200/items/item/2' -d
    '{"language":"en","description":"some crm description data","title":"some
    crm title data"}'
    curl -XPUT 'http://localhost:9200/items/item/3' -d
    '{"language":"en","description":"some crm description data","title":"some
    crm title data"}'
    curl -XPUT 'http://localhost:9200/items/item/4' -d
    '{"language":"en","description":"some crm description data","title":"some
    crm title data"}'
    curl -XPUT 'http://localhost:9200/items/item/5' -d
    '{"language":"en","description":"some crm description data","title":"some
    crm title data"}'

  3. Search again:
    http://localhost:9200/items/item/_search?q=crm
    Should get the same score as above for all the docs:

  4. Search with title field:
    http://localhost:9200/items/item/_search?q=title:crm
    Same score for all docs (slightly higher but because we check on only 1
    field vs all)
    0.15342641

Gets interesting here!
*

  1. Now search again:
    http://localhost:9200/items/item/_search?q=title:crm

Notice that not all the docs get the same score. Obviously i would have
expected different scores for document 6/7 - but not all of documents 1-5
(that are all the same) get the same score
:

  • {
    • _index: "items",
    • _type: "item",
    • _id: "2",
    • _score: 0.2972674,
    • _source:
      {
      • language: "en",
      • description: "some crm description data",
      • title: "some crm title data"
        }
        },

{
- _index: "items",
- _type: "item",
- _id: "4",
- _score: 0.15342641,
- _source:
{
- language: "en",
- description: "some crm description data",
- title: "some crm title data"
}
}

  1. Same if we don't search title explicitly:
    http://localhost:9200/items/item/_search?q=crm

    • {
      • _index: "items",
      • _type: "item",
      • _id: "2",
      • _score: 0.26274976,
      • _source: {
        • language: "en",
        • description: "some crm description data",
        • title: "some crm title data"
          }
          },
    • {
      • _index: "items",
      • _type: "item",
      • _id: "4",
      • _score: 0.13561106,
      • _source: {
        • language: "en",
        • description: "some crm description data",
        • title: "some crm title data"
          }
          }

Note the scores are completely different for the same
title/description/etc. Is this issue related to sharding (e.g. documents
mapping to a particular shard?) or something else? The same issue is seen
if we add 6 documents (all the same) to an index with 5 shards - using the
explain plan seems to show that maxDocs for tfidf is calculated over
#shards. Is this expected?

We are using a completely blank es setup of ES 0.90.0, no complex
analyzer/settings/mapping.

http://localhost:9200/items/_settings

{

  • items:
    {
    • settings:
      {
      • index.number_of_shards: "5",
      • index.number_of_replicas: "1",
      • index.version.created: "900099"
        }
        }

}

http://localhost:9200/_mapping
{

  • items:
    {
    • item:
      {
      • properties:
        {
        • description:
          {
          • type: "string"
            },
        • language:
          {
          • type: "string"
            },
        • title:
          {
          • type: "string"
            }
            }
            }
            }

}

Help as always greatly appreciated

Derry

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Its not strange, its expected....try DFS_QUERY_THEN_FETCH search type:
http://www.elasticsearch.org/guide/reference/api/search/search-type/

to solve this problem. Also omit_norms if you dont want/need tfidf in a
field.

On Tue, May 21, 2013 at 9:33 AM, Derry O' Sullivan derryos@gmail.comwrote:

Hi all,

We've noticed a strange issue with similarity scores on ES. The outline of
the bug is that we get different tfidf scores back for exactly the same
documents (e.g. duplicate documents) after a number of document inserts.

Steps to reproduce:

  1. Start a clean ES setup
  2. No index/type mapping should be created
  3. Add some content:
    curl -XPUT 'http://localhost:9200/items/item/1' -d
    '{"language":"en","description":"some crm description data","title":"some
    crm title data"}'

So 1 document in a new index called items with a type called item:

  1. Search for that 1 doc and check the score:
    http://localhost:9200/items/item/_search?q=crm
    0.13561106

  2. Start adding that same doc multiple times (search every few occurences)
    curl -XPUT 'http://localhost:9200/items/item/2' -d
    '{"language":"en","description":"some crm description data","title":"some
    crm title data"}'
    curl -XPUT 'http://localhost:9200/items/item/3' -d
    '{"language":"en","description":"some crm description data","title":"some
    crm title data"}'
    curl -XPUT 'http://localhost:9200/items/item/4' -d
    '{"language":"en","description":"some crm description data","title":"some
    crm title data"}'
    curl -XPUT 'http://localhost:9200/items/item/5' -d
    '{"language":"en","description":"some crm description data","title":"some
    crm title data"}'

  3. Search again:
    http://localhost:9200/items/item/_search?q=crm
    Should get the same score as above for all the docs:

  4. Search with title field:
    http://localhost:9200/items/item/_search?q=title:crm
    Same score for all docs (slightly higher but because we check on only 1
    field vs all)
    0.15342641

Gets interesting here!
*

  1. Now search again:
    http://localhost:9200/items/item/_search?q=title:crm

Notice that not all the docs get the same score. Obviously i would have
expected different scores for document 6/7 - but not all of documents
1-5 (that are all the same) get the same score
:

  • {
    • _index: "items",
    • _type: "item",
    • _id: "2",
    • _score: 0.2972674,
    • _source:
      {
      • language: "en",
      • description: "some crm description data",
      • title: "some crm title data"
        }
        },

{
- _index: "items",
- _type: "item",
- _id: "4",
- _score: 0.15342641,
- _source:
{
- language: "en",
- description: "some crm description data",
- title: "some crm title data"
}
}

  1. Same if we don't search title explicitly:
    http://localhost:9200/items/item/_search?q=crm

    • {
      • _index: "items",
      • _type: "item",
      • _id: "2",
      • _score: 0.26274976,
      • _source: {
        • language: "en",
        • description: "some crm description data",
        • title: "some crm title data"
          }
          },
    • {
      • _index: "items",
      • _type: "item",
      • _id: "4",
      • _score: 0.13561106,
      • _source: {
        • language: "en",
        • description: "some crm description data",
        • title: "some crm title data"
          }
          }

Note the scores are completely different for the same
title/description/etc. Is this issue related to sharding (e.g. documents
mapping to a particular shard?) or something else? The same issue is seen
if we add 6 documents (all the same) to an index with 5 shards - using the
explain plan seems to show that maxDocs for tfidf is calculated over
#shards. Is this expected?

We are using a completely blank es setup of ES 0.90.0, no complex
analyzer/settings/mapping.

http://localhost:9200/items/_settings

{

  • items:
    {
    • settings:
      {
      • index.number_of_shards: "5",
      • index.number_of_replicas: "1",
      • index.version.created: "900099"
        }
        }

}

http://localhost:9200/_mapping
{

  • items:
    {
    • item:
      {
      • properties:
        {
        • description:
          {
          • type: "string"
            },
        • language:
          {
          • type: "string"
            },
        • title:
          {
          • type: "string"
            }
            }
            }
            }

}

Help as always greatly appreciated

Derry

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Randall,

Thanks for the response. I guess the real question is now why the ES
default is 5 shards for new indexes (i understand about sharding from an
indexing speed vs search speed perspective). If i redo the below with 1
shard, i won't get this issue so in a 'small' index, it would make sense to
have a smaller shard size.

I guess i would have expected this to be a bit more obvious in the
documentation (e.g. watch out for unusual score values among un-even (if
that is the term) shards? Is the assumption that documents can be (without
routing) randomly spread among shards so the issue is not seen in a large
dataset?

Derry
On 21 May 2013 22:04, "Randall McRee" randall.mcree@gmail.com wrote:

Its not strange, its expected....try DFS_QUERY_THEN_FETCH search type:
http://www.elasticsearch.org/guide/reference/api/search/search-type/

to solve this problem. Also omit_norms if you dont want/need tfidf in a
field.

On Tue, May 21, 2013 at 9:33 AM, Derry O' Sullivan derryos@gmail.comwrote:

Hi all,

We've noticed a strange issue with similarity scores on ES. The outline
of the bug is that we get different tfidf scores back for exactly the same
documents (e.g. duplicate documents) after a number of document inserts.

Steps to reproduce:

  1. Start a clean ES setup
  2. No index/type mapping should be created
  3. Add some content:
    curl -XPUT 'http://localhost:9200/items/item/1' -d
    '{"language":"en","description":"some crm description data","title":"some
    crm title data"}'

So 1 document in a new index called items with a type called item:

  1. Search for that 1 doc and check the score:
    http://localhost:9200/items/item/_search?q=crm
    0.13561106

  2. Start adding that same doc multiple times (search every few occurences)
    curl -XPUT 'http://localhost:9200/items/item/2' -d
    '{"language":"en","description":"some crm description data","title":"some
    crm title data"}'
    curl -XPUT 'http://localhost:9200/items/item/3' -d
    '{"language":"en","description":"some crm description data","title":"some
    crm title data"}'
    curl -XPUT 'http://localhost:9200/items/item/4' -d
    '{"language":"en","description":"some crm description data","title":"some
    crm title data"}'
    curl -XPUT 'http://localhost:9200/items/item/5' -d
    '{"language":"en","description":"some crm description data","title":"some
    crm title data"}'

  3. Search again:
    http://localhost:9200/items/item/_search?q=crm
    Should get the same score as above for all the docs:

  4. Search with title field:
    http://localhost:9200/items/item/_search?q=title:crm
    Same score for all docs (slightly higher but because we check on only 1
    field vs all)
    0.15342641

Gets interesting here!
*

  1. Now search again:
    http://localhost:9200/items/item/_search?q=title:crm

Notice that not all the docs get the same score. Obviously i would have
expected different scores for document 6/7 - but not all of documents
1-5 (that are all the same) get the same score
:

  • {
    • _index: "items",
    • _type: "item",
    • _id: "2",
    • _score: 0.2972674,
    • _source:
      {
      • language: "en",
      • description: "some crm description data",
      • title: "some crm title data"
        }
        },

{
- _index: "items",
- _type: "item",
- _id: "4",
- _score: 0.15342641,
- _source:
{
- language: "en",
- description: "some crm description data",
- title: "some crm title data"
}
}

  1. Same if we don't search title explicitly:
    http://localhost:9200/items/item/_search?q=crm

    • {
      • _index: "items",
      • _type: "item",
      • _id: "2",
      • _score: 0.26274976,
      • _source: {
        • language: "en",
        • description: "some crm description data",
        • title: "some crm title data"
          }
          },
    • {
      • _index: "items",
      • _type: "item",
      • _id: "4",
      • _score: 0.13561106,
      • _source: {
        • language: "en",
        • description: "some crm description data",
        • title: "some crm title data"
          }
          }

Note the scores are completely different for the same
title/description/etc. Is this issue related to sharding (e.g. documents
mapping to a particular shard?) or something else? The same issue is seen
if we add 6 documents (all the same) to an index with 5 shards - using the
explain plan seems to show that maxDocs for tfidf is calculated over
#shards. Is this expected?

We are using a completely blank es setup of ES 0.90.0, no complex
analyzer/settings/mapping.

http://localhost:9200/items/_settings

{

  • items:
    {
    • settings:
      {
      • index.number_of_shards: "5",
      • index.number_of_replicas: "1",
      • index.version.created: "900099"
        }
        }

}

http://localhost:9200/_mapping
{

  • items:
    {
    • item:
      {
      • properties:
        {
        • description:
          {
          • type: "string"
            },
        • language:
          {
          • type: "string"
            },
        • title:
          {
          • type: "string"
            }
            }
            }
            }

}

Help as always greatly appreciated

Derry

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/lFq_V3PiRwA/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The results that you are seeing are an artefact of having too few docs in a
distributed environment. With a real application, you have many more docs,
so the differences even out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey Clinton,

Thanks for that. I understand the concept that over time/with lots of
documents, the scoring stabilizes. I guess the real q is for a beginner
database (e.g. from the tutorial/etc) where you insert a low number of
documents, this is probably going to be quite confusing as the default
shard size is 5. Then again, i guess people are not going to be adding the
same content multiple times :wink:

Derry

On 23 May 2013 11:06, Clinton Gormley clint@traveljury.com wrote:

The results that you are seeing are an artefact of having too few docs in
a distributed environment. With a real application, you have many more
docs, so the differences even out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/lFq_V3PiRwA/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It is a common source of confusion. But being able to scale out
automatically by default has greater benefit, hence the multiple shards

clint

On 23 May 2013 12:09, Derry O' Sullivan derryos@gmail.com wrote:

Hey Clinton,

Thanks for that. I understand the concept that over time/with lots of
documents, the scoring stabilizes. I guess the real q is for a beginner
database (e.g. from the tutorial/etc) where you insert a low number of
documents, this is probably going to be quite confusing as the default
shard size is 5. Then again, i guess people are not going to be adding the
same content multiple times :wink:

Derry

On 23 May 2013 11:06, Clinton Gormley clint@traveljury.com wrote:

The results that you are seeing are an artefact of having too few docs in
a distributed environment. With a real application, you have many more
docs, so the differences even out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/lFq_V3PiRwA/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.