Scores and order differ after reindex

Fluckx · November 22, 2013, 10:27am

Hello,

I am currently writing a unittest to verify the response of the
elasticsearch. The reason for this is so we can run these same tests on
higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

Create a new index
Put the mapping
Insert data in bulk
flush index
optimize index to 1 segment
refresh index
perform queries and assertions
remove index

The problem is that every time i run these unittests they're unreliable
because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are
something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with each run
( keep in mind that every run it creates a new index ). The issue I have is
that when I recreate the same index 10 times and run my queries that
suddenly some items score higher than others. While the elasticsearch
version is the same and the data is the exact same ( it's a file that
contains all the bulk data ).

Anybody that can explain why this is or how i can get around this issue?
I'd assume that running a query on an index that is built the exact same
way 5 times should return the same results every time? Especially since i
flush - optimize and refresh. I assume all the documents are indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · November 22, 2013, 1:23pm

When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van.houtryve@gmail.com) a écrit:

Hello,

I am currently writing a unittest to verify the response of the elasticsearch. The reason for this is so we can run these same tests on higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

Create a new index
Put the mapping
Insert data in bulk
flush index
optimize index to 1 segment
refresh index
perform queries and assertions
remove index

The problem is that every time i run these unittests they're unreliable because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with each run ( keep in mind that every run it creates a new index ). The issue I have is that when I recreate the same index 10 times and run my queries that suddenly some items score higher than others. While the elasticsearch version is the same and the data is the exact same ( it's a file that contains all the bulk data ).

Anybody that can explain why this is or how i can get around this issue? I'd assume that running a query on an index that is built the exact same way 5 times should return the same results every time? Especially since i flush - optimize and refresh. I assume all the documents are indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Fluckx · November 22, 2013, 2:54pm

Hi David!

Thanks for the reply.

All the documents are indexed with their own id's ( not auto generated ).

This unittest runs it's queries on a single node with 1 shard ( The
production cluster has replication and multiple shards of course, but this
unittest just creates an index - inserts data - tries the queries - and
removes the index again ).

I have also discovered what the problem was. It's really stupid, but the
reason that some documents kept switching order is because they had the
exact same score.
So I decided to add a sort to the query so the return order is more
consistent.

On Friday, 22 November 2013 14:23:47 UTC+1, David Pilato wrote:

When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Fdadoonet&sa=D&sntz=1&usg=AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA
| @elasticsearchfrhttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Felasticsearchfr&sa=D&sntz=1&usg=AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van...@gmail.com<javascript:>)
a écrit:

Hello,

I am currently writing a unittest to verify the response of the
elasticsearch. The reason for this is so we can run these same tests on
higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

Create a new index

Put the mapping

Insert data in bulk

flush index

optimize index to 1 segment

refresh index

perform queries and assertions

remove index

The problem is that every time i run these unittests they're unreliable
because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are
something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with each
run ( keep in mind that every run it creates a new index ). The issue I
have is that when I recreate the same index 10 times and run my queries
that suddenly some items score higher than others. While the elasticsearch
version is the same and the data is the exact same ( it's a file that
contains all the bulk data ).

Anybody that can explain why this is or how i can get around this issue?
I'd assume that running a query on an index that is built the exact same
way 5 times should return the same results every time? Especially since i
flush - optimize and refresh. I assume all the documents are indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · November 22, 2013, 3:01pm

Ha! Thanks for the update.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 22 novembre 2013 at 15:54:32, Fluckx (filip.van.houtryve@gmail.com) a écrit:

Hi David!

Thanks for the reply.

All the documents are indexed with their own id's ( not auto generated ).

This unittest runs it's queries on a single node with 1 shard ( The production cluster has replication and multiple shards of course, but this unittest just creates an index - inserts data - tries the queries - and removes the index again ).

I have also discovered what the problem was. It's really stupid, but the reason that some documents kept switching order is because they had the exact same score.
So I decided to add a sort to the query so the return order is more consistent.

On Friday, 22 November 2013 14:23:47 UTC+1, David Pilato wrote:
When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van...@gmail.com) a écrit:

Hello,

I am currently writing a unittest to verify the response of the elasticsearch. The reason for this is so we can run these same tests on higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

Create a new index
Put the mapping
Insert data in bulk
flush index
optimize index to 1 segment
refresh index
perform queries and assertions
remove index

The problem is that every time i run these unittests they're unreliable because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with each run ( keep in mind that every run it creates a new index ). The issue I have is that when I recreate the same index 10 times and run my queries that suddenly some items score higher than others. While the elasticsearch version is the same and the data is the exact same ( it's a file that contains all the bulk data ).

Anybody that can explain why this is or how i can get around this issue? I'd assume that running a query on an index that is built the exact same way 5 times should return the same results every time? Especially since i flush - optimize and refresh. I assume all the documents are indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Fluckx · November 29, 2013, 1:43pm

Hello again.

I seem to be running into the same issue again. Unfortunately it's not as
simple as the sorting order this time.
It works most of the time, but occasionally the last two items switch order.

The _score of both items also differ, but they're never very fart apart( up
to maximum 0.03 difference ). Occasionally they switch order because the
last item scores minimally higher than the item before it.

For clarity:

if i run my query multiple times on the same index - the scores don't
change. But since the index is recreated every time the unittest is run -
the scores do change ( which is a little weird i suppose ).

Elasticsearch version is still 0.90.5.

On Friday, 22 November 2013 16:01:11 UTC+1, David Pilato wrote:

Ha! Thanks for the update.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 22 novembre 2013 at 15:54:32, Fluckx (filip.van...@gmail.com<javascript:>)
a écrit:

Hi David!

Thanks for the reply.

All the documents are indexed with their own id's ( not auto generated ).

This unittest runs it's queries on a single node with 1 shard ( The
production cluster has replication and multiple shards of course, but this
unittest just creates an index - inserts data - tries the queries - and
removes the index again ).

I have also discovered what the problem was. It's really stupid, but the
reason that some documents kept switching order is because they had the
exact same score.
So I decided to add a sort to the query so the return order is more
consistent.

On Friday, 22 November 2013 14:23:47 UTC+1, David Pilato wrote:
When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.
 -- 
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Fdadoonet&sa=D&sntz=1&usg=AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA
| @elasticsearchfrhttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Felasticsearchfr&sa=D&sntz=1&usg=AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van...@gmail.com) a écrit:

Hello,

I am currently writing a unittest to verify the response of the
elasticsearch. The reason for this is so we can run these same tests on
higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

Create a new index

Put the mapping

Insert data in bulk

flush index

optimize index to 1 segment

refresh index

perform queries and assertions

remove index

The problem is that every time i run these unittests they're unreliable
because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are
something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with each
run ( keep in mind that every run it creates a new index ). The issue I
have is that when I recreate the same index 10 times and run my queries
that suddenly some items score higher than others. While the elasticsearch
version is the same and the data is the exact same ( it's a file that
contains all the bulk data ).

Anybody that can explain why this is or how i can get around this issue?
I'd assume that running a query on an index that is built the exact same
way 5 times should return the same results every time? Especially since i
flush - optimize and refresh. I assume all the documents are indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/17100e17-6a7e-471a-a9da-37ecb01d1a47%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · November 29, 2013, 1:53pm

Are you indexing new documents in the meantime?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 29 novembre 2013 at 14:43:36, Fluckx (filip.van.houtryve@gmail.com) a écrit:

Hello again.

I seem to be running into the same issue again. Unfortunately it's not as simple as the sorting order this time.
It works most of the time, but occasionally the last two items switch order.

The _score of both items also differ, but they're never very fart apart( up to maximum 0.03 difference ). Occasionally they switch order because the last item scores minimally higher than the item before it.

For clarity:

if i run my query multiple times on the same index - the scores don't change. But since the index is recreated every time the unittest is run - the scores do change ( which is a little weird i suppose ).

Elasticsearch version is still 0.90.5.

On Friday, 22 November 2013 16:01:11 UTC+1, David Pilato wrote:
Ha! Thanks for the update.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 22 novembre 2013 at 15:54:32, Fluckx (filip.van...@gmail.com) a écrit:

Hi David!

Thanks for the reply.

All the documents are indexed with their own id's ( not auto generated ).

This unittest runs it's queries on a single node with 1 shard ( The production cluster has replication and multiple shards of course, but this unittest just creates an index - inserts data - tries the queries - and removes the index again ).

I have also discovered what the problem was. It's really stupid, but the reason that some documents kept switching order is because they had the exact same score.
So I decided to add a sort to the query so the return order is more consistent.

On Friday, 22 November 2013 14:23:47 UTC+1, David Pilato wrote:
When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van...@gmail.com) a écrit:

Hello,

I am currently writing a unittest to verify the response of the elasticsearch. The reason for this is so we can run these same tests on higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

Create a new index
Put the mapping
Insert data in bulk
flush index
optimize index to 1 segment
refresh index
perform queries and assertions
remove index

The problem is that every time i run these unittests they're unreliable because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with each run ( keep in mind that every run it creates a new index ). The issue I have is that when I recreate the same index 10 times and run my queries that suddenly some items score higher than others. While the elasticsearch version is the same and the data is the exact same ( it's a file that contains all the bulk data ).

Anybody that can explain why this is or how i can get around this issue? I'd assume that running a query on an index that is built the exact same way 5 times should return the same results every time? Especially since i flush - optimize and refresh. I assume all the documents are indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/17100e17-6a7e-471a-a9da-37ecb01d1a47%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52989c55.3f2dba31.a94c%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/groups/opt_out.

Fluckx · November 29, 2013, 2:35pm

Hi David!

Thanks for the swift reply.
There are no documents being indexed in the meantime.

I generate a unique name for the index.

I am currently

Creating the index
Putting the mapping
Inserting the data from the bulkfile
flushing and refreshing the index ( $client->indices()->flush ( array(
'index' => $indexname,
'refresh' => true,
) );

this is all done in the setup of the unittest ( before any test is run )
and the index is removed after all tests are run.

On Friday, 29 November 2013 14:53:25 UTC+1, David Pilato wrote:

Are you indexing new documents in the meantime?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 29 novembre 2013 at 14:43:36, Fluckx (filip.van...@gmail.com<javascript:>)
a écrit:

Hello again.

I seem to be running into the same issue again. Unfortunately it's not as
simple as the sorting order this time.
It works most of the time, but occasionally the last two items switch
order.

The _score of both items also differ, but they're never very fart apart(
up to maximum 0.03 difference ). Occasionally they switch order because the
last item scores minimally higher than the item before it.

For clarity:

if i run my query multiple times on the same index - the scores don't
change. But since the index is recreated every time the unittest is run -
the scores do change ( which is a little weird i suppose ).

Elasticsearch version is still 0.90.5.

On Friday, 22 November 2013 16:01:11 UTC+1, David Pilato wrote:
Ha! Thanks for the update.
 -- 
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 22 novembre 2013 at 15:54:32, Fluckx (filip.van...@gmail.com) a écrit:

Hi David!

Thanks for the reply.

All the documents are indexed with their own id's ( not auto generated ).

This unittest runs it's queries on a single node with 1 shard ( The
production cluster has replication and multiple shards of course, but this
unittest just creates an index - inserts data - tries the queries - and
removes the index again ).

I have also discovered what the problem was. It's really stupid, but the
reason that some documents kept switching order is because they had the
exact same score.
So I decided to add a sort to the query so the return order is more
consistent.

On Friday, 22 November 2013 14:23:47 UTC+1, David Pilato wrote:
When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.
 -- 
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Fdadoonet&sa=D&sntz=1&usg=AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA
| @elasticsearchfrhttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Felasticsearchfr&sa=D&sntz=1&usg=AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van...@gmail.com) a
écrit:

Hello,

I am currently writing a unittest to verify the response of the
elasticsearch. The reason for this is so we can run these same tests on
higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

Create a new index

Put the mapping

Insert data in bulk

flush index

optimize index to 1 segment

refresh index

perform queries and assertions

remove index

The problem is that every time i run these unittests they're unreliable
because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this
order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are
something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with each
run ( keep in mind that every run it creates a new index ). The issue I
have is that when I recreate the same index 10 times and run my queries
that suddenly some items score higher than others. While the elasticsearch
version is the same and the data is the exact same ( it's a file that
contains all the bulk data ).

Anybody that can explain why this is or how i can get around this issue?
I'd assume that running a query on an index that is built the exact same
way 5 times should return the same results every time? Especially since i
flush - optimize and refresh. I assume all the documents are indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/17100e17-6a7e-471a-a9da-37ecb01d1a47%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ded45227-b50e-4fb1-9abc-488054ecad7e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Henrik_Nordvik · December 2, 2013, 9:23am

Hi,
Are you using dfs_query_then_fetch? It should help getting the same score.

If that doesn't help then it could he because lucene uses internal doc-ids
as tie-breaker when the scores are the same. Not sure what the best fix for
this is though.

github.com/elastic/elasticsearch

inconsistent sorting by score due to differences between primary and replicas

opened 09:59AM - 27 Aug 13 UTC

closed 10:29AM - 27 Aug 13 UTC

lmenezes

not really sure if this is a bug or not. the first part is: - is it normal t…hat shards have different number of max_docs? what could cause that? a fast insertion + delete(that my guess is wont be replicated to the other shards). and of course, i guess that if they have different number of max docs, they most likely will also have different term freq and whatnot. the second, based on if the previous is true: - is it then possible to have a consistent sorting(based on score) with this scenario? i currently have for a simple match query, completely different result lists based on the shards that the query hits.

http://web.archiveorange.com/archive/v/AAfXfnLZCbyQTykIeQWm

Henrik Nordvik

On Friday, November 29, 2013 3:35:52 PM UTC+1, Fluckx wrote:

Hi David!

Thanks for the swift reply.
There are no documents being indexed in the meantime.

I generate a unique name for the index.

I am currently

Creating the index

Putting the mapping

Inserting the data from the bulkfile

flushing and refreshing the index ( $client->indices()->flush ( array(
'index' => $indexname,
'refresh' => true,
) );

this is all done in the setup of the unittest ( before any test is run )
and the index is removed after all tests are run.

On Friday, 29 November 2013 14:53:25 UTC+1, David Pilato wrote:
Are you indexing new documents in the meantime?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 29 novembre 2013 at 14:43:36, Fluckx (filip.van...@gmail.com) a écrit:

Hello again.

I seem to be running into the same issue again. Unfortunately it's not as
simple as the sorting order this time.
It works most of the time, but occasionally the last two items switch
order.

The _score of both items also differ, but they're never very fart apart(
up to maximum 0.03 difference ). Occasionally they switch order because the
last item scores minimally higher than the item before it.

For clarity:

if i run my query multiple times on the same index - the scores don't
change. But since the index is recreated every time the unittest is run -
the scores do change ( which is a little weird i suppose ).

Elasticsearch version is still 0.90.5.

On Friday, 22 November 2013 16:01:11 UTC+1, David Pilato wrote:
Ha! Thanks for the update.
 -- 
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 22 novembre 2013 at 15:54:32, Fluckx (filip.van...@gmail.com) a
écrit:

Hi David!

Thanks for the reply.

All the documents are indexed with their own id's ( not auto generated ).

This unittest runs it's queries on a single node with 1 shard ( The
production cluster has replication and multiple shards of course, but this
unittest just creates an index - inserts data - tries the queries - and
removes the index again ).

I have also discovered what the problem was. It's really stupid, but the
reason that some documents kept switching order is because they had the
exact same score.
So I decided to add a sort to the query so the return order is more
consistent.

On Friday, 22 November 2013 14:23:47 UTC+1, David Pilato wrote:
When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.
 -- 
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Fdadoonet&sa=D&sntz=1&usg=AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA
| @elasticsearchfrhttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Felasticsearchfr&sa=D&sntz=1&usg=AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van...@gmail.com) a
écrit:

Hello,

I am currently writing a unittest to verify the response of the
elasticsearch. The reason for this is so we can run these same tests on
higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

Create a new index

Put the mapping

Insert data in bulk

flush index

optimize index to 1 segment

refresh index

perform queries and assertions

remove index

The problem is that every time i run these unittests they're unreliable
because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this
order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are
something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with each
run ( keep in mind that every run it creates a new index ). The issue I
have is that when I recreate the same index 10 times and run my queries
that suddenly some items score higher than others. While the elasticsearch
version is the same and the data is the exact same ( it's a file that
contains all the bulk data ).

Anybody that can explain why this is or how i can get around this
issue? I'd assume that running a query on an index that is built the exact
same way 5 times should return the same results every time? Especially
since i flush - optimize and refresh. I assume all the documents are
indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/17100e17-6a7e-471a-a9da-37ecb01d1a47%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4bbd4cbd-0117-4b15-99ee-720a0cf45980%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Fluckx · December 2, 2013, 10:12am

Hey,

I doubt that will fix the problem. my queries are run on a node which only
has 1 shard and 0 replicas, so I don't think the query_then_fetch will
solve this. I will try it though. If it doesn't work I will try to post a
sample to further demonstrate my problem.

The weirdest thing for me is that the scores differ. If I create a file
with bulk inserts and I create 2 indexes and insert this bulk data into
both.
If I then run the same query against both indexes, shouldn't they return
the exact same result with the exact same scores? The data is identical and
both indexes are optimized and refreshed after the insert ( to make sure
there is no data falling behind ). The "scoring" algorithm has the same
information ( besides being created at a different time ).

On Monday, 2 December 2013 10:23:21 UTC+1, Henrik Nordvik wrote:

Hi,
Are you using dfs_query_then_fetch? It should help getting the same score.

Elasticsearch Platform — Find real-time answers at scale | Elastic

If that doesn't help then it could he because lucene uses internal doc-ids
as tie-breaker when the scores are the same. Not sure what the best fix for
this is though.
inconsistent sorting by score due to differences between primary and replicas · Issue #3578 · elastic/elasticsearch · GitHub
http://web.archiveorange.com/archive/v/AAfXfnLZCbyQTykIeQWm

Henrik Nordvik

On Friday, November 29, 2013 3:35:52 PM UTC+1, Fluckx wrote:
Hi David!

Thanks for the swift reply.
There are no documents being indexed in the meantime.

I generate a unique name for the index.

I am currently

Creating the index

Putting the mapping

Inserting the data from the bulkfile

flushing and refreshing the index ( $client->indices()->flush ( array(
'index' => $indexname,
'refresh' => true,
) );

this is all done in the setup of the unittest ( before any test is run )
and the index is removed after all tests are run.

On Friday, 29 November 2013 14:53:25 UTC+1, David Pilato wrote:
Are you indexing new documents in the meantime?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 29 novembre 2013 at 14:43:36, Fluckx (filip.van...@gmail.com) a
écrit:

Hello again.

I seem to be running into the same issue again. Unfortunately it's not
as simple as the sorting order this time.
It works most of the time, but occasionally the last two items switch
order.

The _score of both items also differ, but they're never very fart apart(
up to maximum 0.03 difference ). Occasionally they switch order because the
last item scores minimally higher than the item before it.

For clarity:

if i run my query multiple times on the same index - the scores don't
change. But since the index is recreated every time the unittest is run -
the scores do change ( which is a little weird i suppose ).

Elasticsearch version is still 0.90.5.

On Friday, 22 November 2013 16:01:11 UTC+1, David Pilato wrote:
Ha! Thanks for the update.
 -- 
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 22 novembre 2013 at 15:54:32, Fluckx (filip.van...@gmail.com) a
écrit:

Hi David!

Thanks for the reply.

All the documents are indexed with their own id's ( not auto generated
).

This unittest runs it's queries on a single node with 1 shard ( The
production cluster has replication and multiple shards of course, but this
unittest just creates an index - inserts data - tries the queries - and
removes the index again ).

I have also discovered what the problem was. It's really stupid, but
the reason that some documents kept switching order is because they had the
exact same score.
So I decided to add a sort to the query so the return order is more
consistent.

On Friday, 22 November 2013 14:23:47 UTC+1, David Pilato wrote:
When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.
 -- 
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Fdadoonet&sa=D&sntz=1&usg=AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA
| @elasticsearchfrhttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Felasticsearchfr&sa=D&sntz=1&usg=AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van...@gmail.com) a
écrit:

Hello,

I am currently writing a unittest to verify the response of the
elasticsearch. The reason for this is so we can run these same tests on
higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

Create a new index

Put the mapping

Insert data in bulk

flush index

optimize index to 1 segment

refresh index

perform queries and assertions

remove index

The problem is that every time i run these unittests they're
unreliable because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this
order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are
something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with
each run ( keep in mind that every run it creates a new index ). The issue
I have is that when I recreate the same index 10 times and run my queries
that suddenly some items score higher than others. While the elasticsearch
version is the same and the data is the exact same ( it's a file that
contains all the bulk data ).

Anybody that can explain why this is or how i can get around this
issue? I'd assume that running a query on an index that is built the exact
same way 5 times should return the same results every time? Especially
since i flush - optimize and refresh. I assume all the documents are
indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/17100e17-6a7e-471a-a9da-37ecb01d1a47%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c748cf41-5b82-41bc-b2e4-7034eb513fb3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Fluckx · December 2, 2013, 10:43am

Hello again,

I noticed that my max number of documents and deleted documents differed. I
don't even know how this is possible ( concidering the same file should
yield the same results every time? ).

I optimized my index with max_num_segments set to 1 after inserting the
data.
Ever since the scoring has been "stable" across all the indexes.

If I have some spare time I should try and see why 1 bulkfile has 2
different amounts of deleted documents across indexes. It does explain
why the scoring is different every time i queried and why it's now the same.

On Monday, 2 December 2013 11:12:48 UTC+1, Fluckx wrote:

Hey,

I doubt that will fix the problem. my queries are run on a node which only
has 1 shard and 0 replicas, so I don't think the query_then_fetch will
solve this. I will try it though. If it doesn't work I will try to post a
sample to further demonstrate my problem.

The weirdest thing for me is that the scores differ. If I create a file
with bulk inserts and I create 2 indexes and insert this bulk data into
both.
If I then run the same query against both indexes, shouldn't they return
the exact same result with the exact same scores? The data is identical and
both indexes are optimized and refreshed after the insert ( to make sure
there is no data falling behind ). The "scoring" algorithm has the same
information ( besides being created at a different time ).

On Monday, 2 December 2013 10:23:21 UTC+1, Henrik Nordvik wrote:
Hi,
Are you using dfs_query_then_fetch? It should help getting the same score.

Elasticsearch Platform — Find real-time answers at scale | Elastic

If that doesn't help then it could he because lucene uses internal
doc-ids as tie-breaker when the scores are the same. Not sure what the best
fix for this is though.
inconsistent sorting by score due to differences between primary and replicas · Issue #3578 · elastic/elasticsearch · GitHub
http://web.archiveorange.com/archive/v/AAfXfnLZCbyQTykIeQWm

Henrik Nordvik

On Friday, November 29, 2013 3:35:52 PM UTC+1, Fluckx wrote:
Hi David!

Thanks for the swift reply.
There are no documents being indexed in the meantime.

I generate a unique name for the index.

I am currently

Creating the index

Putting the mapping

Inserting the data from the bulkfile

flushing and refreshing the index ( $client->indices()->flush ( array(
'index' => $indexname,
'refresh' => true,
) );

this is all done in the setup of the unittest ( before any test is run )
and the index is removed after all tests are run.

On Friday, 29 November 2013 14:53:25 UTC+1, David Pilato wrote:
Are you indexing new documents in the meantime?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 29 novembre 2013 at 14:43:36, Fluckx (filip.van...@gmail.com) a
écrit:

Hello again.

I seem to be running into the same issue again. Unfortunately it's not
as simple as the sorting order this time.
It works most of the time, but occasionally the last two items switch
order.

The _score of both items also differ, but they're never very fart
apart( up to maximum 0.03 difference ). Occasionally they switch order
because the last item scores minimally higher than the item before it.

For clarity:

if i run my query multiple times on the same index - the scores don't
change. But since the index is recreated every time the unittest is run -
the scores do change ( which is a little weird i suppose ).

Elasticsearch version is still 0.90.5.

On Friday, 22 November 2013 16:01:11 UTC+1, David Pilato wrote:
Ha! Thanks for the update.
 -- 
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 22 novembre 2013 at 15:54:32, Fluckx (filip.van...@gmail.com) a
écrit:

Hi David!

Thanks for the reply.

All the documents are indexed with their own id's ( not auto generated
).

This unittest runs it's queries on a single node with 1 shard ( The
production cluster has replication and multiple shards of course, but this
unittest just creates an index - inserts data - tries the queries - and
removes the index again ).

I have also discovered what the problem was. It's really stupid, but
the reason that some documents kept switching order is because they had the
exact same score.
So I decided to add a sort to the query so the return order is more
consistent.

On Friday, 22 November 2013 14:23:47 UTC+1, David Pilato wrote:
When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.
 -- 
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Fdadoonet&sa=D&sntz=1&usg=AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA
| @elasticsearchfrhttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Felasticsearchfr&sa=D&sntz=1&usg=AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van...@gmail.com) a
écrit:

Hello,

I am currently writing a unittest to verify the response of the
elasticsearch. The reason for this is so we can run these same tests on
higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

Create a new index

Put the mapping

Insert data in bulk

flush index

optimize index to 1 segment

refresh index

perform queries and assertions

remove index

The problem is that every time i run these unittests they're
unreliable because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this
order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are
something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with
each run ( keep in mind that every run it creates a new index ). The issue
I have is that when I recreate the same index 10 times and run my queries
that suddenly some items score higher than others. While the elasticsearch
version is the same and the data is the exact same ( it's a file that
contains all the bulk data ).

Anybody that can explain why this is or how i can get around this
issue? I'd assume that running a query on an index that is built the exact
same way 5 times should return the same results every time? Especially
since i flush - optimize and refresh. I assume all the documents are
indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/17100e17-6a7e-471a-a9da-37ecb01d1a47%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e7a76307-6cf3-4a1f-8f1c-93229e766fe3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Same query returning different result each time Elasticsearch	6	230	August 21, 2024
Embedded ES used for unit testing not returning deterministic results Elasticsearch	3	738	July 5, 2017
Elasticsearch response order consistency issues Elasticsearch	17	1498	September 21, 2021
Different scores on replicas with the same documents Elasticsearch	6	2171	July 6, 2017
Inconsistent results when sorting on index order Elasticsearch	2	731	June 25, 2019

Scores and order differ after reindex

Version: 0.90.5 OS: linux

Version: 0.90.5 OS: linux

Version: 0.90.5 OS: linux

You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.

Version: 0.90.5 OS: linux

Version: 0.90.5 OS: linux

You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.

You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.

Version: 0.90.5 OS: linux

Version: 0.90.5 OS: linux

Version: 0.90.5 OS: linux

Version: 0.90.5 OS: linux

Related topics

Version: 0.90.5
OS: linux

Version: 0.90.5
OS: linux

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Version: 0.90.5
OS: linux

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Version: 0.90.5
OS: linux

Version: 0.90.5
OS: linux

Version: 0.90.5
OS: linux

Version: 0.90.5
OS: linux