Scores and order differ after reindex


(Fluckx) #1

Hello,

I am currently writing a unittest to verify the response of the
elasticsearch. The reason for this is so we can run these same tests on
higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

  1. Create a new index
  2. Put the mapping
  3. Insert data in bulk
  4. flush index
  5. optimize index to 1 segment
  6. refresh index
  7. perform queries and assertions
  8. remove index

The problem is that every time i run these unittests they're unreliable
because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are
something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with each run
( keep in mind that every run it creates a new index ). The issue I have is
that when I recreate the same index 10 times and run my queries that
suddenly some items score higher than others. While the elasticsearch
version is the same and the data is the exact same ( it's a file that
contains all the bulk data ).

Anybody that can explain why this is or how i can get around this issue?
I'd assume that running a query on an index that is built the exact same
way 5 times should return the same results every time? Especially since i
flush - optimize and refresh. I assume all the documents are indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #2

When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van.houtryve@gmail.com) a écrit:

Hello,

I am currently writing a unittest to verify the response of the elasticsearch. The reason for this is so we can run these same tests on higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

  1. Create a new index
  2. Put the mapping
  3. Insert data in bulk
  4. flush index
  5. optimize index to 1 segment
  6. refresh index
  7. perform queries and assertions
  8. remove index

The problem is that every time i run these unittests they're unreliable because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with each run ( keep in mind that every run it creates a new index ). The issue I have is that when I recreate the same index 10 times and run my queries that suddenly some items score higher than others. While the elasticsearch version is the same and the data is the exact same ( it's a file that contains all the bulk data ).

Anybody that can explain why this is or how i can get around this issue? I'd assume that running a query on an index that is built the exact same way 5 times should return the same results every time? Especially since i flush - optimize and refresh. I assume all the documents are indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Fluckx) #3

Hi David!

Thanks for the reply.

All the documents are indexed with their own id's ( not auto generated ).

This unittest runs it's queries on a single node with 1 shard ( The
production cluster has replication and multiple shards of course, but this
unittest just creates an index - inserts data - tries the queries - and
removes the index again ).

I have also discovered what the problem was. It's really stupid, but the
reason that some documents kept switching order is because they had the
exact same score.
So I decided to add a sort to the query so the return order is more
consistent.

On Friday, 22 November 2013 14:23:47 UTC+1, David Pilato wrote:

When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonethttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Fdadoonet&sa=D&sntz=1&usg=AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA
| @elasticsearchfrhttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Felasticsearchfr&sa=D&sntz=1&usg=AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van...@gmail.com<javascript:>)
a écrit:

Hello,

I am currently writing a unittest to verify the response of the
elasticsearch. The reason for this is so we can run these same tests on
higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

  1. Create a new index
  2. Put the mapping
  3. Insert data in bulk
  4. flush index
  5. optimize index to 1 segment
  6. refresh index
  7. perform queries and assertions
  8. remove index

The problem is that every time i run these unittests they're unreliable
because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are
something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with each
run ( keep in mind that every run it creates a new index ). The issue I
have is that when I recreate the same index 10 times and run my queries
that suddenly some items score higher than others. While the elasticsearch
version is the same and the data is the exact same ( it's a file that
contains all the bulk data ).

Anybody that can explain why this is or how i can get around this issue?
I'd assume that running a query on an index that is built the exact same
way 5 times should return the same results every time? Especially since i
flush - optimize and refresh. I assume all the documents are indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #4

Ha! Thanks for the update.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 22 novembre 2013 at 15:54:32, Fluckx (filip.van.houtryve@gmail.com) a écrit:

Hi David!

Thanks for the reply.

All the documents are indexed with their own id's ( not auto generated ).

This unittest runs it's queries on a single node with 1 shard ( The production cluster has replication and multiple shards of course, but this unittest just creates an index - inserts data - tries the queries - and removes the index again ).

I have also discovered what the problem was. It's really stupid, but the reason that some documents kept switching order is because they had the exact same score.
So I decided to add a sort to the query so the return order is more consistent.

On Friday, 22 November 2013 14:23:47 UTC+1, David Pilato wrote:
When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van...@gmail.com) a écrit:

Hello,

I am currently writing a unittest to verify the response of the elasticsearch. The reason for this is so we can run these same tests on higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

  1. Create a new index
  2. Put the mapping
  3. Insert data in bulk
  4. flush index
  5. optimize index to 1 segment
  6. refresh index
  7. perform queries and assertions
  8. remove index

The problem is that every time i run these unittests they're unreliable because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with each run ( keep in mind that every run it creates a new index ). The issue I have is that when I recreate the same index 10 times and run my queries that suddenly some items score higher than others. While the elasticsearch version is the same and the data is the exact same ( it's a file that contains all the bulk data ).

Anybody that can explain why this is or how i can get around this issue? I'd assume that running a query on an index that is built the exact same way 5 times should return the same results every time? Especially since i flush - optimize and refresh. I assume all the documents are indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Fluckx) #5

Hello again.

I seem to be running into the same issue again. Unfortunately it's not as
simple as the sorting order this time.
It works most of the time, but occasionally the last two items switch order.

The _score of both items also differ, but they're never very fart apart( up
to maximum 0.03 difference ). Occasionally they switch order because the
last item scores minimally higher than the item before it.

For clarity:

if i run my query multiple times on the same index - the scores don't
change. But since the index is recreated every time the unittest is run -
the scores do change ( which is a little weird i suppose ).

Elasticsearch version is still 0.90.5.

On Friday, 22 November 2013 16:01:11 UTC+1, David Pilato wrote:

Ha! Thanks for the update.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 22 novembre 2013 at 15:54:32, Fluckx (filip.van...@gmail.com<javascript:>)
a écrit:

Hi David!

Thanks for the reply.

All the documents are indexed with their own id's ( not auto generated ).

This unittest runs it's queries on a single node with 1 shard ( The
production cluster has replication and multiple shards of course, but this
unittest just creates an index - inserts data - tries the queries - and
removes the index again ).

I have also discovered what the problem was. It's really stupid, but the
reason that some documents kept switching order is because they had the
exact same score.
So I decided to add a sort to the query so the return order is more
consistent.

On Friday, 22 November 2013 14:23:47 UTC+1, David Pilato wrote:

When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.

 -- 

David Pilato | Technical Advocate | Elasticsearch.com
@dadoonethttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Fdadoonet&sa=D&sntz=1&usg=AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA
| @elasticsearchfrhttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Felasticsearchfr&sa=D&sntz=1&usg=AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van...@gmail.com) a écrit:

Hello,

I am currently writing a unittest to verify the response of the
elasticsearch. The reason for this is so we can run these same tests on
higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

  1. Create a new index
  2. Put the mapping
  3. Insert data in bulk
  4. flush index
  5. optimize index to 1 segment
  6. refresh index
  7. perform queries and assertions
  8. remove index

The problem is that every time i run these unittests they're unreliable
because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are
something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with each
run ( keep in mind that every run it creates a new index ). The issue I
have is that when I recreate the same index 10 times and run my queries
that suddenly some items score higher than others. While the elasticsearch
version is the same and the data is the exact same ( it's a file that
contains all the bulk data ).

Anybody that can explain why this is or how i can get around this issue?
I'd assume that running a query on an index that is built the exact same
way 5 times should return the same results every time? Especially since i
flush - optimize and refresh. I assume all the documents are indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/17100e17-6a7e-471a-a9da-37ecb01d1a47%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #6

Are you indexing new documents in the meantime?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 29 novembre 2013 at 14:43:36, Fluckx (filip.van.houtryve@gmail.com) a écrit:

Hello again.

I seem to be running into the same issue again. Unfortunately it's not as simple as the sorting order this time.
It works most of the time, but occasionally the last two items switch order.

The _score of both items also differ, but they're never very fart apart( up to maximum 0.03 difference ). Occasionally they switch order because the last item scores minimally higher than the item before it.

For clarity:

if i run my query multiple times on the same index - the scores don't change. But since the index is recreated every time the unittest is run - the scores do change ( which is a little weird i suppose ).

Elasticsearch version is still 0.90.5.

On Friday, 22 November 2013 16:01:11 UTC+1, David Pilato wrote:
Ha! Thanks for the update.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 22 novembre 2013 at 15:54:32, Fluckx (filip.van...@gmail.com) a écrit:

Hi David!

Thanks for the reply.

All the documents are indexed with their own id's ( not auto generated ).

This unittest runs it's queries on a single node with 1 shard ( The production cluster has replication and multiple shards of course, but this unittest just creates an index - inserts data - tries the queries - and removes the index again ).

I have also discovered what the problem was. It's really stupid, but the reason that some documents kept switching order is because they had the exact same score.
So I decided to add a sort to the query so the return order is more consistent.

On Friday, 22 November 2013 14:23:47 UTC+1, David Pilato wrote:
When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van...@gmail.com) a écrit:

Hello,

I am currently writing a unittest to verify the response of the elasticsearch. The reason for this is so we can run these same tests on higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

  1. Create a new index
  2. Put the mapping
  3. Insert data in bulk
  4. flush index
  5. optimize index to 1 segment
  6. refresh index
  7. perform queries and assertions
  8. remove index

The problem is that every time i run these unittests they're unreliable because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with each run ( keep in mind that every run it creates a new index ). The issue I have is that when I recreate the same index 10 times and run my queries that suddenly some items score higher than others. While the elasticsearch version is the same and the data is the exact same ( it's a file that contains all the bulk data ).

Anybody that can explain why this is or how i can get around this issue? I'd assume that running a query on an index that is built the exact same way 5 times should return the same results every time? Especially since i flush - optimize and refresh. I assume all the documents are indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/17100e17-6a7e-471a-a9da-37ecb01d1a47%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52989c55.3f2dba31.a94c%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/groups/opt_out.


(Fluckx) #7

Hi David!

Thanks for the swift reply.
There are no documents being indexed in the meantime.

I generate a unique name for the index.

I am currently

  1. Creating the index
  2. Putting the mapping
  3. Inserting the data from the bulkfile
  4. flushing and refreshing the index ( $client->indices()->flush ( array(
    'index' => $indexname,
    'refresh' => true,
    ) );

this is all done in the setup of the unittest ( before any test is run )
and the index is removed after all tests are run.

On Friday, 29 November 2013 14:53:25 UTC+1, David Pilato wrote:

Are you indexing new documents in the meantime?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 29 novembre 2013 at 14:43:36, Fluckx (filip.van...@gmail.com<javascript:>)
a écrit:

Hello again.

I seem to be running into the same issue again. Unfortunately it's not as
simple as the sorting order this time.
It works most of the time, but occasionally the last two items switch
order.

The _score of both items also differ, but they're never very fart apart(
up to maximum 0.03 difference ). Occasionally they switch order because the
last item scores minimally higher than the item before it.

For clarity:

if i run my query multiple times on the same index - the scores don't
change. But since the index is recreated every time the unittest is run -
the scores do change ( which is a little weird i suppose ).

Elasticsearch version is still 0.90.5.

On Friday, 22 November 2013 16:01:11 UTC+1, David Pilato wrote:

Ha! Thanks for the update.

 -- 

David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 22 novembre 2013 at 15:54:32, Fluckx (filip.van...@gmail.com) a écrit:

Hi David!

Thanks for the reply.

All the documents are indexed with their own id's ( not auto generated ).

This unittest runs it's queries on a single node with 1 shard ( The
production cluster has replication and multiple shards of course, but this
unittest just creates an index - inserts data - tries the queries - and
removes the index again ).

I have also discovered what the problem was. It's really stupid, but the
reason that some documents kept switching order is because they had the
exact same score.
So I decided to add a sort to the query so the return order is more
consistent.

On Friday, 22 November 2013 14:23:47 UTC+1, David Pilato wrote:

When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.

 -- 

David Pilato | Technical Advocate | Elasticsearch.com
@dadoonethttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Fdadoonet&sa=D&sntz=1&usg=AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA
| @elasticsearchfrhttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Felasticsearchfr&sa=D&sntz=1&usg=AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van...@gmail.com) a
écrit:

Hello,

I am currently writing a unittest to verify the response of the
elasticsearch. The reason for this is so we can run these same tests on
higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

  1. Create a new index
  2. Put the mapping
  3. Insert data in bulk
  4. flush index
  5. optimize index to 1 segment
  6. refresh index
  7. perform queries and assertions
  8. remove index

The problem is that every time i run these unittests they're unreliable
because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this
order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are
something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with each
run ( keep in mind that every run it creates a new index ). The issue I
have is that when I recreate the same index 10 times and run my queries
that suddenly some items score higher than others. While the elasticsearch
version is the same and the data is the exact same ( it's a file that
contains all the bulk data ).

Anybody that can explain why this is or how i can get around this issue?
I'd assume that running a query on an index that is built the exact same
way 5 times should return the same results every time? Especially since i
flush - optimize and refresh. I assume all the documents are indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/17100e17-6a7e-471a-a9da-37ecb01d1a47%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ded45227-b50e-4fb1-9abc-488054ecad7e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Henrik Nordvik) #8

Hi,
Are you using dfs_query_then_fetch? It should help getting the same score.

If that doesn't help then it could he because lucene uses internal doc-ids
as tie-breaker when the scores are the same. Not sure what the best fix for
this is though.


http://web.archiveorange.com/archive/v/AAfXfnLZCbyQTykIeQWm

Henrik Nordvik

On Friday, November 29, 2013 3:35:52 PM UTC+1, Fluckx wrote:

Hi David!

Thanks for the swift reply.
There are no documents being indexed in the meantime.

I generate a unique name for the index.

I am currently

  1. Creating the index
  2. Putting the mapping
  3. Inserting the data from the bulkfile
  4. flushing and refreshing the index ( $client->indices()->flush ( array(
    'index' => $indexname,
    'refresh' => true,
    ) );

this is all done in the setup of the unittest ( before any test is run )
and the index is removed after all tests are run.

On Friday, 29 November 2013 14:53:25 UTC+1, David Pilato wrote:

Are you indexing new documents in the meantime?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 29 novembre 2013 at 14:43:36, Fluckx (filip.van...@gmail.com) a écrit:

Hello again.

I seem to be running into the same issue again. Unfortunately it's not as
simple as the sorting order this time.
It works most of the time, but occasionally the last two items switch
order.

The _score of both items also differ, but they're never very fart apart(
up to maximum 0.03 difference ). Occasionally they switch order because the
last item scores minimally higher than the item before it.

For clarity:

if i run my query multiple times on the same index - the scores don't
change. But since the index is recreated every time the unittest is run -
the scores do change ( which is a little weird i suppose ).

Elasticsearch version is still 0.90.5.

On Friday, 22 November 2013 16:01:11 UTC+1, David Pilato wrote:

Ha! Thanks for the update.

 -- 

David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 22 novembre 2013 at 15:54:32, Fluckx (filip.van...@gmail.com) a
écrit:

Hi David!

Thanks for the reply.

All the documents are indexed with their own id's ( not auto generated ).

This unittest runs it's queries on a single node with 1 shard ( The
production cluster has replication and multiple shards of course, but this
unittest just creates an index - inserts data - tries the queries - and
removes the index again ).

I have also discovered what the problem was. It's really stupid, but the
reason that some documents kept switching order is because they had the
exact same score.
So I decided to add a sort to the query so the return order is more
consistent.

On Friday, 22 November 2013 14:23:47 UTC+1, David Pilato wrote:

When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.

 -- 

David Pilato | Technical Advocate | Elasticsearch.com
@dadoonethttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Fdadoonet&sa=D&sntz=1&usg=AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA
| @elasticsearchfrhttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Felasticsearchfr&sa=D&sntz=1&usg=AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van...@gmail.com) a
écrit:

Hello,

I am currently writing a unittest to verify the response of the
elasticsearch. The reason for this is so we can run these same tests on
higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

  1. Create a new index
  2. Put the mapping
  3. Insert data in bulk
  4. flush index
  5. optimize index to 1 segment
  6. refresh index
  7. perform queries and assertions
  8. remove index

The problem is that every time i run these unittests they're unreliable
because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this
order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are
something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with each
run ( keep in mind that every run it creates a new index ). The issue I
have is that when I recreate the same index 10 times and run my queries
that suddenly some items score higher than others. While the elasticsearch
version is the same and the data is the exact same ( it's a file that
contains all the bulk data ).

Anybody that can explain why this is or how i can get around this
issue? I'd assume that running a query on an index that is built the exact
same way 5 times should return the same results every time? Especially
since i flush - optimize and refresh. I assume all the documents are
indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/17100e17-6a7e-471a-a9da-37ecb01d1a47%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4bbd4cbd-0117-4b15-99ee-720a0cf45980%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Fluckx) #9

Hey,

I doubt that will fix the problem. my queries are run on a node which only
has 1 shard and 0 replicas, so I don't think the query_then_fetch will
solve this. I will try it though. If it doesn't work I will try to post a
sample to further demonstrate my problem.

The weirdest thing for me is that the scores differ. If I create a file
with bulk inserts and I create 2 indexes and insert this bulk data into
both.
If I then run the same query against both indexes, shouldn't they return
the exact same result with the exact same scores? The data is identical and
both indexes are optimized and refreshed after the insert ( to make sure
there is no data falling behind ). The "scoring" algorithm has the same
information ( besides being created at a different time ).

On Monday, 2 December 2013 10:23:21 UTC+1, Henrik Nordvik wrote:

Hi,
Are you using dfs_query_then_fetch? It should help getting the same score.

http://www.elasticsearch.org/blog/understanding-query-then-fetch-vs-dfs-query-then-fetch/

If that doesn't help then it could he because lucene uses internal doc-ids
as tie-breaker when the scores are the same. Not sure what the best fix for
this is though.
https://github.com/elasticsearch/elasticsearch/issues/3578
http://web.archiveorange.com/archive/v/AAfXfnLZCbyQTykIeQWm

Henrik Nordvik

On Friday, November 29, 2013 3:35:52 PM UTC+1, Fluckx wrote:

Hi David!

Thanks for the swift reply.
There are no documents being indexed in the meantime.

I generate a unique name for the index.

I am currently

  1. Creating the index
  2. Putting the mapping
  3. Inserting the data from the bulkfile
  4. flushing and refreshing the index ( $client->indices()->flush ( array(
    'index' => $indexname,
    'refresh' => true,
    ) );

this is all done in the setup of the unittest ( before any test is run )
and the index is removed after all tests are run.

On Friday, 29 November 2013 14:53:25 UTC+1, David Pilato wrote:

Are you indexing new documents in the meantime?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 29 novembre 2013 at 14:43:36, Fluckx (filip.van...@gmail.com) a
écrit:

Hello again.

I seem to be running into the same issue again. Unfortunately it's not
as simple as the sorting order this time.
It works most of the time, but occasionally the last two items switch
order.

The _score of both items also differ, but they're never very fart apart(
up to maximum 0.03 difference ). Occasionally they switch order because the
last item scores minimally higher than the item before it.

For clarity:

if i run my query multiple times on the same index - the scores don't
change. But since the index is recreated every time the unittest is run -
the scores do change ( which is a little weird i suppose ).

Elasticsearch version is still 0.90.5.

On Friday, 22 November 2013 16:01:11 UTC+1, David Pilato wrote:

Ha! Thanks for the update.

 -- 

David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 22 novembre 2013 at 15:54:32, Fluckx (filip.van...@gmail.com) a
écrit:

Hi David!

Thanks for the reply.

All the documents are indexed with their own id's ( not auto generated
).

This unittest runs it's queries on a single node with 1 shard ( The
production cluster has replication and multiple shards of course, but this
unittest just creates an index - inserts data - tries the queries - and
removes the index again ).

I have also discovered what the problem was. It's really stupid, but
the reason that some documents kept switching order is because they had the
exact same score.
So I decided to add a sort to the query so the return order is more
consistent.

On Friday, 22 November 2013 14:23:47 UTC+1, David Pilato wrote:

When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.

 -- 

David Pilato | Technical Advocate | Elasticsearch.com
@dadoonethttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Fdadoonet&sa=D&sntz=1&usg=AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA
| @elasticsearchfrhttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Felasticsearchfr&sa=D&sntz=1&usg=AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van...@gmail.com) a
écrit:

Hello,

I am currently writing a unittest to verify the response of the
elasticsearch. The reason for this is so we can run these same tests on
higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

  1. Create a new index
  2. Put the mapping
  3. Insert data in bulk
  4. flush index
  5. optimize index to 1 segment
  6. refresh index
  7. perform queries and assertions
  8. remove index

The problem is that every time i run these unittests they're
unreliable because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this
order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are
something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with
each run ( keep in mind that every run it creates a new index ). The issue
I have is that when I recreate the same index 10 times and run my queries
that suddenly some items score higher than others. While the elasticsearch
version is the same and the data is the exact same ( it's a file that
contains all the bulk data ).

Anybody that can explain why this is or how i can get around this
issue? I'd assume that running a query on an index that is built the exact
same way 5 times should return the same results every time? Especially
since i flush - optimize and refresh. I assume all the documents are
indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/17100e17-6a7e-471a-a9da-37ecb01d1a47%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c748cf41-5b82-41bc-b2e4-7034eb513fb3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Fluckx) #10

Hello again,

I noticed that my max number of documents and deleted documents differed. I
don't even know how this is possible ( concidering the same file should
yield the same results every time? ).

I optimized my index with max_num_segments set to 1 after inserting the
data.
Ever since the scoring has been "stable" across all the indexes.

If I have some spare time I should try and see why 1 bulkfile has 2
different amounts of deleted documents across indexes. It does explain
why the scoring is different every time i queried and why it's now the same.

On Monday, 2 December 2013 11:12:48 UTC+1, Fluckx wrote:

Hey,

I doubt that will fix the problem. my queries are run on a node which only
has 1 shard and 0 replicas, so I don't think the query_then_fetch will
solve this. I will try it though. If it doesn't work I will try to post a
sample to further demonstrate my problem.

The weirdest thing for me is that the scores differ. If I create a file
with bulk inserts and I create 2 indexes and insert this bulk data into
both.
If I then run the same query against both indexes, shouldn't they return
the exact same result with the exact same scores? The data is identical and
both indexes are optimized and refreshed after the insert ( to make sure
there is no data falling behind ). The "scoring" algorithm has the same
information ( besides being created at a different time ).

On Monday, 2 December 2013 10:23:21 UTC+1, Henrik Nordvik wrote:

Hi,
Are you using dfs_query_then_fetch? It should help getting the same score.

http://www.elasticsearch.org/blog/understanding-query-then-fetch-vs-dfs-query-then-fetch/

If that doesn't help then it could he because lucene uses internal
doc-ids as tie-breaker when the scores are the same. Not sure what the best
fix for this is though.
https://github.com/elasticsearch/elasticsearch/issues/3578
http://web.archiveorange.com/archive/v/AAfXfnLZCbyQTykIeQWm

Henrik Nordvik

On Friday, November 29, 2013 3:35:52 PM UTC+1, Fluckx wrote:

Hi David!

Thanks for the swift reply.
There are no documents being indexed in the meantime.

I generate a unique name for the index.

I am currently

  1. Creating the index
  2. Putting the mapping
  3. Inserting the data from the bulkfile
  4. flushing and refreshing the index ( $client->indices()->flush ( array(
    'index' => $indexname,
    'refresh' => true,
    ) );

this is all done in the setup of the unittest ( before any test is run )
and the index is removed after all tests are run.

On Friday, 29 November 2013 14:53:25 UTC+1, David Pilato wrote:

Are you indexing new documents in the meantime?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 29 novembre 2013 at 14:43:36, Fluckx (filip.van...@gmail.com) a
écrit:

Hello again.

I seem to be running into the same issue again. Unfortunately it's not
as simple as the sorting order this time.
It works most of the time, but occasionally the last two items switch
order.

The _score of both items also differ, but they're never very fart
apart( up to maximum 0.03 difference ). Occasionally they switch order
because the last item scores minimally higher than the item before it.

For clarity:

if i run my query multiple times on the same index - the scores don't
change. But since the index is recreated every time the unittest is run -
the scores do change ( which is a little weird i suppose ).

Elasticsearch version is still 0.90.5.

On Friday, 22 November 2013 16:01:11 UTC+1, David Pilato wrote:

Ha! Thanks for the update.

 -- 

David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 22 novembre 2013 at 15:54:32, Fluckx (filip.van...@gmail.com) a
écrit:

Hi David!

Thanks for the reply.

All the documents are indexed with their own id's ( not auto generated
).

This unittest runs it's queries on a single node with 1 shard ( The
production cluster has replication and multiple shards of course, but this
unittest just creates an index - inserts data - tries the queries - and
removes the index again ).

I have also discovered what the problem was. It's really stupid, but
the reason that some documents kept switching order is because they had the
exact same score.
So I decided to add a sort to the query so the return order is more
consistent.

On Friday, 22 November 2013 14:23:47 UTC+1, David Pilato wrote:

When you use the bulk, are you providing id for each doc?
Or are they auto generated?

I suppose that you have more than 1 shard for your index, right?

On a side note, you probably don't need to optimize your index.

 -- 

David Pilato | Technical Advocate | Elasticsearch.com
@dadoonethttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Fdadoonet&sa=D&sntz=1&usg=AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA
| @elasticsearchfrhttps://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Felasticsearchfr&sa=D&sntz=1&usg=AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA

Le 22 novembre 2013 at 11:27:50, Fluckx (filip.van...@gmail.com) a
écrit:

Hello,

I am currently writing a unittest to verify the response of the
elasticsearch. The reason for this is so we can run these same tests on
higher versions of elasticsearch to see if it's safe to upgrade.

The flow of the unittest:

  1. Create a new index
  2. Put the mapping
  3. Insert data in bulk
  4. flush index
  5. optimize index to 1 segment
  6. refresh index
  7. perform queries and assertions
  8. remove index

The problem is that every time i run these unittests they're
unreliable because the results i get return in different orders.
For example

First run of the unittest i would get the result of a query in this
order:

Document A
Document B
Document C
Document D
Document E
Document F

The second run of the unittest ( immediately after ), the results are
something like this

Document A
Document B
Document D
Document C
Document E
Document F

The third run is something similar again.

If I look at the resultset i noticed the scores are different with
each run ( keep in mind that every run it creates a new index ). The issue
I have is that when I recreate the same index 10 times and run my queries
that suddenly some items score higher than others. While the elasticsearch
version is the same and the data is the exact same ( it's a file that
contains all the bulk data ).

Anybody that can explain why this is or how i can get around this
issue? I'd assume that running a query on an index that is built the exact
same way 5 times should return the same results every time? Especially
since i flush - optimize and refresh. I assume all the documents are
indexed.

The index isn't that big ( around 8000 documents ).

Extra information:

Version: 0.90.5
OS: linux

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/17100e17-6a7e-471a-a9da-37ecb01d1a47%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e7a76307-6cf3-4a1f-8f1c-93229e766fe3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #11