Duplicate data in same index

aschaar · December 9, 2011, 1:40am

Here's duplicate records:

http://pastebin.mozilla.org/1397732

ES Configuration:
http://oremj.pastebin.mozilla.org/1397404

Back story:
We pushed a new release on Wednesday around 2pm. At that time we deleted the index, setup mapping, and reindexed everything. I watched the results on our search page drop to 0, then climb back up to around 1k as the indexing task completed. Everything looked fine.

However, this morning(Thursday) search suddenly was returning twice as many results. Over 2k. We do have a nightly refresh index task that kicks off around 2:30am. Notice the "last_update": "2011-12-08T02:30:28" value on the second record, but the first record has a "last_update":"2011-12-07T03:12:17" which is Tuesday night, before we did the release and before we deleted the index.

So the question is, how did these records come back from the dead? Replication error? Nodes not all in sync?

Let me know if you need any more information.

Arron
#flightdeck @ irc.mozilla.org

kimchy · December 9, 2011, 3:26pm

Which version are you using? Are you using routing or parent/child mapping?
Can you run the same search with explain set to true and gist the result?

On Fri, Dec 9, 2011 at 3:40 AM, aschaar aschaar@gmail.com wrote:

Here's duplicate records:

http://pastebin.mozilla.org/1397732

ES Configuration:
http://oremj.pastebin.mozilla.org/1397404

Back story:
We pushed a new release on Wednesday around 2pm. At that time we deleted
the index, setup mapping, and reindexed everything. I watched the results
on our search page drop to 0, then climb back up to around 1k as the
indexing task completed. Everything looked fine.

However, this morning(Thursday) search suddenly was returning twice as many
results. Over 2k. We do have a nightly refresh index task that kicks off
around 2:30am. Notice the "last_update": "2011-12-08T02:30:28" value on
the second record, but the first record has a
"last_update":"2011-12-07T03:12:17" which is Tuesday night, before we did
the release and before we deleted the index.

So the question is, how did these records come back from the dead?
Replication error? Nodes not all in sync?

Let me know if you need any more information.

Arron
#flightdeck @ irc.mozilla.org

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Duplicate-data-in-same-index-tp3571874p3571874.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

aschaar · December 9, 2011, 5:38pm

Version 0.17.4,

No routing or parent/child mapping.

Gist with explain=true
http://oremj.pastebin.mozilla.org/1398661

aschaar · December 12, 2011, 5:31pm

Bump...

aschaar · December 13, 2011, 7:03pm

Servers:
http://pastebin.mozilla.org/1404356

Here are the results from running a query with explain=true from last week.
http://pastebin.mozilla.org/1404340

Notable differences

_shard: 1, _node: "wI_Er7xrQLWd7J1Kdh7Zsw" has the old package of "last_update": "2011-12-07T03:12:17"

_shard: 4 of each node had the recent package "last_update": "2011-12-09T02:30:30"

Today we ran the query again and here are the results:
http://pastebin.mozilla.org/1404342

Notable differences

_shard: 1 of each node has the old package of "last_update": "2011-12-12T02:30:48"

_shard: 4, _node" : "G5EUD-YFTA2-ymrLpTCodA" has the most recent version of the package "last_update": "2011-12-13T02:37:33"

Thoughts?

kimchy · December 13, 2011, 10:28pm

When did oyu start to use elasticsearch (since which version)? Maybe it was
before a version that used the type when hashing (pre 0.13.0)? If it is,
then any future version that used the same data should have
set cluster.routing.operation.use_type to true in the settings, otherwise
you might get into this situation.

On Tue, Dec 13, 2011 at 9:03 PM, aschaar aschaar@gmail.com wrote:

Servers:
http://pastebin.mozilla.org/1404356

Here are the results from running a query with explain=true from last week.
http://pastebin.mozilla.org/1404340

Notable differences

_shard: 1, _node: "wI_Er7xrQLWd7J1Kdh7Zsw" has the old package of
"last_update": "2011-12-07T03:12:17"

_shard: 4 of each node had the recent package "last_update":
"2011-12-09T02:30:30"

Today we ran the query again and here are the results:
http://pastebin.mozilla.org/1404342

Notable differences

_shard: 1 of each node has the old package of "last_update":
"2011-12-12T02:30:48"

_shard: 4, _node" : "G5EUD-YFTA2-ymrLpTCodA" has the most recent version of
the package "last_update": "2011-12-13T02:37:33"

Thoughts?

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Duplicate-data-in-same-index-tp3571874p3583276.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

Topic		Replies	Views
Duplicate results in search of index via alias after restoring snapshot of index to a new name Elasticsearch	3	3005	October 13, 2017
Exact duplicate results (same _id) for a search query. Is this a bug? Elasticsearch	5	550	July 6, 2017
Duplicate results Elasticsearch	11	3638	March 25, 2022
Duplicate data in ES Elasticsearch	4	432	July 6, 2017
Duplicated records returned using pagination after update Elasticsearch	3	991	July 6, 2017

Duplicate data in same index

Related topics