Possible collision with child document _id generation?

Dan_Pilone · July 25, 2011, 8:29pm

All,
We're indexing millions of documents into a 3 (sometimes 4) node
elastic cluster with 3 shards. Starting with 0.17.1 (we didn't try
0.17.0) and continuing to today's source I'm getting an occasional
exception bulk indexing. I'm seeing this:

{"create":
{"_index":"echo","_type":"measured_parameter","_id":"6RGi7lBOTd-
H28o31svvfQ","error":"DocumentAlreadyExistsEngineException[[echo][2]
[measured_parameter][6RGi7lBOTd-H28o31svvfQ]: document already
exists]"}}

This is buried in with lots of successful creations. We have 10's of
millions of documents already indexed and we're hitting the cluster
with three separate "indexers" which are generating the JSON. These
are child documents and I'm not sure how we're colliding on the IDs.
How are IDs generated? Could they result in a collision? Is there
something we can do to prevent it? Thanks -- Dan

kikster · July 25, 2011, 8:45pm

I'm sure you've already thought of this but...couldn't you just explicitly set the ID of the document you want to index (if the children already have unique ids associated with them)?

curl -XPOST 'http://localhost:9200/echo/measured_parameter/3' -d .......

kimchy · July 26, 2011, 5:51am

Heya,

Strange..., the id is generated using version 4 128bit UUID (basically,
Java UUID generation). There shouldn't be conflicts... . Is it something
that you can reproduce each time? On a smaller scale?

-shay.banon

On Mon, Jul 25, 2011 at 11:29 PM, Dan Pilone dan@element84.com wrote:

All,
We're indexing millions of documents into a 3 (sometimes 4) node
elastic cluster with 3 shards. Starting with 0.17.1 (we didn't try
0.17.0) and continuing to today's source I'm getting an occasional
exception bulk indexing. I'm seeing this:

{"create":
{"_index":"echo","_type":"measured_parameter","_id":"6RGi7lBOTd-
H28o31svvfQ","error":"DocumentAlreadyExistsEngineException[[echo][2]
[measured_parameter][6RGi7lBOTd-H28o31svvfQ]: document already
exists]"}}

This is buried in with lots of successful creations. We have 10's of
millions of documents already indexed and we're hitting the cluster
with three separate "indexers" which are generating the JSON. These
are child documents and I'm not sure how we're colliding on the IDs.
How are IDs generated? Could they result in a collision? Is there
something we can do to prevent it? Thanks -- Dan

Clinton_Gormley · July 26, 2011, 8:20pm

Hiya

Strange..., the id is generated using version 4 128bit UUID
(basically, Java UUID generation). There shouldn't be conflicts... .
Is it something that you can reproduce each time? On a smaller scale?

I haven't experienced this myself, but I've heard reports from two
people while bulk re-indexing millions of records from one index to
another.

The process suddenly dies with 'document already exists', but it can't
already exist, because the only thing writing to the new index is the
reindexing process.

My gut feeling is that this is a timing issue, and the bulk indexer in
ES up trying to index the same doc twice on one shard, by mistake.

clint

Dan_Pilone · July 26, 2011, 10:48pm

I don't think I can reproduce this on a small scale. We're indexing 100's
of millions of docs as fast as we can push them and we see it maybe once
an hour. I hadn't ever seen it before we really ratcheted up the rate we
were indexing documents so I'm inclined to agree that it's a timing/race
kind of scenario. Interestingly, I think we only see it with child
documents, but that may be a red-herring. Right now our ratio of child to
parent documents is about 40:1 so obviously we're far more likely to see it
there regardless of whether it could happen to any document. -- Dan

--
Dan Pilone
Managing Partner, Element 84 LLC
www.element84.com / dan@element84.com / 703-622-7370

On Tue, Jul 26, 2011 at 4:20 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

Hiya

Strange..., the id is generated using version 4 128bit UUID
(basically, Java UUID generation). There shouldn't be conflicts... .
Is it something that you can reproduce each time? On a smaller scale?

I haven't experienced this myself, but I've heard reports from two
people while bulk re-indexing millions of records from one index to
another.

The process suddenly dies with 'document already exists', but it can't
already exist, because the only thing writing to the new index is the
reindexing process.

My gut feeling is that this is a timing issue, and the bulk indexer in
ES up trying to index the same doc twice on one shard, by mistake.

clint

Topic		Replies	Views
Indexing of a two documents with same ID at the same time Elasticsearch	2	3198	March 16, 2018
Create return 409 on non conflict items Elasticsearch	5	2161	July 5, 2017
Possible to create duplicate child ids in parent/child? Elasticsearch	1	525	June 2, 2017
Potential Clash of Auto-Generated IDs Elasticsearch	3	1101	March 8, 2018
Document already exists Elasticsearch	1	825	July 23, 2019

Possible collision with child document _id generation?

Related topics