Possible collision with child document _id generation?


(Dan Pilone) #1

All,
We're indexing millions of documents into a 3 (sometimes 4) node
elastic cluster with 3 shards. Starting with 0.17.1 (we didn't try
0.17.0) and continuing to today's source I'm getting an occasional
exception bulk indexing. I'm seeing this:

{"create":
{"_index":"echo","_type":"measured_parameter","_id":"6RGi7lBOTd-
H28o31svvfQ","error":"DocumentAlreadyExistsEngineException[[echo][2]
[measured_parameter][6RGi7lBOTd-H28o31svvfQ]: document already
exists]"}}

This is buried in with lots of successful creations. We have 10's of
millions of documents already indexed and we're hitting the cluster
with three separate "indexers" which are generating the JSON. These
are child documents and I'm not sure how we're colliding on the IDs.
How are IDs generated? Could they result in a collision? Is there
something we can do to prevent it? Thanks -- Dan


(kikster) #2

I'm sure you've already thought of this but...couldn't you just explicitly set the ID of the document you want to index (if the children already have unique ids associated with them)?

curl -XPOST 'http://localhost:9200/echo/measured_parameter/3' -d .......


(Shay Banon) #3

Heya,

Strange..., the id is generated using version 4 128bit UUID (basically,
Java UUID generation). There shouldn't be conflicts... . Is it something
that you can reproduce each time? On a smaller scale?

-shay.banon

On Mon, Jul 25, 2011 at 11:29 PM, Dan Pilone dan@element84.com wrote:

All,
We're indexing millions of documents into a 3 (sometimes 4) node
elastic cluster with 3 shards. Starting with 0.17.1 (we didn't try
0.17.0) and continuing to today's source I'm getting an occasional
exception bulk indexing. I'm seeing this:

{"create":
{"_index":"echo","_type":"measured_parameter","_id":"6RGi7lBOTd-
H28o31svvfQ","error":"DocumentAlreadyExistsEngineException[[echo][2]
[measured_parameter][6RGi7lBOTd-H28o31svvfQ]: document already
exists]"}}

This is buried in with lots of successful creations. We have 10's of
millions of documents already indexed and we're hitting the cluster
with three separate "indexers" which are generating the JSON. These
are child documents and I'm not sure how we're colliding on the IDs.
How are IDs generated? Could they result in a collision? Is there
something we can do to prevent it? Thanks -- Dan


(Clinton Gormley) #4

Hiya

Strange..., the id is generated using version 4 128bit UUID
(basically, Java UUID generation). There shouldn't be conflicts... .
Is it something that you can reproduce each time? On a smaller scale?

I haven't experienced this myself, but I've heard reports from two
people while bulk re-indexing millions of records from one index to
another.

The process suddenly dies with 'document already exists', but it can't
already exist, because the only thing writing to the new index is the
reindexing process.

My gut feeling is that this is a timing issue, and the bulk indexer in
ES up trying to index the same doc twice on one shard, by mistake.

clint


(Dan Pilone) #5

I don't think I can reproduce this on a small scale. We're indexing 100's
of millions of docs as fast as we can push them and we see it maybe once
an hour. I hadn't ever seen it before we really ratcheted up the rate we
were indexing documents so I'm inclined to agree that it's a timing/race
kind of scenario. Interestingly, I think we only see it with child
documents, but that may be a red-herring. Right now our ratio of child to
parent documents is about 40:1 so obviously we're far more likely to see it
there regardless of whether it could happen to any document. -- Dan

--
Dan Pilone
Managing Partner, Element 84 LLC
www.element84.com / dan@element84.com / 703-622-7370

On Tue, Jul 26, 2011 at 4:20 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

Hiya

Strange..., the id is generated using version 4 128bit UUID
(basically, Java UUID generation). There shouldn't be conflicts... .
Is it something that you can reproduce each time? On a smaller scale?

I haven't experienced this myself, but I've heard reports from two
people while bulk re-indexing millions of records from one index to
another.

The process suddenly dies with 'document already exists', but it can't
already exist, because the only thing writing to the new index is the
reindexing process.

My gut feeling is that this is a timing issue, and the bulk indexer in
ES up trying to index the same doc twice on one shard, by mistake.

clint


(system) #6