How to properly bulk index while defining a custom id?

Shane_Witbeck · October 24, 2011, 5:57pm

I've noticed different behavior when bulk indexing using a custom
document id versus not defining a document id. By not defining an id,
I get the desire behavior which is all documents are indexed. If I
attempt to define an id, only one document gets indexed as opposed to
all the documents defined in a bulk iteration.

gist.github.com

https://gist.github.com/digitalsanctum/1309649

gistfile1.java

@Override
    public int reIndexPosts(int minPostID, int maxPostID) {

        if (minPostID < 1 || maxPostID < 1 || (minPostID > maxPostID)) {
            log.warn("invalid args, skipping re-indexing of posts; minPostID=" + minPostID + ", maxPostID=" + maxPostID);
            return 0;
        }

        List<Post> posts = getPosts(minPostID, maxPostID);
        if (posts == null || posts.isEmpty()) return 0;

This file has been truncated. show original

How do you properly index all documents in a bulk request while
defining a custom document id?

Thanks,
Shane

kimchy · October 24, 2011, 10:34pm

What is hte failure that you get? You should also see it in the longs.

On Mon, Oct 24, 2011 at 7:57 PM, Shane Witbeck shane@digitalsanctum.comwrote:

I've noticed different behavior when bulk indexing using a custom
document id versus not defining a document id. By not defining an id,
I get the desire behavior which is all documents are indexed. If I
attempt to define an id, only one document gets indexed as opposed to
all the documents defined in a bulk iteration.

gist:1309649 · GitHub

How do you properly index all documents in a bulk request while
defining a custom document id?

Thanks,
Shane

Shane_Witbeck · October 24, 2011, 11:01pm

Thanks for the reply. I see no exceptions or errors in the logs. One
other thing I noticed is that the counts (in elasticsearch-head) are
something like:

docs: {
num_docs: 3
max_doc: 2196
deleted_docs: 2193
}

which seems to indicate that all but one of the docs are getting
deleted for each bulk iteration.

Any additional guidance is appreciated.

Shane

On Oct 24, 6:34 pm, Shay Banon kim...@gmail.com wrote:

What is hte failure that you get? You should also see it in the longs.

On Mon, Oct 24, 2011 at 7:57 PM, Shane Witbeck sh...@digitalsanctum.comwrote:

I've noticed different behavior when bulk indexing using a custom
document id versus not defining a document id. By not defining an id,
I get the desire behavior which is all documents are indexed. If I
attempt to define an id, only one document gets indexed as opposed to
all the documents defined in a bulk iteration.

gist:1309649 · GitHub

How do you properly index all documents in a bulk request while
defining a custom document id?

Thanks,
Shane

Clinton_Gormley · October 25, 2011, 7:19am

On Mon, 2011-10-24 at 16:01 -0700, Shane Witbeck wrote:

Thanks for the reply. I see no exceptions or errors in the logs. One
other thing I noticed is that the counts (in elasticsearch-head) are
something like:

docs: {
num_docs: 3
max_doc: 2196
deleted_docs: 2193
}

which seems to indicate that all but one of the docs are getting
deleted for each bulk iteration.

You don't provide the code for getPosts or getPostID but I would suspect
that you are reusing the same ID over and over again.

Check the _version of the single doc that you manage to index - I bet it
is high (when it should be 1)

clint