Bulk Insertion and Aliasing

Hi guys,

I need to do a bulk insert into a cluster but I want to use aliases. Is
there any way? I'll be using aliases to route all users data to the same
shared when inserting - I think that's the most important thing to be able
to do when I do a bulk insert. I noticed there's a routing option on the
bulk insert, would this be the way to go? Would you just perform your bulk
insert creating a route out of say the user-id and then elsewhere in your
application where you do a single insert / search use an alias that routed
to the user-id and filtered on the user-id?

Cheers,
James

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi James

I need to do a bulk insert into a cluster but I want to use aliases.
Is there any way?

Yes, you can use an alias instead of the index name. But for indexing
purposes, your alias should only point to one index. For searching, you
can use a different alias which points to multiple indices.

I'll be using aliases to route all users data to the same shared when
inserting - I think that's the most important thing to be able to do
when I do a bulk insert. I noticed there's a routing option on the
bulk insert, would this be the way to go? Would you just perform your
bulk insert creating a route out of say the user-id and then elsewhere
in your application where you do a single insert / search use an alias
that routed to the user-id and filtered on the user-id?

You can do that, or you can specify a routing and a filter when you
create an alias, then you don't need to worry about having to specify it
each time.

Have a look at
http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html
and
http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey cheers for the reply, I think what I'm having trouble with at the
moment is how to use the correct routing when performing a bulk insert - so
the batch will have 100,000 records in it all from different users.
Ultimately, I want to use routing so that 1 users documents are all on the
same shard ( a la the user data flow discussed in that video you
referenced).

Any ideas if that's possible?

Regards,
James

On Tue, Feb 5, 2013 at 1:02 PM, Clinton Gormley clint@traveljury.comwrote:

Hi James

I need to do a bulk insert into a cluster but I want to use aliases.
Is there any way?

Yes, you can use an alias instead of the index name. But for indexing
purposes, your alias should only point to one index. For searching, you
can use a different alias which points to multiple indices.

I'll be using aliases to route all users data to the same shared when
inserting - I think that's the most important thing to be able to do
when I do a bulk insert. I noticed there's a routing option on the
bulk insert, would this be the way to go? Would you just perform your
bulk insert creating a route out of say the user-id and then elsewhere
in your application where you do a single insert / search use an alias
that routed to the user-id and filtered on the user-id?

You can do that, or you can specify a routing and a filter when you
create an alias, then you don't need to worry about having to specify it
each time.

Have a look at
http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html
and

http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html

clint

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Tue, 2013-02-05 at 13:05 +0000, James Lewis wrote:

Hey cheers for the reply, I think what I'm having trouble with at the
moment is how to use the correct routing when performing a bulk insert

  • so the batch will have 100,000 records in it all from different
    users. Ultimately, I want to use routing so that 1 users documents
    are all on the same shard ( a la the user data flow discussed in that
    video you referenced).

Any ideas if that's possible?

Yes, that's possible. Either by specifying the routing manually or by
using aliases with routing built in (as explained in that video).

clint

Regards,
James

On Tue, Feb 5, 2013 at 1:02 PM, Clinton Gormley clint@traveljury.com
wrote:
Hi James

    > I need to do a bulk insert into a cluster but I want to use
    aliases.
    > Is there any way?
    
    
    Yes, you can use an alias instead of the index name. But for
    indexing
    purposes, your alias should only point to one index. For
    searching, you
    can use a different alias which points to multiple indices.
    
    > I'll be using aliases to route all users data to the same
    shared when
    > inserting - I think that's the most important thing to be
    able to do
    > when I do a bulk insert.  I noticed there's a routing option
    on the
    > bulk insert, would this be the way to go?  Would you just
    perform your
    > bulk insert creating a route out of say the user-id and then
    elsewhere
    > in your application where you do a single insert / search
    use an alias
    > that routed to the user-id and filtered on the user-id?
    
    
    You can do that, or you can specify a routing and a filter
    when you
    create an alias, then you don't need to worry about having to
    specify it
    each time.
    
    Have a look at
    http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html
    and
    http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html
    
    clint
    
    --
    You received this message because you are subscribed to the
    Google Groups "elasticsearch" group.
    To unsubscribe from this group and stop receiving emails from
    it, send an email to elasticsearch
    +unsubscribe@googlegroups.com.
    For more options, visit
    https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

OK - lets say I create 100000 json documents from some data stored in an
SQL database. The documents are in no way grouped so this could either be
an entire batch of a single users data, 1 document for 100000 different
users or any combination in between. There's no way of knowing. When I
insert the 100000 documents I want to make sure that each users data is
routed correctly - so a users data will go to 1 shard using a routing based
on their id. Every batch of 100000 documents will be as different as the
last.

I understand that if I insert each document separately I can determine what
the user id is and I can use an alias
which contains a route based on the id. However, I want to use bulk
insert. So what I was looking for was how I might perform the routing (or
aliasing) without needing to group my bulk insertions together by user-id
or something.

In the ES API docs it says you can add the _routing field per bulk item.
So I guess the answer is to do a bit of processing on the client to make
sure that for each bulk item, the _routing field is set correctly. I was
having trouble understanding how to do this because it's currently
unsupported by my .Net client. Also, the bulk API says nothing about
aliases - I guess these should be created after a bulk backfill or on the
fly as new documents are added.

Cheers!
James

On Tue, Feb 5, 2013 at 1:15 PM, Clinton Gormley clint@traveljury.comwrote:

On Tue, 2013-02-05 at 13:05 +0000, James Lewis wrote:

Hey cheers for the reply, I think what I'm having trouble with at the
moment is how to use the correct routing when performing a bulk insert

  • so the batch will have 100,000 records in it all from different
    users. Ultimately, I want to use routing so that 1 users documents
    are all on the same shard ( a la the user data flow discussed in that
    video you referenced).

Any ideas if that's possible?

Yes, that's possible. Either by specifying the routing manually or by
using aliases with routing built in (as explained in that video).

clint

Regards,
James

On Tue, Feb 5, 2013 at 1:02 PM, Clinton Gormley clint@traveljury.com
wrote:
Hi James

    > I need to do a bulk insert into a cluster but I want to use
    aliases.
    > Is there any way?


    Yes, you can use an alias instead of the index name. But for
    indexing
    purposes, your alias should only point to one index. For
    searching, you
    can use a different alias which points to multiple indices.

    > I'll be using aliases to route all users data to the same
    shared when
    > inserting - I think that's the most important thing to be
    able to do
    > when I do a bulk insert.  I noticed there's a routing option
    on the
    > bulk insert, would this be the way to go?  Would you just
    perform your
    > bulk insert creating a route out of say the user-id and then
    elsewhere
    > in your application where you do a single insert / search
    use an alias
    > that routed to the user-id and filtered on the user-id?


    You can do that, or you can specify a routing and a filter
    when you
    create an alias, then you don't need to worry about having to
    specify it
    each time.

    Have a look at

http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html

    and

http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html

    clint

    --
    You received this message because you are subscribed to the
    Google Groups "elasticsearch" group.
    To unsubscribe from this group and stop receiving emails from
    it, send an email to elasticsearch
    +unsubscribe@googlegroups.com.
    For more options, visit
    https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Tue, 2013-02-05 at 14:21 +0000, James Lewis wrote:

OK - lets say I create 100000 json documents from some data stored in
an SQL database. The documents are in no way grouped so this could
either be an entire batch of a single users data, 1 document for
100000 different users or any combination in between. There's no way
of knowing. When I insert the 100000 documents I want to make sure
that each users data is routed correctly - so a users data will go to
1 shard using a routing based on their id. Every batch of 100000
documents will be as different as the last.

curl -XPOST 'http://127.0.0.1:9200/my_index/couchbaseDocument/_bulk?pretty=1' -d '
{"index" : {"_id" : 1, "_routing" : "user_1", "_type" : "user"}}
{"foo" : "bar"}
{"index" : {"_routing" : "user_1", "_type" : "comment"}}
{"user_id" : 1, "foo" : "bar"}
'

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.