Considering bulk upserts from hadoop [Hadoop]

Val_Vakar · August 13, 2013, 5:07pm

Hello Experts,

I'd really like to leverage bulk upserts from Hive. I could implement that
in elasticsearch-hadoop, but I need a good way to specify the id (see
https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg).
That's totally doable, but it's also awkward since the last time we have
the deserialized object is in ESSerDe.serialize().

Has there been any thought on this on your side?

Thanks!
-Val

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

costin · August 13, 2013, 8:12pm

Hi Val,

Can you expand on what you're trying to achieve - some example of the data in Hive and ES would help.

Currently we don't recognize id on the target source since in general, the IDs in Hadoop are different (and can collide)
with the ones in ES.
However we might add an option to recognize that for folks that really want to use it - in the meantime one can define a
mapping for it to point to the desired field, loaded from Hadoop.

P.S. Note that Hadoop system, especially Hive does not have a concept of an ID (as in unique identifier). Your data set
can have duplicate entries with the same ID which are treated separate by Hive but obviously will be merged/rewritten in ES.

Cheers,

On 13/08/2013 8:07 PM, Val Vakar wrote:

Hello Experts,

I'd really like to leverage bulk upserts from Hive. I could implement that in elasticsearch-hadoop, but I need a good
way to specify the id (see Redirecting to Google Groups). That's totally
doable, but it's also awkward since the last time we have the deserialized object is in ESSerDe.serialize().

Has there been any thought on this on your side?

Thanks!
-Val

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Val_Vakar · August 13, 2013, 9:13pm

Hey Costin,

You're right that Hive doesn't have the concept of an id. ES handles such
cases elegantly through the _id path mapping so the client doesn't need to
understand id's; I would like that same exact concept for bulk loading
through Hive.

My goal is to run different processing streams (concurrently or days/weeks
apart) on hadoop - and possibly elsewhere - that compute various parts of
the same document at their own pace. Whenever something is computed, it's
upserted into ES. It's hard to describe my exact business case, but as a
very contrived example let's say we're tracking sales data for electronic
devices: there's a data stream from brick-and-mortar stores, another stream
from web sales, etc - we would like to accumulate all that under each
device's own document in ES ( say { "device": "iPhone", "sales_data": {
"January": { "brick_and_mortar": 305, "website": 298 }, "February": {
"brick_and_mortar": 225, "website": 168 }, "March": ... } . Basically, it's
in the upsert sweet spot.

What I'm thinking of is - say we define the table using

TBLPROPERTIES(
'es.host' = 'myhost',
'es.resource' = 'myindex/mytype',
'es.insert.strategy' = 'upsert',
'es.id.path' = 'page_id'

where es.insert.strategy = [index]/upsert and es.id.path is the path to the
id property (much like the server-side _id path mapping).

This would make it possible to construct the rest request with
doc_as_upsert semantics - but since we need that id mapping, I realize this
feels pretty awkward to do in ESSerde and elsewhere.

What do you think?

Thanks,
-Val

On Tuesday, August 13, 2013 1:07:55 PM UTC-4, Val Vakar wrote:

Hello Experts,

I'd really like to leverage bulk upserts from Hive. I could implement that
in elasticsearch-hadoop, but I need a good way to specify the id (see
Redirecting to Google Groups).
That's totally doable, but it's also awkward since the last time we have
the deserialized object is in ESSerDe.serialize().

Has there been any thought on this on your side?

Thanks!
-Val

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

costin · August 14, 2013, 11:09am

Awareness of the id is something that we're planning on addressing after releasing the current version of
elasticsearch-hadoop.
This means that there are three main strategies for sending a document to ES:

create/put-if-absent (applies also when there's no id)
index op-type=index or create document if missing, replace if present
upsert or create document if missing, _update if present

If my understanding is correct, you're looking for version 3 - to update/merge the document and not replace it (or upsert).

Let me know if I'm missed/misunderstood anything.

Thanks,

P.S. we're also planning on adding parent/child support - which is also based on id awareness.

On 14/08/2013 12:13 AM, Val Vakar wrote:

Hey Costin,

You're right that Hive doesn't have the concept of an id. ES handles such cases elegantly through the _id path mapping
so the client doesn't need to understand id's; I would like that same exact concept for bulk loading through Hive.

My goal is to run different processing streams (concurrently or days/weeks apart) on hadoop - and possibly elsewhere -
that compute various parts of the same document at their own pace. Whenever something is computed, it's upserted into
ES. It's hard to describe my exact business case, but as a very contrived example let's say we're tracking sales data
for electronic devices: there's a data stream from brick-and-mortar stores, another stream from web sales, etc - we
would like to accumulate all that under each device's own document in ES ( say { "device": "iPhone", "sales_data": {
"January": { "brick_and_mortar": 305, "website": 298 }, "February": { "brick_and_mortar": 225, "website": 168 },
"March": ... } . Basically, it's in the upsert sweet spot.

What I'm thinking of is - say we define the table using

|TBLPROPERTIES(|
|||'es.host'| |= ||'myhost'||,|
|||'es.resource'| |= ||'myindex/mytype',||
'es.insert.strategy' = 'upsert',
'es.id.path' = 'page_id'

|
where es.insert.strategy = [index]/upsert and es.id.path is the path to the id property (much like the server-side _id
path mapping).

This would make it possible to construct the rest request with doc_as_upsert semantics - but since we need that id
mapping, I realize this feels pretty awkward to do in ESSerde and elsewhere.

What do you think?

Thanks,
-Val

On Tuesday, August 13, 2013 1:07:55 PM UTC-4, Val Vakar wrote:
Hello Experts,

I'd really like to leverage bulk upserts from Hive. I could implement that in elasticsearch-hadoop, but I need a
good way to specify the id (see https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg
<https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>). That's totally doable, but it's
also awkward since the last time we have the deserialized object is in ESSerDe.serialize().

Has there been any thought on this on your side?

Thanks!
-Val
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Val_Vakar · August 14, 2013, 11:26am

Yes, you're right - I would like #3. It's awesome that you guys are working
on that.

For my project it would be very useful to know:

What might your solution look like (client-side / server-side / etc)?
When do you expect to release it?

Thanks a lot!
-Val

On Wednesday, August 14, 2013 7:09:00 AM UTC-4, Costin Leau wrote:

Awareness of the id is something that we're planning on addressing after
releasing the current version of
elasticsearch-hadoop.
This means that there are three main strategies for sending a document to
ES:

create/put-if-absent (applies also when there's no id)

index op-type=index or create document if missing, replace if present

upsert or create document if missing, _update if present

If my understanding is correct, you're looking for version 3 - to
update/merge the document and not replace it (or upsert).

Let me know if I'm missed/misunderstood anything.

Thanks,

P.S. we're also planning on adding parent/child support - which is also
based on id awareness.

On 14/08/2013 12:13 AM, Val Vakar wrote:
Hey Costin,

You're right that Hive doesn't have the concept of an id. ES handles
such cases elegantly through the _id path mapping
so the client doesn't need to understand id's; I would like that same
exact concept for bulk loading through Hive.

My goal is to run different processing streams (concurrently or
days/weeks apart) on hadoop - and possibly elsewhere -
that compute various parts of the same document at their own pace.
Whenever something is computed, it's upserted into
ES. It's hard to describe my exact business case, but as a very
contrived example let's say we're tracking sales data
for electronic devices: there's a data stream from brick-and-mortar
stores, another stream from web sales, etc - we
would like to accumulate all that under each device's own document in ES
( say { "device": "iPhone", "sales_data": {
"January": { "brick_and_mortar": 305, "website": 298 }, "February": {
"brick_and_mortar": 225, "website": 168 },
"March": ... } . Basically, it's in the upsert sweet spot.

What I'm thinking of is - say we define the table using

|TBLPROPERTIES(|
|||'es.host'| |= ||'myhost'||,|
|||'es.resource'| |= ||'myindex/mytype',||
'es.insert.strategy' = 'upsert',
'es.id.path' = 'page_id'

|
where es.insert.strategy = [index]/upsert and es.id.path is the path to
the id property (much like the server-side _id
path mapping).

This would make it possible to construct the rest request with
doc_as_upsert semantics - but since we need that id
mapping, I realize this feels pretty awkward to do in ESSerde and
elsewhere.

What do you think?

Thanks,
-Val

On Tuesday, August 13, 2013 1:07:55 PM UTC-4, Val Vakar wrote:
Hello Experts, 

I'd really like to leverage bulk upserts from Hive. I could 
implement that in elasticsearch-hadoop, but I need a
good way to specify the id (see 
Redirecting to Google Groups
<
Redirecting to Google Groups>).
That's totally doable, but it's
also awkward since the last time we have the deserialized object is 
in ESSerDe.serialize().
Has there been any thought on this on your side? 

Thanks! 
-Val 
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.
--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

costin · August 14, 2013, 12:30pm

You can track down the progress through the github tracker [1] - I've cleaned it up so you can see how issues are
assigned per milestone. In short, the plan is to wrap up the current release (working on the docs now) and push this out
as soon as possible - probably by early next week.

The second milestone should follow that shortly probably in about 1.5 month. I'm talking about the actual release date -
the features will obviously be available in master (and through the nightly builds) long before.
For your use-case, you might be interested in issue #69 [2] - feel free to add context to it or add another issue if
you'd like for Hive; the more info there is, the better we can define it.

Thanks,

[1] Issues · elastic/elasticsearch-hadoop · GitHub
[2] Index aware document writes · Issue #69 · elastic/elasticsearch-hadoop · GitHub

On 14/08/2013 2:26 PM, Val Vakar wrote:

Yes, you're right - I would like #3. It's awesome that you guys are working on that.

For my project it would be very useful to know:

What might your solution look like (client-side / server-side / etc)?
When do you expect to release it?

Thanks a lot!
-Val

On Wednesday, August 14, 2013 7:09:00 AM UTC-4, Costin Leau wrote:

Awareness of the id is something that we're planning on addressing after releasing the current version of
elasticsearch-hadoop.
This means that there are three main strategies for sending a document to ES:
1. create/put-if-absent (applies also when there's no id)
2. index op-type=index or create document if missing, _replace_ if present
3. upsert or create document if missing, _update if present

If my understanding is correct, you're looking for version 3 - to update/merge the document and not replace it (or
upsert).

Let me know if I'm missed/misunderstood anything.

Thanks,

P.S. we're also planning on adding parent/child support - which is also based on id awareness.


On 14/08/2013 12:13 AM, Val Vakar wrote:
> Hey Costin,
>
> You're right that Hive doesn't have the concept of an id. ES handles such cases elegantly through the _id path mapping
> so the client doesn't need to understand id's; I would like that same exact concept for bulk loading through Hive.
>
> My goal is to run different processing streams (concurrently or days/weeks apart) on hadoop - and possibly elsewhere -
> that compute various parts of the same document at their own pace. Whenever something is computed, it's upserted into
> ES. It's hard to describe my exact business case, but as a very contrived example let's say we're tracking sales data
> for electronic devices: there's a data stream from brick-and-mortar stores, another stream from web sales, etc - we
> would like to accumulate all that under each device's own document in ES ( say { "device": "iPhone", "sales_data": {
> "January": { "brick_and_mortar": 305, "website": 298 }, "February": { "brick_and_mortar": 225, "website": 168 },
> "March": ... } . Basically, it's in the upsert sweet spot.
>
> What I'm thinking of is - say we define the table using
>
> |TBLPROPERTIES(|
> |||'es.host'| |= ||'myhost'||,|
> |||'es.resource'| |= ||'myindex/mytype',||
>            'es.insert.strategy' = 'upsert',
>            'es.id.path' = 'page_id'
>
> |
> where es.insert.strategy = [index]/upsert and es.id.path is the path to the id property (much like the server-side _id
> path mapping).
>
> This would make it possible to construct the rest request with doc_as_upsert semantics - but since we need that id
> mapping, I realize this feels pretty awkward to do in ESSerde and elsewhere.
>
> What do you think?
>
> Thanks,
> -Val
>
>
>
>
>
>
> On Tuesday, August 13, 2013 1:07:55 PM UTC-4, Val Vakar wrote:
>
>     Hello Experts,
>
>     I'd really like to leverage bulk upserts from Hive. I could implement that in elasticsearch-hadoop, but I need a
>     good way to specify the id (seehttps://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg
<https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>
>     <https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg
<https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>>). That's totally doable, but it's
>     also awkward since the last time we have the deserialized object is in ESSerDe.serialize().
>
>     Has there been any thought on this on your side?
>
>     Thanks!
>     -Val
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>elasticsearc...@googlegroups.com <javascript:>.
> For more options, visithttps://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.
>
>

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Val_Vakar · August 14, 2013, 12:34pm

Excellent! You rock.

On Wednesday, August 14, 2013 8:30:51 AM UTC-4, Costin Leau wrote:

You can track down the progress through the github tracker [1] - I've
cleaned it up so you can see how issues are
assigned per milestone. In short, the plan is to wrap up the current
release (working on the docs now) and push this out
as soon as possible - probably by early next week.

The second milestone should follow that shortly probably in about 1.5
month. I'm talking about the actual release date -
the features will obviously be available in master (and through the
nightly builds) long before.
For your use-case, you might be interested in issue #69 [2] - feel free to
add context to it or add another issue if
you'd like for Hive; the more info there is, the better we can define it.

Thanks,

[1]
Issues · elastic/elasticsearch-hadoop · GitHub
[2] Index aware document writes · Issue #69 · elastic/elasticsearch-hadoop · GitHub

On 14/08/2013 2:26 PM, Val Vakar wrote:
Yes, you're right - I would like #3. It's awesome that you guys are
working on that.

For my project it would be very useful to know:

What might your solution look like (client-side / server-side /
etc)?

When do you expect to release it?

Thanks a lot!
-Val

On Wednesday, August 14, 2013 7:09:00 AM UTC-4, Costin Leau wrote:
Awareness of the id is something that we're planning on addressing 
after releasing the current version of
elasticsearch-hadoop. 
This means that there are three main strategies for sending a 
document to ES:
1. create/put-if-absent (applies also when there's no id) 
2. index op-type=index or create document if missing, _replace_ if 
present
3. upsert or create document if missing, _update if present 

If my understanding is correct, you're looking for version 3 - to 
update/merge the document and not replace it (or
upsert). 

Let me know if I'm missed/misunderstood anything. 

Thanks, 

P.S. we're also planning on adding parent/child support - which is 
also based on id awareness.
On 14/08/2013 12:13 AM, Val Vakar wrote: 
> Hey Costin, 
> 
> You're right that Hive doesn't have the concept of an id. ES 
handles such cases elegantly through the _id path mapping
> so the client doesn't need to understand id's; I would like that 
same exact concept for bulk loading through Hive.
> 
> My goal is to run different processing streams (concurrently or 
days/weeks apart) on hadoop - and possibly elsewhere -
> that compute various parts of the same document at their own pace. 
Whenever something is computed, it's upserted into
> ES. It's hard to describe my exact business case, but as a very 
contrived example let's say we're tracking sales data
> for electronic devices: there's a data stream from 
brick-and-mortar stores, another stream from web sales, etc - we
> would like to accumulate all that under each device's own document 
in ES ( say { "device": "iPhone", "sales_data": {
> "January": { "brick_and_mortar": 305, "website": 298 }, 
"February": { "brick_and_mortar": 225, "website": 168 },
> "March": ... } . Basically, it's in the upsert sweet spot. 
> 
> What I'm thinking of is - say we define the table using 
> 
> |TBLPROPERTIES(| 
> |||'es.host'| |= ||'myhost'||,| 
> |||'es.resource'| |= ||'myindex/mytype',|| 
>            'es.insert.strategy' = 'upsert', 
>            'es.id.path' = 'page_id' 
> 
> | 
> where es.insert.strategy = [index]/upsert and es.id.path is the 
path to the id property (much like the server-side _id
> path mapping). 
> 
> This would make it possible to construct the rest request with 
doc_as_upsert semantics - but since we need that id
> mapping, I realize this feels pretty awkward to do in ESSerde and 
elsewhere.
> 
> What do you think? 
> 
> Thanks, 
> -Val 
> 
> 
> 
> 
> 
> 
> On Tuesday, August 13, 2013 1:07:55 PM UTC-4, Val Vakar wrote: 
> 
>     Hello Experts, 
> 
>     I'd really like to leverage bulk upserts from Hive. I could 
implement that in elasticsearch-hadoop, but I need a
>     good way to specify the id (seehttps://
Redirecting to Google Groups
<
Redirecting to Google Groups>
>     <
Redirecting to Google Groups
<
Redirecting to Google Groups>>).
That's totally doable, but it's
>     also awkward since the last time we have the deserialized 
object is in ESSerDe.serialize().
> 
>     Has there been any thought on this on your side? 
> 
>     Thanks! 
>     -Val 
> 
> -- 
> You received this message because you are subscribed to the Google 
Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, 
send an email to
>elasticsearc...@googlegroups.com <javascript:>. 
> For more options, visithttps://groups.google.com/groups/opt_out <
https://groups.google.com/groups/opt_out>.
> 
> 

-- 
Costin 
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.
--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

costin · November 7, 2013, 3:05pm

A quick update - as mentioned on the initial issue [1] - the master has been updated to support not just update and
create operation (in addition to index) but also parent/child relationships, timestamp, version, routing, etc...

Please see [1] for more information - feedback is welcome!

[1] Upsert support from Hive by vvakar · Pull Request #83 · elastic/elasticsearch-hadoop · GitHub

On 14/08/2013 3:34 PM, Val Vakar wrote:

Excellent! You rock.

On Wednesday, August 14, 2013 8:30:51 AM UTC-4, Costin Leau wrote:

You can track down the progress through the github tracker [1] - I've cleaned it up so you can see how issues are
assigned per milestone. In short, the plan is to wrap up the current release (working on the docs now) and push this
out
as soon as possible - probably by early next week.

The second milestone should follow that shortly probably in about 1.5 month. I'm talking about the actual release
date -
the features will obviously be available in master (and through the nightly builds) long before.
For your use-case, you might be interested in issue #69 [2] - feel free to add context to it or add another issue if
you'd like for Hive; the more info there is, the better we can define it.

Thanks,

[1] https://github.com/elasticsearch/elasticsearch-hadoop/issues/milestones
<https://github.com/elasticsearch/elasticsearch-hadoop/issues/milestones>
[2] https://github.com/elasticsearch/elasticsearch-hadoop/issues/69
<https://github.com/elasticsearch/elasticsearch-hadoop/issues/69>

On 14/08/2013 2:26 PM, Val Vakar wrote:
> Yes, you're right - I would like #3. It's awesome that you guys are working on that.
>
> For my project it would be very useful to know:
>
>   - What might your solution look like (client-side / server-side / etc)?
>   - When do you expect to release it?
>
> Thanks a lot!
> -Val
>
>
> On Wednesday, August 14, 2013 7:09:00 AM UTC-4, Costin Leau wrote:
>
>     Awareness of the id is something that we're planning on addressing after releasing the current version of
>     elasticsearch-hadoop.
>     This means that there are three main strategies for sending a document to ES:
>     1. create/put-if-absent (applies also when there's no id)
>     2. index op-type=index or create document if missing, _replace_ if present
>     3. upsert or create document if missing, _update if present
>
>     If my understanding is correct, you're looking for version 3 - to update/merge the document and not replace it (or
>     upsert).
>
>     Let me know if I'm missed/misunderstood anything.
>
>     Thanks,
>
>     P.S. we're also planning on adding parent/child support - which is also based on id awareness.
>
>
>     On 14/08/2013 12:13 AM, Val Vakar wrote:
>     > Hey Costin,
>     >
>     > You're right that Hive doesn't have the concept of an id. ES handles such cases elegantly through the _id path mapping
>     > so the client doesn't need to understand id's; I would like that same exact concept for bulk loading through Hive.
>     >
>     > My goal is to run different processing streams (concurrently or days/weeks apart) on hadoop - and possibly elsewhere -
>     > that compute various parts of the same document at their own pace. Whenever something is computed, it's upserted into
>     > ES. It's hard to describe my exact business case, but as a very contrived example let's say we're tracking sales data
>     > for electronic devices: there's a data stream from brick-and-mortar stores, another stream from web sales, etc - we
>     > would like to accumulate all that under each device's own document in ES ( say { "device": "iPhone", "sales_data": {
>     > "January": { "brick_and_mortar": 305, "website": 298 }, "February": { "brick_and_mortar": 225, "website": 168 },
>     > "March": ... } . Basically, it's in the upsert sweet spot.
>     >
>     > What I'm thinking of is - say we define the table using
>     >
>     > |TBLPROPERTIES(|
>     > |||'es.host'| |= ||'myhost'||,|
>     > |||'es.resource'| |= ||'myindex/mytype',||
>     >            'es.insert.strategy' = 'upsert',
>     >            'es.id.path' = 'page_id'
>     >
>     > |
>     > where es.insert.strategy = [index]/upsert and es.id.path is the path to the id property (much like the server-side _id
>     > path mapping).
>     >
>     > This would make it possible to construct the rest request with doc_as_upsert semantics - but since we need that id
>     > mapping, I realize this feels pretty awkward to do in ESSerde and elsewhere.
>     >
>     > What do you think?
>     >
>     > Thanks,
>     > -Val
>     >
>     >
>     >
>     >
>     >
>     >
>     > On Tuesday, August 13, 2013 1:07:55 PM UTC-4, Val Vakar wrote:
>     >
>     >     Hello Experts,
>     >
>     >     I'd really like to leverage bulk upserts from Hive. I could implement that in elasticsearch-hadoop, but I need a
>     >     good way to specify the id (seehttps://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg
<http://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>
>     <https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg
<https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>>
>     >     <https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg
<https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>
>     <https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg
<https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>>>). That's totally doable, but it's
>     >     also awkward since the last time we have the deserialized object is in ESSerDe.serialize().
>     >
>     >     Has there been any thought on this on your side?
>     >
>     >     Thanks!
>     >     -Val
>     >
>     > --
>     > You received this message because you are subscribed to the Google Groups "elasticsearch" group.
>     > To unsubscribe from this group and stop receiving emails from it, send an email to
>     >elasticsearc...@googlegroups.com <javascript:>.
>     > For more options, visithttps://groups.google.com/groups/opt_out <http://groups.google.com/groups/opt_out> <https://groups.google.com/groups/opt_out
<https://groups.google.com/groups/opt_out>>.
>     >
>     >
>
>     --
>     Costin
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>elasticsearc...@googlegroups.com <javascript:>.
> For more options, visithttps://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.
>
>

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Pushing data from Hive to Elastic Search Elasticsearch	15	1434	July 6, 2017
[hadoop] Pipelining Hadoop/Spark with ElasticSearch Elasticsearch	9	440	July 6, 2017
Setting id of document with elasticsearch-hadoop that is not in source document Elasticsearch	6	533	July 6, 2017
[Hadoop] Routing keys, _id's and multi nodes in elasticsearch-hadoop Elasticsearch	4	437	July 6, 2017
Is anyway to bulk huge data to ES without rest Elasticsearch	17	550	July 6, 2017

Considering bulk upserts from hadoop [Hadoop]

Related topics