Considering bulk upserts from hadoop [Hadoop]


(Val Vakar) #1

Hello Experts,

I'd really like to leverage bulk upserts from Hive. I could implement that
in elasticsearch-hadoop, but I need a good way to specify the id (see
https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg).
That's totally doable, but it's also awkward since the last time we have
the deserialized object is in ESSerDe.serialize().

Has there been any thought on this on your side?

Thanks!
-Val

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Costin Leau) #2

Hi Val,

Can you expand on what you're trying to achieve - some example of the data in Hive and ES would help.

Currently we don't recognize id on the target source since in general, the IDs in Hadoop are different (and can collide)
with the ones in ES.
However we might add an option to recognize that for folks that really want to use it - in the meantime one can define a
mapping for it to point to the desired field, loaded from Hadoop.

P.S. Note that Hadoop system, especially Hive does not have a concept of an ID (as in unique identifier). Your data set
can have duplicate entries with the same ID which are treated separate by Hive but obviously will be merged/rewritten in ES.

Cheers,

On 13/08/2013 8:07 PM, Val Vakar wrote:

Hello Experts,

I'd really like to leverage bulk upserts from Hive. I could implement that in elasticsearch-hadoop, but I need a good
way to specify the id (see https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg). That's totally
doable, but it's also awkward since the last time we have the deserialized object is in ESSerDe.serialize().

Has there been any thought on this on your side?

Thanks!
-Val

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Val Vakar) #3

Hey Costin,

You're right that Hive doesn't have the concept of an id. ES handles such
cases elegantly through the _id path mapping so the client doesn't need to
understand id's; I would like that same exact concept for bulk loading
through Hive.

My goal is to run different processing streams (concurrently or days/weeks
apart) on hadoop - and possibly elsewhere - that compute various parts of
the same document at their own pace. Whenever something is computed, it's
upserted into ES. It's hard to describe my exact business case, but as a
very contrived example let's say we're tracking sales data for electronic
devices: there's a data stream from brick-and-mortar stores, another stream
from web sales, etc - we would like to accumulate all that under each
device's own document in ES ( say { "device": "iPhone", "sales_data": {
"January": { "brick_and_mortar": 305, "website": 298 }, "February": {
"brick_and_mortar": 225, "website": 168 }, "March": ... } . Basically, it's
in the upsert sweet spot.

What I'm thinking of is - say we define the table using

TBLPROPERTIES(
'es.host' = 'myhost',
'es.resource' = 'myindex/mytype',
'es.insert.strategy' = 'upsert',
'es.id.path' = 'page_id'

where es.insert.strategy = [index]/upsert and es.id.path is the path to the
id property (much like the server-side _id path mapping).

This would make it possible to construct the rest request with
doc_as_upsert semantics - but since we need that id mapping, I realize this
feels pretty awkward to do in ESSerde and elsewhere.

What do you think?

Thanks,
-Val

On Tuesday, August 13, 2013 1:07:55 PM UTC-4, Val Vakar wrote:

Hello Experts,

I'd really like to leverage bulk upserts from Hive. I could implement that
in elasticsearch-hadoop, but I need a good way to specify the id (see
https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg).
That's totally doable, but it's also awkward since the last time we have
the deserialized object is in ESSerDe.serialize().

Has there been any thought on this on your side?

Thanks!
-Val

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Costin Leau) #4

Awareness of the id is something that we're planning on addressing after releasing the current version of
elasticsearch-hadoop.
This means that there are three main strategies for sending a document to ES:

  1. create/put-if-absent (applies also when there's no id)
  2. index op-type=index or create document if missing, replace if present
  3. upsert or create document if missing, _update if present

If my understanding is correct, you're looking for version 3 - to update/merge the document and not replace it (or upsert).

Let me know if I'm missed/misunderstood anything.

Thanks,

P.S. we're also planning on adding parent/child support - which is also based on id awareness.

On 14/08/2013 12:13 AM, Val Vakar wrote:

Hey Costin,

You're right that Hive doesn't have the concept of an id. ES handles such cases elegantly through the _id path mapping
so the client doesn't need to understand id's; I would like that same exact concept for bulk loading through Hive.

My goal is to run different processing streams (concurrently or days/weeks apart) on hadoop - and possibly elsewhere -
that compute various parts of the same document at their own pace. Whenever something is computed, it's upserted into
ES. It's hard to describe my exact business case, but as a very contrived example let's say we're tracking sales data
for electronic devices: there's a data stream from brick-and-mortar stores, another stream from web sales, etc - we
would like to accumulate all that under each device's own document in ES ( say { "device": "iPhone", "sales_data": {
"January": { "brick_and_mortar": 305, "website": 298 }, "February": { "brick_and_mortar": 225, "website": 168 },
"March": ... } . Basically, it's in the upsert sweet spot.

What I'm thinking of is - say we define the table using

|TBLPROPERTIES(|
|||'es.host'| |= ||'myhost'||,|
|||'es.resource'| |= ||'myindex/mytype',||
'es.insert.strategy' = 'upsert',
'es.id.path' = 'page_id'

|
where es.insert.strategy = [index]/upsert and es.id.path is the path to the id property (much like the server-side _id
path mapping).

This would make it possible to construct the rest request with doc_as_upsert semantics - but since we need that id
mapping, I realize this feels pretty awkward to do in ESSerde and elsewhere.

What do you think?

Thanks,
-Val

On Tuesday, August 13, 2013 1:07:55 PM UTC-4, Val Vakar wrote:

Hello Experts,

I'd really like to leverage bulk upserts from Hive. I could implement that in elasticsearch-hadoop, but I need a
good way to specify the id (see https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg
<https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>). That's totally doable, but it's
also awkward since the last time we have the deserialized object is in ESSerDe.serialize().

Has there been any thought on this on your side?

Thanks!
-Val

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Val Vakar) #5

Yes, you're right - I would like #3. It's awesome that you guys are working
on that.

For my project it would be very useful to know:

  • What might your solution look like (client-side / server-side / etc)?
  • When do you expect to release it?

Thanks a lot!
-Val

On Wednesday, August 14, 2013 7:09:00 AM UTC-4, Costin Leau wrote:

Awareness of the id is something that we're planning on addressing after
releasing the current version of
elasticsearch-hadoop.
This means that there are three main strategies for sending a document to
ES:

  1. create/put-if-absent (applies also when there's no id)
  2. index op-type=index or create document if missing, replace if present
  3. upsert or create document if missing, _update if present

If my understanding is correct, you're looking for version 3 - to
update/merge the document and not replace it (or upsert).

Let me know if I'm missed/misunderstood anything.

Thanks,

P.S. we're also planning on adding parent/child support - which is also
based on id awareness.

On 14/08/2013 12:13 AM, Val Vakar wrote:

Hey Costin,

You're right that Hive doesn't have the concept of an id. ES handles
such cases elegantly through the _id path mapping
so the client doesn't need to understand id's; I would like that same
exact concept for bulk loading through Hive.

My goal is to run different processing streams (concurrently or
days/weeks apart) on hadoop - and possibly elsewhere -
that compute various parts of the same document at their own pace.
Whenever something is computed, it's upserted into
ES. It's hard to describe my exact business case, but as a very
contrived example let's say we're tracking sales data
for electronic devices: there's a data stream from brick-and-mortar
stores, another stream from web sales, etc - we
would like to accumulate all that under each device's own document in ES
( say { "device": "iPhone", "sales_data": {
"January": { "brick_and_mortar": 305, "website": 298 }, "February": {
"brick_and_mortar": 225, "website": 168 },
"March": ... } . Basically, it's in the upsert sweet spot.

What I'm thinking of is - say we define the table using

|TBLPROPERTIES(|
|||'es.host'| |= ||'myhost'||,|
|||'es.resource'| |= ||'myindex/mytype',||
'es.insert.strategy' = 'upsert',
'es.id.path' = 'page_id'

|
where es.insert.strategy = [index]/upsert and es.id.path is the path to
the id property (much like the server-side _id
path mapping).

This would make it possible to construct the rest request with
doc_as_upsert semantics - but since we need that id
mapping, I realize this feels pretty awkward to do in ESSerde and
elsewhere.

What do you think?

Thanks,
-Val

On Tuesday, August 13, 2013 1:07:55 PM UTC-4, Val Vakar wrote:

Hello Experts, 

I'd really like to leverage bulk upserts from Hive. I could 

implement that in elasticsearch-hadoop, but I need a

good way to specify the id (see 

https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg

<

https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>).
That's totally doable, but it's

also awkward since the last time we have the deserialized object is 

in ESSerDe.serialize().

Has there been any thought on this on your side? 

Thanks! 
-Val 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Costin Leau) #6

You can track down the progress through the github tracker [1] - I've cleaned it up so you can see how issues are
assigned per milestone. In short, the plan is to wrap up the current release (working on the docs now) and push this out
as soon as possible - probably by early next week.

The second milestone should follow that shortly probably in about 1.5 month. I'm talking about the actual release date -
the features will obviously be available in master (and through the nightly builds) long before.
For your use-case, you might be interested in issue #69 [2] - feel free to add context to it or add another issue if
you'd like for Hive; the more info there is, the better we can define it.

Thanks,

[1] https://github.com/elasticsearch/elasticsearch-hadoop/issues/milestones
[2] https://github.com/elasticsearch/elasticsearch-hadoop/issues/69

On 14/08/2013 2:26 PM, Val Vakar wrote:

Yes, you're right - I would like #3. It's awesome that you guys are working on that.

For my project it would be very useful to know:

  • What might your solution look like (client-side / server-side / etc)?
  • When do you expect to release it?

Thanks a lot!
-Val

On Wednesday, August 14, 2013 7:09:00 AM UTC-4, Costin Leau wrote:

Awareness of the id is something that we're planning on addressing after releasing the current version of
elasticsearch-hadoop.
This means that there are three main strategies for sending a document to ES:
1. create/put-if-absent (applies also when there's no id)
2. index op-type=index or create document if missing, _replace_ if present
3. upsert or create document if missing, _update if present

If my understanding is correct, you're looking for version 3 - to update/merge the document and not replace it (or
upsert).

Let me know if I'm missed/misunderstood anything.

Thanks,

P.S. we're also planning on adding parent/child support - which is also based on id awareness.


On 14/08/2013 12:13 AM, Val Vakar wrote:
> Hey Costin,
>
> You're right that Hive doesn't have the concept of an id. ES handles such cases elegantly through the _id path mapping
> so the client doesn't need to understand id's; I would like that same exact concept for bulk loading through Hive.
>
> My goal is to run different processing streams (concurrently or days/weeks apart) on hadoop - and possibly elsewhere -
> that compute various parts of the same document at their own pace. Whenever something is computed, it's upserted into
> ES. It's hard to describe my exact business case, but as a very contrived example let's say we're tracking sales data
> for electronic devices: there's a data stream from brick-and-mortar stores, another stream from web sales, etc - we
> would like to accumulate all that under each device's own document in ES ( say { "device": "iPhone", "sales_data": {
> "January": { "brick_and_mortar": 305, "website": 298 }, "February": { "brick_and_mortar": 225, "website": 168 },
> "March": ... } . Basically, it's in the upsert sweet spot.
>
> What I'm thinking of is - say we define the table using
>
> |TBLPROPERTIES(|
> |||'es.host'| |= ||'myhost'||,|
> |||'es.resource'| |= ||'myindex/mytype',||
>            'es.insert.strategy' = 'upsert',
>            'es.id.path' = 'page_id'
>
> |
> where es.insert.strategy = [index]/upsert and es.id.path is the path to the id property (much like the server-side _id
> path mapping).
>
> This would make it possible to construct the rest request with doc_as_upsert semantics - but since we need that id
> mapping, I realize this feels pretty awkward to do in ESSerde and elsewhere.
>
> What do you think?
>
> Thanks,
> -Val
>
>
>
>
>
>
> On Tuesday, August 13, 2013 1:07:55 PM UTC-4, Val Vakar wrote:
>
>     Hello Experts,
>
>     I'd really like to leverage bulk upserts from Hive. I could implement that in elasticsearch-hadoop, but I need a
>     good way to specify the id (seehttps://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg
<https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>
>     <https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg
<https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>>). That's totally doable, but it's
>     also awkward since the last time we have the deserialized object is in ESSerDe.serialize().
>
>     Has there been any thought on this on your side?
>
>     Thanks!
>     -Val
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>elasticsearc...@googlegroups.com <javascript:>.
> For more options, visithttps://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.
>
>

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Val Vakar) #7

Excellent! You rock.

On Wednesday, August 14, 2013 8:30:51 AM UTC-4, Costin Leau wrote:

You can track down the progress through the github tracker [1] - I've
cleaned it up so you can see how issues are
assigned per milestone. In short, the plan is to wrap up the current
release (working on the docs now) and push this out
as soon as possible - probably by early next week.

The second milestone should follow that shortly probably in about 1.5
month. I'm talking about the actual release date -
the features will obviously be available in master (and through the
nightly builds) long before.
For your use-case, you might be interested in issue #69 [2] - feel free to
add context to it or add another issue if
you'd like for Hive; the more info there is, the better we can define it.

Thanks,

[1]
https://github.com/elasticsearch/elasticsearch-hadoop/issues/milestones
[2] https://github.com/elasticsearch/elasticsearch-hadoop/issues/69

On 14/08/2013 2:26 PM, Val Vakar wrote:

Yes, you're right - I would like #3. It's awesome that you guys are
working on that.

For my project it would be very useful to know:

  • What might your solution look like (client-side / server-side /
    etc)?
  • When do you expect to release it?

Thanks a lot!
-Val

On Wednesday, August 14, 2013 7:09:00 AM UTC-4, Costin Leau wrote:

Awareness of the id is something that we're planning on addressing 

after releasing the current version of

elasticsearch-hadoop. 
This means that there are three main strategies for sending a 

document to ES:

1. create/put-if-absent (applies also when there's no id) 
2. index op-type=index or create document if missing, _replace_ if 

present

3. upsert or create document if missing, _update if present 

If my understanding is correct, you're looking for version 3 - to 

update/merge the document and not replace it (or

upsert). 

Let me know if I'm missed/misunderstood anything. 

Thanks, 

P.S. we're also planning on adding parent/child support - which is 

also based on id awareness.

On 14/08/2013 12:13 AM, Val Vakar wrote: 
> Hey Costin, 
> 
> You're right that Hive doesn't have the concept of an id. ES 

handles such cases elegantly through the _id path mapping

> so the client doesn't need to understand id's; I would like that 

same exact concept for bulk loading through Hive.

> 
> My goal is to run different processing streams (concurrently or 

days/weeks apart) on hadoop - and possibly elsewhere -

> that compute various parts of the same document at their own pace. 

Whenever something is computed, it's upserted into

> ES. It's hard to describe my exact business case, but as a very 

contrived example let's say we're tracking sales data

> for electronic devices: there's a data stream from 

brick-and-mortar stores, another stream from web sales, etc - we

> would like to accumulate all that under each device's own document 

in ES ( say { "device": "iPhone", "sales_data": {

> "January": { "brick_and_mortar": 305, "website": 298 }, 

"February": { "brick_and_mortar": 225, "website": 168 },

> "March": ... } . Basically, it's in the upsert sweet spot. 
> 
> What I'm thinking of is - say we define the table using 
> 
> |TBLPROPERTIES(| 
> |||'es.host'| |= ||'myhost'||,| 
> |||'es.resource'| |= ||'myindex/mytype',|| 
>            'es.insert.strategy' = 'upsert', 
>            'es.id.path' = 'page_id' 
> 
> | 
> where es.insert.strategy = [index]/upsert and es.id.path is the 

path to the id property (much like the server-side _id

> path mapping). 
> 
> This would make it possible to construct the rest request with 

doc_as_upsert semantics - but since we need that id

> mapping, I realize this feels pretty awkward to do in ESSerde and 

elsewhere.

> 
> What do you think? 
> 
> Thanks, 
> -Val 
> 
> 
> 
> 
> 
> 
> On Tuesday, August 13, 2013 1:07:55 PM UTC-4, Val Vakar wrote: 
> 
>     Hello Experts, 
> 
>     I'd really like to leverage bulk upserts from Hive. I could 

implement that in elasticsearch-hadoop, but I need a

>     good way to specify the id (seehttps://

groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg

<

https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>

>     <

https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg

<

https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>>).
That's totally doable, but it's

>     also awkward since the last time we have the deserialized 

object is in ESSerDe.serialize().

> 
>     Has there been any thought on this on your side? 
> 
>     Thanks! 
>     -Val 
> 
> -- 
> You received this message because you are subscribed to the Google 

Groups "elasticsearch" group.

> To unsubscribe from this group and stop receiving emails from it, 

send an email to

>elasticsearc...@googlegroups.com <javascript:>. 
> For more options, visithttps://groups.google.com/groups/opt_out <

https://groups.google.com/groups/opt_out>.

> 
> 

-- 
Costin 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Costin Leau) #8

A quick update - as mentioned on the initial issue [1] - the master has been updated to support not just update and
create operation (in addition to index) but also parent/child relationships, timestamp, version, routing, etc...

Please see [1] for more information - feedback is welcome!

[1] https://github.com/elasticsearch/elasticsearch-hadoop/pull/83

On 14/08/2013 3:34 PM, Val Vakar wrote:

Excellent! You rock.

On Wednesday, August 14, 2013 8:30:51 AM UTC-4, Costin Leau wrote:

You can track down the progress through the github tracker [1] - I've cleaned it up so you can see how issues are
assigned per milestone. In short, the plan is to wrap up the current release (working on the docs now) and push this
out
as soon as possible - probably by early next week.

The second milestone should follow that shortly probably in about 1.5 month. I'm talking about the actual release
date -
the features will obviously be available in master (and through the nightly builds) long before.
For your use-case, you might be interested in issue #69 [2] - feel free to add context to it or add another issue if
you'd like for Hive; the more info there is, the better we can define it.

Thanks,

[1] https://github.com/elasticsearch/elasticsearch-hadoop/issues/milestones
<https://github.com/elasticsearch/elasticsearch-hadoop/issues/milestones>
[2] https://github.com/elasticsearch/elasticsearch-hadoop/issues/69
<https://github.com/elasticsearch/elasticsearch-hadoop/issues/69>

On 14/08/2013 2:26 PM, Val Vakar wrote:
> Yes, you're right - I would like #3. It's awesome that you guys are working on that.
>
> For my project it would be very useful to know:
>
>   - What might your solution look like (client-side / server-side / etc)?
>   - When do you expect to release it?
>
> Thanks a lot!
> -Val
>
>
> On Wednesday, August 14, 2013 7:09:00 AM UTC-4, Costin Leau wrote:
>
>     Awareness of the id is something that we're planning on addressing after releasing the current version of
>     elasticsearch-hadoop.
>     This means that there are three main strategies for sending a document to ES:
>     1. create/put-if-absent (applies also when there's no id)
>     2. index op-type=index or create document if missing, _replace_ if present
>     3. upsert or create document if missing, _update if present
>
>     If my understanding is correct, you're looking for version 3 - to update/merge the document and not replace it (or
>     upsert).
>
>     Let me know if I'm missed/misunderstood anything.
>
>     Thanks,
>
>     P.S. we're also planning on adding parent/child support - which is also based on id awareness.
>
>
>     On 14/08/2013 12:13 AM, Val Vakar wrote:
>     > Hey Costin,
>     >
>     > You're right that Hive doesn't have the concept of an id. ES handles such cases elegantly through the _id path mapping
>     > so the client doesn't need to understand id's; I would like that same exact concept for bulk loading through Hive.
>     >
>     > My goal is to run different processing streams (concurrently or days/weeks apart) on hadoop - and possibly elsewhere -
>     > that compute various parts of the same document at their own pace. Whenever something is computed, it's upserted into
>     > ES. It's hard to describe my exact business case, but as a very contrived example let's say we're tracking sales data
>     > for electronic devices: there's a data stream from brick-and-mortar stores, another stream from web sales, etc - we
>     > would like to accumulate all that under each device's own document in ES ( say { "device": "iPhone", "sales_data": {
>     > "January": { "brick_and_mortar": 305, "website": 298 }, "February": { "brick_and_mortar": 225, "website": 168 },
>     > "March": ... } . Basically, it's in the upsert sweet spot.
>     >
>     > What I'm thinking of is - say we define the table using
>     >
>     > |TBLPROPERTIES(|
>     > |||'es.host'| |= ||'myhost'||,|
>     > |||'es.resource'| |= ||'myindex/mytype',||
>     >            'es.insert.strategy' = 'upsert',
>     >            'es.id.path' = 'page_id'
>     >
>     > |
>     > where es.insert.strategy = [index]/upsert and es.id.path is the path to the id property (much like the server-side _id
>     > path mapping).
>     >
>     > This would make it possible to construct the rest request with doc_as_upsert semantics - but since we need that id
>     > mapping, I realize this feels pretty awkward to do in ESSerde and elsewhere.
>     >
>     > What do you think?
>     >
>     > Thanks,
>     > -Val
>     >
>     >
>     >
>     >
>     >
>     >
>     > On Tuesday, August 13, 2013 1:07:55 PM UTC-4, Val Vakar wrote:
>     >
>     >     Hello Experts,
>     >
>     >     I'd really like to leverage bulk upserts from Hive. I could implement that in elasticsearch-hadoop, but I need a
>     >     good way to specify the id (seehttps://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg
<http://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>
>     <https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg
<https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>>
>     >     <https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg
<https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>
>     <https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg
<https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/-RxDUq3zBjg>>>). That's totally doable, but it's
>     >     also awkward since the last time we have the deserialized object is in ESSerDe.serialize().
>     >
>     >     Has there been any thought on this on your side?
>     >
>     >     Thanks!
>     >     -Val
>     >
>     > --
>     > You received this message because you are subscribed to the Google Groups "elasticsearch" group.
>     > To unsubscribe from this group and stop receiving emails from it, send an email to
>     >elasticsearc...@googlegroups.com <javascript:>.
>     > For more options, visithttps://groups.google.com/groups/opt_out <http://groups.google.com/groups/opt_out> <https://groups.google.com/groups/opt_out
<https://groups.google.com/groups/opt_out>>.
>     >
>     >
>
>     --
>     Costin
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>elasticsearc...@googlegroups.com <javascript:>.
> For more options, visithttps://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.
>
>

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #9