Update


#1

Any thoughts on supporting an "update" feature in ES?

We have a need to update a quantity of documents - and rather than
rebuild the ES document and re-index, we'd rather read the data from
ES (since it has all the data), modify & reindex. (will be much
snappier, as re-assembling the ES document is a bit costly for us) .

What would be one step better is calling update with a query/filter &
pass, say, a js function to do our update and have ES execute it & re-
index all under the hood. That way we avoid having to write a bunch of
scrolling through large results sets, batching indexing operations and
we avoid shifting all the data back to the client and then back to the
ES node(s).

Thoughts?


(Shay Banon) #2

Hi,

Yes, that certainly make sense. The difficulty of handling this revolves
around the distributed nature (more specifically, replication) of
operations. There are different ways to implement it:

  1. Have the update js function run on the primary shard, and then batch
    changes using a similar mechanism to the batch API. This will means
    replicating the data to the replicas. This is simpler to implement, though
    the update is not atomic or blocking (other index operations on the same
    data might "get in").

    This does require to refresh the relevant shard(s) before execution in
    order to see the latest data.

  2. Have the update function happen on the primary and the replicas. This is
    more efficient when it comes to not needing to transfer the data to the
    replicas, but the query will be executed on all replicas, and its much
    harder to maintain consistency of shard and its replicas in this case (this
    must be maintained of course).

aparo has been talking about it as well (on IRC), and even went ahead and
implemented a proof of concept code.

-shay.banon

On Thu, Nov 11, 2010 at 8:04 PM, Mooky nick.minutello@gmail.com wrote:

Any thoughts on supporting an "update" feature in ES?

We have a need to update a quantity of documents - and rather than
rebuild the ES document and re-index, we'd rather read the data from
ES (since it has all the data), modify & reindex. (will be much
snappier, as re-assembling the ES document is a bit costly for us) .

What would be one step better is calling update with a query/filter &
pass, say, a js function to do our update and have ES execute it & re-
index all under the hood. That way we avoid having to write a bunch of
scrolling through large results sets, batching indexing operations and
we avoid shifting all the data back to the client and then back to the
ES node(s).

Thoughts?


#3

Cool

On 12 November 2010 08:13, Shay Banon shay.banon@elasticsearch.com wrote:

Hi,

Yes, that certainly make sense. The difficulty of handling this revolves
around the distributed nature (more specifically, replication) of
operations. There are different ways to implement it:

  1. Have the update js function run on the primary shard, and then batch
    changes using a similar mechanism to the batch API. This will means
    replicating the data to the replicas. This is simpler to implement, though
    the update is not atomic or blocking (other index operations on the same
    data might "get in").

    This does require to refresh the relevant shard(s) before execution in
    order to see the latest data.

  2. Have the update function happen on the primary and the replicas. This is
    more efficient when it comes to not needing to transfer the data to the
    replicas, but the query will be executed on all replicas, and its much
    harder to maintain consistency of shard and its replicas in this case (this
    must be maintained of course).

aparo has been talking about it as well (on IRC), and even went ahead and
implemented a proof of concept code.

-shay.banon

On Thu, Nov 11, 2010 at 8:04 PM, Mooky nick.minutello@gmail.com wrote:

Any thoughts on supporting an "update" feature in ES?

We have a need to update a quantity of documents - and rather than
rebuild the ES document and re-index, we'd rather read the data from
ES (since it has all the data), modify & reindex. (will be much
snappier, as re-assembling the ES document is a bit costly for us) .

What would be one step better is calling update with a query/filter &
pass, say, a js function to do our update and have ES execute it & re-
index all under the hood. That way we avoid having to write a bunch of
scrolling through large results sets, batching indexing operations and
we avoid shifting all the data back to the client and then back to the
ES node(s).

Thoughts?


(Alberto Paro-2) #4

On 12/nov/2010, at 19.36, Nick Minutello wrote:

Cool

On 12 November 2010 08:13, Shay Banon shay.banon@elasticsearch.com wrote:
Hi,

Yes, that certainly make sense. The difficulty of handling this revolves around the distributed nature (more specifically, replication) of operations. There are different ways to implement it:

  1. Have the update js function run on the primary shard, and then batch changes using a similar mechanism to the batch API. This will means replicating the data to the replicas. This is simpler to implement, though the update is not atomic or blocking (other index operations on the same data might "get in").

    This does require to refresh the relevant shard(s) before execution in order to see the latest data.

  2. Have the update function happen on the primary and the replicas. This is more efficient when it comes to not needing to transfer the data to the replicas, but the query will be executed on all replicas, and its much harder to maintain consistency of shard and its replicas in this case (this must be maintained of course).

aparo has been talking about it as well (on IRC), and even went ahead and implemented a proof of concept code.

-shay.banon
I've implemented an update of ES for a my client at point 2. The problem, as Shay said, is the consistency of shard and its replicas.
There is a high risk to have broken/corrupted data in your index if there are problem during update (Verified in some my border cases).
So now I'm working to implement parent/children approuch as Shay suggested to me.

Hi,
Alberto Paro


(Shay Banon) #5

Hi Alberto,

I am going to work on parent child post 0.13... :), no promises though...

On Tue, Nov 16, 2010 at 10:15 AM, Alberto Paro alberto.paro@gmail.comwrote:

On 12/nov/2010, at 19.36, Nick Minutello wrote:

Cool

On 12 November 2010 08:13, Shay Banon shay.banon@elasticsearch.comwrote:

Hi,

Yes, that certainly make sense. The difficulty of handling this revolves
around the distributed nature (more specifically, replication) of
operations. There are different ways to implement it:

  1. Have the update js function run on the primary shard, and then batch
    changes using a similar mechanism to the batch API. This will means
    replicating the data to the replicas. This is simpler to implement, though
    the update is not atomic or blocking (other index operations on the same
    data might "get in").

    This does require to refresh the relevant shard(s) before execution in
    order to see the latest data.

  2. Have the update function happen on the primary and the replicas. This
    is more efficient when it comes to not needing to transfer the data to the
    replicas, but the query will be executed on all replicas, and its much
    harder to maintain consistency of shard and its replicas in this case (this
    must be maintained of course).

aparo has been talking about it as well (on IRC), and even went ahead and
implemented a proof of concept code.

-shay.banon

I've implemented an update of ES for a my client at point 2. The problem,
as Shay said, is the consistency of shard and its replicas.
There is a high risk to have broken/corrupted data in your index if there
are problem during update (Verified in some my border cases).
So now I'm working to implement parent/children approuch as Shay suggested
to me.

Hi,
Alberto Paro


(system) #6