Any thoughts on supporting an "update" feature in ES?
We have a need to update a quantity of documents - and rather than
rebuild the ES document and re-index, we'd rather read the data from
ES (since it has all the data), modify & reindex. (will be much
snappier, as re-assembling the ES document is a bit costly for us) .
What would be one step better is calling update with a query/filter &
pass, say, a js function to do our update and have ES execute it & re-
index all under the hood. That way we avoid having to write a bunch of
scrolling through large results sets, batching indexing operations and
we avoid shifting all the data back to the client and then back to the
ES node(s).
Yes, that certainly make sense. The difficulty of handling this revolves
around the distributed nature (more specifically, replication) of
operations. There are different ways to implement it:
Have the update js function run on the primary shard, and then batch
changes using a similar mechanism to the batch API. This will means
replicating the data to the replicas. This is simpler to implement, though
the update is not atomic or blocking (other index operations on the same
data might "get in").
This does require to refresh the relevant shard(s) before execution in
order to see the latest data.
Have the update function happen on the primary and the replicas. This is
more efficient when it comes to not needing to transfer the data to the
replicas, but the query will be executed on all replicas, and its much
harder to maintain consistency of shard and its replicas in this case (this
must be maintained of course).
aparo has been talking about it as well (on IRC), and even went ahead and
implemented a proof of concept code.
Any thoughts on supporting an "update" feature in ES?
We have a need to update a quantity of documents - and rather than
rebuild the ES document and re-index, we'd rather read the data from
ES (since it has all the data), modify & reindex. (will be much
snappier, as re-assembling the ES document is a bit costly for us) .
What would be one step better is calling update with a query/filter &
pass, say, a js function to do our update and have ES execute it & re-
index all under the hood. That way we avoid having to write a bunch of
scrolling through large results sets, batching indexing operations and
we avoid shifting all the data back to the client and then back to the
ES node(s).
Yes, that certainly make sense. The difficulty of handling this revolves
around the distributed nature (more specifically, replication) of
operations. There are different ways to implement it:
Have the update js function run on the primary shard, and then batch
changes using a similar mechanism to the batch API. This will means
replicating the data to the replicas. This is simpler to implement, though
the update is not atomic or blocking (other index operations on the same
data might "get in").
This does require to refresh the relevant shard(s) before execution in
order to see the latest data.
Have the update function happen on the primary and the replicas. This is
more efficient when it comes to not needing to transfer the data to the
replicas, but the query will be executed on all replicas, and its much
harder to maintain consistency of shard and its replicas in this case (this
must be maintained of course).
aparo has been talking about it as well (on IRC), and even went ahead and
implemented a proof of concept code.
Any thoughts on supporting an "update" feature in ES?
We have a need to update a quantity of documents - and rather than
rebuild the ES document and re-index, we'd rather read the data from
ES (since it has all the data), modify & reindex. (will be much
snappier, as re-assembling the ES document is a bit costly for us) .
What would be one step better is calling update with a query/filter &
pass, say, a js function to do our update and have ES execute it & re-
index all under the hood. That way we avoid having to write a bunch of
scrolling through large results sets, batching indexing operations and
we avoid shifting all the data back to the client and then back to the
ES node(s).
Yes, that certainly make sense. The difficulty of handling this revolves around the distributed nature (more specifically, replication) of operations. There are different ways to implement it:
Have the update js function run on the primary shard, and then batch changes using a similar mechanism to the batch API. This will means replicating the data to the replicas. This is simpler to implement, though the update is not atomic or blocking (other index operations on the same data might "get in").
This does require to refresh the relevant shard(s) before execution in order to see the latest data.
Have the update function happen on the primary and the replicas. This is more efficient when it comes to not needing to transfer the data to the replicas, but the query will be executed on all replicas, and its much harder to maintain consistency of shard and its replicas in this case (this must be maintained of course).
aparo has been talking about it as well (on IRC), and even went ahead and implemented a proof of concept code.
-shay.banon
I've implemented an update of ES for a my client at point 2. The problem, as Shay said, is the consistency of shard and its replicas.
There is a high risk to have broken/corrupted data in your index if there are problem during update (Verified in some my border cases).
So now I'm working to implement parent/children approuch as Shay suggested to me.
Yes, that certainly make sense. The difficulty of handling this revolves
around the distributed nature (more specifically, replication) of
operations. There are different ways to implement it:
Have the update js function run on the primary shard, and then batch
changes using a similar mechanism to the batch API. This will means
replicating the data to the replicas. This is simpler to implement, though
the update is not atomic or blocking (other index operations on the same
data might "get in").
This does require to refresh the relevant shard(s) before execution in
order to see the latest data.
Have the update function happen on the primary and the replicas. This
is more efficient when it comes to not needing to transfer the data to the
replicas, but the query will be executed on all replicas, and its much
harder to maintain consistency of shard and its replicas in this case (this
must be maintained of course).
aparo has been talking about it as well (on IRC), and even went ahead and
implemented a proof of concept code.
-shay.banon
I've implemented an update of ES for a my client at point 2. The problem,
as Shay said, is the consistency of shard and its replicas.
There is a high risk to have broken/corrupted data in your index if there
are problem during update (Verified in some my border cases).
So now I'm working to implement parent/children approuch as Shay suggested
to me.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.