Was wondering if anyone has tried replicating one cluster to a new cluster
and keep it in "sync". Example is I have a production cluster and i need
to reindex all data. I would like to do this in a 2nd cluster so I can
compare the changes but if an update happens on the original index I want
it reflected on the replicated one.
I am pretty sure I can whip something with scroll/scan but if someone has
done before and has code to share it would be great.
Was wondering if anyone has tried replicating one cluster to a new cluster
and keep it in "sync". Example is I have a production cluster and i need
to reindex all data. I would like to do this in a 2nd cluster so I can
compare the changes but if an update happens on the original index I want
it reflected on the replicated one.
I am pretty sure I can whip something with scroll/scan but if someone has
done before and has code to share it would be great.
First and most important, the good news: ES 1.0.0.Beta2 has
snapshot/restore feature in place so it should be easy to snapshot and
restore the result back to a target cluster. The snapshots are also
incremental.
Second, there are also news for the knapsack plugin.
In the next knapsack plugin version due this week, a full copy from
cluster1 to cluster2 will be as simple as
Limitations will be that you have knapsack plugin installed at
cluster1node, the same JVM version in cluster1 and cluster2, same ES
version in cluster1 and cluster2, and all your indexes have stored fields,
preferably the _source field. Also, cluster1 must not modify the indexes
while the _export/copy is running, or cluster2 may have different data
(there is no inherent locking).
In the new knapsack export version, you will be able to use arbitrary ES
queries to select subsets of the cluster data to copy, so only the hits of
a query can be transferred.
A colleague of mine here at TaskRabbit whipped-up a node.js-based tool
similar to these:
His main use case is to replicate ES cluster from our production system to
a staging/test environment. I believe it has the same requirements as
other similar tools, mainly that the source index needs to have the
original documents stored in the _source field.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.