We have a need now for remote cluster replication. (Can't wait for native replication whenever that might be coming...) I don't need active active (yet) but I do need to replicate all the indices within the cluster to a remote cluster. The sources writing to Elasticsearch shouldn't need to know about where the remote, read-only clusters are. So I've been playing with the idea of using an index listener to watch for changes/deletions within each index. Those changes get written as a bulk message that gets dropped onto a kafka broker/topic. Kafka then replicates its topic log to the remote data center. From there a kafka consumer picks up the topic changes and pushes them to its local elasticsearch cluster.
I've written a little POC plugin that does this exactly. Obviously it assumes that the index mappings and settings are already in the remote cluster.
I'm just looking for some feedback on this approach.
Is this a taboo question?
I believe LinkedIn does something similar, but I cannot find the article
that I read. Not sure if they use an index listener.
If you are going to index to another cluster, we don't you simply index
twice from the start? Each cluster has its own Kafka subscriber.
you simply index twice from the start? Each cluster has its own Kafka subscriber.
You mean write to kafka initially, distribute the log, and then the kafka consumer indexes in each data center? Such a better idea. So our teams never write to elasticsearch directly. I kinda feel silly for not thinking of this.
- Where the individual clusters are no longer matters to the development team... they just write once to a kafka topic
- I decide where the topic is distributed
- I don't worry about them flooding the cluster because I control the number of consumers indexing
I way over complicated it... Thanks Ivan.
Don't get me wrong, the plugin you wrote could be very useful. If the
original indexed content did not come from Kafka or some replicated source,
a post index tool is helpful. There is am old issue requesting such a
feature that is still very active:
But as you discovered, if you have control over the data pipeline, you
might as well exploit the pubsub architecture and have multiple subscribers
for each topic. You get to decouple the indexing process from the rest of
the app. I believe queues should be the norm when indexing content nowadays.