Synchronize data between two Elasticsearch clusters of different versions.

Hi all, recently I was exploring how to synchronize data from two Elasticsearch clusters of different versions, and luckily I came across a fantastic open-source tool on Github called INFINI Gateway.

According to the official description, INFINI Gateway is a reverse proxy for Elasticsearch clusters that can do many things, such as traffic control, query result caching, request logging and analysis, as well as traffic replication.

Traffic replication means that INFINI Gateway replicates the traffic it receives to multiple clusters, which can be different versions of Elasticsearch or even Opensearch, which is amazing.

I had also considered using a message queue to achieve synchronization between multiple ES clusters, but I believe that INFINI Gateway’s solution is better, lighter, and offers more extensibility features that I may consider using in the future.

Now, let me briefly talk about how I used INFINI Gateway to achieve data synchronization.

  1. Download
    Click here.
  2. Modify the INFINI Gateway configuration file
    The default config file is not for Traffic Replication. Download this configuration file from github and modify it according to your envs.
    I only modified the resource definition section at the top.
  #primary
  PRIMARY_ENDPOINT: http://192.168.56.3:7171
  PRIMARY_USERNAME: elastic
  PRIMARY_PASSWORD: password
  PRIMARY_MAX_QPS_PER_NODE: 10000
  PRIMARY_MAX_BYTES_PER_NODE: 104857600 #100MB/s
  PRIMARY_MAX_CONNECTION_PER_NODE: 200
  PRIMARY_DISCOVERY_ENABLED: false
  PRIMARY_DISCOVERY_REFRESH_ENABLED: false
  #backup
  BACKUP_ENDPOINT: http://192.168.56.3:9200
  BACKUP_USERNAME: admin
  BACKUP_PASSWORD: admin
  BACKUP_MAX_QPS_PER_NODE: 10000
  BACKUP_MAX_BYTES_PER_NODE: 104857600 #100MB/s
  BACKUP_MAX_CONNECTION_PER_NODE: 200
  BACKUP_DISCOVERY_ENABLED: false
  BACKUP_DISCOVERY_REFRESH_ENABLED: false
  1. Run
    ./gateway-linux-amd64 -config replication_via-disk.yml

  2. Testing
    Shoot your bulk requests to the Gateway's endpoint which defaults to http://Your-IP:18000

Once the PRIMARY has executed the bulk request successfully, the Gateway then replicates the bulk request to the BACKUP. Otherwise, the Gateway just returns the error information from the PRIMARY to the client that sent the bulk request.

If you send a query to the gateway, it will be forwarded to the PRIMARY for execution, and the results will be returned to the client.

Interestingly, if the PRIMARY is unavailable when you send the query, it will be forwarded to the BACKUP for execution and return the results.

I hope this can help those in need.

How does it handle the case when the PRIMARY succeeds but the BACKUP is unavailable or fails? Does it queue up requests and execute them in order when the cluster is again available or will it accept that the data will be inconsistent?

Will this lead to inconsistency or is data queued up also for the PRIMARY if it is not available?

Yes, the gateway has a local disk queue module, which is enabled by default. Once the PRIMARY has executed the bulk request successfully, the gateway places the request into the queue for consumption by other threads. (the local disk queue could be replaced by kafka)

If the BACKUP is unavailable or fails, there is a retry mechanism. If the retry exceeds a certain threshold, the request will be placed in a dead-letter queue waiting for manual intervention.(restore the BACKUP cluster or network,etc)

When the PRIMARY is not available, only search requests are forwarded to the BACKUP for collecting results. Bulk/write requests are still forwarded to the PRIMARY, and subsequent error messages are returned to the client. These are the default behaviors of the gateway and can be modified.