Reindexing indices between clusters

Hi @Igor_Motov,

Following our thread (closed),
We consider upgrading from 1.4.7 to Elasticsearch 6 in both clusters (the one how baked the index and the serving) for using the reindex feature. It is going to be very challenging upgrade so I want to make sure that this is the right solution for us. The motivation is: every baked/ready index would trigger reindexing operation in the serving cluster (can be done in parallel right?) instead of using the snapshot&restore. what do you think?

Thanks

1 Like

What problem are you trying to solve with all this complexity?

Moving/Updating indices from one source cluster to another serving clusters.
There are hundreds of indices, each one should be updated every hour.
Each index contains between 10k-10M documents.

Why couldn't you just update them in the serving cluster? What's the purpose of having two clusters instead of one?

One is "baking" the index (do calculations for documents data and the second one just serving. To serve in high performance (latency) we are using aliases in serving cluster. Per each alias we maintain two indices - one that currently serving and another one that is waiting to be updated (today by s3 snapshot, and that is what we want to change to reindex). Once it's updated the alias points to this "most updated" index which serve until the next update and so on..

Why can this not be done in a single cluster? If you want these to run on different nodes, you could set up two zones using shard allocation filtering and have preparation only be done on some of the nodes. When it is complete you can reallocate them to the other zone and flip the alias.

2 Likes

I was typing pretty much what @Christian_Dahlqvist said. I just want to add the switching from restore to reindex is not going to help you here since restore is mostly I/O operation while reindex is going to tax both CPU and I/O. So restore is going to negate the benefit of having a separate cluster.

1 Like

What do I achieve by that? can I control which nodes can be serve and which doesn't? how? and what is the tax for reallocate shards to other zone?

If you want the indexing to not affect the searching, you can have that happen on different sets of nodes within the cluster. I assumed this was why you were trying to use 2 separate clusters. If this is not the reason you were contemplating two clusters, can you please explain what you were aiming to achieve using this approach?

What is the cost of reallocate shards? How much time does it take? Only I/O? Coping entire shards? Does is work with Delta?

This approach assumes you work with different versions of the index and replace one by another once it is ready. It moves shards by copying files, which is the same mechanism Elasticsearch uses to redistribute data when nodes fail or are added to the cluster.

Does it have concurrency limitations? (Like snapshots). What if I want to move shards of more than one index at a time?

Thanks for the quick reply. My name is Yaron, working with Itay.

  1. Can 'moving index between shards work with elasticsearch 1.4.7 or should we upgrade to 6 for that?
  2. How will 'moving shards' work if we only want to update the delta (for example if only 0.1% of the index has changed)...? Update / Inserts / Deletions...
  3. Note that in the serving cluster we have 2 separate clusters (each in a different Availability Zone) - both clusters receiving traffic - how will 'moving shards' work in this case? It will require updating the index on both clusters right?
    We use 2 clusters for serving for redundancy purposes and allow fallback from one cluster to the other (if one
    cannot serve for some reason): index was deleted for some reason, or for maintenance tasks on the cluster -
    these use cases require us to temporarily divert traffic to the other cluster while the cluster is down.

Moving indices between zones (not shards) rely on zone allocation filtering, which is available also in Elasticsearch 1.x.

The solution I proposed rely on having separate indices which you switch between, which is what I thought you were doing based on the initial description. If you are only updating a small portion of data you may be better off just updating in place on a single index.

If you could describe the solution and requirements in greater detail, it would probably be easier to provide appropriate advice.

When you have multiple active-active clusters it is generally recommended to update both clusters in parallel.

  1. We are working on a recommendation system, supposed to render personalized strategies within miliseconds.
  2. we have an periodic celery based application indexing customer's catalogs to its own Elasticsearch cluster, enriching the catalog with collected statistics (for example, assigning a popularity score for each of the products in the catalog) which the index is then sorted by.
  3. Once we are done updating the index, a snapshot is created, waiting to be published to the serving region.
  4. In the serving region we have an active-active Elasticsearch cluster, which our backend application is querying in order to return personalized recommendations to the client.
  5. We are working with customer's catalogs that contain up to 10-30 million products.
  6. We have around 200 catalogs (indexes) at the moment.
  7. In order to shorten catalog update cycles we try to update only the deltas (updates, inserts, deletions) as soon as the catalog is updated (product's metadata changes...) - our goal is to have the index in the serving region updated with the latest version at near realtime - within 2-3 minutes after we are notified of the change in the catalog.
  8. once or twice a day we are required to update the entire index (not only metadata changes, but update full statistics on all of the products).
  9. In order to improve performance we split the index into shards only when it exceeds 2.5-3 million products (i.e. a 15 million products index will be consisted of 5 shards).
  10. The way we currently update the index in the serving regions is by publishing a snapshot (restoring from S3 with elasticsearch AWS plugin) - This is our main bottleneck since this is a single process flow, each snapshot creation (for indexes containing more than 1 million products for example, takes several minutes). The restore process also takes several minutes for such big indexes.
  11. Waiting time for publish can increase up to 4 hours... this is unacceptable and we are trying to come up with a new architecture, a new way to publish the snapshot and build the index in the serving region, only deltas and so on.

@Christian_Dahlqvist @Igor_Motov Can you give an advice please? What would you do?

Thanks,
Itay

I am assuming that you are applying delta updates directly to the serving clusters and that this is not currently a problem.

With respect to the full refreshes, I would probably consider adding a couple of nodes to the serving cluster and set it up to have two distinct zones using shard allocation filtering as I described earlier. One zone would only hold the indices being queried, while the other would only hold the ones being built. This will separate the load and prevent the indexing heavy update process from affecting the search load (at least of you use coordinating only nodes to rout the traffic).

Once an index has been rebuilt, you can move it to the serving zone by changing settings on the index. The process to relocate the index is the same one that the restore process uses, but you can tune the concurrency level and change the settings on indices to 'queue up' relocations, so you should not suffer the same issues as when restoring multiple snapshots. If you have an alias in front of each index, you can then change this to point to the new version of the index once it is deemed ready and has completed relocating.

Thanks for the response.
Is it possible to copy a shard from one cluster to another? (not only within the same cluster..)

Copying indices between clusters is what snapshot/restore is for.

Ok, Thanks!