Reindexing indices between clusters

Itay_Bittan · November 28, 2017, 6:39am

Following our thread (closed),
We consider upgrading from 1.4.7 to Elasticsearch 6 in both clusters (the one how baked the index and the serving) for using the reindex feature. It is going to be very challenging upgrade so I want to make sure that this is the right solution for us. The motivation is: every baked/ready index would trigger reindexing operation in the serving cluster (can be done in parallel right?) instead of using the snapshot&restore. what do you think?

Thanks

Igor_Motov · November 28, 2017, 1:48pm

What problem are you trying to solve with all this complexity?

Itay_Bittan · November 29, 2017, 6:29pm

Moving/Updating indices from one source cluster to another serving clusters.
There are hundreds of indices, each one should be updated every hour.
Each index contains between 10k-10M documents.

Igor_Motov · November 30, 2017, 1:08pm

Why couldn't you just update them in the serving cluster? What's the purpose of having two clusters instead of one?

Itay_Bittan · November 30, 2017, 3:10pm

One is "baking" the index (do calculations for documents data and the second one just serving. To serve in high performance (latency) we are using aliases in serving cluster. Per each alias we maintain two indices - one that currently serving and another one that is waiting to be updated (today by s3 snapshot, and that is what we want to change to reindex). Once it's updated the alias points to this "most updated" index which serve until the next update and so on..

Christian_Dahlqvist · November 30, 2017, 3:26pm

Why can this not be done in a single cluster? If you want these to run on different nodes, you could set up two zones using shard allocation filtering and have preparation only be done on some of the nodes. When it is complete you can reallocate them to the other zone and flip the alias.

Igor_Motov · November 30, 2017, 3:30pm

I was typing pretty much what @Christian_Dahlqvist said. I just want to add the switching from restore to reindex is not going to help you here since restore is mostly I/O operation while reindex is going to tax both CPU and I/O. So restore is going to negate the benefit of having a separate cluster.

Itay_Bittan · November 30, 2017, 9:07pm

What do I achieve by that? can I control which nodes can be serve and which doesn't? how? and what is the tax for reallocate shards to other zone?

Christian_Dahlqvist · November 30, 2017, 10:04pm

If you want the indexing to not affect the searching, you can have that happen on different sets of nodes within the cluster. I assumed this was why you were trying to use 2 separate clusters. If this is not the reason you were contemplating two clusters, can you please explain what you were aiming to achieve using this approach?

Itay_Bittan · December 1, 2017, 6:29am

What is the cost of reallocate shards? How much time does it take? Only I/O? Coping entire shards? Does is work with Delta?

Christian_Dahlqvist · December 1, 2017, 7:31am

This approach assumes you work with different versions of the index and replace one by another once it is ready. It moves shards by copying files, which is the same mechanism Elasticsearch uses to redistribute data when nodes fail or are added to the cluster.

Itay_Bittan · December 2, 2017, 5:11pm

Does it have concurrency limitations? (Like snapshots). What if I want to move shards of more than one index at a time?

Yaron · December 2, 2017, 6:52pm

Thanks for the quick reply. My name is Yaron, working with Itay.

Can 'moving index between shards work with elasticsearch 1.4.7 or should we upgrade to 6 for that?
How will 'moving shards' work if we only want to update the delta (for example if only 0.1% of the index has changed)...? Update / Inserts / Deletions...
Note that in the serving cluster we have 2 separate clusters (each in a different Availability Zone) - both clusters receiving traffic - how will 'moving shards' work in this case? It will require updating the index on both clusters right?
We use 2 clusters for serving for redundancy purposes and allow fallback from one cluster to the other (if one
cannot serve for some reason): index was deleted for some reason, or for maintenance tasks on the cluster -
these use cases require us to temporarily divert traffic to the other cluster while the cluster is down.

Christian_Dahlqvist · December 3, 2017, 9:51am

Moving indices between zones (not shards) rely on zone allocation filtering, which is available also in Elasticsearch 1.x.

The solution I proposed rely on having separate indices which you switch between, which is what I thought you were doing based on the initial description. If you are only updating a small portion of data you may be better off just updating in place on a single index.

If you could describe the solution and requirements in greater detail, it would probably be easier to provide appropriate advice.

When you have multiple active-active clusters it is generally recommended to update both clusters in parallel.

Yaron · December 3, 2017, 8:16pm

We are working on a recommendation system, supposed to render personalized strategies within miliseconds.
we have an periodic celery based application indexing customer's catalogs to its own Elasticsearch cluster, enriching the catalog with collected statistics (for example, assigning a popularity score for each of the products in the catalog) which the index is then sorted by.
Once we are done updating the index, a snapshot is created, waiting to be published to the serving region.
In the serving region we have an active-active Elasticsearch cluster, which our backend application is querying in order to return personalized recommendations to the client.
We are working with customer's catalogs that contain up to 10-30 million products.
We have around 200 catalogs (indexes) at the moment.
In order to shorten catalog update cycles we try to update only the deltas (updates, inserts, deletions) as soon as the catalog is updated (product's metadata changes...) - our goal is to have the index in the serving region updated with the latest version at near realtime - within 2-3 minutes after we are notified of the change in the catalog.
once or twice a day we are required to update the entire index (not only metadata changes, but update full statistics on all of the products).
In order to improve performance we split the index into shards only when it exceeds 2.5-3 million products (i.e. a 15 million products index will be consisted of 5 shards).
The way we currently update the index in the serving regions is by publishing a snapshot (restoring from S3 with elasticsearch AWS plugin) - This is our main bottleneck since this is a single process flow, each snapshot creation (for indexes containing more than 1 million products for example, takes several minutes). The restore process also takes several minutes for such big indexes.
Waiting time for publish can increase up to 4 hours... this is unacceptable and we are trying to come up with a new architecture, a new way to publish the snapshot and build the index in the serving region, only deltas and so on.

Itay_Bittan · December 5, 2017, 6:11am

@Christian_Dahlqvist @Igor_Motov Can you give an advice please? What would you do?

Thanks,
Itay

Christian_Dahlqvist · December 5, 2017, 10:54am

I am assuming that you are applying delta updates directly to the serving clusters and that this is not currently a problem.

With respect to the full refreshes, I would probably consider adding a couple of nodes to the serving cluster and set it up to have two distinct zones using shard allocation filtering as I described earlier. One zone would only hold the indices being queried, while the other would only hold the ones being built. This will separate the load and prevent the indexing heavy update process from affecting the search load (at least of you use coordinating only nodes to rout the traffic).

Once an index has been rebuilt, you can move it to the serving zone by changing settings on the index. The process to relocate the index is the same one that the restore process uses, but you can tune the concurrency level and change the settings on indices to 'queue up' relocations, so you should not suffer the same issues as when restoring multiple snapshots. If you have an alias in front of each index, you can then change this to point to the new version of the index once it is deemed ready and has completed relocating.

Yaron · December 5, 2017, 11:18am

Thanks for the response.
Is it possible to copy a shard from one cluster to another? (not only within the same cluster..)

Christian_Dahlqvist · December 5, 2017, 11:19am

Copying indices between clusters is what snapshot/restore is for.

Yaron · December 5, 2017, 11:19am

Ok, Thanks!

Topic		Replies	Views
Using the internal "transport module" for moving data between clusters Elasticsearch	6	421	July 6, 2017
Separating Index and Search Elasticsearch	7	1772	July 6, 2017
Reindex into another Elasticsearch Elasticsearch	5	432	July 6, 2017
Aging strategy described in the ES website's "Retiring Data" document isn't working for me Elasticsearch	12	1319	July 6, 2017
Move all index shards to a node Elasticsearch	3	2371	July 6, 2017

Reindexing indices between clusters

Related topics