How many processes can update the same index in parallel?

I have an index in ES that contains 50-100 fields.
I'm getting all those fields from 10-15 different data sources (as partial documents).
For each data source I want to configure an indexer process(es) that will update the index in ES with the new data.

  1. Are there any limitations on parallel indexing for ES?
  2. Can I have for example 5 instances of each indexer (50-100 instances working in parallel within the same index)?
  3. How to calculate the capacity of the index?
  4. Is that possible to corrupt the index data by having too many indexers in parallel?

How large is the data set in terms of document count?

How frequently is each individual document likely to be updated?

Are you going to use bulk updates?

Which version of Elasticsearch are you using?

Are you updating individual documents directly by ID?

250-500M

permanently, but the periodicity is different. some parts will be updated permanently, some of them once per day. over the 24h there is a low chance that more than 10M documents will be updated.

low chance because I'm using Kafka as a source of item changes.

7.6.1 with Lucene 8.4

yes

Does this mean several updates per second? If not, what periodicity does it translate to?

I do not have a lot of recent experience with high-update use cases, but can provide some pointers based on what I have seen here in the forum.

  1. Updating documents without using bulk requests can lead to a lot small segments being generated, which is inefficient and can have huge negative impact on performance. I believe some improvements might have been made in later versions so would recommend that you upgrade to the latest version.
  2. Elasticsearch have never been optimised for very frequent updates, so if you have e.g. counters that are updated several times per second you may be better off aggregating these outside Elasticsearch and updating periodically.
  3. I do not think you will see corruption if you have highly concurrent updates, but throughput may be poor and you might get errors due to version conflicts. You need to benchmark to tell for sure as it is a somewhat unusual use case.

Thank you for the response. We had it working in Azure Search with 9-12 concurrent permanent indexers. The fastest stream was processed in 5 single thread processes with the constant updates queue.
Now we are looking to change the scaling approach that may lead to more indexers in parallel.

What did that setup look like? How many indices and shards? What throughput did you get to? What is now you target?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.