Updating only a few fields out of many

ktech007 · October 21, 2023, 3:23pm

ES version: 7.10
100 data nodes
1000 primary shards
5 B documents, 12 TB
External versioning

We are upserting almost 500 M documents a day and it is done via Index API. Each document could have 50 - 300 fields and in totality, the unique mappings/fields are greater than 3k. In reality, though, only a handful of fields are changing per doc, but as we use index API, I think ES is re-analyzing every field even if not updated. With this, we had to scale up the cluster to keep the CPU and GC in control.

The other option I could think of is to use the update API, but as we use external versioning, that is also not possible.

So, just looking if there is any other approach that we can take.

xeraa · October 22, 2023, 12:06am

The update API has the advantage that you only need to send a request with the changes rather than the full document but the rest is the same: Because of Lucene's immutable data structure, you'll still need to replace the document and write it anew (re-analyzing, rewriting, merging the replaced document away,...).

One approach to the problem is parent-child; basically putting the static data into one document and the frequently changing one in another, much smaller one. BUT it comes with a heavy read overhead (since you're now dealing with more than one document for each entry) and is most likely not what you want.

Other knobs to tune that would come first to my mind:

Refresh interval: Are you using the default of 1s and could you increase that? That will normally help with heavy write activity (but has the tradeoff of data "freshness")
Elasticsearch upgrade; 7.10 is missing years of optimizations (including in Lucene and the default JDK version that is shipped with it).

ktech007 · October 22, 2023, 1:22am

@xeraa thanks for the reply.

Good to have that understanding about update API. I was mistaken that it would be faster.

Refresh interval: Are you using the default of 1s and could you increase that?

Yup, that has already increased to 5 sec

Elasticsearch upgrade; 7.10 is missing years of optimizations

Yes, that is part of the plan, probably in 8 - 10 months, but doesn't ES still have some limitations around how many mappings are being used? I guess my question would we see just a small improvement or a huge improvement moving to 8 wit this?

xeraa · October 24, 2023, 1:57am

The immutability of Lucene doesn't change, so performance improvements will only be incremental though they add up over time.
And I see 30s refresh intervals quite frequently as well — while it has a tradeoff it's something that's easy to change or try out that should have a very direct impact.

Then there are more complicated changes that will also require better knowledge of the data or access patterns and will have more of an "it depends" performance improvement:

Since less than 10% of the available fields are actually used in a single document, could you group similar documents together into different groups of indices to make them "denser"? (This was one of the motivations for creating datastreams)
Is it really the optimal shard count and size?
Is it the optimal hardware profile (maybe you could go with fewer nodes that have more CPU)?
Parent / child or if you have repetitive data in the "static" part of documents maybe a lookup runtime field?
...

system · November 21, 2023, 1:57am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Field updates (via LUCENE-5189?) Elasticsearch	2	329	July 6, 2017
Frequent updates to documents Elasticsearch	4	1915	July 6, 2017
Update a field, not the entire document Elasticsearch	4	347	July 6, 2017
Suggestion for updating documents? Elasticsearch	3	375	July 6, 2017
How many processes can update the same index in parallel? Elasticsearch	6	1998	June 12, 2021

Updating only a few fields out of many

Related topics