We are upserting almost 500 M documents a day and it is done via Index API. Each document could have 50 - 300 fields and in totality, the unique mappings/fields are greater than 3k. In reality, though, only a handful of fields are changing per doc, but as we use index API, I think ES is re-analyzing every field even if not updated. With this, we had to scale up the cluster to keep the CPU and GC in control.
The other option I could think of is to use the update API, but as we use external versioning, that is also not possible.
So, just looking if there is any other approach that we can take.
The update API has the advantage that you only need to send a request with the changes rather than the full document but the rest is the same: Because of Lucene's immutable data structure, you'll still need to replace the document and write it anew (re-analyzing, rewriting, merging the replaced document away,...).
One approach to the problem is parent-child; basically putting the static data into one document and the frequently changing one in another, much smaller one. BUT it comes with a heavy read overhead (since you're now dealing with more than one document for each entry) and is most likely not what you want.
Other knobs to tune that would come first to my mind:
Refresh interval: Are you using the default of 1s and could you increase that? That will normally help with heavy write activity (but has the tradeoff of data "freshness")
Elasticsearch upgrade; 7.10 is missing years of optimizations (including in Lucene and the default JDK version that is shipped with it).
Good to have that understanding about update API. I was mistaken that it would be faster.
Refresh interval: Are you using the default of 1s and could you increase that?
Yup, that has already increased to 5 sec
Elasticsearch upgrade; 7.10 is missing years of optimizations
Yes, that is part of the plan, probably in 8 - 10 months, but doesn't ES still have some limitations around how many mappings are being used? I guess my question would we see just a small improvement or a huge improvement moving to 8 wit this?
The immutability of Lucene doesn't change, so performance improvements will only be incremental though they add up over time.
And I see 30s refresh intervals quite frequently as well — while it has a tradeoff it's something that's easy to change or try out that should have a very direct impact.
Then there are more complicated changes that will also require better knowledge of the data or access patterns and will have more of an "it depends" performance improvement:
Since less than 10% of the available fields are actually used in a single document, could you group similar documents together into different groups of indices to make them "denser"? (This was one of the motivations for creating datastreams)
Is it really the optimal shard count and size?
Is it the optimal hardware profile (maybe you could go with fewer nodes that have more CPU)?
Parent / child or if you have repetitive data in the "static" part of documents maybe a lookup runtime field?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.