Hi,
I am in the process of upgrading to the latest version of Elasticsearch, and during our reindex testing from the old cluster to the new one (as we are jumping from 6.8 to 8.x), we ran into a couple of issues with the source data being too large.
- we hit the memory buffer issue (100 MB mem buffer using the default 1000 doc scroll bucket size) and
- some documents had > 10K items in a nested array
We solved other field issues through use of a migration pipeline, which may solve issue #2, but I have not looked into that just yet. It appeared that issue was describing a source cluster problem, which seemed odd, but it's a moot point if we can solve issue #1.
Issue #1, however, is presenting quite a problem. I attempted to run an _update_by_query call with a script to find large messages in the data and add a new boolean param called "oversized" to the document so we could ignore the oversized documents during reindexing, but this was throwing up document version conflict errors, and I was running it on a dead index for testing purposes, so I abandoned that idea. It was also taking a long time to process.
We shifted to an external approach of running scrolls on one index. During testing on a smaller index (180K documents), it was fine, and then we moved to a larger index (550K docs), and the cluster began to have garbage collection timeouts and grind to a halt.
The logic here was: get 10K documents with a scroll id, process those docs (eventually intending to put some of those docs back in after reducing their size), then get the next 10K docs with a scroll id, process those docs, repeat, and finally delete the scroll on the last call. The first version of this didn't put anything back into elasticsearch. It just looked the document over, found problem areas, and stored some metrics for us to identify solution areas. We've disabled the writes until we get this working first.
So now it seems we can't update every message via an external call using scroll, because it causes too much memory pressure. I'm not clear why the memory pressure seemed to get worse with each scroll request we made. With the first batch, we sent about 7 scroll requests (one scroll id), and had no noticeable issues, but of course the entire batch was a smaller footprint. With the second batch, we made it to about 32 scroll requests before Elasticsearch hung up and stopped responding due to spending more time in GC than processing data. It took about 5 minutes to recover.
Does the scroll set aside memory at the very beginning for the entire scrolled set of data? Or does it use more memory with each scroll request (10K at a time)? It appears to be the latter based on my observation, because the first 32 scroll requests came back fine, which took several minutes, and then it started having memory issues.
My thoughts here are that we can maybe reduce the total scroll windows, so that a full scroll is always closer to 100K documents. And maybe sleep between batches to let elasticsearch recover its memory. But we have hundreds of millions of documents to process, so I'm not sure of the best way forward on this.
As a last resort, I suppose we may have to just change the input data source to enforce maintaining small messages, and let that run for our retention period, before we migrate. Just so that we know we won't hit this 100 MB buffer issue during migration. Unfortunately, some of our documents came out to be over 15 MB during my research, but typically should be less than 1.
Does anyone know of a way to retrieve the size of a document efficiently? My approach was to run the update_by_query script and convert the document to a string, and measure the length. I couldn't find any info on a better way to accomplish this.
Thanks for any suggestions or alternative approaches I have not considered.