Root cause of old gen memory use during bulk reindexing

Justin_Treher · June 25, 2018, 2:40pm

Normally, we have zero old gen collections. The exception is bulk upserts of our entire index, refreshing product data from our DB to Elasticsearch. While this is not necessarily a problem now, it gives me "pause." Is this due to use the default refresh interval and having fielddata constantly refresh? We have about 150mb of field data (5GB JVM). While ingesting upserts, these nodes are also receiving search traffic. Even if we moved our analyzed strings over to doc_values, I assume that the global ordinals would still be in memory. Due to the nature of fielddata being immutable, I assume that after every refresh, the field data would be invalidated and dereferenced in memory. Or, is it only after a segment merge?

Is there anything else regarding indexing that would cause old gen to rapidly fill with dereferenced objects such that old gen collection is both filling up and collected successfully (not a leak)? Is there anything about segment merging that might use memory?

My current thought was that, during the bulk reindex period, we should change the index settings to extend the refresh rate. This should eliminate field data from having to reload (assuming it is the culprit). After bulk indexing, set it back to the default for one off document updates as recommended for index performance.

loren · June 25, 2018, 6:53pm

Apologies for asking a different question instead of just answering yours, but this caught my eye: bulk upserts of our entire index. Have you considered just building a fresh copy of the index and switching over to it versus updating records in place? I have a similar situation and found this to be more performant and also makes it easier to change mappings, experiment with shard counts, etc. Heavy updates, in my experience, put a strain on everything: CPU, RAM, and disk IO.

Justin_Treher · June 25, 2018, 7:33pm

@Ioren Historically we used to always reindex that way, now we do that a couple of times a week. I think we are going to switch back to always because, as you mentioned, performance is better and ES doesn't have to worry about finding the doc by id, marking it for deletion, and inserting the new.

The benefit to "in place" upserting for us was that we could start having fresh data right away for those docs popped off the queue first rather than waiting for the whole shebang to finish and swapping.

system · July 23, 2018, 7:33pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Java Client Bulk API performance settings ES 5.x Elasticsearch	6	1767	October 5, 2017
Very slow bulk indexing Elasticsearch	2	321	July 6, 2017
Refresh On Bulk Update is Good or bad Elasticsearch	3	1828	August 25, 2017
Frequently updating index entries Elasticsearch	4	1102	July 6, 2017
Updating only a few fields out of many Elasticsearch	4	455	November 21, 2023

Root cause of old gen memory use during bulk reindexing

Related topics