Root cause of old gen memory use during bulk reindexing

(Justin Treher) #1

Normally, we have zero old gen collections. The exception is bulk upserts of our entire index, refreshing product data from our DB to Elasticsearch. While this is not necessarily a problem now, it gives me "pause." Is this due to use the default refresh interval and having fielddata constantly refresh? We have about 150mb of field data (5GB JVM). While ingesting upserts, these nodes are also receiving search traffic. Even if we moved our analyzed strings over to doc_values, I assume that the global ordinals would still be in memory. Due to the nature of fielddata being immutable, I assume that after every refresh, the field data would be invalidated and dereferenced in memory. Or, is it only after a segment merge?

Is there anything else regarding indexing that would cause old gen to rapidly fill with dereferenced objects such that old gen collection is both filling up and collected successfully (not a leak)? Is there anything about segment merging that might use memory?

My current thought was that, during the bulk reindex period, we should change the index settings to extend the refresh rate. This should eliminate field data from having to reload (assuming it is the culprit). After bulk indexing, set it back to the default for one off document updates as recommended for index performance.

(Loren Siebert) #2

Apologies for asking a different question instead of just answering yours, but this caught my eye: bulk upserts of our entire index. Have you considered just building a fresh copy of the index and switching over to it versus updating records in place? I have a similar situation and found this to be more performant and also makes it easier to change mappings, experiment with shard counts, etc. Heavy updates, in my experience, put a strain on everything: CPU, RAM, and disk IO.

(Justin Treher) #3

@Ioren Historically we used to always reindex that way, now we do that a couple of times a week. I think we are going to switch back to always because, as you mentioned, performance is better and ES doesn't have to worry about finding the doc by id, marking it for deletion, and inserting the new.

The benefit to "in place" upserting for us was that we could start having fresh data right away for those docs popped off the queue first rather than waiting for the whole shebang to finish and swapping.

(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.