Single large bulk import operations results in high merge size that produces an index size that is relatively small

I'm trying to understand why I'm seeing high memory usage after importing a large number of documents into my cluster.

I'm running on ES 1.4.4 in EC2 m3.xlarge, so my machine has about 8GB of committed memory. Also swapping disable. Starting from a node with no data I start my application to import a large number of document which in turn have a large number nested documents. The total number of 3.9 Million where 21K are not nested, so its a lot of nested documents. The end result of import results in an index size of 622 MB. I started to use the bulk processor to throttle the number of bulk requests coming in using the default values. In terms of memory usage I start at 100MB of heap usage but the entire process spikes it up to 4.0GB+, and stays are there for a while. The node holding the primary shard went back down to 1GB while the replica still holds it at 4GB. I'm also enabling routing so all the data is only hitting those 2 nodes.

Looking at stats, I do see very large merge sizes of 14.8GB and 13.5GB as result of that one operation. Using the HQ plugin also tells me my IO is slow. Warnings were issued on Refresh, flush and documents deleted. Besides that everything is pretty much default settings.

It seems the issue is merging for me and possibly the number of nested documents using is causing the large number of merges. It seems though from the documentation that throttle on merges should be enabled by but from what I can tell its not via the site plugins I'm using it says the index i'm using is not throttled. I guess the two issues that concern me is the large spike in memory it has to use to ingest the documents and the lack of reclaiming that memory, which I'm concerned if there is a memory leak I'm encountering in my situation.

Should I apply new settings to throttle differently or throttle more by pausing and give the node time to ingest the data and complete any merges they are in?

Is all this causing problems or do you just want to reduce the numbers, cause it's not really clear.

It does cause problems eventually because of my node goes down. I'd like to get the number down or at least understand what the max peak usage will be so I can allocate more memory if necessary so it can be stabilized.

One thing I don't understand which hopefully someone can explain is when does ES think that merging doesn't need to happen anymore. For example reran that bulk insert scenario and I'm seeing interesting results. After indexing 3.9 million documents the memory usage is pretty moderate about 1GB. I check the segments and everything is flushed. I leave the node alone with no activity (reads and writes) and just let it sit there. I then start to see the heap going up and start to observe merge activity. Merging happens periodically and I just let it sit there idle for 24+ hours. So I'm assuming elastic is merging what is stored in-memory and translog to disk via segments. During the merging I do catch observing elastic writing new smaller segments (using segement spy, whatson) and then merging that. So I check to see what is left to write segments for, but the translog size is small to none and so is buffer pool size. I tried flushing but merging still occurs hours later.

I'm at the point now where I really want to understand where to find how much data is left for ES to push into new segments that'll eventually get merged in hopes that I can eventually understand how to optimize the merge policy for my situation.

ES merge if there are documents being indexed, are you sure there is nothing being added/changed?

Yeah positive. The document count is always the same. I've stopped all traffic from hitting my cluster.

So I found one reason for the constant merging. During ingestion, there are requests to partially update the same document . If the partial document update comes first it will automatically create the initial version 1 of the document. If an index request comes in after, it will do an upsert on the document (upsert is actually the default). If the partial update request comes in after it will update that portion of the document. I set the retry option to 5 for partial update requests to ensure it won't fail.

I then tuned out all the partial update requests and only allowed the index requests. This dramatically dropped the deleted document count to near zero. In contrast with partial update requests the deleted document count was very high. The line graph on whatson made this apparent.

I'm wondering then if my ingestion process exposes on underlying performance issue in ES 1.4.4 at least for partial document updates or if I'm doing something wrong in my approach for attempting to update the same document from multiple feeds.

Don't mix inserts (creations) with upserts (modifications). Do all inserts first, then change documents afterwards. The idea is to collect all deletions of words at one location or nearby so they can go into a single or few segments only. If deletions are spread over the whole indexing procedure, they create a vast amount of extra segments that have to be merged. There is no "performance issue". It's how the algorithm of inverted index building works.

Yeah. However,upsert where I replace the entire document is ok. I reran my upsert logic and that seems ok. It is the partial update requests that causes major issues on my heap. When I first inserted all the documents at once then issued a lot of partial update requests afterwards, the heap starts to jump up and merge sizes jump up.

It seems like I shouldn't even be using partial updates if spans potentially thousands of documents. Immediately when I enabled partial update ingestion my heap usage went from 1 GB to 6-7 GB in 10 minutes.

FYI: Ok so I test version 1.4.4 s. 1.7.2. Version 1.7.2 handles it way better than 1.4.4.

I'm assuming this is the issue that I was seeing in my version of ES.

https://issues.apache.org/jira/browse/LUCENE-6161

Seems like 1.4.5 has the lucene bug fix release that 1.4.4 doesn't have. Merge sizes are still the same but the heap usage is much better where doesn't grow like crazy as I was seeing before.