I'm trying to understand why I'm seeing high memory usage after importing a large number of documents into my cluster.
I'm running on ES 1.4.4 in EC2 m3.xlarge, so my machine has about 8GB of committed memory. Also swapping disable. Starting from a node with no data I start my application to import a large number of document which in turn have a large number nested documents. The total number of 3.9 Million where 21K are not nested, so its a lot of nested documents. The end result of import results in an index size of 622 MB. I started to use the bulk processor to throttle the number of bulk requests coming in using the default values. In terms of memory usage I start at 100MB of heap usage but the entire process spikes it up to 4.0GB+, and stays are there for a while. The node holding the primary shard went back down to 1GB while the replica still holds it at 4GB. I'm also enabling routing so all the data is only hitting those 2 nodes.
Looking at stats, I do see very large merge sizes of 14.8GB and 13.5GB as result of that one operation. Using the HQ plugin also tells me my IO is slow. Warnings were issued on Refresh, flush and documents deleted. Besides that everything is pretty much default settings.
It seems the issue is merging for me and possibly the number of nested documents using is causing the large number of merges. It seems though from the documentation that throttle on merges should be enabled by but from what I can tell its not via the site plugins I'm using it says the index i'm using is not throttled. I guess the two issues that concern me is the large spike in memory it has to use to ingest the documents and the lack of reclaiming that memory, which I'm concerned if there is a memory leak I'm encountering in my situation.
Should I apply new settings to throttle differently or throttle more by pausing and give the node time to ingest the data and complete any merges they are in?