New 1.7.5 docker-based image, 1000 documents causing OOM?

I'm using the elasticsearch:1.7.5 docker image, and setting up a new development environment. In importing data from production, I'm using bulk writes, retrieving 1000 documents at a time.

The problem is this. I can get 1000 documents in, and have the ES fail with a Java OOM error. As I understand it, the docker image has 2gb memory by default, and the default JVM settings for ES are 1gb. So, that should be good.

The documents I'm importing vary in size. There are a few large ones, and many small ones. I've taken to writing the bulk in batches of 10 documents (pretty small), which is taking forever.. which helps. But, it can still OOM error going this slow.

In the node stats, the heap used shows 84%. There is 0 field_data in the node. The total stored is around 500mb.

I'm sure there's something about how we have done our mapping.. that makes this painful. But, OOM'ing out on less than 1000 documents just seems wrong.

Am I doing something wrong here?

Here is a link to the mapping.. and the log files.

I get that this is a complex document. But, this is a machine with 2gb RAM.. 100% dedicated to ES.. would 1k to 2k documents cause it to OOM.

By the way, I report if any document is greater than 1,000,000 bytes. None of them area.

I'm at a loss on this at this point? I get the usual "you need more RAM or more nodes" advice.. but for 2k documents?

[cluster.routing.allocation.decider] [audienti-dev-1] high disk watermark [90%] exceeded on

That suggests you're running out of disk space.
Read more here.

What is the maximum and average document size? How many indices and shards do you have?

Hi AJ_2.. there is about 32 Gig of disk space left.. this is on a 400 Gig disk, so I don't think it's that.. Here is the initial message on cluster boot.

(]], net usable_space [29.1gb], net total_space [446.3gb], types [nfs]

so, no document is greater than 1Mb, as in the client. .there is a log that writes a notice of any document larger than 1mb. The average size is probably 300k. This is a development server, so I believe it has 1 shard and 1 index. This is a brand new, freshly booted container with no data before I start the process that dies 1000 to 2000 documents in.

A batch of 1000 documents at an average size of 300kB is around 300MB, so I am not surprised that you are experiencing OOMs, given that you have a small 1GB heap. With documents that size, you will need to send considerable smaller batches.

Thanks Christian..

So, help me out here. This is a clean, new index. If the "data stored" is 300MB, that would mean I would need 3x the RAM in the Java heap space to avoid an OOM as a general rule?

I understand from my dealings with ES that basically you have to have everything in RAM, and have enough RAM to cover the ES working/operation. But, 3x?

Or is this a special case? I'm getting to the point with ES that I'm wondering if its the right solution (after 2+ years and ES in production at our site) because it seems to get a stable, reliable platform I'm going to have to throw so much hardware at ES to make it happy....

Our production environment is 5 servers on AWS with 64GB RAM.. we have about 800 GB of data.. and they are painfully slow... and we get frequent OOM errors.. if I have to work at 3x RAM.. that would be 800*3 or 2.4TB of RAM... or 38 servers... to handle 800GB of data.

That just seems out of whack for the data stored. I could easily just store it all in RAM many fold times.

Are there any general rules here? Because a fresh index, with a data size of 1/3rd of the heap memory allocation, doesn't seem like its "obvious" I'd get OOM errors.

I would recommend testing different bulk sizes to find the optimal one. Start small and gradually increase until you see no further improvement in indexing throughput. That is typically the ideal bulk size. While you are doing this, monitor the system and try to identify what is limiting performance. Is it network speed, CPU or perhaps disk I/O? Also check the logs for signs of extensive garbage collection or any other errors.

You can also try sending multiple bulk requests in apparelled to see if that improves indexing throughput. Start with a single connection and slowly increase until no further improvement in indexing throughput can be observed.