New 1.7.5 docker-based image, 1000 documents causing OOM?

wflanagan · May 29, 2016, 11:08am

I'm using the elasticsearch:1.7.5 docker image, and setting up a new development environment. In importing data from production, I'm using bulk writes, retrieving 1000 documents at a time.

The problem is this. I can get 1000 documents in, and have the ES fail with a Java OOM error. As I understand it, the docker image has 2gb memory by default, and the default JVM settings for ES are 1gb. So, that should be good.

The documents I'm importing vary in size. There are a few large ones, and many small ones. I've taken to writing the bulk in batches of 10 documents (pretty small), which is taking forever.. which helps. But, it can still OOM error going this slow.

In the node stats, the heap used shows 84%. There is 0 field_data in the node. The total stored is around 500mb.

I'm sure there's something about how we have done our mapping.. that makes this painful. But, OOM'ing out on less than 1000 documents just seems wrong.

Am I doing something wrong here?

wflanagan · May 29, 2016, 8:44pm

Here is a link to the mapping.. and the log files.

I get that this is a complex document. But, this is a machine with 2gb RAM.. 100% dedicated to ES.. would 1k to 2k documents cause it to OOM.

By the way, I report if any document is greater than 1,000,000 bytes. None of them area.

I'm at a loss on this at this point? I get the usual "you need more RAM or more nodes" advice.. but for 2k documents?

wflanagan · May 29, 2016, 8:45pm

gist.github.com

https://gist.github.com/wflanagan/7acf34a1ceb9156b170657bbdb5bf206

Logs.txt

[2016-05-29 20:17:51,446][WARN ][cluster.routing.allocation.decider] [audienti-dev-1] high disk watermark [90%] exceeded on [KsO91bLuS1yvz4V5k-yIdw][audienti-dev-1] free: 29.4gb[6.5%], shards will be relocated away from this node
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid1.hprof ...
Heap dump file created [1512674531 bytes in 14.224 secs]
[2016-05-29 20:18:45,633][WARN ][cluster.service          ] [audienti-dev-1] cluster state update task [cluster_reroute (api)] took 33.2s above the warn threshold of 30s
[2016-05-29 20:18:54,045][WARN ][cluster.routing.allocation.decider] [audienti-dev-1] high disk watermark [90%] exceeded on [KsO91bLuS1yvz4V5k-yIdw][audienti-dev-1] free: 29.4gb[6.5%], shards will be relocated away from this node
[2016-05-29 20:19:22,463][WARN ][cluster.routing.allocation.decider] [audienti-dev-1] high disk watermark [90%] exceeded on [KsO91bLuS1yvz4V5k-yIdw][audienti-dev-1] free: 29.4gb[6.5%], shards will be relocated away from this node
[2016-05-29 20:19:22,463][INFO ][cluster.routing.allocation.decider] [audienti-dev-1] high disk watermark exceeded on one or more nodes, rerouting shards
[2016-05-29 20:19:32,483][WARN ][index.merge.scheduler    ] [audienti-dev-1] [influencers_development][0] failed to merge
org.apache.lucene.store.AlreadyClosedException: refusing to delete any files: this IndexWriter hit an unrecoverable exception

This file has been truncated. show original

influencer_mapping.txt

#<Elasticsearch::Model::Indexing::Mappings:0x00559a10b6e180
 @mapping=
  {:created_at=>{:type=>"date"},
   :updated_at=>{:type=>"date"},
   :wordsmaster_ids=>{:type=>"integer", :doc_values=>true},
   :scores_updated_at=>{:type=>"date"},
   :avg_total_score=>{:type=>"float"},
   :avg_visibility_score=>{:type=>"float"},
   :avg_engagement_score=>{:type=>"float"},
   :avg_presence_score=>{:type=>"float"},

This file has been truncated. show original

AJ_2 · May 30, 2016, 7:33am

[cluster.routing.allocation.decider] [audienti-dev-1] high disk watermark [90%] exceeded on

That suggests you're running out of disk space.
Read more here.

Christian_Dahlqvist · May 30, 2016, 7:43am

What is the maximum and average document size? How many indices and shards do you have?

wflanagan · May 30, 2016, 9:39am

Hi AJ_2.. there is about 32 Gig of disk space left.. this is on a 400 Gig disk, so I don't think it's that.. Here is the initial message on cluster boot.

(192.168.64.1:/Users/wflanagan/containers/audienti/data/elasticsearch/data)]], net usable_space [29.1gb], net total_space [446.3gb], types [nfs]

wflanagan · May 30, 2016, 9:40am

so, no document is greater than 1Mb, as in the client. .there is a log that writes a notice of any document larger than 1mb. The average size is probably 300k. This is a development server, so I believe it has 1 shard and 1 index. This is a brand new, freshly booted container with no data before I start the process that dies 1000 to 2000 documents in.

Christian_Dahlqvist · May 30, 2016, 10:16am

A batch of 1000 documents at an average size of 300kB is around 300MB, so I am not surprised that you are experiencing OOMs, given that you have a small 1GB heap. With documents that size, you will need to send considerable smaller batches.

wflanagan · May 30, 2016, 11:11am

Thanks Christian..

So, help me out here. This is a clean, new index. If the "data stored" is 300MB, that would mean I would need 3x the RAM in the Java heap space to avoid an OOM as a general rule?

I understand from my dealings with ES that basically you have to have everything in RAM, and have enough RAM to cover the ES working/operation. But, 3x?

Or is this a special case? I'm getting to the point with ES that I'm wondering if its the right solution (after 2+ years and ES in production at our site) because it seems to get a stable, reliable platform I'm going to have to throw so much hardware at ES to make it happy....

Our production environment is 5 servers on AWS with 64GB RAM.. we have about 800 GB of data.. and they are painfully slow... and we get frequent OOM errors.. if I have to work at 3x RAM.. that would be 800*3 or 2.4TB of RAM... or 38 servers... to handle 800GB of data.

That just seems out of whack for the data stored. I could easily just store it all in RAM many fold times.

Are there any general rules here? Because a fresh index, with a data size of 1/3rd of the heap memory allocation, doesn't seem like its "obvious" I'd get OOM errors.

Christian_Dahlqvist · May 30, 2016, 11:26am

I would recommend testing different bulk sizes to find the optimal one. Start small and gradually increase until you see no further improvement in indexing throughput. That is typically the ideal bulk size. While you are doing this, monitor the system and try to identify what is limiting performance. Is it network speed, CPU or perhaps disk I/O? Also check the logs for signs of extensive garbage collection or any other errors.

You can also try sending multiple bulk requests in apparelled to see if that improves indexing throughput. Start with a single connection and slowly increase until no further improvement in indexing throughput can be observed.

Topic		Replies	Views
OutOfMemory insexing documents Elasticsearch	4	353	July 6, 2017
Can ES running with small ram but big docs? Elasticsearch	1	349	July 6, 2017
OutOfMemory Exceptions during bulk insert Elasticsearch	9	3009	July 6, 2017
Java.lang.OutOfMemoryError - how to anticipate memory usage? Elasticsearch	8	705	July 6, 2017
Out of space Exception Java Heap Elasticsearch	7	327	July 6, 2017

New 1.7.5 docker-based image, 1000 documents causing OOM?

Related topics