OutOfMemory Exceptions during bulk insert

Hi, I'm having a problem when performing bulk inserts into elasticsearch.
The short story is, I'm inserting batches of about 100,000 documents (70mb)
about once every 1.5 minutes and after about 10million documents have been
inserted I start getting OutOfMemory exceptions being thrown / es becomes
unresponsive until I restart it etc.

Here's the longer version:

Background:

I'm running: ElasticSearch Version: 0.19.11, on Debian Squeeze with OpenJDK

  • here's the output from java -version:

java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.13) (6b18-1.8.13-0+squeeze2)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)

My elasticsearch.yml is here: https://gist.github.com/4318237
And my elasticsearch startup script is here:
https://gist.github.com/4318572

Problem Description:

I have some C# code which denormalizes some SQL data and inserts into my
elasticsearch instance using the Bulk API (via the NEST .net elasticsearch
client). My batches are around 100,000 documents per request which is
about 70mb of data. I insert around once every 1.5 minutes.

Everything seems fine to start with. My heap memory seems to go up and
down in a pattern I would expect (sorry no screenshot of that). Until I
get to around 5 million documents inserted.

At this point, I start to get a lot of gc ConcurrentMarkSweep warnings in
the log. Here's a capture from my log when these warnings start to
appear: https://gist.github.com/4318175. The API is now lagging, takes
around 5 seconds to get a response. Also, here's the output from the
hot_threads api: https://gist.github.com/4318193.

From then on, the time a ConcurrentMarkSweep gc takes increases constantly
along with the size of the heap. Here's another capture from the log
showing the heap size increase and also the gc duration time has increased:
https://gist.github.com/4318188. And here is the output from hot_threads
again: https://gist.github.com/4318197. Now the API is really unresponsive
taking around 12s to respond (I guess this will always be relative to the
gc duration).

At this point, the heap looks to be increasing in size constantly and the
garbage collections seem to just reduce it a tiny amount. Here's what
bigdesk looked like at this point: http://tinypic.com/r/1611gtl/6. Here's
another bigdesk screenshot taken a little while later showing the heap
increasing: http://tinypic.com/view.php?pic=2z7o5tu&s=6.

This behaviour carries on until the heap size is at its limit and the gc
collections are taking > 20s. At this point the API is almost unresponsive
and I start getting OutOfMemory exceptions. Here's the log output at this
point: https://gist.github.com/4318255. Here's bigdesk at this point:
http://tinypic.com/r/nlnyms/6.

If I restart elasticsearch then the heap goes back to normal size, the api
becomes responsive again and all the warnings stop until of course, I've
inserted another 5 million through my batch inserter. Here's bigdesk after
I've done a restart of elasticsearch: http://tinypic.com/r/290sivc/6.

*The Question: *

What could be causing this behaviour? I've read this article (awesome
article btw):
http://jprante.github.com/2012/11/28/Elasticsearch-Java-Virtual-Machine-settings-explained.html
but due to my lack of knowledge about the JVM I'm not sure if this could be
a case of setting the ES_HEAP_SIZE too large (it's 6gb) or if it's
something else like my version of the OpenJDK?

Any thoughts greatly appreciated and if you need more info please ask.

Regards,

James

--

Two things:
Move to java version 7.
Try a batch size around 1,000 docs.

On Mon, Dec 17, 2012 at 6:35 AM, james.lewis@7digital.com wrote:

Hi, I'm having a problem when performing bulk inserts into elasticsearch.
The short story is, I'm inserting batches of about 100,000 documents (70mb)
about once every 1.5 minutes and after about 10million documents have been
inserted I start getting OutOfMemory exceptions being thrown / es becomes
unresponsive until I restart it etc.

Here's the longer version:

Background:

I'm running: ElasticSearch Version: 0.19.11, on Debian Squeeze with
OpenJDK - here's the output from java -version:

java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.13) (6b18-1.8.13-0+squeeze2)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)

My elasticsearch.yml is here: https://gist.github.com/4318237
And my elasticsearch startup script is here:
https://gist.github.com/4318572

Problem Description:

I have some C# code which denormalizes some SQL data and inserts into my
elasticsearch instance using the Bulk API (via the NEST .net elasticsearch
client). My batches are around 100,000 documents per request which is
about 70mb of data. I insert around once every 1.5 minutes.

Everything seems fine to start with. My heap memory seems to go up and
down in a pattern I would expect (sorry no screenshot of that). Until I
get to around 5 million documents inserted.

At this point, I start to get a lot of gc ConcurrentMarkSweep warnings in
the log. Here's a capture from my log when these warnings start to
appear: https://gist.github.com/4318175. The API is now lagging, takes
around 5 seconds to get a response. Also, here's the output from the
hot_threads api: https://gist.github.com/4318193.

From then on, the time a ConcurrentMarkSweep gc takes increases constantly
along with the size of the heap. Here's another capture from the log
showing the heap size increase and also the gc duration time has increased:
https://gist.github.com/4318188. And here is the output from hot_threads
again: https://gist.github.com/4318197. Now the API is really
unresponsive taking around 12s to respond (I guess this will always be
relative to the gc duration).

At this point, the heap looks to be increasing in size constantly and the
garbage collections seem to just reduce it a tiny amount. Here's what
bigdesk looked like at this point: http://tinypic.com/r/1611gtl/6.
Here's another bigdesk screenshot taken a little while later showing the
heap increasing: http://tinypic.com/view.php?pic=2z7o5tu&s=6.

This behaviour carries on until the heap size is at its limit and the gc
collections are taking > 20s. At this point the API is almost unresponsive
and I start getting OutOfMemory exceptions. Here's the log output at this
point: https://gist.github.com/4318255. Here's bigdesk at this point:
http://tinypic.com/r/nlnyms/6.

If I restart elasticsearch then the heap goes back to normal size, the api
becomes responsive again and all the warnings stop until of course, I've
inserted another 5 million through my batch inserter. Here's bigdesk after
I've done a restart of elasticsearch: http://tinypic.com/r/290sivc/6.

*The Question: *

What could be causing this behaviour? I've read this article (awesome
article btw):
http://jprante.github.com/2012/11/28/Elasticsearch-Java-Virtual-Machine-settings-explained.htmlbut due to my lack of knowledge about the JVM I'm not sure if this could be
a case of setting the ES_HEAP_SIZE too large (it's 6gb) or if it's
something else like my version of the OpenJDK?

Any thoughts greatly appreciated and if you need more info please ask.

Regards,

James

--

--

I'll give Java 7 a go today - I did wonder if this would make a difference

  • I've read there's been improvements to the garbage collection?

I'm reluctant to reduce the batch size though. I'm backfilling 45 million
documents and batches of 1000 is just too slow for our requirements (we've
tried it).

Cheers,

James

--

I had also problems with slow performances and ES being stuck during bulk
insert with NEST. I inserted records from MongoDB in .net. I thought
problem is in mongodb driver code. I inserted 500 hundred
for (int NumSent = 0; NumSent < 100000; )
{
...
client.IndexManyAsync(all.Documents.Skip((int)NumSent).Take(noToSend)
NumSent += noToSend;
...
}

But when I use some custom code to generate data (like in NEST demos)
everything works ok.

On Tuesday, December 18, 2012 8:55:24 AM UTC+1, james...@7digital.com wrote:

I'll give Java 7 a go today - I did wonder if this would make a difference

  • I've read there's been improvements to the garbage collection?

I'm reluctant to reduce the batch size though. I'm backfilling 45
million documents and batches of 1000 is just too slow for our requirements
(we've tried it).

Cheers,

James

--

Whats your maximum async connections count? Noted that you are using IndexManyAsync, and iterating over it a hundred thousand times, maybe you are initiating a lot of non blocking requests concurrently?

--

I set the maximum async connections in my .Net client to be 16. If you
look at the bigdesk plugin outputs then there are a huge amount of threads
running by the time I get my OOM exception.

On Tue, Dec 18, 2012 at 11:09 AM, Nicolas Garfinkiel <
nicolas.garfinkiel@gmail.com> wrote:

Whats your maximum async connections count? Noted that you are using
IndexManyAsync, and iterating over it a hundred thousand times, maybe you
are initiating a lot of non blocking requests concurrently?

--

--

Just testing this with the Oracle 7 JDK - so far everything is running
smooth, I'm up to 5.5million docs which is where I was having problems
previously. If this works I'm going to be over the moon!

James

--

Looks like it was the JVM, Now that we've switched to Oracle 7 JDK we're
not seeing the same symptoms (currently at ~10million docs as I write
this). I've not changed my code or my batch size or any elasticsearch
setup. Just switched to Oracle and all is well.

Cool - Once all 45 million documents are in I'll update this thread with my
final thoughts.

James

On Tuesday, December 18, 2012 1:05:46 PM UTC, james...@7digital.com wrote:

Just testing this with the Oracle 7 JDK - so far everything is running
smooth, I'm up to 5.5million docs which is where I was having problems
previously. If this works I'm going to be over the moon!

James

--

OK - this was definitely the JVM in Open JDK 6 that was causing the issue.
I haven't dug deep enough to be able to tell why but installing Oracle 7 on
my box worked. I reran my test followed by my entire backfill with no
issues. I still had a few gc ConcurrentMarkSweep warnings where they took
around 6 seconds to run - but that's it. No errors / exceptions and the
API was responsive for the duration.

Cheers,

James

--