Multi-threaded Bulk Indexing

I've been currently working on a multi-threaded process that where
each thread is making bulk index requests. I'm using 8 worker threads
where each one is submitting batches of 200 items each time (items are
around 10-15KB). I keep running into an OutOfMemory error on one of
the ElasticSearch servers. I have 3 servers each running with 4GB
heaps. I hit the following error in under 80K items.

Here's the snippet from the logs:

http://xstevens.pastebin.mozilla.org/1360477

Here's my code I'm using for this as well:

https://github.com/mozilla-metrics/bagheera/blob/master/src/main/java/com/mozilla/bagheera/elasticsearch/ElasticSearchIndexer.java

Anyone have any ideas on what I can do to stop running into this? I
could always try a larger heap size my gut says that will only delay
the issue.

So I increased the heaps and started over. This time I ran into an
issue where a couple of the ES servers failed to connect back to the
client.

This repeatedly spit out errors such as:

[2011-10-20 11:40:30,068][WARN ][cluster.service ] [Phade]
failed to reconnect to node [Kristoff Vernard][Z1j5sDdJRX2397jZrI43tA]
[inet[/X.X.X.X:9300]]{client=true, data=false}
org.elasticsearch.transport.ConnectTransportException: [Kristoff
Vernard][inet[/X.X.X.X:9300]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:
562)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:
505)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:
484)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:
121)
at org.elasticsearch.cluster.service.InternalClusterService
$ReconnectToNodes.run(InternalClusterService.java:325)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:
574)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink
$Boss.connect(NioClientSocketPipelineSink.java:384)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink
$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:354)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink
$Boss.run(NioClientSocketPipelineSink.java:276)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:
108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker
$1.run(DeadLockProofWorker.java:44)
... 3 more

Do I just need to increase the number of client threads, perhaps?

-Xavier

On Oct 20, 9:42 am, Xavier xavier.stev...@gmail.com wrote:

I've been currently working on a multi-threaded process that where
each thread is making bulk index requests. I'm using 8 worker threads
where each one is submitting batches of 200 items each time (items are
around 10-15KB). I keep running into an OutOfMemory error on one of
the Elasticsearch servers. I have 3 servers each running with 4GB
heaps. I hit the following error in under 80K items.

Here's the snippet from the logs:

http://xstevens.pastebin.mozilla.org/1360477

Here's my code I'm using for this as well:

https://github.com/mozilla-metrics/bagheera/blob/master/src/main/java...

Anyone have any ideas on what I can do to stop running into this? I
could always try a larger heap size my gut says that will only delay
the issue.

Thats strange..., how much memory does the client has? I see that you wait
for 60 seconds on the future to return, do you see failures on the client
side? Do you by any chance have the heap dump that I can access?

The exception below is caused by failure to connect to the relevant node
(opening a socket to it).

On Thu, Oct 20, 2011 at 9:16 PM, Xavier xavier.stevens@gmail.com wrote:

So I increased the heaps and started over. This time I ran into an
issue where a couple of the ES servers failed to connect back to the
client.

This repeatedly spit out errors such as:

[2011-10-20 11:40:30,068][WARN ][cluster.service ] [Phade]
failed to reconnect to node [Kristoff Vernard][Z1j5sDdJRX2397jZrI43tA]
[inet[/X.X.X.X:9300]]{client=true, data=false}
org.elasticsearch.transport.ConnectTransportException: [Kristoff
Vernard][inet[/X.X.X.X:9300]] connect_timeout[30s]
at

org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:
562)
at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:
505)
at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:
484)
at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:
121)
at org.elasticsearch.cluster.service.InternalClusterService
$ReconnectToNodes.run(InternalClusterService.java:325)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:
574)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink
$Boss.connect(NioClientSocketPipelineSink.java:384)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink
$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:354)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink
$Boss.run(NioClientSocketPipelineSink.java:276)
at

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:
108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker
$1.run(DeadLockProofWorker.java:44)
... 3 more

Do I just need to increase the number of client threads, perhaps?

-Xavier

On Oct 20, 9:42 am, Xavier xavier.stev...@gmail.com wrote:

I've been currently working on a multi-threaded process that where
each thread is making bulk index requests. I'm using 8 worker threads
where each one is submitting batches of 200 items each time (items are
around 10-15KB). I keep running into an OutOfMemory error on one of
the Elasticsearch servers. I have 3 servers each running with 4GB
heaps. I hit the following error in under 80K items.

Here's the snippet from the logs:

http://xstevens.pastebin.mozilla.org/1360477

Here's my code I'm using for this as well:

https://github.com/mozilla-metrics/bagheera/blob/master/src/main/java...

Anyone have any ideas on what I can do to stop running into this? I
could always try a larger heap size my gut says that will only delay
the issue.

1 Like

I don't have a heap dump for you. We're now trying to revert back to
my single threaded version of the indexing class.

The problem I'm getting now is that for batches of 100 items sometimes
it takes longer than 60 seconds to process them. Here's a few lines
showing the timings:

2011-10-26 07:56:04,863 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 584 ms
2011-10-26 07:56:04,863 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:08,808 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 3815 ms
2011-10-26 07:56:08,896 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 88 ms
2011-10-26 07:56:08,896 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:12,807 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 3903 ms
2011-10-26 07:56:12,883 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 76 ms
2011-10-26 07:56:12,883 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:13,624 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 734 ms
2011-10-26 07:56:13,697 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 73 ms
2011-10-26 07:56:13,697 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:14,451 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 746 ms
2011-10-26 07:56:14,573 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 122 ms
2011-10-26 07:56:14,573 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:57:33,119 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 78539 ms

Any ideas on why that is taking so long? I don't see much in the logs.

-Xavier

On Oct 20, 6:14 pm, Shay Banon kim...@gmail.com wrote:

Thats strange..., how much memory does the client has? I see that you wait
for 60 seconds on the future to return, do you see failures on the client
side? Do you by any chance have the heap dump that I can access?

The exception below is caused by failure to connect to the relevant node
(opening a socket to it).

On Thu, Oct 20, 2011 at 9:16 PM, Xavier xavier.stev...@gmail.com wrote:

So I increased the heaps and started over. This time I ran into an
issue where a couple of the ES servers failed to connect back to the
client.

This repeatedly spit out errors such as:

[2011-10-20 11:40:30,068][WARN ][cluster.service ] [Phade]
failed to reconnect to node [Kristoff Vernard][Z1j5sDdJRX2397jZrI43tA]
[inet[/X.X.X.X:9300]]{client=true, data=false}
org.elasticsearch.transport.ConnectTransportException: [Kristoff
Vernard][inet[/X.X.X.X:9300]] connect_timeout[30s]
at

org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:
562)
at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:
505)
at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:
484)
at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:
121)
at org.elasticsearch.cluster.service.InternalClusterService
$ReconnectToNodes.run(InternalClusterService.java:325)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:
574)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink
$Boss.connect(NioClientSocketPipelineSink.java:384)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink
$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:354)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink
$Boss.run(NioClientSocketPipelineSink.java:276)
at

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:
108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker
$1.run(DeadLockProofWorker.java:44)
... 3 more

Do I just need to increase the number of client threads, perhaps?

-Xavier

On Oct 20, 9:42 am, Xavier xavier.stev...@gmail.com wrote:

I've been currently working on a multi-threaded process that where
each thread is making bulk index requests. I'm using 8 worker threads
where each one is submitting batches of 200 items each time (items are
around 10-15KB). I keep running into an OutOfMemory error on one of
the Elasticsearch servers. I have 3 servers each running with 4GB
heaps. I hit the following error in under 80K items.

Here's the snippet from the logs:

http://xstevens.pastebin.mozilla.org/1360477

Here's my code I'm using for this as well:

https://github.com/mozilla-metrics/bagheera/blob/master/src/main/java...

Anyone have any ideas on what I can do to stop running into this? I
could always try a larger heap size my gut says that will only delay
the issue.

It seems to spend a large amount of time in:

org.elasticsearch.common.util.concurrent.AbstractFuture$Sync.get()

java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int)

This is originates from under (BulkRequest's)
AdapterActionFuture.actionGet() call.

On Oct 26, 8:11 am, Xavier xavier.stev...@gmail.com wrote:

I don't have a heap dump for you. We're now trying to revert back to
my single threaded version of the indexing class.

GitHub - mozilla-metrics/bagheera at a5de0978a590fb33ba5234429a41200bc78617da...

The problem I'm getting now is that for batches of 100 items sometimes
it takes longer than 60 seconds to process them. Here's a few lines
showing the timings:

2011-10-26 07:56:04,863 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 584 ms
2011-10-26 07:56:04,863 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:08,808 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 3815 ms
2011-10-26 07:56:08,896 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 88 ms
2011-10-26 07:56:08,896 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:12,807 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 3903 ms
2011-10-26 07:56:12,883 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 76 ms
2011-10-26 07:56:12,883 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:13,624 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 734 ms
2011-10-26 07:56:13,697 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 73 ms
2011-10-26 07:56:13,697 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:14,451 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 746 ms
2011-10-26 07:56:14,573 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 122 ms
2011-10-26 07:56:14,573 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:57:33,119 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 78539 ms

Any ideas on why that is taking so long? I don't see much in the logs.

-Xavier

On Oct 20, 6:14 pm, Shay Banon kim...@gmail.com wrote:

Thats strange..., how much memory does the client has? I see that you wait
for 60 seconds on the future to return, do you see failures on the client
side? Do you by any chance have the heap dump that I can access?

The exception below is caused by failure to connect to the relevant node
(opening a socket to it).

On Thu, Oct 20, 2011 at 9:16 PM, Xavier xavier.stev...@gmail.com wrote:

So I increased the heaps and started over. This time I ran into an
issue where a couple of the ES servers failed to connect back to the
client.

This repeatedly spit out errors such as:

[2011-10-20 11:40:30,068][WARN ][cluster.service ] [Phade]
failed to reconnect to node [Kristoff Vernard][Z1j5sDdJRX2397jZrI43tA]
[inet[/X.X.X.X:9300]]{client=true, data=false}
org.elasticsearch.transport.ConnectTransportException: [Kristoff
Vernard][inet[/X.X.X.X:9300]] connect_timeout[30s]
at

org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:
562)
at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:
505)
at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:
484)
at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:
121)
at org.elasticsearch.cluster.service.InternalClusterService
$ReconnectToNodes.run(InternalClusterService.java:325)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:
574)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink
$Boss.connect(NioClientSocketPipelineSink.java:384)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink
$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:354)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink
$Boss.run(NioClientSocketPipelineSink.java:276)
at

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:
108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker
$1.run(DeadLockProofWorker.java:44)
... 3 more

Do I just need to increase the number of client threads, perhaps?

-Xavier

On Oct 20, 9:42 am, Xavier xavier.stev...@gmail.com wrote:

I've been currently working on a multi-threaded process that where
each thread is making bulk index requests. I'm using 8 worker threads
where each one is submitting batches of 200 items each time (items are
around 10-15KB). I keep running into an OutOfMemory error on one of
the Elasticsearch servers. I have 3 servers each running with 4GB
heaps. I hit the following error in under 80K items.

Here's the snippet from the logs:

http://xstevens.pastebin.mozilla.org/1360477

Here's my code I'm using for this as well:

https://github.com/mozilla-metrics/bagheera/blob/master/src/main/java...

Anyone have any ideas on what I can do to stop running into this? I
could always try a larger heap size my gut says that will only delay
the issue.

It might take more than 60 seconds to index the bulk request, I guess... .
Where you see the time it spends is on the client side, where it waits for a
response, not on the sever side, where it actually executes the operation.

On Wed, Oct 26, 2011 at 7:19 PM, Xavier xavier.stevens@gmail.com wrote:

It seems to spend a large amount of time in:

org.elasticsearch.common.util.concurrent.AbstractFuture$Sync.get()

java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int)

This is originates from under (BulkRequest's)
AdapterActionFuture.actionGet() call.

On Oct 26, 8:11 am, Xavier xavier.stev...@gmail.com wrote:

I don't have a heap dump for you. We're now trying to revert back to
my single threaded version of the indexing class.

GitHub - mozilla-metrics/bagheera at a5de0978a590fb33ba5234429a41200bc78617da...

The problem I'm getting now is that for batches of 100 items sometimes
it takes longer than 60 seconds to process them. Here's a few lines
showing the timings:

2011-10-26 07:56:04,863 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 584 ms
2011-10-26 07:56:04,863 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:08,808 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 3815 ms
2011-10-26 07:56:08,896 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 88 ms
2011-10-26 07:56:08,896 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:12,807 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 3903 ms
2011-10-26 07:56:12,883 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 76 ms
2011-10-26 07:56:12,883 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:13,624 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 734 ms
2011-10-26 07:56:13,697 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 73 ms
2011-10-26 07:56:13,697 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:14,451 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 746 ms
2011-10-26 07:56:14,573 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 122 ms
2011-10-26 07:56:14,573 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:57:33,119 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 78539 ms

Any ideas on why that is taking so long? I don't see much in the logs.

-Xavier

On Oct 20, 6:14 pm, Shay Banon kim...@gmail.com wrote:

Thats strange..., how much memory does the client has? I see that you
wait
for 60 seconds on the future to return, do you see failures on the
client
side? Do you by any chance have the heap dump that I can access?

The exception below is caused by failure to connect to the relevant
node
(opening a socket to it).

On Thu, Oct 20, 2011 at 9:16 PM, Xavier xavier.stev...@gmail.com
wrote:

So I increased the heaps and started over. This time I ran into an
issue where a couple of the ES servers failed to connect back to the
client.

This repeatedly spit out errors such as:

[2011-10-20 11:40:30,068][WARN ][cluster.service ] [Phade]
failed to reconnect to node [Kristoff
Vernard][Z1j5sDdJRX2397jZrI43tA]
[inet[/X.X.X.X:9300]]{client=true, data=false}
org.elasticsearch.transport.ConnectTransportException: [Kristoff
Vernard][inet[/X.X.X.X:9300]] connect_timeout[30s]
at

org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:

  1. at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:

  1. at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:

  1. at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:

  1. at org.elasticsearch.cluster.service.InternalClusterService
    $ReconnectToNodes.run(InternalClusterService.java:325)
    at java.util.concurrent.ThreadPoolExecutor
    $Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor
    $Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
    Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at
    sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:
  2. at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink

$Boss.connect(NioClientSocketPipelineSink.java:384)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink

$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:354)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink

$Boss.run(NioClientSocketPipelineSink.java:276)
at

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:

  1. at
    org.elasticsearch.common.netty.util.internal.DeadLockProofWorker
    $1.run(DeadLockProofWorker.java:44)
    ... 3 more

Do I just need to increase the number of client threads, perhaps?

-Xavier

On Oct 20, 9:42 am, Xavier xavier.stev...@gmail.com wrote:

I've been currently working on a multi-threaded process that where
each thread is making bulk index requests. I'm using 8 worker
threads
where each one is submitting batches of 200 items each time (items
are
around 10-15KB). I keep running into an OutOfMemory error on one of
the Elasticsearch servers. I have 3 servers each running with 4GB
heaps. I hit the following error in under 80K items.

Here's the snippet from the logs:

http://xstevens.pastebin.mozilla.org/1360477

Here's my code I'm using for this as well:

https://github.com/mozilla-metrics/bagheera/blob/master/src/main/java...

Anyone have any ideas on what I can do to stop running into this? I
could always try a larger heap size my gut says that will only
delay
the issue.

So general patterns I'm seeing:

actionGet() takes longer in the first sets of documents. Probably due
to mapping?
actionGet() takes longer in the later sets of documents. Possibly due
to Garbage Collection pauses. I'm not sure what is using all of the
memory on the server side. Is it caching? Is there a way or do I need
to disable caching when doing bulk index runs?

FYI: We're running with 12GB heaps now. 24GB total RAM per machine so
I have room to go up if I have to.

-Xavier

On Oct 26, 1:43 pm, Shay Banon kim...@gmail.com wrote:

It might take more than 60 seconds to index the bulk request, I guess... .
Where you see the time it spends is on the client side, where it waits for a
response, not on the sever side, where it actually executes the operation.

On Wed, Oct 26, 2011 at 7:19 PM, Xavier xavier.stev...@gmail.com wrote:

It seems to spend a large amount of time in:

org.elasticsearch.common.util.concurrent.AbstractFuture$Sync.get()

java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int)

This is originates from under (BulkRequest's)
AdapterActionFuture.actionGet() call.

On Oct 26, 8:11 am, Xavier xavier.stev...@gmail.com wrote:

I don't have a heap dump for you. We're now trying to revert back to
my single threaded version of the indexing class.

GitHub - mozilla-metrics/bagheera at a5de0978a590fb33ba5234429a41200bc78617da...

The problem I'm getting now is that for batches of 100 items sometimes
it takes longer than 60 seconds to process them. Here's a few lines
showing the timings:

2011-10-26 07:56:04,863 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 584 ms
2011-10-26 07:56:04,863 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:08,808 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 3815 ms
2011-10-26 07:56:08,896 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 88 ms
2011-10-26 07:56:08,896 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:12,807 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 3903 ms
2011-10-26 07:56:12,883 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 76 ms
2011-10-26 07:56:12,883 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:13,624 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 734 ms
2011-10-26 07:56:13,697 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 73 ms
2011-10-26 07:56:13,697 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:14,451 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 746 ms
2011-10-26 07:56:14,573 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned hbase
in roughly: 122 ms
2011-10-26 07:56:14,573 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:57:33,119 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 78539 ms

Any ideas on why that is taking so long? I don't see much in the logs.

-Xavier

On Oct 20, 6:14 pm, Shay Banon kim...@gmail.com wrote:

Thats strange..., how much memory does the client has? I see that you
wait
for 60 seconds on the future to return, do you see failures on the
client
side? Do you by any chance have the heap dump that I can access?

The exception below is caused by failure to connect to the relevant
node
(opening a socket to it).

On Thu, Oct 20, 2011 at 9:16 PM, Xavier xavier.stev...@gmail.com
wrote:

So I increased the heaps and started over. This time I ran into an
issue where a couple of the ES servers failed to connect back to the
client.

This repeatedly spit out errors such as:

[2011-10-20 11:40:30,068][WARN ][cluster.service ] [Phade]
failed to reconnect to node [Kristoff
Vernard][Z1j5sDdJRX2397jZrI43tA]
[inet[/X.X.X.X:9300]]{client=true, data=false}
org.elasticsearch.transport.ConnectTransportException: [Kristoff
Vernard][inet[/X.X.X.X:9300]] connect_timeout[30s]
at

org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:

  1. at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:

  1. at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:

  1. at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:

  1. at org.elasticsearch.cluster.service.InternalClusterService
    $ReconnectToNodes.run(InternalClusterService.java:325)
    at java.util.concurrent.ThreadPoolExecutor
    $Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor
    $Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
    Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at
    sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:
  2. at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink

$Boss.connect(NioClientSocketPipelineSink.java:384)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink

$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:354)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink

$Boss.run(NioClientSocketPipelineSink.java:276)
at

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:

  1. at
    org.elasticsearch.common.netty.util.internal.DeadLockProofWorker
    $1.run(DeadLockProofWorker.java:44)
    ... 3 more

Do I just need to increase the number of client threads, perhaps?

-Xavier

On Oct 20, 9:42 am, Xavier xavier.stev...@gmail.com wrote:

I've been currently working on a multi-threaded process that where
each thread is making bulk index requests. I'm using 8 worker
threads
where each one is submitting batches of 200 items each time (items
are
around 10-15KB). I keep running into an OutOfMemory error on one of
the Elasticsearch servers. I have 3 servers each running with 4GB
heaps. I hit the following error in under 80K items.

Here's the snippet from the logs:

http://xstevens.pastebin.mozilla.org/1360477

Here's my code I'm using for this as well:

https://github.com/mozilla-metrics/bagheera/blob/master/src/main/java...

Anyone have any ideas on what I can do to stop running into this? I
could always try a larger heap size my gut says that will only
delay
the issue.

On Wed, Oct 26, 2011 at 11:16 PM, Xavier xavier.stevens@gmail.com wrote:

So general patterns I'm seeing:

actionGet is just waiting to the operation to execute, so its more of the
bulk execution takes its time, not really actionGet...

actionGet() takes longer in the first sets of documents. Probably due
to mapping?

This might do with just warming up...

actionGet() takes longer in the later sets of documents. Possibly due
to Garbage Collection pauses. I'm not sure what is using all of the
memory on the server side. Is it caching? Is there a way or do I need
to disable caching when doing bulk index runs?

It does not sound like garbage collection, though it might be. The more you
index, the more the index needs to do to maintain itself (internal merging
and so on), which adds up while indexing. It gets to a state where its
consistent, but the initial indexing is not evidence to it.

FYI: We're running with 12GB heaps now. 24GB total RAM per machine so
I have room to go up if I have to.

-Xavier

On Oct 26, 1:43 pm, Shay Banon kim...@gmail.com wrote:

It might take more than 60 seconds to index the bulk request, I guess...
.
Where you see the time it spends is on the client side, where it waits
for a
response, not on the sever side, where it actually executes the
operation.

On Wed, Oct 26, 2011 at 7:19 PM, Xavier xavier.stev...@gmail.com
wrote:

It seems to spend a large amount of time in:

org.elasticsearch.common.util.concurrent.AbstractFuture$Sync.get()

java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int)

This is originates from under (BulkRequest's)
AdapterActionFuture.actionGet() call.

On Oct 26, 8:11 am, Xavier xavier.stev...@gmail.com wrote:

I don't have a heap dump for you. We're now trying to revert back to
my single threaded version of the indexing class.

GitHub - mozilla-metrics/bagheera at a5de0978a590fb33ba5234429a41200bc78617da...

The problem I'm getting now is that for batches of 100 items
sometimes
it takes longer than 60 seconds to process them. Here's a few lines
showing the timings:

2011-10-26 07:56:04,863 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned
hbase
in roughly: 584 ms
2011-10-26 07:56:04,863 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:08,808 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 3815 ms
2011-10-26 07:56:08,896 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned
hbase
in roughly: 88 ms
2011-10-26 07:56:08,896 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:12,807 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 3903 ms
2011-10-26 07:56:12,883 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned
hbase
in roughly: 76 ms
2011-10-26 07:56:12,883 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:13,624 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 734 ms
2011-10-26 07:56:13,697 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned
hbase
in roughly: 73 ms
2011-10-26 07:56:13,697 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:56:14,451 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 746 ms
2011-10-26 07:56:14,573 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Scanned
hbase
in roughly: 122 ms
2011-10-26 07:56:14,573 INFO
com.mozilla.bagheera.elasticsearch.ElasticSearchIndexer: Indexing
batch size: 100
2011-10-26 07:57:33,119 INFO
com.mozilla.bagheera.dao.ElasticSearchDao: BulkRequest took: 78539 ms

Any ideas on why that is taking so long? I don't see much in the
logs.

-Xavier

On Oct 20, 6:14 pm, Shay Banon kim...@gmail.com wrote:

Thats strange..., how much memory does the client has? I see that
you
wait
for 60 seconds on the future to return, do you see failures on the
client
side? Do you by any chance have the heap dump that I can access?

The exception below is caused by failure to connect to the relevant
node
(opening a socket to it).

On Thu, Oct 20, 2011 at 9:16 PM, Xavier xavier.stev...@gmail.com
wrote:

So I increased the heaps and started over. This time I ran into
an
issue where a couple of the ES servers failed to connect back to
the
client.

This repeatedly spit out errors such as:

[2011-10-20 11:40:30,068][WARN ][cluster.service ]
[Phade]
failed to reconnect to node [Kristoff
Vernard][Z1j5sDdJRX2397jZrI43tA]
[inet[/X.X.X.X:9300]]{client=true, data=false}
org.elasticsearch.transport.ConnectTransportException: [Kristoff
Vernard][inet[/X.X.X.X:9300]] connect_timeout[30s]
at

org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:

  1. at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:

  1. at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:

  1. at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:

  1. at
    org.elasticsearch.cluster.service.InternalClusterService
    $ReconnectToNodes.run(InternalClusterService.java:325)
    at java.util.concurrent.ThreadPoolExecutor
    $Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor
    $Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
    Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native
    Method)
    at

sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:

  1. at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink

$Boss.connect(NioClientSocketPipelineSink.java:384)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink

$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:354)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink

$Boss.run(NioClientSocketPipelineSink.java:276)
at

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:

  1. at
    org.elasticsearch.common.netty.util.internal.DeadLockProofWorker
    $1.run(DeadLockProofWorker.java:44)
    ... 3 more

Do I just need to increase the number of client threads, perhaps?

-Xavier

On Oct 20, 9:42 am, Xavier xavier.stev...@gmail.com wrote:

I've been currently working on a multi-threaded process that
where
each thread is making bulk index requests. I'm using 8 worker
threads
where each one is submitting batches of 200 items each time
(items
are
around 10-15KB). I keep running into an OutOfMemory error on
one of
the Elasticsearch servers. I have 3 servers each running with
4GB
heaps. I hit the following error in under 80K items.

Here's the snippet from the logs:

http://xstevens.pastebin.mozilla.org/1360477

Here's my code I'm using for this as well:

https://github.com/mozilla-metrics/bagheera/blob/master/src/main/java.
..

Anyone have any ideas on what I can do to stop running into
this? I
could always try a larger heap size my gut says that will only
delay
the issue.