Python vs java bulk indexing

eunever32 · March 28, 2014, 10:16am

Hi,

When running the bulk indexing with python everything works fine.. good
solid throughput for the full indexing run.

When doing the same with the Java api what is happening is that thousands
of client threads are being created (7000)

And the server stops indexing and then the client just hangs with direct
buffer memory errors being displayed ie

Exception: error [Direct buffer memory]

Also I notice this in the dmesg: possible SYN flooding on port 9300.
Sending cookies. (not sure if related)

I can't understand why ES is creating so many client threads because I'm
using:

BulkResponse bulkResponse = bulkRequest.execute().actionGet();

which is synchronous? And the ES threads should not exceed my client
threads?
I have tried both nodeClient and transportClient and same thing.

Any help appreciated.

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1b8ec428-0e0b-4d33-8d02-0c9d50436d43%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

eunever32 · March 28, 2014, 1:35pm

If it's any help, this is the error when the threads start to hang:

2014-03-28 13:34:39,845
[elasticsearch[Cerberus][transport_client_worker][T#16]{New I/O worker
#2832}] (Log4jESLogger.java:129) WARN
org.elasticsearch.netty.channel.socket.nio.AbstractNioSelector - Unexpected
exception in the selector loop.
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
at
org.elasticsearch.common.netty.channel.socket.nio.SocketReceiveBufferAllocator.newBuffer(SocketReceiveBufferAllocator.java:64)
at
org.elasticsearch.common.netty.channel.socket.nio.SocketReceiveBufferAllocator.get(SocketReceiveBufferAllocator.java:41)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:62)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

On Friday, March 28, 2014 10:16:35 AM UTC, eune...@gmail.com wrote:

Hi,

When running the bulk indexing with python everything works fine.. good
solid throughput for the full indexing run.

When doing the same with the Java api what is happening is that thousands
of client threads are being created (7000)

And the server stops indexing and then the client just hangs with direct
buffer memory errors being displayed ie

Exception: error [Direct buffer memory]

Also I notice this in the dmesg: possible SYN flooding on port 9300.
Sending cookies. (not sure if related)

I can't understand why ES is creating so many client threads because I'm
using:

BulkResponse bulkResponse = bulkRequest.execute().actionGet();

which is synchronous? And the ES threads should not exceed my client
threads?
I have tried both nodeClient and transportClient and same thing.

Any help appreciated.

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/65de0d91-041f-483b-b0e9-9bcc855a3a88%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

brian_yoder · March 28, 2014, 2:44pm

When I use the Java TransportClient and the BulkRequest builder, my
throughput is like a scalded cat racing a bolt of greased lightning, with
the cat way ahead!

"the Java API" does not say how you are using it. Since I cannot see your
code, I cannot comment on where your mistake is located.

But I have noticed that a small number of folks use BulkRequestBuilder but
keep adding documents to it. Since BulkRequestBuilder is additive, their
first bulk load batch contains N documents, their second contains 2N (the
first batch plus the second batch), their third batch contains 3N, and so
on until they crash the JVM with OOM errors.

So if this is your mistake, then simply create a new BulkRequestBuilder for
each batch of documents to submit, and let the previous BulkRequestBuilder
get garbage collected, and your Java build will run lightning fast and
never run into memory or thread issues.

If not, the problem is still in your Java code and not in ES. I have been
working with ES at the Java API level for over a year now. I cannot recall
any issue that I've had that was not my own fault (a few breaking changes
during release upgrades have given me some problems, but none that I
couldn't solve). ES has been remarkably rock solid, and for something as
elemental as bulk loading, it's the Rock of Gibraltar.

Hope this helps.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/17e3b852-fe36-476f-8e98-3afa3fec3432%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

eunever32 · March 28, 2014, 4:53pm

You could be right: I can't test right now but this is my code:

(there may be 20 workerThreads)
As you can see, as each thread submits work, the thread will do a
client.prepareBulk() ... is that sufficient clear out the documents?

workerThread() {
Client client = getMyGlobalTransportClient();
BulkRequestBuilder bulkRequest = client.prepareBulk();
for (...) {
bulkRequest.add(...)
if (bulkRequest.numberOfActions() >= chunksize) {
BulkResponse bulkResponse = bulkRequest.execute().actionGet();
if (bulkResponse.hasFailures()) {
...
} else {
...
}
bulkRequest = client.prepareBulk();
}

etc

On Friday, March 28, 2014 2:44:07 PM UTC, InquiringMind wrote:

When I use the Java TransportClient and the BulkRequest builder, my
throughput is like a scalded cat racing a bolt of greased lightning, with
the cat way ahead!

"the Java API" does not say how you are using it. Since I cannot see your
code, I cannot comment on where your mistake is located.

But I have noticed that a small number of folks use BulkRequestBuilder but
keep adding documents to it. Since BulkRequestBuilder is additive, their
first bulk load batch contains N documents, their second contains 2N (the
first batch plus the second batch), their third batch contains 3N, and so
on until they crash the JVM with OOM errors.

So if this is your mistake, then simply create a new BulkRequestBuilder
for each batch of documents to submit, and let the previous
BulkRequestBuilder get garbage collected, and your Java build will run
lightning fast and never run into memory or thread issues.

If not, the problem is still in your Java code and not in ES. I have been
working with ES at the Java API level for over a year now. I cannot recall
any issue that I've had that was not my own fault (a few breaking changes
during release upgrades have given me some problems, but none that I
couldn't solve). ES has been remarkably rock solid, and for something as
elemental as bulk loading, it's the Rock of Gibraltar.

Hope this helps.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bf4f5daf-8d4a-4574-bfd2-809b37972ace%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · March 28, 2014, 6:00pm

Your code has no precautions against overwhelming the cluster. 20 worker
threads that are not coordinated is a challenge.

I recommend the BulkProcessor class at https://github
.com/elasticsearch/elasticsearch/blob/master/src
/main/java/org/elasticsearch/action/bulk/BulkProcessor.java

SYN flood message is not related to ES. If you have opened your port to
public internet access, take care! Don't do it, risk of DOS attack is too
high.

Jörg

On Fri, Mar 28, 2014 at 5:53 PM, eunever32@gmail.com wrote:

You could be right: I can't test right now but this is my code:

(there may be 20 workerThreads)
As you can see, as each thread submits work, the thread will do a
client.prepareBulk() ... is that sufficient clear out the documents?

workerThread() {
Client client = getMyGlobalTransportClient();
BulkRequestBuilder bulkRequest = client.prepareBulk();
for (...) {
bulkRequest.add(...)
if (bulkRequest.numberOfActions() >= chunksize) {
BulkResponse bulkResponse = bulkRequest.execute().actionGet();
if (bulkResponse.hasFailures()) {
...
} else {
...
}
bulkRequest = client.prepareBulk();
}

etc

On Friday, March 28, 2014 2:44:07 PM UTC, InquiringMind wrote:

When I use the Java TransportClient and the BulkRequest builder, my
throughput is like a scalded cat racing a bolt of greased lightning, with
the cat way ahead!

"the Java API" does not say how you are using it. Since I cannot see your
code, I cannot comment on where your mistake is located.

But I have noticed that a small number of folks use BulkRequestBuilder
but keep adding documents to it. Since BulkRequestBuilder is additive,
their first bulk load batch contains N documents, their second contains 2N
(the first batch plus the second batch), their third batch contains 3N, and
so on until they crash the JVM with OOM errors.

So if this is your mistake, then simply create a new BulkRequestBuilder
for each batch of documents to submit, and let the previous
BulkRequestBuilder get garbage collected, and your Java build will run
lightning fast and never run into memory or thread issues.

If not, the problem is still in your Java code and not in ES. I have been
working with ES at the Java API level for over a year now. I cannot recall
any issue that I've had that was not my own fault (a few breaking changes
during release upgrades have given me some problems, but none that I
couldn't solve). ES has been remarkably rock solid, and for something as
elemental as bulk loading, it's the Rock of Gibraltar.

Hope this helps.

Brian

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/bf4f5daf-8d4a-4574-bfd2-809b37972ace%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/bf4f5daf-8d4a-4574-bfd2-809b37972ace%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGWt5C1rZ9SJ7jbOauZRPn4L91s7fMPM%2B68cKe9_BJa%2BQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

brian_yoder · March 28, 2014, 7:19pm

Yes, that is sufficient to clear out the documents. But... take the advice
given by Jörg to heart.

Elasticsearch is already optimized to take a bulk request and optimally
process it as fast as it can be done. There should not be more than one of
them at a time; no gain will be seen, and (as you have seen) bad results
will be seen.

What you could do is use something like the LMAX Disruptorhttp://lmax-exchange.github.io/disruptor/and set it up for multiple producers and one handler thread (or worker
thread, either one in this case). Your own 20 (or whatever) worker threads
should publish to the disruptor's ring buffer. Then the handler thread
would contain the BulkRequestBuilder and process incoming documents as you
show in your code snippet.

Or do the same thing but with a Java queue of some kind that your workers
store into and your one processor thread pulls from and does the bulk
request processing. I only recommend the Disruptor because it's an
incredibly awesome thing that is small and very easy to use; once you get
up to speed it takes a few lines of code to be able to pass through
millions of events per second (yes, you read that right. I've seen it on my
little old laptop for myself). Of course, ES won't keep up, but the
Disruptor will not add any perceptible latency to your processing. It's a
thing of beauty and joy, just like Elasticsearch is for search.

Regards,
Brian

(there may be 20 workerThreads)
As you can see, as each thread submits work, the thread will do a
client.prepareBulk() ... is that sufficient clear out the documents?

workerThread() {
Client client = getMyGlobalTransportClient();
BulkRequestBuilder bulkRequest = client.prepareBulk();
for (...) {
bulkRequest.add(...)
if (bulkRequest.numberOfActions() >= chunksize) {
BulkResponse bulkResponse = bulkRequest.execute().actionGet();
if (bulkResponse.hasFailures()) {
...
} else {
...
}
bulkRequest = client.prepareBulk();
}

etc

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/36de8440-d3f8-4f2e-8e88-bd251e8d63cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

eunever32 · March 29, 2014, 2:00pm

Guys
I appreciate the suggestions
But shouldn't actionget() block ?
So there should only be 20 threads (maybe another 20 for ES)

I mean we're saying client threads are just being for each bulk request ?
How does it work for other applications?
I notice search has options singlethread no thread
Is something similar available for bulk?

Cheers

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/058da964-b0d1-42d9-b690-4161123c2294%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

eunever32 · March 29, 2014, 7:19pm

By the way I can successfully run 16 python processes no problem.
So the server can handle concurrent bulk requests.
The problem is with my java code as it somehow starts threads indefinitely

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9eca3d30-f355-4f58-b8d2-ed86f3cfa566%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · March 30, 2014, 1:15am

If you run 16 python processes, why do you run 20 Java threads and not 16?

Most important is the bulk action size (how many requests are sent) and the
concurrency (how many bulk requests are active), also the bulk request
volume.

I recommend to control the concurrency, your code does not do it. It is not
a question of actionGet() blocking. You push blindly data to the cluster no
matter what other threads are doing. A more polite way would be to find out
if a concurrency or volume threshold has been exceeded, so it can be
decided if the client should wait for bulk responses before sending a new
bulk request.

All of this is solved by BulkProcessor.

Regarding your exception message, you should post complete code (both
python and Java). It is not possible to trace bugs in your code by looking
at code fragments/pseudo code.

Jörg

On Sat, Mar 29, 2014 at 8:19 PM, eunever32@gmail.com wrote:

By the way I can successfully run 16 python processes no problem.
So the server can handle concurrent bulk requests.
The problem is with my java code as it somehow starts threads indefinitely

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHUiASSj5nU-PGvx%3DsTuHv7-MQswmLdVDNnmsdmwJBztw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Multi-threaded Bulk Indexing Elasticsearch	8	3458	July 6, 2017
Bulk throughput issues Elasticsearch	15	1741	July 6, 2017
Bulk index java not freeing its memory Elasticsearch	5	474	July 6, 2017
Indexing large number of documents Elasticsearch	5	919	July 6, 2017
Improving Bulk Indexing Elasticsearch	12	4588	July 6, 2017

Python vs java bulk indexing

Related topics