RabbitMQ river & bulksize


(Max Kossatz) #1

Hi,

we are importing millions of documents a day via RabbitMQ over the
RabbitMQ-river into Elasticsearch 0.19.2. Works great!
The index-part of the river is configured like this:
"index":{
"bulk_size":100,
"bulk_timeout":"10s",
"ordered":true
}

The messages we send to RabbitMQ for getting indexed are again Bulkinserts
with up to 100 inserts in one message, the size varies from 1 to 100
inserts.
The question is:
Would it be better to put only one insert per message for RabbitMQ, so
Elasticsearch always has the same amount of data to index or does this make
no difference?

I am asking this because on the search-side we get quite very different
amounts of time it take to do the same search, it varies from milliseconds
to up to 20 seconds and i thought this has maybe to do with the different
bulksize that Elasticsearch has to index, sometime it takes longer,
sometimes not.

The refresh-time for elasticsearch is set to 10s (changed from the default
1s because we experienced much lower cpu load with a higher amount).

Thank you for your help,
Max


(benny.sadeh) #2

my understanding is, and the rabbitmq river source code confirms, that
the river is batching the requests it pulls from the queue.

that bulk_size, to the best of my knowledge, can not be changed
dynamically.

but maybe someone more knowledgeable can chime in ...

On Apr 25, 1:54 am, Max Kossatz max.koss...@gmail.com wrote:

Hi,

we are importing millions of documents a day via RabbitMQ over the
RabbitMQ-river into Elasticsearch 0.19.2. Works great!
The index-part of the river is configured like this:
"index":{
"bulk_size":100,
"bulk_timeout":"10s",
"ordered":true
}

The messages we send to RabbitMQ for getting indexed are again Bulkinserts
with up to 100 inserts in one message, the size varies from 1 to 100
inserts.
The question is:
Would it be better to put only one insert per message for RabbitMQ, so
Elasticsearch always has the same amount of data to index or does this make
no difference?

I am asking this because on the search-side we get quite very different
amounts of time it take to do the same search, it varies from milliseconds
to up to 20 seconds and i thought this has maybe to do with the different
bulksize that Elasticsearch has to index, sometime it takes longer,
sometimes not.

The refresh-time for elasticsearch is set to 10s (changed from the default
1s because we experienced much lower cpu load with a higher amount).

Thank you for your help,
Max


(Shay Banon) #3

The way that the logic of bulk size works is that the first message is read
and all the bulk items there are added to the bulk request (regardless of
the bulk size). If the bulk size is not yet reached, it will try and read
(without a timeout) more messages to try and reach the bulk size, once
reached, or there are no more messages, the bulk is executed.

On Thu, Apr 26, 2012 at 7:06 AM, benny.sadeh benny.sadeh@gmail.com wrote:

my understanding is, and the rabbitmq river source code confirms, that
the river is batching the requests it pulls from the queue.

that bulk_size, to the best of my knowledge, can not be changed
dynamically.

but maybe someone more knowledgeable can chime in ...

On Apr 25, 1:54 am, Max Kossatz max.koss...@gmail.com wrote:

Hi,

we are importing millions of documents a day via RabbitMQ over the
RabbitMQ-river into Elasticsearch 0.19.2. Works great!
The index-part of the river is configured like this:
"index":{
"bulk_size":100,
"bulk_timeout":"10s",
"ordered":true
}

The messages we send to RabbitMQ for getting indexed are again
Bulkinserts
with up to 100 inserts in one message, the size varies from 1 to 100
inserts.
The question is:
Would it be better to put only one insert per message for RabbitMQ, so
Elasticsearch always has the same amount of data to index or does this
make
no difference?

I am asking this because on the search-side we get quite very different
amounts of time it take to do the same search, it varies from
milliseconds
to up to 20 seconds and i thought this has maybe to do with the different
bulksize that Elasticsearch has to index, sometime it takes longer,
sometimes not.

The refresh-time for elasticsearch is set to 10s (changed from the
default
1s because we experienced much lower cpu load with a higher amount).

Thank you for your help,
Max


(system) #4