we are importing millions of documents a day via RabbitMQ over the
RabbitMQ-river into Elasticsearch 0.19.2. Works great!
The index-part of the river is configured like this:
"index":{
"bulk_size":100,
"bulk_timeout":"10s",
"ordered":true
}
The messages we send to RabbitMQ for getting indexed are again Bulkinserts
with up to 100 inserts in one message, the size varies from 1 to 100
inserts.
The question is: Would it be better to put only one insert per message for RabbitMQ, so
Elasticsearch always has the same amount of data to index or does this make
no difference?
I am asking this because on the search-side we get quite very different
amounts of time it take to do the same search, it varies from milliseconds
to up to 20 seconds and i thought this has maybe to do with the different
bulksize that Elasticsearch has to index, sometime it takes longer,
sometimes not.
The refresh-time for elasticsearch is set to 10s (changed from the default
1s because we experienced much lower cpu load with a higher amount).
we are importing millions of documents a day via RabbitMQ over the
RabbitMQ-river into Elasticsearch 0.19.2. Works great!
The index-part of the river is configured like this:
"index":{
"bulk_size":100,
"bulk_timeout":"10s",
"ordered":true
}
The messages we send to RabbitMQ for getting indexed are again Bulkinserts
with up to 100 inserts in one message, the size varies from 1 to 100
inserts.
The question is: Would it be better to put only one insert per message for RabbitMQ, so
Elasticsearch always has the same amount of data to index or does this make
no difference?
I am asking this because on the search-side we get quite very different
amounts of time it take to do the same search, it varies from milliseconds
to up to 20 seconds and i thought this has maybe to do with the different
bulksize that Elasticsearch has to index, sometime it takes longer,
sometimes not.
The refresh-time for elasticsearch is set to 10s (changed from the default
1s because we experienced much lower cpu load with a higher amount).
The way that the logic of bulk size works is that the first message is read
and all the bulk items there are added to the bulk request (regardless of
the bulk size). If the bulk size is not yet reached, it will try and read
(without a timeout) more messages to try and reach the bulk size, once
reached, or there are no more messages, the bulk is executed.
we are importing millions of documents a day via RabbitMQ over the
RabbitMQ-river into Elasticsearch 0.19.2. Works great!
The index-part of the river is configured like this:
"index":{
"bulk_size":100,
"bulk_timeout":"10s",
"ordered":true
}
The messages we send to RabbitMQ for getting indexed are again
Bulkinserts
with up to 100 inserts in one message, the size varies from 1 to 100
inserts.
The question is: Would it be better to put only one insert per message for RabbitMQ, so
Elasticsearch always has the same amount of data to index or does this
make
no difference?
I am asking this because on the search-side we get quite very different
amounts of time it take to do the same search, it varies from
milliseconds
to up to 20 seconds and i thought this has maybe to do with the different
bulksize that Elasticsearch has to index, sometime it takes longer,
sometimes not.
The refresh-time for elasticsearch is set to 10s (changed from the
default
1s because we experienced much lower cpu load with a higher amount).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.