Hi guys i'm trying to import my data source using the _bulk api.
I predefined mappers (has 5 dfferent type of analysers and using edgeNGram
filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000
any other optimization i should do as with the current settings to insert a
batch of 1000 records is still taking around 1-2 minutes, and have only
inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.
On Friday, 9 November 2012 11:38:29 UTC+1, Shawn Ritchie wrote:
Hi guys i'm trying to import my data source using the _bulk api.
I predefined mappers (has 5 dfferent type of analysers and using edgeNGram
filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000
any other optimization i should do as with the current settings to insert
a batch of 1000 records is still taking around 1-2 minutes, and have only
inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.
On Friday, 9 November 2012 11:38:29 UTC+1, Shawn Ritchie wrote:
Hi guys i'm trying to import my data source using the _bulk api.
I predefined mappers (has 5 dfferent type of analysers and using edgeNGram filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000
any other optimization i should do as with the current settings to insert a batch of 1000 records is still taking around 1-2 minutes, and have only inserted around 6 batches before i halted it i'm not certain if this performance will degrade over time.
On Friday, 9 November 2012 11:38:29 UTC+1, Shawn Ritchie wrote:
Hi guys i'm trying to import my data source using the _bulk api.
I predefined mappers (has 5 dfferent type of analysers and using
edgeNGram filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000
any other optimization i should do as with the current settings to insert
a batch of 1000 records is still taking around 1-2 minutes, and have only
inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.
On Friday, 9 November 2012 11:38:29 UTC+1, Shawn Ritchie wrote:
Hi guys i'm trying to import my data source using the _bulk api.
I predefined mappers (has 5 dfferent type of analysers and using
edgeNGram filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000
any other optimization i should do as with the current settings to
insert a batch of 1000 records is still taking around 1-2 minutes, and have
only inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.
On Friday, 9 November 2012 11:38:29 UTC+1, Shawn Ritchie wrote:
Hi guys i'm trying to import my data source using the _bulk api.
I predefined mappers (has 5 dfferent type of analysers and using edgeNGram filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000
any other optimization i should do as with the current settings to insert a batch of 1000 records is still taking around 1-2 minutes, and have only inserted around 6 batches before i halted it i'm not certain if this performance will degrade over time.
On Friday, 9 November 2012 11:38:29 UTC+1, Shawn Ritchie wrote:
Hi guys i'm trying to import my data source using the _bulk api.
I predefined mappers (has 5 dfferent type of analysers and using
edgeNGram filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000
any other optimization i should do as with the current settings to
insert a batch of 1000 records is still taking around 1-2 minutes, and have
only inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.
On Friday, 9 November 2012 11:38:29 UTC+1, Shawn Ritchie wrote:
Hi guys i'm trying to import my data source using the _bulk api.
I predefined mappers (has 5 dfferent type of analysers and using
edgeNGram filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000
any other optimization i should do as with the current settings to
insert a batch of 1000 records is still taking around 1-2 minutes, and have
only inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.
On Fri, Nov 9, 2012 at 12:38 PM, Shawn Ritchie xritchie@gmail.com wrote:
Hi guys i'm trying to import my data source using the _bulk api.
I predefined mappers (has 5 dfferent type of analysers and using edgeNGram
filters)
turned of refresh_interval
set the max_num_segments to 5
I'm not sure I understand this one. Do you optimize after each bulk, or?
and bulk inserting in batches of 1000
any other optimization i should do as with the current settings to insert a
batch of 1000 records is still taking around 1-2 minutes, and have only
inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.
Maybe you already went through this but it's worth a shot
how much memory did you allocate to ES out of the total RAM?
you can disable _all if you don't need it
test to find the optimum batch size, maybe it works better with
smaller batches
As long as you don't evaluate the BulkResponses from the _bulk requests,
there is no safeguard against flooding ES, degrading the insertion
performance over time will be unavoidable.
Your strategy should be: estimate your data volume size of your 1000
requests in a single bulk. Issue a BulkRequest, do not wait for response,
issue more BulkRequests, then wait for incoming BulkResponses. Limit the
number of concurrent BulkRequests by waiting for the corresponding
BulkResponses. Check your heap settings if you can handle (number of max
concurrent bulks * number of req's in a bulk). Adjust the length of a bulk
request of the number of concurrent bulk until you hit the sweet spot of
your configuration. So, you can balance the total volume of bulk data you
are sending between c# client and ES cluster, without flooding the system.
Shay has developed the class org.elasticsearch.action.bulk.BulkProcessor as
an example to show how the throughput and concurrency of bulk ingesting can
be controlled by using the BulkResponses.
Cheers,
Jörg
On Saturday, November 10, 2012 11:26:10 AM UTC+1, Radu Gheorghe wrote:
Hello Shawn,
On Fri, Nov 9, 2012 at 12:38 PM, Shawn Ritchie <xrit...@gmail.com<javascript:>>
wrote:
Hi guys i'm trying to import my data source using the _bulk api.
I predefined mappers (has 5 dfferent type of analysers and using
edgeNGram
filters)
turned of refresh_interval
set the max_num_segments to 5
I'm not sure I understand this one. Do you optimize after each bulk, or?
and bulk inserting in batches of 1000
any other optimization i should do as with the current settings to
insert a
batch of 1000 records is still taking around 1-2 minutes, and have only
inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.
Maybe you already went through this but it's worth a shot
how much memory did you allocate to ES out of the total RAM?
you can disable _all if you don't need it
test to find the optimum batch size, maybe it works better with
smaller batches
On Friday, 9 November 2012 11:38:29 UTC+1, Shawn Ritchie wrote:
Hi guys i'm trying to import my data source using the _bulk api.
I predefined mappers (has 5 dfferent type of analysers and using
edgeNGram filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000
any other optimization i should do as with the current settings to insert
a batch of 1000 records is still taking around 1-2 minutes, and have only
inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.