Optimize bulk insertion

Hi guys i'm trying to import my data source using the _bulk api.

I predefined mappers (has 5 dfferent type of analysers and using edgeNGram
filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000

any other optimization i should do as with the current settings to insert a
batch of 1000 records is still taking around 1-2 minutes, and have only
inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.

Any help is appretiated.

Regards
Shawn

--

Also getting this in the logs when i do bulk insertion anyone mind
explaining what these warnings are

[2012-11-09 10:49:37,678][WARN ][monitor.jvm ] [Frank Castle]
[gc][ParNew][340020][1597] duration [2.8s], collections [1]/[3.8s], total
[2.8s]/[22.2m], memory [547.5mb]->[571.6mb]/[989.8mb], all_pools {[Code
Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[1.1mb]->[3.8mb]/[273mb]}{[Par Survivor Space]
[30.2mb]->[33.5mb]/[34.1mb]}{[CMS Old Gen]
[516.5mb]->[534.6mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:50:55,045][WARN ][monitor.jvm ] [Frank Castle]
[gc][ParNew][340095][1625] duration [1s], collections [1]/[1.9s], total
[1s]/[22.2m], memory [540.8mb]->[571.3mb]/[989.8mb], all_pools {[Code
Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[12.4mb]->[11.8mb]/[273mb]}{[Par Survivor Space]
[28.7mb]->[34.1mb]/[34.1mb]}{[CMS Old Gen]
[499.6mb]->[525.3mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:51:14,384][WARN ][monitor.jvm ] [Frank Castle]
[gc][ParNew][340097][1626] duration [17.6s], collections [1]/[18.3s], total
[17.6s]/[22.5m], memory [763.1mb]->[586.5mb]/[989.8mb], all_pools {[Code
Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[203.6mb]->[3.3mb]/[273mb]}{[Par Survivor Space]
[34.1mb]->[33.9mb]/[34.1mb]}{[CMS Old Gen]
[525.3mb]->[549.3mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:51:40,359][WARN ][monitor.jvm ] [Frank Castle]
[gc][ParNew][340122][1636] duration [1.1s], collections [1]/[1.8s], total
[1.1s]/[22.5m], memory [573.5mb]->[311.7mb]/[989.8mb], all_pools {[Code
Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[271.5mb]->[1.2mb]/[273mb]}{[Par Survivor Space]
[24.4mb]->[11.2mb]/[34.1mb]}{[CMS Old Gen]
[277.5mb]->[299.2mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}

On Friday, 9 November 2012 11:38:29 UTC+1, Shawn Ritchie wrote:

Hi guys i'm trying to import my data source using the _bulk api.

I predefined mappers (has 5 dfferent type of analysers and using edgeNGram
filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000

any other optimization i should do as with the current settings to insert
a batch of 1000 records is still taking around 1-2 minutes, and have only
inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.

Any help is appretiated.

Regards
Shawn

--

Are you doing bulk in Java?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 9 nov. 2012 à 11:44, Shawn Ritchie xritchie@gmail.com a écrit :

Also getting this in the logs when i do bulk insertion anyone mind explaining what these warnings are

[2012-11-09 10:49:37,678][WARN ][monitor.jvm ] [Frank Castle] [gc][ParNew][340020][1597] duration [2.8s], collections [1]/[3.8s], total [2.8s]/[22.2m], memory [547.5mb]->[571.6mb]/[989.8mb], all_pools {[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space] [1.1mb]->[3.8mb]/[273mb]}{[Par Survivor Space] [30.2mb]->[33.5mb]/[34.1mb]}{[CMS Old Gen] [516.5mb]->[534.6mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:50:55,045][WARN ][monitor.jvm ] [Frank Castle] [gc][ParNew][340095][1625] duration [1s], collections [1]/[1.9s], total [1s]/[22.2m], memory [540.8mb]->[571.3mb]/[989.8mb], all_pools {[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space] [12.4mb]->[11.8mb]/[273mb]}{[Par Survivor Space] [28.7mb]->[34.1mb]/[34.1mb]}{[CMS Old Gen] [499.6mb]->[525.3mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:51:14,384][WARN ][monitor.jvm ] [Frank Castle] [gc][ParNew][340097][1626] duration [17.6s], collections [1]/[18.3s], total [17.6s]/[22.5m], memory [763.1mb]->[586.5mb]/[989.8mb], all_pools {[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space] [203.6mb]->[3.3mb]/[273mb]}{[Par Survivor Space] [34.1mb]->[33.9mb]/[34.1mb]}{[CMS Old Gen] [525.3mb]->[549.3mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:51:40,359][WARN ][monitor.jvm ] [Frank Castle] [gc][ParNew][340122][1636] duration [1.1s], collections [1]/[1.8s], total [1.1s]/[22.5m], memory [573.5mb]->[311.7mb]/[989.8mb], all_pools {[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space] [271.5mb]->[1.2mb]/[273mb]}{[Par Survivor Space] [24.4mb]->[11.2mb]/[34.1mb]}{[CMS Old Gen] [277.5mb]->[299.2mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}

On Friday, 9 November 2012 11:38:29 UTC+1, Shawn Ritchie wrote:
Hi guys i'm trying to import my data source using the _bulk api.

I predefined mappers (has 5 dfferent type of analysers and using edgeNGram filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000

any other optimization i should do as with the current settings to insert a batch of 1000 records is still taking around 1-2 minutes, and have only inserted around 6 batches before i halted it i'm not certain if this performance will degrade over time.

Any help is appretiated.

Regards
Shawn

--

no using HttpRequests using c#, allocated on the same server as elastic
search

On Friday, 9 November 2012 11:48:00 UTC+1, David Pilato wrote:

Are you doing bulk in Java?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 9 nov. 2012 à 11:44, Shawn Ritchie <xrit...@gmail.com <javascript:>> a
écrit :

Also getting this in the logs when i do bulk insertion anyone mind
explaining what these warnings are

[2012-11-09 10:49:37,678][WARN ][monitor.jvm ] [Frank Castle]
[gc][ParNew][340020][1597] duration [2.8s], collections [1]/[3.8s], total
[2.8s]/[22.2m], memory [547.5mb]->[571.6mb]/[989.8mb], all_pools {[Code
Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[1.1mb]->[3.8mb]/[273mb]}{[Par Survivor Space]
[30.2mb]->[33.5mb]/[34.1mb]}{[CMS Old Gen]
[516.5mb]->[534.6mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:50:55,045][WARN ][monitor.jvm ] [Frank Castle]
[gc][ParNew][340095][1625] duration [1s], collections [1]/[1.9s], total
[1s]/[22.2m], memory [540.8mb]->[571.3mb]/[989.8mb], all_pools {[Code
Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[12.4mb]->[11.8mb]/[273mb]}{[Par Survivor Space]
[28.7mb]->[34.1mb]/[34.1mb]}{[CMS Old Gen]
[499.6mb]->[525.3mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:51:14,384][WARN ][monitor.jvm ] [Frank Castle]
[gc][ParNew][340097][1626] duration [17.6s], collections [1]/[18.3s], total
[17.6s]/[22.5m], memory [763.1mb]->[586.5mb]/[989.8mb], all_pools {[Code
Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[203.6mb]->[3.3mb]/[273mb]}{[Par Survivor Space]
[34.1mb]->[33.9mb]/[34.1mb]}{[CMS Old Gen]
[525.3mb]->[549.3mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:51:40,359][WARN ][monitor.jvm ] [Frank Castle]
[gc][ParNew][340122][1636] duration [1.1s], collections [1]/[1.8s], total
[1.1s]/[22.5m], memory [573.5mb]->[311.7mb]/[989.8mb], all_pools {[Code
Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[271.5mb]->[1.2mb]/[273mb]}{[Par Survivor Space]
[24.4mb]->[11.2mb]/[34.1mb]}{[CMS Old Gen]
[277.5mb]->[299.2mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}

On Friday, 9 November 2012 11:38:29 UTC+1, Shawn Ritchie wrote:

Hi guys i'm trying to import my data source using the _bulk api.

I predefined mappers (has 5 dfferent type of analysers and using
edgeNGram filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000

any other optimization i should do as with the current settings to insert
a batch of 1000 records is still taking around 1-2 minutes, and have only
inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.

Any help is appretiated.

Regards
Shawn

--

--

will try out the same thing using curl and see how long it will take.

On Friday, 9 November 2012 11:55:47 UTC+1, Shawn Ritchie wrote:

no using HttpRequests using c#, allocated on the same server as elastic
search

On Friday, 9 November 2012 11:48:00 UTC+1, David Pilato wrote:

Are you doing bulk in Java?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 9 nov. 2012 à 11:44, Shawn Ritchie xrit...@gmail.com a écrit :

Also getting this in the logs when i do bulk insertion anyone mind
explaining what these warnings are

[2012-11-09 10:49:37,678][WARN ][monitor.jvm ] [Frank
Castle] [gc][ParNew][340020][1597] duration [2.8s], collections [1]/[3.8s],
total [2.8s]/[22.2m], memory [547.5mb]->[571.6mb]/[989.8mb], all_pools
{[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[1.1mb]->[3.8mb]/[273mb]}{[Par Survivor Space]
[30.2mb]->[33.5mb]/[34.1mb]}{[CMS Old Gen]
[516.5mb]->[534.6mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:50:55,045][WARN ][monitor.jvm ] [Frank
Castle] [gc][ParNew][340095][1625] duration [1s], collections [1]/[1.9s],
total [1s]/[22.2m], memory [540.8mb]->[571.3mb]/[989.8mb], all_pools {[Code
Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[12.4mb]->[11.8mb]/[273mb]}{[Par Survivor Space]
[28.7mb]->[34.1mb]/[34.1mb]}{[CMS Old Gen]
[499.6mb]->[525.3mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:51:14,384][WARN ][monitor.jvm ] [Frank
Castle] [gc][ParNew][340097][1626] duration [17.6s], collections
[1]/[18.3s], total [17.6s]/[22.5m], memory [763.1mb]->[586.5mb]/[989.8mb],
all_pools {[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[203.6mb]->[3.3mb]/[273mb]}{[Par Survivor Space]
[34.1mb]->[33.9mb]/[34.1mb]}{[CMS Old Gen]
[525.3mb]->[549.3mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:51:40,359][WARN ][monitor.jvm ] [Frank
Castle] [gc][ParNew][340122][1636] duration [1.1s], collections [1]/[1.8s],
total [1.1s]/[22.5m], memory [573.5mb]->[311.7mb]/[989.8mb], all_pools
{[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[271.5mb]->[1.2mb]/[273mb]}{[Par Survivor Space]
[24.4mb]->[11.2mb]/[34.1mb]}{[CMS Old Gen]
[277.5mb]->[299.2mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}

On Friday, 9 November 2012 11:38:29 UTC+1, Shawn Ritchie wrote:

Hi guys i'm trying to import my data source using the _bulk api.

I predefined mappers (has 5 dfferent type of analysers and using
edgeNGram filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000

any other optimization i should do as with the current settings to
insert a batch of 1000 records is still taking around 1-2 minutes, and have
only inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.

Any help is appretiated.

Regards
Shawn

--

--

Ok. A common error in Java is to reuse the same bulk at each iteration. But you have to recreate it after each execution.

That's not your concern here. I'm afraid I can't help here and hope you will find answers from others...

Cheers

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 9 nov. 2012 à 12:02, Shawn Ritchie xritchie@gmail.com a écrit :

will try out the same thing using curl and see how long it will take.

On Friday, 9 November 2012 11:55:47 UTC+1, Shawn Ritchie wrote:
no using HttpRequests using c#, allocated on the same server as Elasticsearch

On Friday, 9 November 2012 11:48:00 UTC+1, David Pilato wrote:
Are you doing bulk in Java?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 9 nov. 2012 à 11:44, Shawn Ritchie xrit...@gmail.com a écrit :

Also getting this in the logs when i do bulk insertion anyone mind explaining what these warnings are

[2012-11-09 10:49:37,678][WARN ][monitor.jvm ] [Frank Castle] [gc][ParNew][340020][1597] duration [2.8s], collections [1]/[3.8s], total [2.8s]/[22.2m], memory [547.5mb]->[571.6mb]/[989.8mb], all_pools {[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space] [1.1mb]->[3.8mb]/[273mb]}{[Par Survivor Space] [30.2mb]->[33.5mb]/[34.1mb]}{[CMS Old Gen] [516.5mb]->[534.6mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:50:55,045][WARN ][monitor.jvm ] [Frank Castle] [gc][ParNew][340095][1625] duration [1s], collections [1]/[1.9s], total [1s]/[22.2m], memory [540.8mb]->[571.3mb]/[989.8mb], all_pools {[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space] [12.4mb]->[11.8mb]/[273mb]}{[Par Survivor Space] [28.7mb]->[34.1mb]/[34.1mb]}{[CMS Old Gen] [499.6mb]->[525.3mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:51:14,384][WARN ][monitor.jvm ] [Frank Castle] [gc][ParNew][340097][1626] duration [17.6s], collections [1]/[18.3s], total [17.6s]/[22.5m], memory [763.1mb]->[586.5mb]/[989.8mb], all_pools {[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space] [203.6mb]->[3.3mb]/[273mb]}{[Par Survivor Space] [34.1mb]->[33.9mb]/[34.1mb]}{[CMS Old Gen] [525.3mb]->[549.3mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:51:40,359][WARN ][monitor.jvm ] [Frank Castle] [gc][ParNew][340122][1636] duration [1.1s], collections [1]/[1.8s], total [1.1s]/[22.5m], memory [573.5mb]->[311.7mb]/[989.8mb], all_pools {[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space] [271.5mb]->[1.2mb]/[273mb]}{[Par Survivor Space] [24.4mb]->[11.2mb]/[34.1mb]}{[CMS Old Gen] [277.5mb]->[299.2mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}

On Friday, 9 November 2012 11:38:29 UTC+1, Shawn Ritchie wrote:
Hi guys i'm trying to import my data source using the _bulk api.

I predefined mappers (has 5 dfferent type of analysers and using edgeNGram filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000

any other optimization i should do as with the current settings to insert a batch of 1000 records is still taking around 1-2 minutes, and have only inserted around 6 batches before i halted it i'm not certain if this performance will degrade over time.

Any help is appretiated.

Regards
Shawn

--

--

No I'm recreating the bulk with a fresh 1000 records at each iteration. :smiley:

On Friday, 9 November 2012 12:16:16 UTC+1, David Pilato wrote:

Ok. A common error in Java is to reuse the same bulk at each iteration.
But you have to recreate it after each execution.

That's not your concern here. I'm afraid I can't help here and hope you
will find answers from others...

Cheers

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 9 nov. 2012 à 12:02, Shawn Ritchie <xrit...@gmail.com <javascript:>> a
écrit :

will try out the same thing using curl and see how long it will take.

On Friday, 9 November 2012 11:55:47 UTC+1, Shawn Ritchie wrote:

no using HttpRequests using c#, allocated on the same server as elastic
search

On Friday, 9 November 2012 11:48:00 UTC+1, David Pilato wrote:

Are you doing bulk in Java?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 9 nov. 2012 à 11:44, Shawn Ritchie xrit...@gmail.com a écrit :

Also getting this in the logs when i do bulk insertion anyone mind
explaining what these warnings are

[2012-11-09 10:49:37,678][WARN ][monitor.jvm ] [Frank
Castle] [gc][ParNew][340020][1597] duration [2.8s], collections [1]/[3.8s],
total [2.8s]/[22.2m], memory [547.5mb]->[571.6mb]/[989.8mb], all_pools
{[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[1.1mb]->[3.8mb]/[273mb]}{[Par Survivor Space]
[30.2mb]->[33.5mb]/[34.1mb]}{[CMS Old Gen]
[516.5mb]->[534.6mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:50:55,045][WARN ][monitor.jvm ] [Frank
Castle] [gc][ParNew][340095][1625] duration [1s], collections [1]/[1.9s],
total [1s]/[22.2m], memory [540.8mb]->[571.3mb]/[989.8mb], all_pools {[Code
Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[12.4mb]->[11.8mb]/[273mb]}{[Par Survivor Space]
[28.7mb]->[34.1mb]/[34.1mb]}{[CMS Old Gen]
[499.6mb]->[525.3mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:51:14,384][WARN ][monitor.jvm ] [Frank
Castle] [gc][ParNew][340097][1626] duration [17.6s], collections
[1]/[18.3s], total [17.6s]/[22.5m], memory [763.1mb]->[586.5mb]/[989.8mb],
all_pools {[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[203.6mb]->[3.3mb]/[273mb]}{[Par Survivor Space]
[34.1mb]->[33.9mb]/[34.1mb]}{[CMS Old Gen]
[525.3mb]->[549.3mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:51:40,359][WARN ][monitor.jvm ] [Frank
Castle] [gc][ParNew][340122][1636] duration [1.1s], collections [1]/[1.8s],
total [1.1s]/[22.5m], memory [573.5mb]->[311.7mb]/[989.8mb], all_pools
{[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[271.5mb]->[1.2mb]/[273mb]}{[Par Survivor Space]
[24.4mb]->[11.2mb]/[34.1mb]}{[CMS Old Gen]
[277.5mb]->[299.2mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}

On Friday, 9 November 2012 11:38:29 UTC+1, Shawn Ritchie wrote:

Hi guys i'm trying to import my data source using the _bulk api.

I predefined mappers (has 5 dfferent type of analysers and using
edgeNGram filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000

any other optimization i should do as with the current settings to
insert a batch of 1000 records is still taking around 1-2 minutes, and have
only inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.

Any help is appretiated.

Regards
Shawn

--

--

--

Same performance using curl.

On Friday, 9 November 2012 12:02:44 UTC+1, Shawn Ritchie wrote:

will try out the same thing using curl and see how long it will take.

On Friday, 9 November 2012 11:55:47 UTC+1, Shawn Ritchie wrote:

no using HttpRequests using c#, allocated on the same server as elastic
search

On Friday, 9 November 2012 11:48:00 UTC+1, David Pilato wrote:

Are you doing bulk in Java?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 9 nov. 2012 à 11:44, Shawn Ritchie xrit...@gmail.com a écrit :

Also getting this in the logs when i do bulk insertion anyone mind
explaining what these warnings are

[2012-11-09 10:49:37,678][WARN ][monitor.jvm ] [Frank
Castle] [gc][ParNew][340020][1597] duration [2.8s], collections [1]/[3.8s],
total [2.8s]/[22.2m], memory [547.5mb]->[571.6mb]/[989.8mb], all_pools
{[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[1.1mb]->[3.8mb]/[273mb]}{[Par Survivor Space]
[30.2mb]->[33.5mb]/[34.1mb]}{[CMS Old Gen]
[516.5mb]->[534.6mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:50:55,045][WARN ][monitor.jvm ] [Frank
Castle] [gc][ParNew][340095][1625] duration [1s], collections [1]/[1.9s],
total [1s]/[22.2m], memory [540.8mb]->[571.3mb]/[989.8mb], all_pools {[Code
Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[12.4mb]->[11.8mb]/[273mb]}{[Par Survivor Space]
[28.7mb]->[34.1mb]/[34.1mb]}{[CMS Old Gen]
[499.6mb]->[525.3mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:51:14,384][WARN ][monitor.jvm ] [Frank
Castle] [gc][ParNew][340097][1626] duration [17.6s], collections
[1]/[18.3s], total [17.6s]/[22.5m], memory [763.1mb]->[586.5mb]/[989.8mb],
all_pools {[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[203.6mb]->[3.3mb]/[273mb]}{[Par Survivor Space]
[34.1mb]->[33.9mb]/[34.1mb]}{[CMS Old Gen]
[525.3mb]->[549.3mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:51:40,359][WARN ][monitor.jvm ] [Frank
Castle] [gc][ParNew][340122][1636] duration [1.1s], collections [1]/[1.8s],
total [1.1s]/[22.5m], memory [573.5mb]->[311.7mb]/[989.8mb], all_pools
{[Code Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[271.5mb]->[1.2mb]/[273mb]}{[Par Survivor Space]
[24.4mb]->[11.2mb]/[34.1mb]}{[CMS Old Gen]
[277.5mb]->[299.2mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}

On Friday, 9 November 2012 11:38:29 UTC+1, Shawn Ritchie wrote:

Hi guys i'm trying to import my data source using the _bulk api.

I predefined mappers (has 5 dfferent type of analysers and using
edgeNGram filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000

any other optimization i should do as with the current settings to
insert a batch of 1000 records is still taking around 1-2 minutes, and have
only inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.

Any help is appretiated.

Regards
Shawn

--

--

Hello Shawn,

On Fri, Nov 9, 2012 at 12:38 PM, Shawn Ritchie xritchie@gmail.com wrote:

Hi guys i'm trying to import my data source using the _bulk api.

I predefined mappers (has 5 dfferent type of analysers and using edgeNGram
filters)
turned of refresh_interval
set the max_num_segments to 5

I'm not sure I understand this one. Do you optimize after each bulk, or?

and bulk inserting in batches of 1000

any other optimization i should do as with the current settings to insert a
batch of 1000 records is still taking around 1-2 minutes, and have only
inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.

Maybe you already went through this but it's worth a shot :slight_smile:

  • how much memory did you allocate to ES out of the total RAM?
  • you can disable _all if you don't need it
  • test to find the optimum batch size, maybe it works better with
    smaller batches

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--

As long as you don't evaluate the BulkResponses from the _bulk requests,
there is no safeguard against flooding ES, degrading the insertion
performance over time will be unavoidable.

Your strategy should be: estimate your data volume size of your 1000
requests in a single bulk. Issue a BulkRequest, do not wait for response,
issue more BulkRequests, then wait for incoming BulkResponses. Limit the
number of concurrent BulkRequests by waiting for the corresponding
BulkResponses. Check your heap settings if you can handle (number of max
concurrent bulks * number of req's in a bulk). Adjust the length of a bulk
request of the number of concurrent bulk until you hit the sweet spot of
your configuration. So, you can balance the total volume of bulk data you
are sending between c# client and ES cluster, without flooding the system.

Shay has developed the class org.elasticsearch.action.bulk.BulkProcessor as
an example to show how the throughput and concurrency of bulk ingesting can
be controlled by using the BulkResponses.

Cheers,

Jörg

On Saturday, November 10, 2012 11:26:10 AM UTC+1, Radu Gheorghe wrote:

Hello Shawn,

On Fri, Nov 9, 2012 at 12:38 PM, Shawn Ritchie <xrit...@gmail.com<javascript:>>
wrote:

Hi guys i'm trying to import my data source using the _bulk api.

I predefined mappers (has 5 dfferent type of analysers and using
edgeNGram
filters)
turned of refresh_interval
set the max_num_segments to 5

I'm not sure I understand this one. Do you optimize after each bulk, or?

and bulk inserting in batches of 1000

any other optimization i should do as with the current settings to
insert a
batch of 1000 records is still taking around 1-2 minutes, and have only
inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.

Maybe you already went through this but it's worth a shot :slight_smile:

  • how much memory did you allocate to ES out of the total RAM?
  • you can disable _all if you don't need it
  • test to find the optimum batch size, maybe it works better with
    smaller batches

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--

Those warnings are harmless as they indicate the GC is stepping in.

Jörg

On Friday, November 9, 2012 11:44:16 AM UTC+1, Shawn Ritchie wrote:

Also getting this in the logs when i do bulk insertion anyone mind
explaining what these warnings are

[2012-11-09 10:49:37,678][WARN ][monitor.jvm ] [Frank Castle]
[gc][ParNew][340020][1597] duration [2.8s], collections [1]/[3.8s], total
[2.8s]/[22.2m], memory [547.5mb]->[571.6mb]/[989.8mb], all_pools {[Code
Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[1.1mb]->[3.8mb]/[273mb]}{[Par Survivor Space]
[30.2mb]->[33.5mb]/[34.1mb]}{[CMS Old Gen]
[516.5mb]->[534.6mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:50:55,045][WARN ][monitor.jvm ] [Frank Castle]
[gc][ParNew][340095][1625] duration [1s], collections [1]/[1.9s], total
[1s]/[22.2m], memory [540.8mb]->[571.3mb]/[989.8mb], all_pools {[Code
Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[12.4mb]->[11.8mb]/[273mb]}{[Par Survivor Space]
[28.7mb]->[34.1mb]/[34.1mb]}{[CMS Old Gen]
[499.6mb]->[525.3mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:51:14,384][WARN ][monitor.jvm ] [Frank Castle]
[gc][ParNew][340097][1626] duration [17.6s], collections [1]/[18.3s], total
[17.6s]/[22.5m], memory [763.1mb]->[586.5mb]/[989.8mb], all_pools {[Code
Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[203.6mb]->[3.3mb]/[273mb]}{[Par Survivor Space]
[34.1mb]->[33.9mb]/[34.1mb]}{[CMS Old Gen]
[525.3mb]->[549.3mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}
[2012-11-09 10:51:40,359][WARN ][monitor.jvm ] [Frank Castle]
[gc][ParNew][340122][1636] duration [1.1s], collections [1]/[1.8s], total
[1.1s]/[22.5m], memory [573.5mb]->[311.7mb]/[989.8mb], all_pools {[Code
Cache] [6.5mb]->[6.5mb]/[48mb]}{[Par Eden Space]
[271.5mb]->[1.2mb]/[273mb]}{[Par Survivor Space]
[24.4mb]->[11.2mb]/[34.1mb]}{[CMS Old Gen]
[277.5mb]->[299.2mb]/[682.6mb]}{[CMS Perm Gen] [30.5mb]->[30.5mb]/[82mb]}

On Friday, 9 November 2012 11:38:29 UTC+1, Shawn Ritchie wrote:

Hi guys i'm trying to import my data source using the _bulk api.

I predefined mappers (has 5 dfferent type of analysers and using
edgeNGram filters)
turned of refresh_interval
set the max_num_segments to 5
and bulk inserting in batches of 1000

any other optimization i should do as with the current settings to insert
a batch of 1000 records is still taking around 1-2 minutes, and have only
inserted around 6 batches before i halted it i'm not certain if this
performance will degrade over time.

Any help is appretiated.

Regards
Shawn

--