Bulk Loading Performance slows to a crawl


(derrickburns) #1

I am trying to index 80M tweets with a bulk loader that I wrote using
the Java API on 0.16.4.

When I get to about 2M tweets, indexing performance drops from 4K
indexed tweets per second to under 300 tweets per second. CPU load
goes from 30% to 3%. Heap memory bounces between 300MB and 600Mb and
then spikes to 900Mb.

Here are my index settings:

		index.engine.robin.refresh_interval, 10
		indices.memory.index_buffer_size, 0.50
 		index.number_of_shards, 4
		index.number_of_replicas, 0
		index.merge.policy.merge_factor, 30
		index.merge.policy.use_compound_file, false;
		index.refresh_interval, "-1"

I am doing this on a index I just created on a local node.

I am running on a new quad-core i7 Macbook Pro.

Ideas???


(derrickburns) #2

BTW, I noticed over 3000 open files, mostly segment files of the form:

/Users/derrickrburns/Documents/workspace/com.rascal.growl/data/
elasticsearch/nodes/0/indices/twitter/0/index/_0.frq
/Users/derrickrburns/Documents/workspace/com.rascal.growl/data/
elasticsearch/nodes/0/indices/twitter/0/index/_0.prx
/Users/derrickrburns/Documents/workspace/com.rascal.growl/data/
elasticsearch/nodes/0/indices/twitter/0/index/_0.fdt
/Users/derrickrburns/Documents/workspace/com.rascal.growl/data/
elasticsearch/nodes/0/indices/twitter/0/index/_0.fdx
/Users/derrickrburns/Documents/workspace/com.rascal.growl/data/
elasticsearch/nodes/0/indices/twitter/0/index/_0.nrm
/Users/derrickrburns/Documents/workspace/com.rascal.growl/data/
elasticsearch/nodes/0/indices/twitter/0/index/_1.frq

On Jul 14, 4:40 pm, Derrick derrickrbu...@gmail.com wrote:

I am trying to index 80M tweets with a bulk loader that I wrote using
the Java API on 0.16.4.

When I get to about 2M tweets, indexing performance drops from 4K
indexed tweets per second to under 300 tweets per second. CPU load
goes from 30% to 3%. Heap memory bounces between 300MB and 600Mb and
then spikes to 900Mb.

Here are my index settings:

            index.engine.robin.refresh_interval, 10
            indices.memory.index_buffer_size, 0.50
            index.number_of_shards, 4
            index.number_of_replicas, 0
            index.merge.policy.merge_factor, 30
            index.merge.policy.use_compound_file, false;
            index.refresh_interval, "-1"

I am doing this on a index I just created on a local node.

I am running on a new quad-core i7 Macbook Pro.

Ideas???


(Shay Banon) #3
  • Why are you setting the index.engine.robin.refresh_interval? Don't set it (you are actually using refresh interval of 10 milliseconds....). The index.refresh_interval is enough.
  • How many threads are you indexing with?
  • I assume its on a single server? Merging will start to kick in and slow things down a bit, thats how it works...
  • How do you index hte data? Which client?

On Friday, July 15, 2011 at 2:51 AM, Derrick wrote:

BTW, I noticed over 3000 open files, mostly segment files of the form:

/Users/derrickrburns/Documents/workspace/com.rascal.growl/data/
elasticsearch/nodes/0/indices/twitter/0/index/_0.frq
/Users/derrickrburns/Documents/workspace/com.rascal.growl/data/
elasticsearch/nodes/0/indices/twitter/0/index/_0.prx
/Users/derrickrburns/Documents/workspace/com.rascal.growl/data/
elasticsearch/nodes/0/indices/twitter/0/index/_0.fdt
/Users/derrickrburns/Documents/workspace/com.rascal.growl/data/
elasticsearch/nodes/0/indices/twitter/0/index/_0.fdx
/Users/derrickrburns/Documents/workspace/com.rascal.growl/data/
elasticsearch/nodes/0/indices/twitter/0/index/_0.nrm
/Users/derrickrburns/Documents/workspace/com.rascal.growl/data/
elasticsearch/nodes/0/indices/twitter/0/index/_1.frq

On Jul 14, 4:40 pm, Derrick <derrickrbu...@gmail.com (http://gmail.com)> wrote:

I am trying to index 80M tweets with a bulk loader that I wrote using
the Java API on 0.16.4.

When I get to about 2M tweets, indexing performance drops from 4K
indexed tweets per second to under 300 tweets per second. CPU load
goes from 30% to 3%. Heap memory bounces between 300MB and 600Mb and
then spikes to 900Mb.

Here are my index settings:

index.engine.robin.refresh_interval, 10
indices.memory.index_buffer_size, 0.50
index.number_of_shards, 4
index.number_of_replicas, 0
index.merge.policy.merge_factor, 30
index.merge.policy.use_compound_file, false;
index.refresh_interval, "-1"

I am doing this on a index I just created on a local node.

I am running on a new quad-core i7 Macbook Pro.

Ideas???


(system) #4