Bulk load has spiky behavior

We have a write heavy cluster that is showing only spiky activity during
bulk loads. Incoming write speed is ~1500 documents per second, but the
indexing speed is significantly slower, ~1500 a MINUTE, and the cpu
utilization is incredibly spiky (i.e. high cpu and network io for a short
time, followed by no cpu or network for a short time, repeat). The data is
being written to ~10 object_types between 2 indices. Any ideas on how to
smooth this out and optimize for this volume of writes?

The loaders are on separate cluster (4 Amazon c1.xlarges) and are
configured to be transport nodes only (no data or http).

The indices themselves are stored on a 6 node elasticsearch cluster on
(Amazon) m1.xlarges (these are the machine showing spiky load behavior).
This is what the index settings look like:
"settings" : {
"index.number_of_shards" : "12",
"index.number_of_replicas" : "2",
"index.version.created" : "190899",
"index.gateway.snapshot_interval" : "1200s"
}

and the configuration file (with some private info removed):

ElasticSearch config file

File paths

path:
home: /usr/local/share/elasticsearch
conf: /etc/elasticsearch
logs: /var/log/elasticsearch

http://www.elasticsearch.com/docs/elasticsearch/modules/node/

node:
data: true
master: true

http://www.elasticsearch.com/docs/elasticsearch/modules/http/

http:
enabled: true
port: 9200-9300
max_content_length: 100mb

cluster:
routing:
allocation:
node_initial_primaries_recoveries: 4
concurrent_recoveries: 5

http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/1f3001f43266879a/06d62ea3ceb4db30?lnk=gst&q=translog#06d62ea3ceb4db30
indices:
cache:
filter:
size: 20%
memory:
index_buffer_size: 10%

index:
number_of_shards: 12
number_of_replicas: 2
translog:
flush_threshold_ops: 5000
flush_threshold_size: 200mb
flush_threshold_period: 60s

merge:
policy:
max_merge_at_once: 10
segments_per_tier: 10
use_compound_file: false
floor_segment: 2.7mb
refresh_interval: 1s

shard:
recovery:
concurrent_streams: 7

engine:
robin:
term_index_interval: 1024

gateway:
snapshot_interval: 10s
snapshot_on_close: true

http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-shutdown.html
action:
disable_shutdown: false

http://www.elasticsearch.com/docs/elasticsearch/modules/transport/

transport:
tcp:
port: 9300-9400
connect_timeout: 2m
compress: true

http://www.elasticsearch.com/docs/elasticsearch/modules/jmx/

jmx:
create_connector: true
port: 9400-9500
domain: elasticsearch

monitor.jvm.gc.ParNew.warn: 1000ms
monitor.jvm.gc.ParNew.info: 700ms
monitor.jvm.gc.ParNew.debug: 400ms
monitor.jvm.gc.ConcurrentMarkSweep.warn: 10s
monitor.jvm.gc.ConcurrentMarkSweep.info: 5s
monitor.jvm.gc.ConcurrentMarkSweep.debug: 2s

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

1 Like

What kind of client do you use? How many clients run in parallel? Do you
throttle concurrent bulk requests?

index.merge.policy.max_merged_segment is by default 5g, did you consider
decreasing it?

If you have slow disks, and need store throttling, you can try something
like

index.store.throttle.type: merge
index.store.throttle.max_bytes_per_sec: 1m

(1m may be too low, maybe 5m is better)

Jörg

Am 04.03.13 19:25, schrieb Travis Dempsey:

Any ideas on how to smooth this out and optimize for this volume of
writes?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We use transport clients, 10 of them running in parallel (one for each
object_type) and we do not throttle concurrent bulk requests (intentionally
through any settings).

The rest of the settings suggestions you make seem like you believe that
merges are whats slowing down the bulk index. Is there a way to confirm
this? During the periods of no write (in between spikes) there is no disk
activity either which makes me think that merges are not occurring. Am I
incorrect in that assumption?
Currently the index has ~100,000,000 million documents in it, and each
shard contains ~3Gb of data, pretty evenly distributed. The disks
themselves are not slow (by conventional terms).

On Monday, March 4, 2013 1:47:04 PM UTC-6, Jörg Prante wrote:

What kind of client do you use? How many clients run in parallel? Do you
throttle concurrent bulk requests?

index.merge.policy.max_merged_segment is by default 5g, did you consider
decreasing it?

If you have slow disks, and need store throttling, you can try something
like

index.store.throttle.type: merge
index.store.throttle.max_bytes_per_sec: 1m

(1m may be too low, maybe 5m is better)

Jörg

Am 04.03.13 19:25, schrieb Travis Dempsey:

Any ideas on how to smooth this out and optimize for this volume of
writes?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We use the native Java clients, sorry for the omission, and also in case it
wasn't clear from the above we use elasticsearch 19.8

On Monday, March 4, 2013 2:03:20 PM UTC-6, Travis Dempsey wrote:

We use transport clients, 10 of them running in parallel (one for each
object_type) and we do not throttle concurrent bulk requests (intentionally
through any settings).

The rest of the settings suggestions you make seem like you believe that
merges are whats slowing down the bulk index. Is there a way to confirm
this? During the periods of no write (in between spikes) there is no disk
activity either which makes me think that merges are not occurring. Am I
incorrect in that assumption?
Currently the index has ~100,000,000 million documents in it, and each
shard contains ~3Gb of data, pretty evenly distributed. The disks
themselves are not slow (by conventional terms).

On Monday, March 4, 2013 1:47:04 PM UTC-6, Jörg Prante wrote:

What kind of client do you use? How many clients run in parallel? Do you
throttle concurrent bulk requests?

index.merge.policy.max_merged_segment is by default 5g, did you consider
decreasing it?

If you have slow disks, and need store throttling, you can try something
like

index.store.throttle.type: merge
index.store.throttle.max_bytes_per_sec: 1m

(1m may be too low, maybe 5m is better)

Jörg

Am 04.03.13 19:25, schrieb Travis Dempsey:

Any ideas on how to smooth this out and optimize for this volume of
writes?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

This all looks quite reasonable from my knowledge.

Throttling the store is just a shot in the dark. I think it helps most
when indices are building up and disks are known to be slow, so I/O
waits do not block the machine.

You have ~3g sized shards. Only one assumption: the size of my shards is
below 1g. I'm not sure about a sweet spot, but I think you should
consider to reduce shard size. The smaller the shards / segments, the
lesser is the chance of I/O induced spikes, just because of the volume
of data involved in operations.

Jörg

Am 04.03.13 21:05, schrieb Travis Dempsey:

We use the native Java clients, sorry for the omission, and also in
case it wasn't clear from the above we use elasticsearch 19.8

On Monday, March 4, 2013 2:03:20 PM UTC-6, Travis Dempsey wrote:

We use transport clients, 10 of them running in parallel (one for
each object_type) and we do not throttle concurrent bulk requests
(intentionally through any settings).

The rest of the settings suggestions you make seem like you
believe that merges are whats slowing down the bulk index. Is
there a way to confirm this? During the periods of no write (in
between spikes) there is no disk activity either which makes me
think that merges are not occurring. Am I incorrect in that
assumption?
Currently the index has ~100,000,000 million documents in it, and
each shard contains ~3Gb of data, pretty evenly distributed. The
disks themselves are not slow (by conventional terms).


On Monday, March 4, 2013 1:47:04 PM UTC-6, Jörg Prante wrote:

    What kind of client do you use? How many clients run in
    parallel? Do you
    throttle concurrent bulk requests?

    index.merge.policy.max_merged_segment is by default 5g, did
    you consider
    decreasing it?

    If you have slow disks, and need store throttling, you can try
    something
    like

    index.store.throttle.type: merge
    index.store.throttle.max_bytes_per_sec: 1m

    (1m may be too low, maybe 5m is better)

    Jörg

    Am 04.03.13 19:25, schrieb Travis Dempsey:
    > Any ideas on how to smooth this out and optimize for this
    volume of
    > writes?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the suggestions. I'm going to try fiddling with some of those
knobs and see if anything changes. I'll keep posting with any findings.

On Monday, March 4, 2013 2:24:13 PM UTC-6, Jörg Prante wrote:

This all looks quite reasonable from my knowledge.

Throttling the store is just a shot in the dark. I think it helps most
when indices are building up and disks are known to be slow, so I/O
waits do not block the machine.

You have ~3g sized shards. Only one assumption: the size of my shards is
below 1g. I'm not sure about a sweet spot, but I think you should
consider to reduce shard size. The smaller the shards / segments, the
lesser is the chance of I/O induced spikes, just because of the volume
of data involved in operations.

Jörg

Am 04.03.13 21:05, schrieb Travis Dempsey:

We use the native Java clients, sorry for the omission, and also in
case it wasn't clear from the above we use elasticsearch 19.8

On Monday, March 4, 2013 2:03:20 PM UTC-6, Travis Dempsey wrote:

We use transport clients, 10 of them running in parallel (one for 
each object_type) and we do not throttle concurrent bulk requests 
(intentionally through any settings). 

The rest of the settings suggestions you make seem like you 
believe that merges are whats slowing down the bulk index. Is 
there a way to confirm this? During the periods of no write (in 
between spikes) there is no disk activity either which makes me 
think that merges are not occurring. Am I incorrect in that 
assumption? 
Currently the index has ~100,000,000 million documents in it, and 
each shard contains ~3Gb of data, pretty evenly distributed. The 
disks themselves are not slow (by conventional terms). 


On Monday, March 4, 2013 1:47:04 PM UTC-6, Jörg Prante wrote: 

    What kind of client do you use? How many clients run in 
    parallel? Do you 
    throttle concurrent bulk requests? 

    index.merge.policy.max_merged_segment is by default 5g, did 
    you consider 
    decreasing it? 

    If you have slow disks, and need store throttling, you can try 
    something 
    like 

    index.store.throttle.type: merge 
    index.store.throttle.max_bytes_per_sec: 1m 

    (1m may be too low, maybe 5m is better) 

    Jörg 

    Am 04.03.13 19:25, schrieb Travis Dempsey: 
    > Any ideas on how to smooth this out and optimize for this 
    volume of 
    > writes? 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Update: During loads, when the Elasticsearch nodes are exhibiting the on
again, off again behavior, triggering a flush has been successful in
shaking them back to life. This makes me think that something is going on
related to the translog. Does anyone have any suggested tuning/monitoring
things I can try to prevent the aforementioned spiky load behavior?

On Monday, March 4, 2013 3:45:42 PM UTC-6, Travis Dempsey wrote:

Thanks for the suggestions. I'm going to try fiddling with some of those
knobs and see if anything changes. I'll keep posting with any findings.

On Monday, March 4, 2013 2:24:13 PM UTC-6, Jörg Prante wrote:

This all looks quite reasonable from my knowledge.

Throttling the store is just a shot in the dark. I think it helps most
when indices are building up and disks are known to be slow, so I/O
waits do not block the machine.

You have ~3g sized shards. Only one assumption: the size of my shards is
below 1g. I'm not sure about a sweet spot, but I think you should
consider to reduce shard size. The smaller the shards / segments, the
lesser is the chance of I/O induced spikes, just because of the volume
of data involved in operations.

Jörg

Am 04.03.13 21:05, schrieb Travis Dempsey:

We use the native Java clients, sorry for the omission, and also in
case it wasn't clear from the above we use elasticsearch 19.8

On Monday, March 4, 2013 2:03:20 PM UTC-6, Travis Dempsey wrote:

We use transport clients, 10 of them running in parallel (one for 
each object_type) and we do not throttle concurrent bulk requests 
(intentionally through any settings). 

The rest of the settings suggestions you make seem like you 
believe that merges are whats slowing down the bulk index. Is 
there a way to confirm this? During the periods of no write (in 
between spikes) there is no disk activity either which makes me 
think that merges are not occurring. Am I incorrect in that 
assumption? 
Currently the index has ~100,000,000 million documents in it, and 
each shard contains ~3Gb of data, pretty evenly distributed. The 
disks themselves are not slow (by conventional terms). 


On Monday, March 4, 2013 1:47:04 PM UTC-6, Jörg Prante wrote: 

    What kind of client do you use? How many clients run in 
    parallel? Do you 
    throttle concurrent bulk requests? 

    index.merge.policy.max_merged_segment is by default 5g, did 
    you consider 
    decreasing it? 

    If you have slow disks, and need store throttling, you can try 
    something 
    like 

    index.store.throttle.type: merge 
    index.store.throttle.max_bytes_per_sec: 1m 

    (1m may be too low, maybe 5m is better) 

    Jörg 

    Am 04.03.13 19:25, schrieb Travis Dempsey: 
    > Any ideas on how to smooth this out and optimize for this 
    volume of 
    > writes? 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

hey,

I guess you disabled the refresh so you will only flush once you have
enough Document buffered in RAM or if the auto flush kicks in, right? I
assume you are not using 0.90beta yet (I didn't see it mentioned) and I
guess you have given the JVM reasonable amount of memory?

My first bet would be you are buffering up a ton of docs in ram and then
things need to be flushed to disk (which is single threaded and blocking in
lucene 3.x) This means you are doing nothing for the time being. I wrote a
blog post last year when I committed the concurrent flushing to lucene to
fix this problem (4.0 only) - I think there are still copies out there like
here: http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!

maybe this explains some of the spikes you are seeing?

simon

On Wednesday, March 6, 2013 9:28:37 PM UTC+1, Travis Dempsey wrote:

Update: During loads, when the Elasticsearch nodes are exhibiting the on
again, off again behavior, triggering a flush has been successful in
shaking them back to life. This makes me think that something is going on
related to the translog. Does anyone have any suggested tuning/monitoring
things I can try to prevent the aforementioned spiky load behavior?

On Monday, March 4, 2013 3:45:42 PM UTC-6, Travis Dempsey wrote:

Thanks for the suggestions. I'm going to try fiddling with some of those
knobs and see if anything changes. I'll keep posting with any findings.

On Monday, March 4, 2013 2:24:13 PM UTC-6, Jörg Prante wrote:

This all looks quite reasonable from my knowledge.

Throttling the store is just a shot in the dark. I think it helps most
when indices are building up and disks are known to be slow, so I/O
waits do not block the machine.

You have ~3g sized shards. Only one assumption: the size of my shards is
below 1g. I'm not sure about a sweet spot, but I think you should
consider to reduce shard size. The smaller the shards / segments, the
lesser is the chance of I/O induced spikes, just because of the volume
of data involved in operations.

Jörg

Am 04.03.13 21:05, schrieb Travis Dempsey:

We use the native Java clients, sorry for the omission, and also in
case it wasn't clear from the above we use elasticsearch 19.8

On Monday, March 4, 2013 2:03:20 PM UTC-6, Travis Dempsey wrote:

We use transport clients, 10 of them running in parallel (one for 
each object_type) and we do not throttle concurrent bulk requests 
(intentionally through any settings). 

The rest of the settings suggestions you make seem like you 
believe that merges are whats slowing down the bulk index. Is 
there a way to confirm this? During the periods of no write (in 
between spikes) there is no disk activity either which makes me 
think that merges are not occurring. Am I incorrect in that 
assumption? 
Currently the index has ~100,000,000 million documents in it, and 
each shard contains ~3Gb of data, pretty evenly distributed. The 
disks themselves are not slow (by conventional terms). 


On Monday, March 4, 2013 1:47:04 PM UTC-6, Jörg Prante wrote: 

    What kind of client do you use? How many clients run in 
    parallel? Do you 
    throttle concurrent bulk requests? 

    index.merge.policy.max_merged_segment is by default 5g, did 
    you consider 
    decreasing it? 

    If you have slow disks, and need store throttling, you can try 
    something 
    like 

    index.store.throttle.type: merge 
    index.store.throttle.max_bytes_per_sec: 1m 

    (1m may be too low, maybe 5m is better) 

    Jörg 

    Am 04.03.13 19:25, schrieb Travis Dempsey: 
    > Any ideas on how to smooth this out and optimize for this 
    volume of 
    > writes? 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Maybe? This interesting thing to note is that there is no disk activity
while the loading is not spiked. CPU, network and disk i/o drop to all but
0, and then spike back up. If I was waiting on a flush thread, wouldn't
there be higher disk usage? And isn't flushing triggered per shard? So even
if a particular shard is blocking for a while, shouldn't others be
accepting data? The index has 12 shards...
Is there anything to do besides upgrading Lucene? This isn't out of the
realm of possibility, but I would like to be able to tune/configure around
the issue if I can with my current install if I can.

FYI: Using ES v19.8 ES_MIN_MEM and MAX_MEM are both 7g

On Wednesday, March 6, 2013 2:39:48 PM UTC-6, simonw wrote:

hey,

I guess you disabled the refresh so you will only flush once you have
enough Document buffered in RAM or if the auto flush kicks in, right? I
assume you are not using 0.90beta yet (I didn't see it mentioned) and I
guess you have given the JVM reasonable amount of memory?

My first bet would be you are buffering up a ton of docs in ram and then
things need to be flushed to disk (which is single threaded and blocking in
lucene 3.x) This means you are doing nothing for the time being. I wrote a
blog post last year when I committed the concurrent flushing to lucene to
fix this problem (4.0 only) - I think there are still copies out there like
here:
http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!

maybe this explains some of the spikes you are seeing?

simon

On Wednesday, March 6, 2013 9:28:37 PM UTC+1, Travis Dempsey wrote:

Update: During loads, when the Elasticsearch nodes are exhibiting the on
again, off again behavior, triggering a flush has been successful in
shaking them back to life. This makes me think that something is going on
related to the translog. Does anyone have any suggested tuning/monitoring
things I can try to prevent the aforementioned spiky load behavior?

On Monday, March 4, 2013 3:45:42 PM UTC-6, Travis Dempsey wrote:

Thanks for the suggestions. I'm going to try fiddling with some of those
knobs and see if anything changes. I'll keep posting with any findings.

On Monday, March 4, 2013 2:24:13 PM UTC-6, Jörg Prante wrote:

This all looks quite reasonable from my knowledge.

Throttling the store is just a shot in the dark. I think it helps most
when indices are building up and disks are known to be slow, so I/O
waits do not block the machine.

You have ~3g sized shards. Only one assumption: the size of my shards
is
below 1g. I'm not sure about a sweet spot, but I think you should
consider to reduce shard size. The smaller the shards / segments, the
lesser is the chance of I/O induced spikes, just because of the volume
of data involved in operations.

Jörg

Am 04.03.13 21:05, schrieb Travis Dempsey:

We use the native Java clients, sorry for the omission, and also in
case it wasn't clear from the above we use elasticsearch 19.8

On Monday, March 4, 2013 2:03:20 PM UTC-6, Travis Dempsey wrote:

We use transport clients, 10 of them running in parallel (one for 
each object_type) and we do not throttle concurrent bulk requests 
(intentionally through any settings). 

The rest of the settings suggestions you make seem like you 
believe that merges are whats slowing down the bulk index. Is 
there a way to confirm this? During the periods of no write (in 
between spikes) there is no disk activity either which makes me 
think that merges are not occurring. Am I incorrect in that 
assumption? 
Currently the index has ~100,000,000 million documents in it, and 
each shard contains ~3Gb of data, pretty evenly distributed. The 
disks themselves are not slow (by conventional terms). 


On Monday, March 4, 2013 1:47:04 PM UTC-6, Jörg Prante wrote: 

    What kind of client do you use? How many clients run in 
    parallel? Do you 
    throttle concurrent bulk requests? 

    index.merge.policy.max_merged_segment is by default 5g, did 
    you consider 
    decreasing it? 

    If you have slow disks, and need store throttling, you can 

try

    something 
    like 

    index.store.throttle.type: merge 
    index.store.throttle.max_bytes_per_sec: 1m 

    (1m may be too low, maybe 5m is better) 

    Jörg 

    Am 04.03.13 19:25, schrieb Travis Dempsey: 
    > Any ideas on how to smooth this out and optimize for this 
    volume of 
    > writes? 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Wednesday, March 6, 2013 10:04:26 PM UTC+1, Travis Dempsey wrote:

Maybe? This interesting thing to note is that there is no disk activity
while the loading is not spiked. CPU, network and disk i/o drop to all but
0, and then spike back up. If I was waiting on a flush thread, wouldn't
there be higher disk usage? And isn't flushing triggered per shard? So even
if a particular shard is blocking for a while, shouldn't others be
accepting data? The index has 12 shards...

so this lucene flush is very CPU intensive since it merges the thread
states in memory and then flushes data to disk. But this was just a first
guess given a brief read through you problem. Lemme ask some more questions:

  • can you set the refresh interval to -1 while you are bulk loading data?
  • maybe unrelated but why did you set engine.robin.term_index_interval:
    1024
  • you are using a shared gateway? or do you use the default?
  • you set "index.gateway.snapshot_interval" to 1200s can you try setting it
    to -1 while you are bulk loading?
  • if you are bulk indexing you might want to set the replicas to 0 or 1 and
    then once you are done expand it to 2?

simon

Is there anything to do besides upgrading Lucene? This isn't out of the

realm of possibility, but I would like to be able to tune/configure around
the issue if I can with my current install if I can.

FYI: Using ES v19.8 ES_MIN_MEM and MAX_MEM are both 7g

On Wednesday, March 6, 2013 2:39:48 PM UTC-6, simonw wrote:

hey,

I guess you disabled the refresh so you will only flush once you have
enough Document buffered in RAM or if the auto flush kicks in, right? I
assume you are not using 0.90beta yet (I didn't see it mentioned) and I
guess you have given the JVM reasonable amount of memory?

My first bet would be you are buffering up a ton of docs in ram and then
things need to be flushed to disk (which is single threaded and blocking in
lucene 3.x) This means you are doing nothing for the time being. I wrote a
blog post last year when I committed the concurrent flushing to lucene to
fix this problem (4.0 only) - I think there are still copies out there like
here:
http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!

maybe this explains some of the spikes you are seeing?

simon

On Wednesday, March 6, 2013 9:28:37 PM UTC+1, Travis Dempsey wrote:

Update: During loads, when the Elasticsearch nodes are exhibiting the on
again, off again behavior, triggering a flush has been successful in
shaking them back to life. This makes me think that something is going on
related to the translog. Does anyone have any suggested tuning/monitoring
things I can try to prevent the aforementioned spiky load behavior?

On Monday, March 4, 2013 3:45:42 PM UTC-6, Travis Dempsey wrote:

Thanks for the suggestions. I'm going to try fiddling with some of
those knobs and see if anything changes. I'll keep posting with any
findings.

On Monday, March 4, 2013 2:24:13 PM UTC-6, Jörg Prante wrote:

This all looks quite reasonable from my knowledge.

Throttling the store is just a shot in the dark. I think it helps most
when indices are building up and disks are known to be slow, so I/O
waits do not block the machine.

You have ~3g sized shards. Only one assumption: the size of my shards
is
below 1g. I'm not sure about a sweet spot, but I think you should
consider to reduce shard size. The smaller the shards / segments, the
lesser is the chance of I/O induced spikes, just because of the volume
of data involved in operations.

Jörg

Am 04.03.13 21:05, schrieb Travis Dempsey:

We use the native Java clients, sorry for the omission, and also in
case it wasn't clear from the above we use elasticsearch 19.8

On Monday, March 4, 2013 2:03:20 PM UTC-6, Travis Dempsey wrote:

We use transport clients, 10 of them running in parallel (one 

for

each object_type) and we do not throttle concurrent bulk 

requests

(intentionally through any settings). 

The rest of the settings suggestions you make seem like you 
believe that merges are whats slowing down the bulk index. Is 
there a way to confirm this? During the periods of no write (in 
between spikes) there is no disk activity either which makes me 
think that merges are not occurring. Am I incorrect in that 
assumption? 
Currently the index has ~100,000,000 million documents in it, 

and

each shard contains ~3Gb of data, pretty evenly distributed. The 
disks themselves are not slow (by conventional terms). 


On Monday, March 4, 2013 1:47:04 PM UTC-6, Jörg Prante wrote: 

    What kind of client do you use? How many clients run in 
    parallel? Do you 
    throttle concurrent bulk requests? 

    index.merge.policy.max_merged_segment is by default 5g, did 
    you consider 
    decreasing it? 

    If you have slow disks, and need store throttling, you can 

try

    something 
    like 

    index.store.throttle.type: merge 
    index.store.throttle.max_bytes_per_sec: 1m 

    (1m may be too low, maybe 5m is better) 

    Jörg 

    Am 04.03.13 19:25, schrieb Travis Dempsey: 
    > Any ideas on how to smooth this out and optimize for this 
    volume of 
    > writes? 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The data is being streamed in, so there isn't an opportunity to optimize
the cluster for bulk, then change it back for search.
I am using a shared gateway (s3)
engine.robin.term_index_interval is set to 1024 from a related problem of
cpu thrash that we have since solved, but as a consequence, kept around
some less-aggressive settings as a caution.
Why would this setting be related? Because it wouldn't be much trouble to
change it back.

The problem we are trying to overcome is that spiky load isn't ALWAYS the
case, it's actually a state that the cluster develops after a while,
meaning streaming load works as needed for a time and then (sometimes after
a long while, sometimes after a short while) the load begins to spike up
and down, and is no longer able to keep up with the stream rate.

On Wednesday, March 6, 2013 3:23:14 PM UTC-6, simonw wrote:

On Wednesday, March 6, 2013 10:04:26 PM UTC+1, Travis Dempsey wrote:

Maybe? This interesting thing to note is that there is no disk activity
while the loading is not spiked. CPU, network and disk i/o drop to all but
0, and then spike back up. If I was waiting on a flush thread, wouldn't
there be higher disk usage? And isn't flushing triggered per shard? So even
if a particular shard is blocking for a while, shouldn't others be
accepting data? The index has 12 shards...

so this lucene flush is very CPU intensive since it merges the thread
states in memory and then flushes data to disk. But this was just a first
guess given a brief read through you problem. Lemme ask some more questions:

  • can you set the refresh interval to -1 while you are bulk loading data?
  • maybe unrelated but why did you set engine.robin.term_index_interval:
    1024
  • you are using a shared gateway? or do you use the default?
  • you set "index.gateway.snapshot_interval" to 1200s can you try setting
    it to -1 while you are bulk loading?
  • if you are bulk indexing you might want to set the replicas to 0 or 1
    and then once you are done expand it to 2?

simon

Is there anything to do besides upgrading Lucene? This isn't out of the

realm of possibility, but I would like to be able to tune/configure around
the issue if I can with my current install if I can.

FYI: Using ES v19.8 ES_MIN_MEM and MAX_MEM are both 7g

On Wednesday, March 6, 2013 2:39:48 PM UTC-6, simonw wrote:

hey,

I guess you disabled the refresh so you will only flush once you have
enough Document buffered in RAM or if the auto flush kicks in, right? I
assume you are not using 0.90beta yet (I didn't see it mentioned) and I
guess you have given the JVM reasonable amount of memory?

My first bet would be you are buffering up a ton of docs in ram and then
things need to be flushed to disk (which is single threaded and blocking in
lucene 3.x) This means you are doing nothing for the time being. I wrote a
blog post last year when I committed the concurrent flushing to lucene to
fix this problem (4.0 only) - I think there are still copies out there like
here:
http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!

maybe this explains some of the spikes you are seeing?

simon

On Wednesday, March 6, 2013 9:28:37 PM UTC+1, Travis Dempsey wrote:

Update: During loads, when the Elasticsearch nodes are exhibiting the
on again, off again behavior, triggering a flush has been successful in
shaking them back to life. This makes me think that something is going on
related to the translog. Does anyone have any suggested tuning/monitoring
things I can try to prevent the aforementioned spiky load behavior?

On Monday, March 4, 2013 3:45:42 PM UTC-6, Travis Dempsey wrote:

Thanks for the suggestions. I'm going to try fiddling with some of
those knobs and see if anything changes. I'll keep posting with any
findings.

On Monday, March 4, 2013 2:24:13 PM UTC-6, Jörg Prante wrote:

This all looks quite reasonable from my knowledge.

Throttling the store is just a shot in the dark. I think it helps
most
when indices are building up and disks are known to be slow, so I/O
waits do not block the machine.

You have ~3g sized shards. Only one assumption: the size of my shards
is
below 1g. I'm not sure about a sweet spot, but I think you should
consider to reduce shard size. The smaller the shards / segments, the
lesser is the chance of I/O induced spikes, just because of the
volume
of data involved in operations.

Jörg

Am 04.03.13 21:05, schrieb Travis Dempsey:

We use the native Java clients, sorry for the omission, and also in
case it wasn't clear from the above we use elasticsearch 19.8

On Monday, March 4, 2013 2:03:20 PM UTC-6, Travis Dempsey wrote:

We use transport clients, 10 of them running in parallel (one 

for

each object_type) and we do not throttle concurrent bulk 

requests

(intentionally through any settings). 

The rest of the settings suggestions you make seem like you 
believe that merges are whats slowing down the bulk index. Is 
there a way to confirm this? During the periods of no write (in 
between spikes) there is no disk activity either which makes me 
think that merges are not occurring. Am I incorrect in that 
assumption? 
Currently the index has ~100,000,000 million documents in it, 

and

each shard contains ~3Gb of data, pretty evenly distributed. 

The

disks themselves are not slow (by conventional terms). 


On Monday, March 4, 2013 1:47:04 PM UTC-6, Jörg Prante wrote: 

    What kind of client do you use? How many clients run in 
    parallel? Do you 
    throttle concurrent bulk requests? 

    index.merge.policy.max_merged_segment is by default 5g, did 
    you consider 
    decreasing it? 

    If you have slow disks, and need store throttling, you can 

try

    something 
    like 

    index.store.throttle.type: merge 
    index.store.throttle.max_bytes_per_sec: 1m 

    (1m may be too low, maybe 5m is better) 

    Jörg 

    Am 04.03.13 19:25, schrieb Travis Dempsey: 
    > Any ideas on how to smooth this out and optimize for this 
    volume of 
    > writes? 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.