ELK stack needs tuning

Hi all,

At bol.com we use ELK for a logsearch platform, using 3 machines.

We need fast indexing (to not loose events) and want fast & near realtime
search. The search is currently not fast enough. Simple "give me the last
50 events from the last 15 minutes, from any type, from todays indices,
without any terms" search queries may take 1.0 sec. Sometimes even passing
30 seconds.

It currently does 3k docs added per second, but we expect 8k/sec end of
this year.

I have included lots of specs/config at bottom of this e-mail.

We found 2 reliable knobs to turn:

  1. index.refresh_interval. At 1 sec fast search seems impossible. When
    upping the refresh to 5 sec, search gets faster. At 10 sec its even faster.
    But when you search during the refresh (wouldn't a splay be nice?) its slow
    again. And a refresh every 10 seconds is not near realtime anymore. No
    obvious bottlenecks present: cpu, network, memory, disk i/o all OK.
  2. deleting old indices. No clue why this improves things. And we really
    do not want to delete old data, since we want to keep at least 60 days of
    data online. But after deleting old data to search speed slowly crawls back
    up again...

We have zillions of metrics ("measure everything") of OS, ES and JVM using
Diamond and Graphite. Too much to include here.
We use a nagios check to simulates Kibana queries to monitor the search
speed every 5 minute.

When comparing behaviour at refresh_interval 1s vs 5s we see:

  • system% cpu load: depends per server: 150 vs 80, 100 vs 50, 40 vs 25
    == lower
  • ParNew GC run freqency: 1 vs 0.6 (per second) == less
  • GMS GC run frequency: 1 vs 4 (per hour) == more
  • avg index time: 8 vs 2.5 (ms) == lower
  • refresh frequency: 22 vs 12 (per second) -- still high numbers at 5
    sec because we have 17 active indices every day == less
  • merge frequency: 12 vs 7 (per second) == less
  • flush frequency: no difference
  • search speed: at 1s way too slow, at 5s (at tests timed between the
    refresh bursts) search calls ~50ms.

We already looked at the threadpools:

  • we increased the bulk pool
  • we currently do not have any rejects in any pools
  • only pool that has queueing (a spike per 1 or 2 hours) is the
    'management' pool (but thats probably Diamond)

We have a feeling something blocks/locks upon high index and high search
frequency. But what? I have looked at nearly all metrics and _cat output.

Our current list of untested/wild ideas:

  • Is the index.codec.bloom.load=false on yesterday's indices really the
    magic bullet? We haven't tried it.
  • Adding a 2nd JVM per machine is an option, but as long as we do not
    know the real cause its not a real option (yet).
  • Lowering the heap from 48GB to 30GB, to avoid the 64-bit overhead.

What knobs do you suggest we start turning?

Any help is much appreciated!

A little present from me in return: I suggest you read
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html
and decide if you need dynamic scripting enabled (the default) as it allows
for remote code execution via the rest api. Credits go to Byron at Trifork!

More details:

Versions:

  • ES 1.0.1 on: java version "1.7.0_17", Java(TM) SE Runtime Environment
    (build 1.7.0_17-b02), Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01,
    mixed mode)
  • Logstash 1.1.13 (with a backported elasticsearch_http plugin, for
    idle_flush_time support)
  • Kibana 2

Setup:

  • we use several types of shippers/feeders, all sending logging to a set
    of redis servers (the log4j and accesslog shippers/feeders use the logstash
    json format to avoid grokking at logstash side)
  • several logstash instances consume the redis list, process and store
    in ES using the bulk API (we use bulk because we dislike the version lockin
    using the native transport)
  • we use bulk async (we thought it would speed up indexing, which it
    doesn't)
  • we use bulk batch size of 1000 and idle flush of 1.0 second

Hardware for ES:

  • 3x HP 360G8 24x core
  • each machine has 256GB RAM (1 ES jvm running per machine with 48GB
    heap, so lots of free RAM for caching)
  • each machine has 8x 1TB SAS (1 for OS and 7 as separate disks for use
    in ES' -Des.path.data=....)

Logstash integration:

  • using Bulk API, to avoid the version lockin (maybe slower, which we
    can fix by scaling out / adding more logstash instances)
  • 17 new indices every day (e.g. syslog, accesslogging, log4j +
    stacktraces)

ES configuration:

  • ES_HEAP_SIZE: 48gb
  • index.number_of_shards: 5
  • index.number_of_replicas: 1
  • index.refresh_interval: 1s
  • index.store.compress.stored: true
  • index.translog.flush_threshold_ops: 50000
  • indices.memory.index_buffer_size: 50%
  • default index mapping

Regards,
Renzo Toma
Bol.com

p.s. we are hiring! :slight_smile:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0da7e8fc-813b-4755-9fea-a49bc9eac1b6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Well once you go over 31-32GB of heap you lose pointer compression which
can actually slow you down. You might be better off reducing that and
running multiple instances per physical.

0.90.4 or so compression is on by default, so no need to specify that. You
might also want to change shards to a factor of your nodes, eg 3, 6, 9 for
more even allocation.
Also try moving to java 1.7u25 as that is the general agreed version to
run. We run u51 with no issues though so that might be worth trialling if
you can.

Finally, what are you using to monitor the actual cluster? Something like
ElasticHQ or Marvel will probably provide greater insights into what is
happening and what you can do to improve performance.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 16 April 2014 19:06, R. Toma renzo.toma@gmail.com wrote:

Hi all,

At bol.com we use ELK for a logsearch platform, using 3 machines.

We need fast indexing (to not loose events) and want fast & near realtime
search. The search is currently not fast enough. Simple "give me the last
50 events from the last 15 minutes, from any type, from todays indices,
without any terms" search queries may take 1.0 sec. Sometimes even passing
30 seconds.

It currently does 3k docs added per second, but we expect 8k/sec end of
this year.

I have included lots of specs/config at bottom of this e-mail.

We found 2 reliable knobs to turn:

  1. index.refresh_interval. At 1 sec fast search seems impossible. When
    upping the refresh to 5 sec, search gets faster. At 10 sec its even faster.
    But when you search during the refresh (wouldn't a splay be nice?) its slow
    again. And a refresh every 10 seconds is not near realtime anymore. No
    obvious bottlenecks present: cpu, network, memory, disk i/o all OK.
  2. deleting old indices. No clue why this improves things. And we
    really do not want to delete old data, since we want to keep at least 60
    days of data online. But after deleting old data to search speed slowly
    crawls back up again...

We have zillions of metrics ("measure everything") of OS, ES and JVM using
Diamond and Graphite. Too much to include here.
We use a nagios check to simulates Kibana queries to monitor the search
speed every 5 minute.

When comparing behaviour at refresh_interval 1s vs 5s we see:

  • system% cpu load: depends per server: 150 vs 80, 100 vs 50, 40 vs 25
    == lower
  • ParNew GC run freqency: 1 vs 0.6 (per second) == less
  • GMS GC run frequency: 1 vs 4 (per hour) == more
  • avg index time: 8 vs 2.5 (ms) == lower
  • refresh frequency: 22 vs 12 (per second) -- still high numbers at 5
    sec because we have 17 active indices every day == less
  • merge frequency: 12 vs 7 (per second) == less
  • flush frequency: no difference
  • search speed: at 1s way too slow, at 5s (at tests timed between the
    refresh bursts) search calls ~50ms.

We already looked at the threadpools:

  • we increased the bulk pool
  • we currently do not have any rejects in any pools
  • only pool that has queueing (a spike per 1 or 2 hours) is the
    'management' pool (but thats probably Diamond)

We have a feeling something blocks/locks upon high index and high search
frequency. But what? I have looked at nearly all metrics and _cat output.

Our current list of untested/wild ideas:

  • Is the index.codec.bloom.load=false on yesterday's indices really
    the magic bullet? We haven't tried it.
  • Adding a 2nd JVM per machine is an option, but as long as we do not
    know the real cause its not a real option (yet).
  • Lowering the heap from 48GB to 30GB, to avoid the 64-bit overhead.

What knobs do you suggest we start turning?

Any help is much appreciated!

A little present from me in return: I suggest you read
Elasticsearch Platform — Find real-time answers at scale | Elastic decide if you need dynamic scripting enabled (the default) as it allows
for remote code execution via the rest api. Credits go to Byron at Trifork!

More details:

Versions:

  • ES 1.0.1 on: java version "1.7.0_17", Java(TM) SE Runtime
    Environment (build 1.7.0_17-b02), Java HotSpot(TM) 64-Bit Server VM (build
    23.7-b01, mixed mode)
  • Logstash 1.1.13 (with a backported elasticsearch_http plugin, for
    idle_flush_time support)
  • Kibana 2

Setup:

  • we use several types of shippers/feeders, all sending logging to a
    set of redis servers (the log4j and accesslog shippers/feeders use the
    logstash json format to avoid grokking at logstash side)
  • several logstash instances consume the redis list, process and store
    in ES using the bulk API (we use bulk because we dislike the version lockin
    using the native transport)
  • we use bulk async (we thought it would speed up indexing, which it
    doesn't)
  • we use bulk batch size of 1000 and idle flush of 1.0 second

Hardware for ES:

  • 3x HP 360G8 24x core
  • each machine has 256GB RAM (1 ES jvm running per machine with 48GB
    heap, so lots of free RAM for caching)
  • each machine has 8x 1TB SAS (1 for OS and 7 as separate disks for
    use in ES' -Des.path.data=....)

Logstash integration:

  • using Bulk API, to avoid the version lockin (maybe slower, which we
    can fix by scaling out / adding more logstash instances)
  • 17 new indices every day (e.g. syslog, accesslogging, log4j +
    stacktraces)

ES configuration:

  • ES_HEAP_SIZE: 48gb
  • index.number_of_shards: 5
  • index.number_of_replicas: 1
  • index.refresh_interval: 1s
  • index.store.compress.stored: true
  • index.translog.flush_threshold_ops: 50000
  • indices.memory.index_buffer_size: 50%
  • default index mapping

Regards,
Renzo Toma
Bol.com

p.s. we are hiring! :slight_smile:

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0da7e8fc-813b-4755-9fea-a49bc9eac1b6%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/0da7e8fc-813b-4755-9fea-a49bc9eac1b6%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624aa57EcisZ7%2B6%3DnqDyep7p9yf2h%3DA6OJdMRJJOfLmFafQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Mark,

Thank you for your comments.

Regarding the monitoring. We use the Diamond ES collector which saves
metrics every 30 seconds in Graphite. ElasticHQ is nice, but does
diagnostics calculations for the whole runtime of the cluster instead of
last X minutes. It does have nice diagnostics rules, so I created Graphite
dashboards for them. Marvel is surely nice, but with exception of Sense it
does not offer me anything I do not already have with Graphite.

New finds:

  • Setting index.codec.bloom.load=false on yesterdays/older indices frees up
    memory from the fielddata pool. This stays released even when searching.
  • Closing older indices speeds up indexing & refreshing.

Regarding the closing benefit. The impact on refreshing is great! But from
a functional point-of-view its bad. I know about the 'overhead per index',
but cannot find a solution to this.

Does anyone know how to get an ELK stack with "unlimited" retention?

Regards,
Renzo

Op woensdag 16 april 2014 11:15:32 UTC+2 schreef Mark Walkom:

Well once you go over 31-32GB of heap you lose pointer compression which
can actually slow you down. You might be better off reducing that and
running multiple instances per physical.

0.90.4 or so compression is on by default, so no need to specify that.
You might also want to change shards to a factor of your nodes, eg 3, 6, 9
for more even allocation.
Also try moving to java 1.7u25 as that is the general agreed version to
run. We run u51 with no issues though so that might be worth trialling if
you can.

Finally, what are you using to monitor the actual cluster? Something like
ElasticHQ or Marvel will probably provide greater insights into what is
happening and what you can do to improve performance.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com <javascript:>
web: www.campaignmonitor.com

On 16 April 2014 19:06, R. Toma <renzo...@gmail.com <javascript:>> wrote:

Hi all,

At bol.com we use ELK for a logsearch platform, using 3 machines.

We need fast indexing (to not loose events) and want fast & near realtime
search. The search is currently not fast enough. Simple "give me the last
50 events from the last 15 minutes, from any type, from todays indices,
without any terms" search queries may take 1.0 sec. Sometimes even passing
30 seconds.

It currently does 3k docs added per second, but we expect 8k/sec end of
this year.

I have included lots of specs/config at bottom of this e-mail.

We found 2 reliable knobs to turn:

  1. index.refresh_interval. At 1 sec fast search seems impossible.
    When upping the refresh to 5 sec, search gets faster. At 10 sec its even
    faster. But when you search during the refresh (wouldn't a splay be nice?)
    its slow again. And a refresh every 10 seconds is not near realtime
    anymore. No obvious bottlenecks present: cpu, network, memory, disk i/o all
    OK.
  2. deleting old indices. No clue why this improves things. And we
    really do not want to delete old data, since we want to keep at least 60
    days of data online. But after deleting old data to search speed slowly
    crawls back up again...

We have zillions of metrics ("measure everything") of OS, ES and JVM
using Diamond and Graphite. Too much to include here.
We use a nagios check to simulates Kibana queries to monitor the search
speed every 5 minute.

When comparing behaviour at refresh_interval 1s vs 5s we see:

  • system% cpu load: depends per server: 150 vs 80, 100 vs 50, 40 vs
    25 == lower
  • ParNew GC run freqency: 1 vs 0.6 (per second) == less
  • GMS GC run frequency: 1 vs 4 (per hour) == more
  • avg index time: 8 vs 2.5 (ms) == lower
  • refresh frequency: 22 vs 12 (per second) -- still high numbers at 5
    sec because we have 17 active indices every day == less
  • merge frequency: 12 vs 7 (per second) == less
  • flush frequency: no difference
  • search speed: at 1s way too slow, at 5s (at tests timed between the
    refresh bursts) search calls ~50ms.

We already looked at the threadpools:

  • we increased the bulk pool
  • we currently do not have any rejects in any pools
  • only pool that has queueing (a spike per 1 or 2 hours) is the
    'management' pool (but thats probably Diamond)

We have a feeling something blocks/locks upon high index and high search
frequency. But what? I have looked at nearly all metrics and _cat output.

Our current list of untested/wild ideas:

  • Is the index.codec.bloom.load=false on yesterday's indices really
    the magic bullet? We haven't tried it.
  • Adding a 2nd JVM per machine is an option, but as long as we do not
    know the real cause its not a real option (yet).
  • Lowering the heap from 48GB to 30GB, to avoid the 64-bit overhead.

What knobs do you suggest we start turning?

Any help is much appreciated!

A little present from me in return: I suggest you read
Elasticsearch Platform — Find real-time answers at scale | Elastic decide if you need dynamic scripting enabled (the default) as it allows
for remote code execution via the rest api. Credits go to Byron at Trifork!

More details:

Versions:

  • ES 1.0.1 on: java version "1.7.0_17", Java(TM) SE Runtime
    Environment (build 1.7.0_17-b02), Java HotSpot(TM) 64-Bit Server VM (build
    23.7-b01, mixed mode)
  • Logstash 1.1.13 (with a backported elasticsearch_http plugin, for
    idle_flush_time support)
  • Kibana 2

Setup:

  • we use several types of shippers/feeders, all sending logging to a
    set of redis servers (the log4j and accesslog shippers/feeders use the
    logstash json format to avoid grokking at logstash side)
  • several logstash instances consume the redis list, process and
    store in ES using the bulk API (we use bulk because we dislike the version
    lockin using the native transport)
  • we use bulk async (we thought it would speed up indexing, which it
    doesn't)
  • we use bulk batch size of 1000 and idle flush of 1.0 second

Hardware for ES:

  • 3x HP 360G8 24x core
  • each machine has 256GB RAM (1 ES jvm running per machine with 48GB
    heap, so lots of free RAM for caching)
  • each machine has 8x 1TB SAS (1 for OS and 7 as separate disks for
    use in ES' -Des.path.data=....)

Logstash integration:

  • using Bulk API, to avoid the version lockin (maybe slower, which we
    can fix by scaling out / adding more logstash instances)
  • 17 new indices every day (e.g. syslog, accesslogging, log4j +
    stacktraces)

ES configuration:

  • ES_HEAP_SIZE: 48gb
  • index.number_of_shards: 5
  • index.number_of_replicas: 1
  • index.refresh_interval: 1s
  • index.store.compress.stored: true
  • index.translog.flush_threshold_ops: 50000
  • indices.memory.index_buffer_size: 50%
  • default index mapping

Regards,
Renzo Toma
Bol.com

p.s. we are hiring! :slight_smile:

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0da7e8fc-813b-4755-9fea-a49bc9eac1b6%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/0da7e8fc-813b-4755-9fea-a49bc9eac1b6%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7dd33f9e-28b0-4308-b6c6-59cc01bd302e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

"17 new indices every day" - whew. Why don't you use shard overallocating?

https://groups.google.com/forum/#!msg/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHhV3A1pzqpG7U5CZDOEKzOeXmq8WS76uC3d3CDT3VBWQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Jörg,

Thank you for pointing me to this article. I needed to read it twice, but I
think I understand it now.

I believe shard overallocating works for use-cases where you want to store
& search 'users' or 'products'. Such data allows you to divide all
documents into groups to be stored in different shards using routing. All
shards get indexed & searched.

But how does this work for logstash indices? I could create 1 index with
365 shards (if I want 1 year of retention) and use alias routing (alias per
date with routing to a shard) to index into a different shard every day,
but after 1 year I need to purge a shard. And purging a shard is not easy.
It would require a delete of every document in the shard.

Or am I missing something?

Regards,
Renzp

Op donderdag 17 april 2014 16:15:43 UTC+2 schreef Jörg Prante:

"17 new indices every day" - whew. Why don't you use shard overallocating?

Redirecting to Google Groups

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/88a6f992-400b-4fb5-80e5-7b024b17ffd6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

If you want unlimited retention you're going to have to keep adding more
nodes to the cluster to deal with it.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 17 April 2014 22:48, R. Toma renzo.toma@gmail.com wrote:

Hi Mark,

Thank you for your comments.

Regarding the monitoring. We use the Diamond ES collector which saves
metrics every 30 seconds in Graphite. ElasticHQ is nice, but does
diagnostics calculations for the whole runtime of the cluster instead of
last X minutes. It does have nice diagnostics rules, so I created Graphite
dashboards for them. Marvel is surely nice, but with exception of Sense it
does not offer me anything I do not already have with Graphite.

New finds:

  • Setting index.codec.bloom.load=false on yesterdays/older indices frees
    up memory from the fielddata pool. This stays released even when searching.
  • Closing older indices speeds up indexing & refreshing.

Regarding the closing benefit. The impact on refreshing is great! But from
a functional point-of-view its bad. I know about the 'overhead per index',
but cannot find a solution to this.

Does anyone know how to get an ELK stack with "unlimited" retention?

Regards,
Renzo

Op woensdag 16 april 2014 11:15:32 UTC+2 schreef Mark Walkom:

Well once you go over 31-32GB of heap you lose pointer compression which
can actually slow you down. You might be better off reducing that and
running multiple instances per physical.

0.90.4 or so compression is on by default, so no need to specify that.
You might also want to change shards to a factor of your nodes, eg 3, 6, 9
for more even allocation.
Also try moving to java 1.7u25 as that is the general agreed version to
run. We run u51 with no issues though so that might be worth trialling if
you can.

Finally, what are you using to monitor the actual cluster? Something like
ElasticHQ or Marvel will probably provide greater insights into what is
happening and what you can do to improve performance.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com

On 16 April 2014 19:06, R. Toma renzo...@gmail.com wrote:

Hi all,

At bol.com we use ELK for a logsearch platform, using 3 machines.

We need fast indexing (to not loose events) and want fast & near
realtime search. The search is currently not fast enough. Simple "give me
the last 50 events from the last 15 minutes, from any type, from todays
indices, without any terms" search queries may take 1.0 sec. Sometimes even
passing 30 seconds.

It currently does 3k docs added per second, but we expect 8k/sec end of
this year.

I have included lots of specs/config at bottom of this e-mail.

We found 2 reliable knobs to turn:

  1. index.refresh_interval. At 1 sec fast search seems impossible.
    When upping the refresh to 5 sec, search gets faster. At 10 sec its even
    faster. But when you search during the refresh (wouldn't a splay be nice?)
    its slow again. And a refresh every 10 seconds is not near realtime
    anymore. No obvious bottlenecks present: cpu, network, memory, disk i/o all
    OK.
  2. deleting old indices. No clue why this improves things. And we
    really do not want to delete old data, since we want to keep at least 60
    days of data online. But after deleting old data to search speed slowly
    crawls back up again...

We have zillions of metrics ("measure everything") of OS, ES and JVM
using Diamond and Graphite. Too much to include here.
We use a nagios check to simulates Kibana queries to monitor the search
speed every 5 minute.

When comparing behaviour at refresh_interval 1s vs 5s we see:

  • system% cpu load: depends per server: 150 vs 80, 100 vs 50, 40 vs
    25 == lower
  • ParNew GC run freqency: 1 vs 0.6 (per second) == less
  • GMS GC run frequency: 1 vs 4 (per hour) == more
  • avg index time: 8 vs 2.5 (ms) == lower
  • refresh frequency: 22 vs 12 (per second) -- still high numbers at
    5 sec because we have 17 active indices every day == less
  • merge frequency: 12 vs 7 (per second) == less
  • flush frequency: no difference
  • search speed: at 1s way too slow, at 5s (at tests timed between
    the refresh bursts) search calls ~50ms.

We already looked at the threadpools:

  • we increased the bulk pool
  • we currently do not have any rejects in any pools
  • only pool that has queueing (a spike per 1 or 2 hours) is the
    'management' pool (but thats probably Diamond)

We have a feeling something blocks/locks upon high index and high search
frequency. But what? I have looked at nearly all metrics and _cat output.

Our current list of untested/wild ideas:

  • Is the index.codec.bloom.load=false on yesterday's indices really
    the magic bullet? We haven't tried it.
  • Adding a 2nd JVM per machine is an option, but as long as we do
    not know the real cause its not a real option (yet).
  • Lowering the heap from 48GB to 30GB, to avoid the 64-bit overhead.

What knobs do you suggest we start turning?

Any help is much appreciated!

A little present from me in return: I suggest you read
Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/modules-scripting.html and decide if you need dynamic
scripting enabled (the default) as it allows for remote code execution via
the rest api. Credits go to Byron at Trifork!

More details:

Versions:

  • ES 1.0.1 on: java version "1.7.0_17", Java(TM) SE Runtime
    Environment (build 1.7.0_17-b02), Java HotSpot(TM) 64-Bit Server VM (build
    23.7-b01, mixed mode)
  • Logstash 1.1.13 (with a backported elasticsearch_http plugin, for
    idle_flush_time support)
  • Kibana 2

Setup:

  • we use several types of shippers/feeders, all sending logging to a
    set of redis servers (the log4j and accesslog shippers/feeders use the
    logstash json format to avoid grokking at logstash side)
  • several logstash instances consume the redis list, process and
    store in ES using the bulk API (we use bulk because we dislike the version
    lockin using the native transport)
  • we use bulk async (we thought it would speed up indexing, which it
    doesn't)
  • we use bulk batch size of 1000 and idle flush of 1.0 second

Hardware for ES:

  • 3x HP 360G8 24x core
  • each machine has 256GB RAM (1 ES jvm running per machine with 48GB
    heap, so lots of free RAM for caching)
  • each machine has 8x 1TB SAS (1 for OS and 7 as separate disks for
    use in ES' -Des.path.data=....)

Logstash integration:

  • using Bulk API, to avoid the version lockin (maybe slower, which
    we can fix by scaling out / adding more logstash instances)
  • 17 new indices every day (e.g. syslog, accesslogging, log4j +
    stacktraces)

ES configuration:

  • ES_HEAP_SIZE: 48gb
  • index.number_of_shards: 5
  • index.number_of_replicas: 1
  • index.refresh_interval: 1s
  • index.store.compress.stored: true
  • index.translog.flush_threshold_ops: 50000
  • indices.memory.index_buffer_size: 50%
  • default index mapping

Regards,
Renzo Toma
Bol.com

p.s. we are hiring! :slight_smile:

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0da7e8fc-813b-4755-9fea-a49bc9eac1b6%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/0da7e8fc-813b-4755-9fea-a49bc9eac1b6%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7dd33f9e-28b0-4308-b6c6-59cc01bd302e%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/7dd33f9e-28b0-4308-b6c6-59cc01bd302e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624aD1tJT3vh8SkhRCab4Jyzmex_5YBbkFNdMoxpnGK6%2BRQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.