Sudden high "OS Load", then ES VM disappears

Hey all!

I'm having some a serious problem with my ES cluster. Every now and then,
when writing to the cluster, a machine (or two) will suddenly spike up on
OS Load, writing will come to a screeching halt (>5s for 1k docs, as
opposed to ~100ms normally), and then shortly after, the VM that was
spiking will simply disappear from the cluster.

https://lh3.googleusercontent.com/-tof8Eiwa1hw/U8-44SG-VGI/AAAAAAAAAAM/cQHRGV43r3s/s1600/Screenshot+2014-07-23+09.28.30.png
Another thing I've noticed, is that the "Document Count" dips to 0 when
this happens. FWIW - the VMs are 15gb mem / 4 cores, each, running docker.

https://lh5.googleusercontent.com/-Clp-mK2fddQ/U8-5EhQrlZI/AAAAAAAAAAU/TtgFCAne6aE/s1600/Screenshot+2014-07-23+09.28.10.png

Here are some logs from around the time this happened:

Hot threads: https://gist.github.com/schonfeld/370f9c32dbefce59e628
Nodes stats: https://gist.github.com/schonfeld/5f528949d3a6341417dc
Screenshots:
https://www.dropbox.com/s/gthreucugoz0opm/Screenshot%202014-07-23%2009.28.30.png
&&
https://www.dropbox.com/s/ae1vijcdgv5sv7u/Screenshot%202014-07-23%2009.28.10.png

Any clues, insights, and thoughts will be greatly appreciated!

Thanks,

  • Michael

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e8bb2fa4-42da-4177-9b75-d9d4843fcc93%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

One additional piece of information -- the .yml conf file we
use: conf · GitHub

On Wednesday, July 23, 2014 9:31:45 AM UTC-4, mic...@modernmast.com wrote:

Hey all!

I'm having some a serious problem with my ES cluster. Every now and then,
when writing to the cluster, a machine (or two) will suddenly spike up on
OS Load, writing will come to a screeching halt (>5s for 1k docs, as
opposed to ~100ms normally), and then shortly after, the VM that was
spiking will simply disappear from the cluster.

https://lh3.googleusercontent.com/-tof8Eiwa1hw/U8-44SG-VGI/AAAAAAAAAAM/cQHRGV43r3s/s1600/Screenshot+2014-07-23+09.28.30.png
Another thing I've noticed, is that the "Document Count" dips to 0 when
this happens. FWIW - the VMs are 15gb mem / 4 cores, each, running docker.

https://lh5.googleusercontent.com/-Clp-mK2fddQ/U8-5EhQrlZI/AAAAAAAAAAU/TtgFCAne6aE/s1600/Screenshot+2014-07-23+09.28.10.png

Here are some logs from around the time this happened:

Hot threads: GET _nodes/hot_threads · GitHub
Nodes stats: GET _nodes/stats · GitHub
Screenshots:
Dropbox - File Deleted - Simplify your life
&&
Dropbox - File Deleted - Simplify your life

Any clues, insights, and thoughts will be greatly appreciated!

Thanks,

  • Michael

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a7a93d4e-42e3-4b2d-bae0-2c4ca50a106d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I'm not sure what "OS Load" is in this context but I'm guessing it is load
average. The shape of the memory usage graph indicates that the orange
node might be stuck in a garbage collection storm - the numbers for heap
aren't going up and down - just staying constant while the load is pretty
high. Might not be it, but it'd be nice to see garbage collection
counts/times.

You also might want to look at what the CPU is doing - in particular you
want to know the % of time the CPU is in io wait. If that jumped up then
something went wrong with the disk. Another way to tell is to look at the
reads and writes that elasticsearch reports - its called write_bytes and
read_bytes or something. If you have a graph of that you can see events
like shards moving from one node to another (write spike near the
configured maximum throttle - 20 or 40 MB/sec with a network spike followed
by a read spike) compared to regular operation (steady state reading)
compared to a big merge (looks like shard moving without the network spike
but with a user cpu spike).

The idea is that you can see if any of these events occurred right before
your problem.

On Wed, Jul 23, 2014 at 9:38 AM, michael@modernmast.com wrote:

One additional piece of information -- the .yml conf file we use:
conf · GitHub

On Wednesday, July 23, 2014 9:31:45 AM UTC-4, mic...@modernmast.com wrote:

Hey all!

I'm having some a serious problem with my ES cluster. Every now and then,
when writing to the cluster, a machine (or two) will suddenly spike up on
OS Load, writing will come to a screeching halt (>5s for 1k docs, as
opposed to ~100ms normally), and then shortly after, the VM that was
spiking will simply disappear from the cluster.

https://lh3.googleusercontent.com/-tof8Eiwa1hw/U8-44SG-VGI/AAAAAAAAAAM/cQHRGV43r3s/s1600/Screenshot+2014-07-23+09.28.30.png
Another thing I've noticed, is that the "Document Count" dips to 0 when
this happens. FWIW - the VMs are 15gb mem / 4 cores, each, running docker.

https://lh5.googleusercontent.com/-Clp-mK2fddQ/U8-5EhQrlZI/AAAAAAAAAAU/TtgFCAne6aE/s1600/Screenshot+2014-07-23+09.28.10.png

Here are some logs from around the time this happened:

Hot threads: GET _nodes/hot_threads · GitHub
Nodes stats: GET _nodes/stats · GitHub
Screenshots: Dropbox - File Deleted - Simplify your life
202014-07-23%2009.28.30.png && https://www.dropbox.com/s/
ae1vijcdgv5sv7u/Screenshot%202014-07-23%2009.28.10.png

Any clues, insights, and thoughts will be greatly appreciated!

Thanks,

  • Michael

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a7a93d4e-42e3-4b2d-bae0-2c4ca50a106d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a7a93d4e-42e3-4b2d-bae0-2c4ca50a106d%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1%3DEQVB2pfUkja7GMSv5%2B-UUcx2_0jxQ7qHj7JG-VBxJg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Looking at the JVM GC graphs, I do see some increases there, but not sure
those are enough to cause this storm?

https://lh6.googleusercontent.com/-4wVrdN5UNRY/U8_DuSsh15I/AAAAAAAAAAk/prHDyOwB_gE/s1600/Screenshot+2014-07-23+10.05.10.png
The disk graphs in Marvel don't show anything out of the ordinary. I'm not
sure how to check on those write_bytes and read_bytes... Where does ES
report those? I'm using Google Compute Engine, and according to their
minimal graphs, while there was a small spike in the disk I/Os, it wasn't
anything insane.

Right after the spike happened, the indexing rate spiked to 6.6k / second.
Notice that the first tag notes the VM that left the cluster, the second
tag shows the cluster went back to "green". Considering this happened after
the node "left", does this give us any clues as for the reason?

https://lh3.googleusercontent.com/-8hmcRo-kH6k/U8_D4QliadI/AAAAAAAAAAs/U69wNWbvbcA/s1600/Screenshot+2014-07-23+10.08.01.png

A few observations:

  • The es process is still running on the "dead machine". I can see it when
    I ssh into the VM (thru ps aux, and docker)
  • The VM doesn't show anything in the error logs, etc
  • Running a "ps aux" on that VM actually freezes (after showing the es
    process)
  • Here is the thread dump from the dead VM:

Here is some more detailed information about the cluster:

  • VMs are Google Compute Engine machines: n1-standard-4 (4 virtual CPUs,
    15GB RAM), running CoreOS 367.1.0 with Docker 1.0.1.
  • The HDs are all 100gbs SSDs, delivering 3000/read & 3000/write IOPS (or
    48.00 MB/s). The Docker containers are running with 10gb max mem allocated.
  • Java version "1.7.0_55"; Java(TM) SE Runtime Environment (build
    1.7.0_55-b13); Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed
    mode)
  • ES_HEAP_SIZE = 7168M
  • mlock is enabled
  • memlock ulimit is set to unlimited

Screenshots:

&&

On Wednesday, July 23, 2014 9:51:27 AM UTC-4, Nikolas Everett wrote:

I'm not sure what "OS Load" is in this context but I'm guessing it is load
average. The shape of the memory usage graph indicates that the orange
node might be stuck in a garbage collection storm - the numbers for heap
aren't going up and down - just staying constant while the load is pretty
high. Might not be it, but it'd be nice to see garbage collection
counts/times.

You also might want to look at what the CPU is doing - in particular you
want to know the % of time the CPU is in io wait. If that jumped up then
something went wrong with the disk. Another way to tell is to look at the
reads and writes that elasticsearch reports - its called write_bytes and
read_bytes or something. If you have a graph of that you can see events
like shards moving from one node to another (write spike near the
configured maximum throttle - 20 or 40 MB/sec with a network spike followed
by a read spike) compared to regular operation (steady state reading)
compared to a big merge (looks like shard moving without the network spike
but with a user cpu spike).

The idea is that you can see if any of these events occurred right before
your problem.

On Wed, Jul 23, 2014 at 9:38 AM, <mic...@modernmast.com <javascript:>>
wrote:

One additional piece of information -- the .yml conf file we use:
conf · GitHub

On Wednesday, July 23, 2014 9:31:45 AM UTC-4, mic...@modernmast.com
wrote:

Hey all!

I'm having some a serious problem with my ES cluster. Every now and
then, when writing to the cluster, a machine (or two) will suddenly spike
up on OS Load, writing will come to a screeching halt (>5s for 1k docs, as
opposed to ~100ms normally), and then shortly after, the VM that was
spiking will simply disappear from the cluster.

https://lh3.googleusercontent.com/-tof8Eiwa1hw/U8-44SG-VGI/AAAAAAAAAAM/cQHRGV43r3s/s1600/Screenshot+2014-07-23+09.28.30.png
Another thing I've noticed, is that the "Document Count" dips to 0 when
this happens. FWIW - the VMs are 15gb mem / 4 cores, each, running docker.

https://lh5.googleusercontent.com/-Clp-mK2fddQ/U8-5EhQrlZI/AAAAAAAAAAU/TtgFCAne6aE/s1600/Screenshot+2014-07-23+09.28.10.png

Here are some logs from around the time this happened:

Hot threads: GET _nodes/hot_threads · GitHub
Nodes stats: GET _nodes/stats · GitHub
Screenshots: Dropbox - File Deleted - Simplify your life
202014-07-23%2009.28.30.png && https://www.dropbox.com/s/
ae1vijcdgv5sv7u/Screenshot%202014-07-23%2009.28.10.png

Any clues, insights, and thoughts will be greatly appreciated!

Thanks,

  • Michael

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a7a93d4e-42e3-4b2d-bae0-2c4ca50a106d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a7a93d4e-42e3-4b2d-bae0-2c4ca50a106d%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d4aea843-6893-4abf-a463-24381c87d2f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On Wed, Jul 23, 2014 at 10:19 AM, michael@modernmast.com wrote:

Looking at the JVM GC graphs, I do see some increases there, but not sure
those are enough to cause this storm?

https://lh6.googleusercontent.com/-4wVrdN5UNRY/U8_DuSsh15I/AAAAAAAAAAk/prHDyOwB_gE/s1600/Screenshot+2014-07-23+10.05.10.png

That looks like it wasn't the problem.

https://lh6.googleusercontent.com/-4wVrdN5UNRY/U8_DuSsh15I/AAAAAAAAAAk/prHDyOwB_gE/s1600/Screenshot+2014-07-23+10.05.10.png
The disk graphs in Marvel don't show anything out of the ordinary. I'm not
sure how to check on those write_bytes and read_bytes... Where does ES
report those? I'm using Google Compute Engine, and according to their
minimal graphs, while there was a small spike in the disk I/Os, it wasn't
anything insane.

Elasticsearch reports them in the same API as the number of reads. I
imagine they are right next to each other in marvel but I'm not sure.

Right after the spike happened, the indexing rate spiked to 6.6k / second.
Notice that the first tag notes the VM that left the cluster, the second
tag shows the cluster went back to "green". Considering this happened after
the node "left", does this give us any clues as for the reason?

https://lh3.googleusercontent.com/-8hmcRo-kH6k/U8_D4QliadI/AAAAAAAAAAs/U69wNWbvbcA/s1600/Screenshot+2014-07-23+10.08.01.png

A few observations:

  • The es process is still running on the "dead machine". I can see it when
    I ssh into the VM (thru ps aux, and docker)
  • The VM doesn't show anything in the error logs, etc
  • Running a "ps aux" on that VM actually freezes (after showing the es
    process)

That seems pretty nasty. Does the node respond to stuff like curl localhost:9200? Sounds like the hardware/docker is just angry.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0v%2B4iNmissyNcW50OmpD1yBVoJDTExCepawA1HDms-8w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

No, the VM does not response to curl requests. Closest thing I found to
that read bytes in the API was the _cluster/stats endpoint
--> GET _cluster/stats · GitHub

Were you referring to a different endpoint?

What're your thoughts re "angry hardware"? Insufficient resources? Are
there any known issues with CoreOS + ES?

On Wednesday, July 23, 2014 11:12:10 AM UTC-4, Nikolas Everett wrote:

On Wed, Jul 23, 2014 at 10:19 AM, <mic...@modernmast.com <javascript:>>
wrote:

Looking at the JVM GC graphs, I do see some increases there, but not sure
those are enough to cause this storm?

https://lh6.googleusercontent.com/-4wVrdN5UNRY/U8_DuSsh15I/AAAAAAAAAAk/prHDyOwB_gE/s1600/Screenshot+2014-07-23+10.05.10.png

That looks like it wasn't the problem.

https://lh6.googleusercontent.com/-4wVrdN5UNRY/U8_DuSsh15I/AAAAAAAAAAk/prHDyOwB_gE/s1600/Screenshot+2014-07-23+10.05.10.png
The disk graphs in Marvel don't show anything out of the ordinary. I'm
not sure how to check on those write_bytes and read_bytes... Where does ES
report those? I'm using Google Compute Engine, and according to their
minimal graphs, while there was a small spike in the disk I/Os, it wasn't
anything insane.

Elasticsearch reports them in the same API as the number of reads. I
imagine they are right next to each other in marvel but I'm not sure.

Right after the spike happened, the indexing rate spiked to 6.6k /
second. Notice that the first tag notes the VM that left the cluster, the
second tag shows the cluster went back to "green". Considering this
happened after the node "left", does this give us any clues as for the
reason?

https://lh3.googleusercontent.com/-8hmcRo-kH6k/U8_D4QliadI/AAAAAAAAAAs/U69wNWbvbcA/s1600/Screenshot+2014-07-23+10.08.01.png

A few observations:

  • The es process is still running on the "dead machine". I can see it
    when I ssh into the VM (thru ps aux, and docker)
  • The VM doesn't show anything in the error logs, etc
  • Running a "ps aux" on that VM actually freezes (after showing the es
    process)

That seems pretty nasty. Does the node respond to stuff like curl localhost:9200? Sounds like the hardware/docker is just angry.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1243b181-3ec1-410c-8091-d561ad2b8bed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I wanted to add another five cents worth to what Michael already described.

Before, when this happened, we used to run the docker container without a
memory limit and giving ES a HEAP_SIZE of 10GB (10240M to be exact). When
it happened the first time, doing free -m revealed that the system was
left with barely a few hundreds of megabytes left. Second time around we
figured we'd tweak it a bit. Give the docker container 10G memory limit
and limit the ES_HEAP_SIZE to 7.5GB (7168M). This time when it crashed and
is still hung as we speak (been trying to get a thread dump, but that's not
happening... jstack can't even connect to that process its so deeply hung)
doing 'top' revealed that the memory usage is at 9.9GB for that process.
Basically the system seems bogged out of memory.

Question is... how did it exceed the heap size given? And more to the
point, just how much memory should we be allocating? If this is just a
problem of shortage of memory, spawning up a new machine with a bunch more
RAM is no problem. But we just wish we weren't shooting in the dark.
We've already tried increasing the SSD size to get more IOPS (GCE ties
size to IOPS bandwidth).

Last but not least, is there something we should be doing to tweak the
lucene segment size?

Thanks for all your thoughts!

On Wednesday, July 23, 2014 12:15:51 PM UTC-4, mic...@modernmast.com wrote:

No, the VM does not response to curl requests. Closest thing I found to
that read bytes in the API was the _cluster/stats endpoint -->
GET _cluster/stats · GitHub

Were you referring to a different endpoint?

What're your thoughts re "angry hardware"? Insufficient resources? Are
there any known issues with CoreOS + ES?

On Wednesday, July 23, 2014 11:12:10 AM UTC-4, Nikolas Everett wrote:

On Wed, Jul 23, 2014 at 10:19 AM, mic...@modernmast.com wrote:

Looking at the JVM GC graphs, I do see some increases there, but not
sure those are enough to cause this storm?

https://lh6.googleusercontent.com/-4wVrdN5UNRY/U8_DuSsh15I/AAAAAAAAAAk/prHDyOwB_gE/s1600/Screenshot+2014-07-23+10.05.10.png

That looks like it wasn't the problem.

https://lh6.googleusercontent.com/-4wVrdN5UNRY/U8_DuSsh15I/AAAAAAAAAAk/prHDyOwB_gE/s1600/Screenshot+2014-07-23+10.05.10.png
The disk graphs in Marvel don't show anything out of the ordinary. I'm
not sure how to check on those write_bytes and read_bytes... Where does ES
report those? I'm using Google Compute Engine, and according to their
minimal graphs, while there was a small spike in the disk I/Os, it wasn't
anything insane.

Elasticsearch reports them in the same API as the number of reads. I
imagine they are right next to each other in marvel but I'm not sure.

Right after the spike happened, the indexing rate spiked to 6.6k /
second. Notice that the first tag notes the VM that left the cluster, the
second tag shows the cluster went back to "green". Considering this
happened after the node "left", does this give us any clues as for the
reason?

https://lh3.googleusercontent.com/-8hmcRo-kH6k/U8_D4QliadI/AAAAAAAAAAs/U69wNWbvbcA/s1600/Screenshot+2014-07-23+10.08.01.png

A few observations:

  • The es process is still running on the "dead machine". I can see it
    when I ssh into the VM (thru ps aux, and docker)
  • The VM doesn't show anything in the error logs, etc
  • Running a "ps aux" on that VM actually freezes (after showing the es
    process)

That seems pretty nasty. Does the node respond to stuff like curl localhost:9200? Sounds like the hardware/docker is just angry.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/918f6972-60be-4bff-81e6-939f4840cde0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Here's TOP and df -h... that's as best as I can get for now from inside
that container.

https://gist.github.com/danielschonfeld/d75c43cce34a16a57926

On Wednesday, July 23, 2014 12:49:32 PM UTC-4, Daniel Schonfeld wrote:

I wanted to add another five cents worth to what Michael already described.

Before, when this happened, we used to run the docker container without a
memory limit and giving ES a HEAP_SIZE of 10GB (10240M to be exact). When
it happened the first time, doing free -m revealed that the system was
left with barely a few hundreds of megabytes left. Second time around we
figured we'd tweak it a bit. Give the docker container 10G memory limit
and limit the ES_HEAP_SIZE to 7.5GB (7168M). This time when it crashed and
is still hung as we speak (been trying to get a thread dump, but that's not
happening... jstack can't even connect to that process its so deeply hung)
doing 'top' revealed that the memory usage is at 9.9GB for that process.
Basically the system seems bogged out of memory.

Question is... how did it exceed the heap size given? And more to the
point, just how much memory should we be allocating? If this is just a
problem of shortage of memory, spawning up a new machine with a bunch more
RAM is no problem. But we just wish we weren't shooting in the dark.
We've already tried increasing the SSD size to get more IOPS (GCE ties
size to IOPS bandwidth).

Last but not least, is there something we should be doing to tweak the
lucene segment size?

Thanks for all your thoughts!

On Wednesday, July 23, 2014 12:15:51 PM UTC-4, mic...@modernmast.com
wrote:

No, the VM does not response to curl requests. Closest thing I found to
that read bytes in the API was the _cluster/stats endpoint -->
GET _cluster/stats · GitHub

Were you referring to a different endpoint?

What're your thoughts re "angry hardware"? Insufficient resources? Are
there any known issues with CoreOS + ES?

On Wednesday, July 23, 2014 11:12:10 AM UTC-4, Nikolas Everett wrote:

On Wed, Jul 23, 2014 at 10:19 AM, mic...@modernmast.com wrote:

Looking at the JVM GC graphs, I do see some increases there, but not
sure those are enough to cause this storm?

https://lh6.googleusercontent.com/-4wVrdN5UNRY/U8_DuSsh15I/AAAAAAAAAAk/prHDyOwB_gE/s1600/Screenshot+2014-07-23+10.05.10.png

That looks like it wasn't the problem.

https://lh6.googleusercontent.com/-4wVrdN5UNRY/U8_DuSsh15I/AAAAAAAAAAk/prHDyOwB_gE/s1600/Screenshot+2014-07-23+10.05.10.png
The disk graphs in Marvel don't show anything out of the ordinary. I'm
not sure how to check on those write_bytes and read_bytes... Where does ES
report those? I'm using Google Compute Engine, and according to their
minimal graphs, while there was a small spike in the disk I/Os, it wasn't
anything insane.

Elasticsearch reports them in the same API as the number of reads. I
imagine they are right next to each other in marvel but I'm not sure.

Right after the spike happened, the indexing rate spiked to 6.6k /
second. Notice that the first tag notes the VM that left the cluster, the
second tag shows the cluster went back to "green". Considering this
happened after the node "left", does this give us any clues as for the
reason?

https://lh3.googleusercontent.com/-8hmcRo-kH6k/U8_D4QliadI/AAAAAAAAAAs/U69wNWbvbcA/s1600/Screenshot+2014-07-23+10.08.01.png

A few observations:

  • The es process is still running on the "dead machine". I can see it
    when I ssh into the VM (thru ps aux, and docker)
  • The VM doesn't show anything in the error logs, etc
  • Running a "ps aux" on that VM actually freezes (after showing the es
    process)

That seems pretty nasty. Does the node respond to stuff like curl localhost:9200? Sounds like the hardware/docker is just angry.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/11474c65-4443-4851-b109-75f092b49d04%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Heap size isn't total memory size. Its size for java to allocate stuff.
There are tons of other memory costs but the rule of thumb is to set heap
to no more then 30GB and around half of physical memory. I imagine docker
is complicating things.

I'm not sure what docker does with memory mapped files, for instance.
Elasticsearch uses tons of them so it uses tons of virtual memory. But
that shouldn't be a problem because its virtual. Though I don't know what
docker thinks of that.

On Wed, Jul 23, 2014 at 12:56 PM, Daniel Schonfeld downwindabeam@gmail.com
wrote:

Here's TOP and df -h... that's as best as I can get for now from inside
that container.

https://gist.github.com/danielschonfeld/d75c43cce34a16a57926

On Wednesday, July 23, 2014 12:49:32 PM UTC-4, Daniel Schonfeld wrote:

I wanted to add another five cents worth to what Michael already
described.

Before, when this happened, we used to run the docker container without a
memory limit and giving ES a HEAP_SIZE of 10GB (10240M to be exact). When
it happened the first time, doing free -m revealed that the system was
left with barely a few hundreds of megabytes left. Second time around we
figured we'd tweak it a bit. Give the docker container 10G memory limit
and limit the ES_HEAP_SIZE to 7.5GB (7168M). This time when it crashed and
is still hung as we speak (been trying to get a thread dump, but that's not
happening... jstack can't even connect to that process its so deeply hung)
doing 'top' revealed that the memory usage is at 9.9GB for that process.
Basically the system seems bogged out of memory.

Question is... how did it exceed the heap size given? And more to the
point, just how much memory should we be allocating? If this is just a
problem of shortage of memory, spawning up a new machine with a bunch more
RAM is no problem. But we just wish we weren't shooting in the dark.
We've already tried increasing the SSD size to get more IOPS (GCE ties
size to IOPS bandwidth).

Last but not least, is there something we should be doing to tweak the
lucene segment size?

Thanks for all your thoughts!

On Wednesday, July 23, 2014 12:15:51 PM UTC-4, mic...@modernmast.com
wrote:

No, the VM does not response to curl requests. Closest thing I found to
that read bytes in the API was the _cluster/stats endpoint -->
GET _cluster/stats · GitHub

Were you referring to a different endpoint?

What're your thoughts re "angry hardware"? Insufficient resources? Are
there any known issues with CoreOS + ES?

On Wednesday, July 23, 2014 11:12:10 AM UTC-4, Nikolas Everett wrote:

On Wed, Jul 23, 2014 at 10:19 AM, mic...@modernmast.com wrote:

Looking at the JVM GC graphs, I do see some increases there, but not
sure those are enough to cause this storm?

https://lh6.googleusercontent.com/-4wVrdN5UNRY/U8_DuSsh15I/AAAAAAAAAAk/prHDyOwB_gE/s1600/Screenshot+2014-07-23+10.05.10.png

That looks like it wasn't the problem.

https://lh6.googleusercontent.com/-4wVrdN5UNRY/U8_DuSsh15I/AAAAAAAAAAk/prHDyOwB_gE/s1600/Screenshot+2014-07-23+10.05.10.png
The disk graphs in Marvel don't show anything out of the ordinary. I'm
not sure how to check on those write_bytes and read_bytes... Where does ES
report those? I'm using Google Compute Engine, and according to their
minimal graphs, while there was a small spike in the disk I/Os, it wasn't
anything insane.

Elasticsearch reports them in the same API as the number of reads. I
imagine they are right next to each other in marvel but I'm not sure.

Right after the spike happened, the indexing rate spiked to 6.6k /
second. Notice that the first tag notes the VM that left the cluster, the
second tag shows the cluster went back to "green". Considering this
happened after the node "left", does this give us any clues as for the
reason?

https://lh3.googleusercontent.com/-8hmcRo-kH6k/U8_D4QliadI/AAAAAAAAAAs/U69wNWbvbcA/s1600/Screenshot+2014-07-23+10.08.01.png

A few observations:

  • The es process is still running on the "dead machine". I can see it
    when I ssh into the VM (thru ps aux, and docker)
  • The VM doesn't show anything in the error logs, etc
  • Running a "ps aux" on that VM actually freezes (after showing the es
    process)

That seems pretty nasty. Does the node respond to stuff like curl localhost:9200? Sounds like the hardware/docker is just angry.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/11474c65-4443-4851-b109-75f092b49d04%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/11474c65-4443-4851-b109-75f092b49d04%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2-nhicQa0gwtY%2BmCO6ir%2Ba4n-gtAXbzggTHnuwqLdsfQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

How do you estimate the memory to configure?

Here is my rough estimation:

Docker limit = 10G
Kernel, OS services etc. ~ 1G
OS filesystem cache for ES ~50% of 10G ~ 5G
ES JVM + direct buffer + heap = 10G - 1G - 5G ~ 4G

So when you estimate for ES JVM + direct buffers ~1G, you have left 3G
(maybe 4G) for ES_HEAP_SIZE. So I recommend to start with 3G and increase
slowly, but no more than 5G.

With mlockall=true, you force Linux to abort processes that are not mlocked
(see OOM killer, it may be responsible for rendering the Docker LXC
unusable) so it is not a good idea to start with mlockall=true, as long as
the overall memory allocation in the Docker LXC is only vague. If you find
a save configuration where everything is balanced out, then you can enable
mlockall=true.

Jörg

On Wed, Jul 23, 2014 at 6:56 PM, Daniel Schonfeld downwindabeam@gmail.com
wrote:

Here's TOP and df -h... that's as best as I can get for now from inside
that container.

https://gist.github.com/danielschonfeld/d75c43cce34a16a57926

On Wednesday, July 23, 2014 12:49:32 PM UTC-4, Daniel Schonfeld wrote:

I wanted to add another five cents worth to what Michael already
described.

Before, when this happened, we used to run the docker container without a
memory limit and giving ES a HEAP_SIZE of 10GB (10240M to be exact). When
it happened the first time, doing free -m revealed that the system was
left with barely a few hundreds of megabytes left. Second time around we
figured we'd tweak it a bit. Give the docker container 10G memory limit
and limit the ES_HEAP_SIZE to 7.5GB (7168M). This time when it crashed and
is still hung as we speak (been trying to get a thread dump, but that's not
happening... jstack can't even connect to that process its so deeply hung)
doing 'top' revealed that the memory usage is at 9.9GB for that process.
Basically the system seems bogged out of memory.

Question is... how did it exceed the heap size given? And more to the
point, just how much memory should we be allocating? If this is just a
problem of shortage of memory, spawning up a new machine with a bunch more
RAM is no problem. But we just wish we weren't shooting in the dark.
We've already tried increasing the SSD size to get more IOPS (GCE ties
size to IOPS bandwidth).

Last but not least, is there something we should be doing to tweak the
lucene segment size?

Thanks for all your thoughts!

On Wednesday, July 23, 2014 12:15:51 PM UTC-4, mic...@modernmast.com
wrote:

No, the VM does not response to curl requests. Closest thing I found to
that read bytes in the API was the _cluster/stats endpoint -->
GET _cluster/stats · GitHub

Were you referring to a different endpoint?

What're your thoughts re "angry hardware"? Insufficient resources? Are
there any known issues with CoreOS + ES?

On Wednesday, July 23, 2014 11:12:10 AM UTC-4, Nikolas Everett wrote:

On Wed, Jul 23, 2014 at 10:19 AM, mic...@modernmast.com wrote:

Looking at the JVM GC graphs, I do see some increases there, but not
sure those are enough to cause this storm?

https://lh6.googleusercontent.com/-4wVrdN5UNRY/U8_DuSsh15I/AAAAAAAAAAk/prHDyOwB_gE/s1600/Screenshot+2014-07-23+10.05.10.png

That looks like it wasn't the problem.

https://lh6.googleusercontent.com/-4wVrdN5UNRY/U8_DuSsh15I/AAAAAAAAAAk/prHDyOwB_gE/s1600/Screenshot+2014-07-23+10.05.10.png
The disk graphs in Marvel don't show anything out of the ordinary. I'm
not sure how to check on those write_bytes and read_bytes... Where does ES
report those? I'm using Google Compute Engine, and according to their
minimal graphs, while there was a small spike in the disk I/Os, it wasn't
anything insane.

Elasticsearch reports them in the same API as the number of reads. I
imagine they are right next to each other in marvel but I'm not sure.

Right after the spike happened, the indexing rate spiked to 6.6k /
second. Notice that the first tag notes the VM that left the cluster, the
second tag shows the cluster went back to "green". Considering this
happened after the node "left", does this give us any clues as for the
reason?

https://lh3.googleusercontent.com/-8hmcRo-kH6k/U8_D4QliadI/AAAAAAAAAAs/U69wNWbvbcA/s1600/Screenshot+2014-07-23+10.08.01.png

A few observations:

  • The es process is still running on the "dead machine". I can see it
    when I ssh into the VM (thru ps aux, and docker)
  • The VM doesn't show anything in the error logs, etc
  • Running a "ps aux" on that VM actually freezes (after showing the es
    process)

That seems pretty nasty. Does the node respond to stuff like curl localhost:9200? Sounds like the hardware/docker is just angry.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/11474c65-4443-4851-b109-75f092b49d04%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/11474c65-4443-4851-b109-75f092b49d04%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFKTuK92UMYjMJCBwibAH7CJ-1S5W2yMeNbZQK9_y_Wvg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.