I wanted to add another five cents worth to what Michael already described.
Before, when this happened, we used to run the docker container without a
memory limit and giving ES a HEAP_SIZE of 10GB (10240M to be exact). When
it happened the first time, doing free -m
revealed that the system was
left with barely a few hundreds of megabytes left. Second time around we
figured we'd tweak it a bit. Give the docker container 10G memory limit
and limit the ES_HEAP_SIZE to 7.5GB (7168M). This time when it crashed and
is still hung as we speak (been trying to get a thread dump, but that's not
happening... jstack can't even connect to that process its so deeply hung)
doing 'top' revealed that the memory usage is at 9.9GB for that process.
Basically the system seems bogged out of memory.
Question is... how did it exceed the heap size given? And more to the
point, just how much memory should we be allocating? If this is just a
problem of shortage of memory, spawning up a new machine with a bunch more
RAM is no problem. But we just wish we weren't shooting in the dark.
We've already tried increasing the SSD size to get more IOPS (GCE ties
size to IOPS bandwidth).
Last but not least, is there something we should be doing to tweak the
lucene segment size?
Thanks for all your thoughts!
On Wednesday, July 23, 2014 12:15:51 PM UTC-4, mic...@modernmast.com wrote:
No, the VM does not response to curl requests. Closest thing I found to
that read bytes in the API was the _cluster/stats endpoint -->
GET _cluster/stats · GitHub
Were you referring to a different endpoint?
What're your thoughts re "angry hardware"? Insufficient resources? Are
there any known issues with CoreOS + ES?
On Wednesday, July 23, 2014 11:12:10 AM UTC-4, Nikolas Everett wrote:
On Wed, Jul 23, 2014 at 10:19 AM, mic...@modernmast.com wrote:
Looking at the JVM GC graphs, I do see some increases there, but not
sure those are enough to cause this storm?
https://lh6.googleusercontent.com/-4wVrdN5UNRY/U8_DuSsh15I/AAAAAAAAAAk/prHDyOwB_gE/s1600/Screenshot+2014-07-23+10.05.10.png
That looks like it wasn't the problem.
https://lh6.googleusercontent.com/-4wVrdN5UNRY/U8_DuSsh15I/AAAAAAAAAAk/prHDyOwB_gE/s1600/Screenshot+2014-07-23+10.05.10.png
The disk graphs in Marvel don't show anything out of the ordinary. I'm
not sure how to check on those write_bytes and read_bytes... Where does ES
report those? I'm using Google Compute Engine, and according to their
minimal graphs, while there was a small spike in the disk I/Os, it wasn't
anything insane.
Elasticsearch reports them in the same API as the number of reads. I
imagine they are right next to each other in marvel but I'm not sure.
Right after the spike happened, the indexing rate spiked to 6.6k /
second. Notice that the first tag notes the VM that left the cluster, the
second tag shows the cluster went back to "green". Considering this
happened after the node "left", does this give us any clues as for the
reason?
https://lh3.googleusercontent.com/-8hmcRo-kH6k/U8_D4QliadI/AAAAAAAAAAs/U69wNWbvbcA/s1600/Screenshot+2014-07-23+10.08.01.png
A few observations:
- The es process is still running on the "dead machine". I can see it
when I ssh into the VM (thru ps aux, and docker)
- The VM doesn't show anything in the error logs, etc
- Running a "ps aux" on that VM actually freezes (after showing the es
process)
That seems pretty nasty. Does the node respond to stuff like curl localhost:9200
? Sounds like the hardware/docker is just angry.
Nik
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/918f6972-60be-4bff-81e6-939f4840cde0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.