java version is 1.7.0_55. the servers have a 32GB heap, 96GB of memory, 12 logical cores, and 4 spinning disks.
Currently we have about 450GB of data on each machine, average doc size is about 1.5KB. We create an index (4 shards, 1 replica) every N days. Right now we have 12 indices, meaning about 24 shards/node (1242 / 4).
Looking at ElasticHQ, I noticed some warnings around documents deleted. Our percentages are in the 70s and the pass level is 10% (!). Due to our business requirements, we have to use TTL. My understanding is this leads to a lot of document deletions and increased merge activity. However it seems that maybe segments with lots of deletes aren't being merged? We stopped indexing temporarily and there are no merges occurring anywhere in the system so it's not a throttling issue. We are using almost all default settings, but is there some setting in particular I should look at?
On Jun 10, 2014, at 3:41 PM, Mark Walkom markw@campaignmonitor.com wrote:
Are you using a monitoring plugin such as marvel or elastichq? If not then installing those will give you a better insight into your cluster.
You can also check the hot threads end point to check each node - Elasticsearch Platform — Find real-time answers at scale | Elastic
Providing a bit more info on your cluster setup may help as well, index size and count, server specs, java version, that sort of thing.
Regards,
Mark Walkom
Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com
On 11 June 2014 00:41, Kireet Reddy kireet@feedly.com wrote:
On our 4 node test cluster (1.1.2), seemingly out of the blue we had one node experience very high cpu usage and become unresponsive and then after about 8 hours another node experienced the same issue. The processes themselves stayed alive, gc activity was normal, they didn't experience an OutOfMemoryError. The nodes left the cluster though, perhaps due to the unresponsiveness. The only errors in the log files were a bunch of messages like:
org.elasticsearch.search.SearchContextMissingException: No search context found for id ...
and errors about the search queue being full. We see the SearchContextMissingException occasionally during normal operation, but during the high cpu period it happened quite a bit.
I don't think we had an unusually high number of queries during that time because the other 2 nodes had normal cpu usage and for the prior week things ran smoothly.
We are going to restart testing, but is there anything we can do to better understand what happened? Maybe change a particular log level or do something while the problem is happening, assuming we can reproduce the issue?
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/58351342-da89-43ad-a1be-194d8b608457%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/6ze7e1TVM8A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624bNyfbBkLZbeGpz8v%2Bq8VOPOLmAeGmWf%2BNQrEar2owLoQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5E198314-7782-4987-81B8-D7A37684C249%40feedly.com.
For more options, visit https://groups.google.com/d/optout.