Node failures


(Kireet Reddy) #1

On our 4 node test cluster (1.1.2), seemingly out of the blue we had one
node experience very high cpu usage and become unresponsive and then after
about 8 hours another node experienced the same issue. The processes
themselves stayed alive, gc activity was normal, they didn't experience an
OutOfMemoryError. The nodes left the cluster though, perhaps due to the
unresponsiveness. The only errors in the log files were a bunch of messages
like:

org.elasticsearch.search.SearchContextMissingException: No search context
found for id ...

and errors about the search queue being full. We see the
SearchContextMissingException occasionally during normal operation, but
during the high cpu period it happened quite a bit.

I don't think we had an unusually high number of queries during that time
because the other 2 nodes had normal cpu usage and for the prior week
things ran smoothly.

We are going to restart testing, but is there anything we can do to better
understand what happened? Maybe change a particular log level or do
something while the problem is happening, assuming we can reproduce the
issue?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/58351342-da89-43ad-a1be-194d8b608457%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #2

Are you using a monitoring plugin such as marvel or elastichq? If not then
installing those will give you a better insight into your cluster.
You can also check the hot threads end point to check each node -
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html

Providing a bit more info on your cluster setup may help as well, index
size and count, server specs, java version, that sort of thing.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 11 June 2014 00:41, Kireet Reddy kireet@feedly.com wrote:

On our 4 node test cluster (1.1.2), seemingly out of the blue we had one
node experience very high cpu usage and become unresponsive and then after
about 8 hours another node experienced the same issue. The processes
themselves stayed alive, gc activity was normal, they didn't experience an
OutOfMemoryError. The nodes left the cluster though, perhaps due to the
unresponsiveness. The only errors in the log files were a bunch of messages
like:

org.elasticsearch.search.SearchContextMissingException: No search context
found for id ...

and errors about the search queue being full. We see the
SearchContextMissingException occasionally during normal operation, but
during the high cpu period it happened quite a bit.

I don't think we had an unusually high number of queries during that time
because the other 2 nodes had normal cpu usage and for the prior week
things ran smoothly.

We are going to restart testing, but is there anything we can do to better
understand what happened? Maybe change a particular log level or do
something while the problem is happening, assuming we can reproduce the
issue?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/58351342-da89-43ad-a1be-194d8b608457%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/58351342-da89-43ad-a1be-194d8b608457%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624bNyfbBkLZbeGpz8v%2Bq8VOPOLmAeGmWf%2BNQrEar2owLoQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Kireet Reddy) #3

java version is 1.7.0_55. the servers have a 32GB heap, 96GB of memory, 12 logical cores, and 4 spinning disks.

Currently we have about 450GB of data on each machine, average doc size is about 1.5KB. We create an index (4 shards, 1 replica) every N days. Right now we have 12 indices, meaning about 24 shards/node (1242 / 4).

Looking at ElasticHQ, I noticed some warnings around documents deleted. Our percentages are in the 70s and the pass level is 10% (!). Due to our business requirements, we have to use TTL. My understanding is this leads to a lot of document deletions and increased merge activity. However it seems that maybe segments with lots of deletes aren't being merged? We stopped indexing temporarily and there are no merges occurring anywhere in the system so it's not a throttling issue. We are using almost all default settings, but is there some setting in particular I should look at?

On Jun 10, 2014, at 3:41 PM, Mark Walkom markw@campaignmonitor.com wrote:

Are you using a monitoring plugin such as marvel or elastichq? If not then installing those will give you a better insight into your cluster.
You can also check the hot threads end point to check each node - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html

Providing a bit more info on your cluster setup may help as well, index size and count, server specs, java version, that sort of thing.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 11 June 2014 00:41, Kireet Reddy kireet@feedly.com wrote:
On our 4 node test cluster (1.1.2), seemingly out of the blue we had one node experience very high cpu usage and become unresponsive and then after about 8 hours another node experienced the same issue. The processes themselves stayed alive, gc activity was normal, they didn't experience an OutOfMemoryError. The nodes left the cluster though, perhaps due to the unresponsiveness. The only errors in the log files were a bunch of messages like:

org.elasticsearch.search.SearchContextMissingException: No search context found for id ...

and errors about the search queue being full. We see the SearchContextMissingException occasionally during normal operation, but during the high cpu period it happened quite a bit.

I don't think we had an unusually high number of queries during that time because the other 2 nodes had normal cpu usage and for the prior week things ran smoothly.

We are going to restart testing, but is there anything we can do to better understand what happened? Maybe change a particular log level or do something while the problem is happening, assuming we can reproduce the issue?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/58351342-da89-43ad-a1be-194d8b608457%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/6ze7e1TVM8A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624bNyfbBkLZbeGpz8v%2Bq8VOPOLmAeGmWf%2BNQrEar2owLoQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5E198314-7782-4987-81B8-D7A37684C249%40feedly.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #4

TTL does use a lot of resources as it constantly scans for expired docs.
It'd be more efficient to switch to daily indexes and then drop them,
though that might not fit your business requirements.

You can try forcing an optimise on an index,
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-optimize.html,
it's very resource intensive though but it if reduces your segment count
then it may allude to where the problem lies.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 17 June 2014 07:07, Kireet Reddy kireet@feedly.com wrote:

java version is 1.7.0_55. the servers have a 32GB heap, 96GB of memory, 12
logical cores, and 4 spinning disks.

Currently we have about 450GB of data on each machine, average doc size is
about 1.5KB. We create an index (4 shards, 1 replica) every N days. Right
now we have 12 indices, meaning about 24 shards/node (1242 / 4).

Looking at ElasticHQ, I noticed some warnings around documents deleted.
Our percentages are in the 70s and the pass level is 10% (!). Due to our
business requirements, we have to use TTL. My understanding is this leads
to a lot of document deletions and increased merge activity. However it
seems that maybe segments with lots of deletes aren’t being merged? We
stopped indexing temporarily and there are no merges occurring anywhere in
the system so it’s not a throttling issue. We are using almost all default
settings, but is there some setting in particular I should look at?

On Jun 10, 2014, at 3:41 PM, Mark Walkom markw@campaignmonitor.com
wrote:

Are you using a monitoring plugin such as marvel or elastichq? If not then
installing those will give you a better insight into your cluster.
You can also check the hot threads end point to check each node -
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html

Providing a bit more info on your cluster setup may help as well, index
size and count, server specs, java version, that sort of thing.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 11 June 2014 00:41, Kireet Reddy kireet@feedly.com wrote:

On our 4 node test cluster (1.1.2), seemingly out of the blue we had one
node experience very high cpu usage and become unresponsive and then after
about 8 hours another node experienced the same issue. The processes
themselves stayed alive, gc activity was normal, they didn't experience an
OutOfMemoryError. The nodes left the cluster though, perhaps due to the
unresponsiveness. The only errors in the log files were a bunch of messages
like:

org.elasticsearch.search.SearchContextMissingException: No search context
found for id ...

and errors about the search queue being full. We see the
SearchContextMissingException occasionally during normal operation, but
during the high cpu period it happened quite a bit.

I don't think we had an unusually high number of queries during that time
because the other 2 nodes had normal cpu usage and for the prior week
things ran smoothly.

We are going to restart testing, but is there anything we can do to
better understand what happened? Maybe change a particular log level or do
something while the problem is happening, assuming we can reproduce the
issue?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/58351342-da89-43ad-a1be-194d8b608457%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/58351342-da89-43ad-a1be-194d8b608457%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/6ze7e1TVM8A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEM624bNyfbBkLZbeGpz8v%2Bq8VOPOLmAeGmWf%2BNQrEar2owLoQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEM624bNyfbBkLZbeGpz8v%2Bq8VOPOLmAeGmWf%2BNQrEar2owLoQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5E198314-7782-4987-81B8-D7A37684C249%40feedly.com
https://groups.google.com/d/msgid/elasticsearch/5E198314-7782-4987-81B8-D7A37684C249%40feedly.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624bcBWd97rWtTHFwh0n%3DJ3k%2BAnfFJeRb%2Bir6FJzOYd8%2BTg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Kireet Reddy) #5

As soon as we restarted indexing, we saw a lot of merge activity and the deleted documents percentage went to around 25%. Does indexing activity trigger merges? Currently, there is not much merge activity, but some indices still have high deleted document counts. E.g. we have one index with count around 17m and deleted at 15m, but no merge activity. I am wondering if merges aren't scheduled for that index because writes to that index are infrequent.

On Jun 16, 2014, at 3:16 PM, Mark Walkom markw@campaignmonitor.com wrote:

TTL does use a lot of resources as it constantly scans for expired docs. It'd be more efficient to switch to daily indexes and then drop them, though that might not fit your business requirements.

You can try forcing an optimise on an index, http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-optimize.html, it's very resource intensive though but it if reduces your segment count then it may allude to where the problem lies.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 17 June 2014 07:07, Kireet Reddy kireet@feedly.com wrote:
java version is 1.7.0_55. the servers have a 32GB heap, 96GB of memory, 12 logical cores, and 4 spinning disks.

Currently we have about 450GB of data on each machine, average doc size is about 1.5KB. We create an index (4 shards, 1 replica) every N days. Right now we have 12 indices, meaning about 24 shards/node (1242 / 4).

Looking at ElasticHQ, I noticed some warnings around documents deleted. Our percentages are in the 70s and the pass level is 10% (!). Due to our business requirements, we have to use TTL. My understanding is this leads to a lot of document deletions and increased merge activity. However it seems that maybe segments with lots of deletes aren't being merged? We stopped indexing temporarily and there are no merges occurring anywhere in the system so it's not a throttling issue. We are using almost all default settings, but is there some setting in particular I should look at?

On Jun 10, 2014, at 3:41 PM, Mark Walkom markw@campaignmonitor.com wrote:

Are you using a monitoring plugin such as marvel or elastichq? If not then installing those will give you a better insight into your cluster.
You can also check the hot threads end point to check each node - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html

Providing a bit more info on your cluster setup may help as well, index size and count, server specs, java version, that sort of thing.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 11 June 2014 00:41, Kireet Reddy kireet@feedly.com wrote:
On our 4 node test cluster (1.1.2), seemingly out of the blue we had one node experience very high cpu usage and become unresponsive and then after about 8 hours another node experienced the same issue. The processes themselves stayed alive, gc activity was normal, they didn't experience an OutOfMemoryError. The nodes left the cluster though, perhaps due to the unresponsiveness. The only errors in the log files were a bunch of messages like:

org.elasticsearch.search.SearchContextMissingException: No search context found for id ...

and errors about the search queue being full. We see the SearchContextMissingException occasionally during normal operation, but during the high cpu period it happened quite a bit.

I don't think we had an unusually high number of queries during that time because the other 2 nodes had normal cpu usage and for the prior week things ran smoothly.

We are going to restart testing, but is there anything we can do to better understand what happened? Maybe change a particular log level or do something while the problem is happening, assuming we can reproduce the issue?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/58351342-da89-43ad-a1be-194d8b608457%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/6ze7e1TVM8A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624bNyfbBkLZbeGpz8v%2Bq8VOPOLmAeGmWf%2BNQrEar2owLoQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5E198314-7782-4987-81B8-D7A37684C249%40feedly.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/6ze7e1TVM8A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624bcBWd97rWtTHFwh0n%3DJ3k%2BAnfFJeRb%2Bir6FJzOYd8%2BTg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5BFF7512-2365-425A-99D0-F46EDDB08F66%40feedly.com.
For more options, visit https://groups.google.com/d/optout.


(Kireet Reddy) #6

The problem seems to be occurring again. This time we have better
monitoring set up. The node i/o seems normal, but CPU is is hovering around
80% usage rather than the normal 5-10%. Document indexing rates seem
normal, but merge activity on the node is much higher than usual. I am not
sure if it matters, but the ‘current time’ index that is actually taking
writes seems to always have active merges.
The documentation warns about merges on low i/o environments, but what is
puzzling to me is that we aren't seeing any drastic increase in i/o
activity. The problem seems to be heavy cpu usage, not i/o wait. And if I
refresh the stats endpoint, I see the merge count continually increasing,
almost as if the merge process is caught in some sort of endless loop and
it's going as fast as the node's cpu will allow. The other nodes' merge
counts also increase, but cpu and i/o are normal there and pretty low (5%
cpu usage, fairly low i/o as well).

On Monday, June 16, 2014 2:07:43 PM UTC-7, Kireet Reddy wrote:

java version is 1.7.0_55. the servers have a 32GB heap, 96GB of memory, 12
logical cores, and 4 spinning disks.

Currently we have about 450GB of data on each machine, average doc size is
about 1.5KB. We create an index (4 shards, 1 replica) every N days. Right
now we have 12 indices, meaning about 24 shards/node (1242 / 4).

Looking at ElasticHQ, I noticed some warnings around documents deleted.
Our percentages are in the 70s and the pass level is 10% (!). Due to our
business requirements, we have to use TTL. My understanding is this leads
to a lot of document deletions and increased merge activity. However it
seems that maybe segments with lots of deletes aren’t being merged? We
stopped indexing temporarily and there are no merges occurring anywhere in
the system so it’s not a throttling issue. We are using almost all default
settings, but is there some setting in particular I should look at?

On Jun 10, 2014, at 3:41 PM, Mark Walkom markw@campaignmonitor.com
wrote:

Are you using a monitoring plugin such as marvel or elastichq? If not then
installing those will give you a better insight into your cluster.
You can also check the hot threads end point to check each node -
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html

Providing a bit more info on your cluster setup may help as well, index
size and count, server specs, java version, that sort of thing.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 11 June 2014 00:41, Kireet Reddy kireet@feedly.com wrote:

On our 4 node test cluster (1.1.2), seemingly out of the blue we had one
node experience very high cpu usage and become unresponsive and then after
about 8 hours another node experienced the same issue. The processes
themselves stayed alive, gc activity was normal, they didn't experience an
OutOfMemoryError. The nodes left the cluster though, perhaps due to the
unresponsiveness. The only errors in the log files were a bunch of messages
like:

org.elasticsearch.search.SearchContextMissingException: No search context
found for id ...

and errors about the search queue being full. We see the
SearchContextMissingException occasionally during normal operation, but
during the high cpu period it happened quite a bit.

I don't think we had an unusually high number of queries during that time
because the other 2 nodes had normal cpu usage and for the prior week
things ran smoothly.

We are going to restart testing, but is there anything we can do to
better understand what happened? Maybe change a particular log level or do
something while the problem is happening, assuming we can reproduce the
issue?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/58351342-da89-43ad-a1be-194d8b608457%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/58351342-da89-43ad-a1be-194d8b608457%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/6ze7e1TVM8A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEM624bNyfbBkLZbeGpz8v%2Bq8VOPOLmAeGmWf%2BNQrEar2owLoQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEM624bNyfbBkLZbeGpz8v%2Bq8VOPOLmAeGmWf%2BNQrEar2owLoQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b9e6b838-84c8-41de-a7c1-91efaf734b91%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #7