ES alerting mechanism for failure scenarios, high latency situations


(T Vinod Gupta) #1

is there a plugin or api support for monitoring ES key metrics and alerting
the dev ops about situations when some node in a cluster fails or there is
a spike in latency due to whatever reason?

what are the best practices here and what do people usually do?

thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHau4yv9L%2B5zXtDQcNKmK-b_30Q2MdrTtPjHUWsDYKEgFX8hnQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Otis Gospodnetić) #2

Hi,

We use our own SPM for Elasticsearch. It has classic threshold-based
alerts as well as alerts based on automatic anomaly detection -


. It's a SaaS, not a plugin, but maybe it would work for you.

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Thursday, March 6, 2014 1:24:33 PM UTC-5, T Vinod Gupta wrote:

is there a plugin or api support for monitoring ES key metrics and
alerting the dev ops about situations when some node in a cluster fails or
there is a spike in latency due to whatever reason?

what are the best practices here and what do people usually do?

thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/41c27e2b-5031-44f2-9d8d-4130d451446e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #3

This is a very good point, I'm thinking about this for years.

Node failures should be easy to monitor by OS services. But latency spikes
are totally different.

It is a very, very hard job to measure anomalies in latency correctly. Just
consider the skews of wrong programming, or of the hostile environments
JVMs do run in (clocks, OSes, VMs, ...) If anomalies are detected wrongly,
no or false alerts are emitted, and all of the effort would lead to
annoyance or frustration.

Lately I read about Gil Tene's LatencyUtils

https://groups.google.com/forum/#!topic/mechanical-sympathy/oZSv5QnpAYs

which I find a promising tool to measure anomalies in histograms.

Some of this might be possible to get implemented by an ES plugin, but I
haven't tried LatencyUtils yet, and how it can be connected to ES metrics
is still open to me.

Jörg

On Thu, Mar 6, 2014 at 7:24 PM, T Vinod Gupta tvinod@readypulse.com wrote:

is there a plugin or api support for monitoring ES key metrics and
alerting the dev ops about situations when some node in a cluster fails or
there is a spike in latency due to whatever reason?

what are the best practices here and what do people usually do?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGXNqJkF5uL2oCKmBsHYqQJxFdxUrW%2BF0maVSJupOGupQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(T Vinod Gupta) #4

i was playing around with marvel on a test machine and it is clear that a
lot of thought, effort and time has gone into building it. it is super. but
what will really take it to the next level is alerts - you can configure
certain kinds of events to trigger an alert. and then have rules around
latency spikes. i agree that full automation can lead to false triggers and
annoyance. but if you make it like aws cloudwatch triggers where you say
that if the cluster/node is in a certain state (e.g. search latency > 1s
for a period of 10 min), then trigger.

thanks

On Fri, Mar 7, 2014 at 2:45 PM, joergprante@gmail.com <joergprante@gmail.com

wrote:

This is a very good point, I'm thinking about this for years.

Node failures should be easy to monitor by OS services. But latency spikes
are totally different.

It is a very, very hard job to measure anomalies in latency correctly.
Just consider the skews of wrong programming, or of the hostile
environments JVMs do run in (clocks, OSes, VMs, ...) If anomalies are
detected wrongly, no or false alerts are emitted, and all of the effort
would lead to annoyance or frustration.

Lately I read about Gil Tene's LatencyUtils

https://github.com/LatencyUtils/LatencyUtils

https://groups.google.com/forum/#!topic/mechanical-sympathy/oZSv5QnpAYs

which I find a promising tool to measure anomalies in histograms.

Some of this might be possible to get implemented by an ES plugin, but I
haven't tried LatencyUtils yet, and how it can be connected to ES metrics
is still open to me.

Jörg

On Thu, Mar 6, 2014 at 7:24 PM, T Vinod Gupta tvinod@readypulse.comwrote:

is there a plugin or api support for monitoring ES key metrics and
alerting the dev ops about situations when some node in a cluster fails or
there is a spike in latency due to whatever reason?

what are the best practices here and what do people usually do?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGXNqJkF5uL2oCKmBsHYqQJxFdxUrW%2BF0maVSJupOGupQ%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAKdsXoGXNqJkF5uL2oCKmBsHYqQJxFdxUrW%2BF0maVSJupOGupQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHau4yvm0HKXK%2Bhuvejq%2B0WT4TrWEJYMTnCnYsSWWaipq828ag%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5