Recommendations for health monitoring

We currently monitor our app by having a monitoring tool (Pingdom) retrieve
a health page from our app that retrieves and displays the Elasticsearch
cluster info, e.g.

{
"status": 200,
"name": "whatever",
"cluster_name": "whatever_dev",
"version": {
"number": "1.4.4",
"build_hash": "c38f773fc81201d1abdfde1ca2746fab58efa912",
"build_timestamp": "2015-02-19T13:05:36Z",
"build_snapshot": false,
"lucene_version": "4.10.3"
},
"tagline": "You Know, for Search"
}

If the monitoring process can't reach our app, or our app can't reach
Elasticsearch, we'll get an error and an alert, however, this doesn't tell
us anything about node and index health. I've made a page that calls
ClusterClient.health(level='indices') but want to confirm

  1. Is this sufficient for surfacing any issue with our Elasticsearch
    infrastructure? and
  2. Does this call block query requests/backups, consume a lot of
    resources, or otherwise create impacts such that we wouldn't want to be
    calling it every 60 seconds 24x7?

We don't need to have our monitoring page give us a full diagnosis of all
conceivable issues, we just need it to trigger an alert that there is an
issue so we know we have some work to do, while having minimal impact on
overall application performance.

Any recommendations on what we should monitor to achieve those two mandates
would be greatly appreciated.

Thanks,

-joel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d9290f69-5150-4824-9ef4-6011b35ed959%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You probably want to monitor each node as well, _nodes/stats has useful
disk/cpu/heap/gc stats. Also has information about thread usage and
completed tasks to monitor search/index growth.

I don't fully know the answer to #2, but I assume _nodes & _cluster are
served by management threads. We hit _nodes/stats and _cluster/health
every 5min and haven't seen any issues. Depending on your cluster size I
don't know if I'd do 60seconds, _nodes/stats can take some time to gather
if there's a lot of nodes.

On Monday, March 23, 2015 at 11:11:36 AM UTC-4, Joel Potischman wrote:

We currently monitor our app by having a monitoring tool (Pingdom)
retrieve a health page from our app that retrieves and displays the
Elasticsearch cluster info, e.g.

{
"status": 200,
"name": "whatever",
"cluster_name": "whatever_dev",
"version": {
"number": "1.4.4",
"build_hash": "c38f773fc81201d1abdfde1ca2746fab58efa912",
"build_timestamp": "2015-02-19T13:05:36Z",
"build_snapshot": false,
"lucene_version": "4.10.3"
},
"tagline": "You Know, for Search"
}

If the monitoring process can't reach our app, or our app can't reach
Elasticsearch, we'll get an error and an alert, however, this doesn't tell
us anything about node and index health. I've made a page that calls
ClusterClient.health(level='indices') but want to confirm

  1. Is this sufficient for surfacing any issue with our Elasticsearch
    infrastructure? and
  2. Does this call block query requests/backups, consume a lot of
    resources, or otherwise create impacts such that we wouldn't want to be
    calling it every 60 seconds 24x7?

We don't need to have our monitoring page give us a full diagnosis of all
conceivable issues, we just need it to trigger an alert that there is an
issue so we know we have some work to do, while having minimal impact on
overall application performance.

Any recommendations on what we should monitor to achieve those two
mandates would be greatly appreciated.

Thanks,

-joel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b3d31d67-669e-4175-ae4b-1d734013c977%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.