We currently monitor our app by having a monitoring tool (Pingdom) retrieve
a health page from our app that retrieves and displays the Elasticsearch
cluster info, e.g.
If the monitoring process can't reach our app, or our app can't reach
Elasticsearch, we'll get an error and an alert, however, this doesn't tell
us anything about node and index health. I've made a page that calls
ClusterClient.health(level='indices') but want to confirm
Is this sufficient for surfacing any issue with our Elasticsearch
infrastructure? and
Does this call block query requests/backups, consume a lot of
resources, or otherwise create impacts such that we wouldn't want to be
calling it every 60 seconds 24x7?
We don't need to have our monitoring page give us a full diagnosis of all
conceivable issues, we just need it to trigger an alert that there is an
issue so we know we have some work to do, while having minimal impact on
overall application performance.
Any recommendations on what we should monitor to achieve those two mandates
would be greatly appreciated.
You probably want to monitor each node as well, _nodes/stats has useful
disk/cpu/heap/gc stats. Also has information about thread usage and
completed tasks to monitor search/index growth.
I don't fully know the answer to #2, but I assume _nodes & _cluster are
served by management threads. We hit _nodes/stats and _cluster/health
every 5min and haven't seen any issues. Depending on your cluster size I
don't know if I'd do 60seconds, _nodes/stats can take some time to gather
if there's a lot of nodes.
On Monday, March 23, 2015 at 11:11:36 AM UTC-4, Joel Potischman wrote:
We currently monitor our app by having a monitoring tool (Pingdom)
retrieve a health page from our app that retrieves and displays the
Elasticsearch cluster info, e.g.
If the monitoring process can't reach our app, or our app can't reach
Elasticsearch, we'll get an error and an alert, however, this doesn't tell
us anything about node and index health. I've made a page that calls
ClusterClient.health(level='indices') but want to confirm
Is this sufficient for surfacing any issue with our Elasticsearch
infrastructure? and
Does this call block query requests/backups, consume a lot of
resources, or otherwise create impacts such that we wouldn't want to be
calling it every 60 seconds 24x7?
We don't need to have our monitoring page give us a full diagnosis of all
conceivable issues, we just need it to trigger an alert that there is an
issue so we know we have some work to do, while having minimal impact on
overall application performance.
Any recommendations on what we should monitor to achieve those two
mandates would be greatly appreciated.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.