Not receiving email for CPU usage

alerting

(Iqbal Nazir) #1

I am with watcher. I don't receive any email for cpu and memory usage. I know my email configuration in
elasticsearch.yml is correct because I receive email for another watch (i.e. event_critical_watch). I have followed https://www.elastic.co/guide/en/watcher/current/watching-marvel-data.html#watching-cpu-usage and set the cpu usage to 5% just to check if I receive any email. After reading another post like this in this forum, I have checked POST _watcher/watch/cpu_usage/_execute which shows me output like this...(in the 1st comment)

I have checked in marvel that my node is consuming more than 10% cpu all
the time. Still I don't receive any email. Does any one have any solution for
me?

Just FYI: I have checked in kibana .marvel-* index but there I didn't find any "os.cpu.user" field as mentioned in the guide. Is it the possible reason? If yes, why that field is not there?

(I'm a beginner in Elasticsearch and everything..so detailed answer
would be really appreciated)
thanks in advance.
--Iqbal


(Iqbal Nazir) #2

{
"_id": "cpu_usage_220-2016-06-09T10:35:15.687Z",
"watch_record": {
"watch_id": "cpu_usage",
"state": "execution_not_needed",
"trigger_event": {
"type": "manual",
"triggered_time": "2016-06-09T10:35:15.687Z",
"manual": {
"schedule": {
"scheduled_time": "2016-06-09T10:35:15.687Z"
}
}
},
"input": {
"search": {
"request": {
"search_type": "query_then_fetch",
"indices": [
".marvel-"
],
"types": [],
"body": {
"size": 0,
"query": {
"filtered": {
"filter": {
"range": {
"@timestamp": {
"gte": "now-2m",
"lte": "now"
}
}
}
}
},
"aggs": {
"minutes": {
"date_histogram": {
"field": "@timestamp",
"interval": "minute"
},
"aggs": {
"nodes": {
"terms": {
"field": "node.name.raw",
"size": 10,
"order": {
"cpu": "desc"
}
},
"aggs": {
"cpu": {
"avg": {
"field": "os.cpu.user"
}
}
}
}
}
}
}
}
}
}
},
"condition": {
"script": "if (ctx.payload.aggregations.minutes.buckets.size() == 0) return false; def latest = ctx.payload.aggregations.minutes.buckets[-1]; def node = latest.nodes.buckets[0]; return node && node.cpu && node.cpu.value >= 5;"
},
"messages": [],
"result": {
"execution_time": "2016-06-09T10:35:15.687Z",
"execution_duration": 1,
"input": {
"type": "search",
"status": "success",
"payload": {
"_shards": {
"total": 2,
"failed": 0,
"successful": 2
},
"hits": {
"hits": [],
"total": 0,
"max_score": 0
},
"took": 1,
"timed_out": false,
"aggregations": {
"minutes": {
"buckets": []
}
}
},
"search": {
"request": {
"search_type": "query_then_fetch",
"indices": [
".marvel-
"
],
"types": [],
"template": {
"template": {
"size": 0,
"query": {
"filtered": {
"filter": {
"range": {
"@timestamp": {
"gte": "now-2m",
"lte": "now"
}
}
}
}
},
"aggs": {
"minutes": {
"date_histogram": {
"field": "@timestamp",
"interval": "minute"
},
"aggs": {
"nodes": {
"terms": {
"field": "node.name.raw",
"size": 10,
"order": {
"cpu": "desc"
}
},
"aggs": {
"cpu": {
"avg": {
"field": "os.cpu.user"
}
}
}
}
}
}
}
},
"params": {
"ctx": {
"metadata": null,
"watch_id": "cpu_usage",
"id": "cpu_usage_220-2016-06-09T10:35:15.687Z",
"trigger": {
"triggered_time": "2016-06-09T10:35:15.687Z",
"scheduled_time": "2016-06-09T10:35:15.687Z"
},
"vars": {},
"execution_time": "2016-06-09T10:35:15.687Z"
}
}
}
}
}
},
"condition": {
"type": "script",
"status": "success",
"met": false
},
"actions": []
}
}
}


(Alexander Reelsen) #3

Hey

The last five lines tell you the important part:

...
"condition": {
"type": "script",
"status": "success",
"met": false
},
...

This means that the condition returned false. You should go back and evaluate the condition more closely. Maybe you referenced a wrong path somewhere?

--Alex


(Iqbal Nazir) #4

Hi..
Thanks for the quick reply. My condition field is the same as mentioned in the link (https://www.elastic.co/guide/en/watcher/current/watching-marvel-data.html#watching-cpu-usage). I have just changed 75 to 5 to check if I receive email.

"condition": {
"script": "if (ctx.payload.aggregations.minutes.buckets.size() == 0) return false; def latest = ctx.payload.aggregations.minutes.buckets[-1]; def node = latest.nodes.buckets[0]; return node && node.cpu && node.cpu.value >= 5;"
},

what could be the mistake here? should I adjust anything according to my settings in the elasticsearch. I have only one node called 'My 1st node'
...
Iqbal


(Alexander Reelsen) #5

Hey,

please take your time and examine the result of the execute watch API. Check the result, which contains the search response and find out no hits at all a returned and the buckets array is empty. So either you are querying the wrong index or the index does not exist on your local cluster.

--Alex


(Alexander Reelsen) #6

Hey,

I took a look at the example, and I think it does not reflect the current marvel stats.

  • Can you replace the two occurences of @timestamp with timestamp.
  • Can you replace the mention of node.name.raw with node.name
  • Can you change the heap percent mention from jvm.mem.heap_used_percent to node_stats.jvm.mem.heap_used_percent
  • Last but not least, can you add "types":["node_stats"] after the indices part to configure the correct query of the type?

Let's see if that changes anything!

--Alex


(Iqbal Nazir) #7

Hi Alex,

jvm.mem.heap_used_percent is not there in CPU usage, rather in Memory usage example. Still I tried with the memory usage example. Changed the terms as you suggested. But there is no change in the execute result:

        }
      },
      "condition": {
        "type": "script",
        "status": "success",
        "met": false
      },
      "actions": []
    }
  }

(Alexander Reelsen) #8

Hey,

can you provide the full watch you are testing with. Also, please put it in appropriate formatting tags, see here how to use use code blocks. This makes it much easier for others.

--Alex


(Iqbal Nazir) #9

Hi,
please find my complete watch below:

  PUT _watcher/watch/mem_watch
{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "indices": [
          ".marvel-*"
        ],
        "types": [
          "node_stats"
        ],
        "body": {
          "size": 0,
          "query": {
            "filtered": {
              "filter": {
                "range": {
                  "timestamp": {
                    "gte": "now-2m",
                    "lte": "now"
                  }
                }
              }
            }
          },
          "aggs": {
            "minutes": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "minute"
              },
              "aggs": {
                "nodes": {
                  "terms": {
                    "field": "node.name",
                    "size": 10,
                    "order": {
                      "memory": "desc"
                    }
                  },
                  "aggs": {
                    "memory": {
                      "avg": {
                        "field": "node_stats.jvm.mem.heap_used_percent"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "throttle_period": "2m",
  "condition": {
    "script": "if (ctx.payload.aggregations.minutes.buckets.size() == 0) return false; def latest = ctx.payload.aggregations.minutes.buckets[-1]; def node = latest.nodes.buckets[0]; return node && node.memory && node.memory.value >= 5;"
  },
  "actions": {
    "send_email": {
      "transform": {
        "script": "def latest = ctx.payload.aggregations.minutes.buckets[-1]; return latest.nodes.buckets.findAll { return it.memory && it.memory.value >=5 };"
      },
      "email": {
        "to": "user@mycompany.com",
        "subject": "Watcher Notification - HIGH MEMORY USAGE",
        "body": "Nodes with HIGH MEMORY Usage (above 5%):\n\n{{#ctx.payload._value}}\"{{key}}\" - Memory Usage is at {{memory.value}}%\n{{/ctx.payload._value}}"
      }
    }
  }
}

(Alexander Reelsen) #10

Hey,

the field for the terms agg must be source_node.name instead of node.name, my fault.

--Alex


(Iqbal Nazir) #11

Hi Alex,
It has worked :smiley: Thanks a lot.
Is there any way to edit a watch?
To make a small change, I have to DELETE and PUT again to get "created": true


(Alexander Reelsen) #12

Hey,

just put it again, it's fine. The watch will be overwritten.

--Alex


(Iqbal Nazir) #13
Hi,
Thanks again.
Now could you please review my cpu_usage watch? I'm not receiving any email for high cpu usage. Here is my watch for that: 


        PUT _watcher/watch/cpu_usage
        {
          "trigger": {
            "schedule": {
              "interval": "1m"
            }
          },
          "input": {
            "search": {
              "request": {
                "indices": 
                "types":["node_stats"]
                [
                  ".marvel-*"
                ],
                "body": {
                  "size" : 0,
                  "query": {
                    "filtered": {
                      "filter": {
                        "range": {
                          "timestamp": {
                            "gte": "now-2m",
                            "lte": "now"
                          }
                        }
                      }
                    }
                  },
                  "aggs": {
                    "minutes": {
                      "date_histogram": {
                        "field": "timestamp",
                        "interval": "minute"
                      },
                      "aggs": {
                        "nodes": {
                          "terms": {
                            "field": "source_node.name",
                            "size": 10,
                            "order": {
                              "cpu": "desc"
                            }
                          },
                          "aggs": {
                            "cpu": {
                              "avg": {
                                "field": "os.cpu.user"
                              }
                            }
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          },
          "throttle_period": "1m", 
          "condition": {
            "script":  "if (ctx.payload.aggregations.minutes.buckets.size() == 0) return false; def latest = ctx.payload.aggregations.minutes.buckets[-1]; def node = latest.nodes.buckets[0]; return node && node.cpu && node.cpu.value >= 5;"
          },
          "actions": {
            "send_email": { 
              "transform": {
                "script": "def latest = ctx.payload.aggregations.minutes.buckets[-1]; return latest.nodes.buckets.findAll { return it.cpu && it.cpu.value >= 5 };"
              },
              "email": {
                "to": "user@mycompany.com", 
                "subject": "Watcher Notification - HIGH CPU USAGE",
                "body": "Nodes with HIGH CPU Usage (above 5%):\n\n{{#ctx.payload._value}}\"{{key}}\" - CPU Usage is at {{cpu.value}}%\n{{/ctx.payload._value}}"
              }
            }
          }
        }

(Alexander Reelsen) #14

Hey,

please execute the query standalone first and see if you get back any buckets. If not, execute a search and see where the documents differ.

--Alex


(Iqbal Nazir) #15

Hi Alex,

I have done POST _watcher/watch/cpu_usage/_execute and I think I haven't got any buckets.

   "buckets": [
                      {
                        "doc_count": 4,
                        "cpu": {
                          "value": null

Then, I have also compared between mem_watch and cpu_usage. I have changed from os.cpu.userto node_stats.os.cpu.user with no success so far. I am not an expert and maybe that's why I'm missing something.


(Alexander Reelsen) #16

Hey,

this is not what I meant. You should execute the search operation that you refer to in the watch manually as well as a search operation by index and type manually (without specifying any queries) - just do GET /.marvel-/node_stats/_search - this allows you to check out if the fields you are referring to in your query are set in the returned documents.

--Alex


(Iqbal Nazir) #17

Hi,
sorry for my ignorance. I have done GET /.marvel-/node_stats/_search but received 404then I did GET /.marvel-*/node_stats/_search (please note *) which returned a lot of results but it doesn't contain any node_stats.os.cpu.userfield
Thanks.


(Alexander Reelsen) #18

Hey,

if you execute

GET /.marvel-es-*/node_stats/_search
{
  "size" : 1,
  "sort" : [ { "timestamp" : "desc" } ]
}

You can see the JSON structure for the last node_stats. The CPU load is now part of the process JSON being returned and only covers the load of this process and not the whole OS, which is currently not being monitored.

I will update the watches in the docs over the next days, but until then you can fix your watch by adapting to the JSON returned.

--Alex


(Iqbal Nazir) #19

Hi again,

Thanks for the clarification. Now could you please tell me how to adjust the watch. After executing your search I have found CPU percent is under process of the last node_stats. Then I have changed the field from node_stats.os.cpu.user to node_stats.process.cpu.percent with no success. Should I have to change anything else in the watch?


(Alexander Reelsen) #20

This watch works for me

PUT _watcher/watch/cpu_usage
{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "indices": [
          ".marvel-es-1-*"
        ],
        "types" : [
          "node_stats"
        ],
        "body": {
          "size" : 0,
          "query": {
            "filtered": {
              "filter": {
                "range": {
                  "timestamp": {
                    "gte": "now-2m",
                    "lte": "now"
                  }
                }
              }
            }
          },
          "aggs": {
            "minutes": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "minute"
              },
              "aggs": {
                "nodes": {
                  "terms": {
                    "field": "source_node.name",
                    "size": 10,
                    "order": {
                      "cpu": "desc"
                    }
                  },
                  "aggs": {
                    "cpu": {
                      "avg": {
                        "field": "node_stats.process.cpu.percent"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "throttle_period": "30m", <1>
  "condition": {
    "script":  "if (ctx.payload.aggregations.minutes.buckets.size() == 0) return false; def latest = ctx.payload.aggregations.minutes.buckets[-1]; def node = latest.nodes.buckets[0]; return node && node.cpu && node.cpu.value >= 75;"
  },
  "actions": {
    "send_email": { <2>
      "transform": {
        "script": "def latest = ctx.payload.aggregations.minutes.buckets[-1]; return latest.nodes.buckets.findAll { return it.cpu && it.cpu.value >= 75 };"
      },
      "email": {
        "to": "user@example.com", <3>
        "subject": "Watcher Notification - HIGH CPU USAGE",
        "body": "Nodes with HIGH CPU Usage (above 75%):\n\n{{#ctx.payload._value}}\"{{key}}\" - CPU Usage is at {{cpu.value}}%\n{{/ctx.payload._value}}"
      }
    }
  }
}

--Alex