Unable to parse custom X-pack watcher

Hi,

I am unable to parse the custom watcher which I've written. I am getting errors in chain transform:

Watcher: [parse_exception] could not parse [search] transform for watch [inlined]. unexpected field [indices]

Here's my custom watcher:

{
	"trigger": {
		"schedule": {
			"interval": "30m"
		}
	},
	"input": {
		"search": {
			"request": {
				"indices": [
					"heartbeat-*"
				],
				"body": {
					"size": 0,
					"query": {
						"match_all": {}
					}
				}
			}
		}
	},
	"condition": {
		"script": {
			"source": "return ctx.payload.hits.hits._source.http.response.status > params.status",
			"lang": "painless",
			"params": {
				"status": 500
			}
		}
	},
	"transform": {
		"chain": [
		    {
				"search": {
					"indices": ["heartbeat-*"],
					"body": {
						"size": 0,
						"query": {
							"match": {
								"ctx.payload.hits.hits._source.monitor.status": "down"
							}
						}
					}
				}
			},
			{
				"script": "return [ host_name : ctx.payload.hits.hits._source.monitor.host ]"
			}
		]
	},
	"actions": {
		"send_email": {
			"email": {
				"to": "aaaaa@gmail.com",
				"subject": "Watcher Notification",
				"body": "{{host_name}} is down"
			}
		}
	}
}

All I want to do is:

  1. Use heartbeat index to get data where response status is greater than 500, extract hostnames of these records and create an alert for the same.

Any help would be appreciated as there's not much examples on using chain transform.

you are missing the request part in the transform which you included in the search input correctly.

Corrected. But now I am getting

Watcher: An internal server error occurred while simulating the watcher.

Because Variable [host_name] is not defined. Something seems to be wrong but I'm not sure what.

Also, the example mentioned here - https://www.elastic.co/guide/en/x-pack/current/transform-chain.html is missing request part

Edit 1:
While the above issue is resolved, I am now getting error while parsing heartbeat json. Here's the error:

"exception": {
    "type": "script_exception",
    "reason": "runtime error",
    "script_stack": [
      "return ctx.payload.hits.hits._source.http.response.status > params.status",
      "                            ^---- HERE"
    ],
    "script": "return ctx.payload.hits.hits._source.http.response.status > params.status",
    "lang": "painless",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Illegal list shortcut value [_source].",
      "stack_trace": "java.lang.IllegalArgumentException: Illegal list shortcut value [_source].\n\tat org.elasticsearch.painless.Def.lookupGetter(Def.java:454)\n\tat org.elasticsearch.painless.DefBootstrap$PIC.lookup(DefBootstrap.java:149)\n\tat org.elasticsearch.painless.DefBootstrap$PIC.fallback(DefBootstrap.java:203)\n\tat org.elasticsearch.painless.PainlessScript$Script.execute(return ctx.payload.hits.hits._source.http.response.status > params.status:29)\n\tat org.elasticsearch.painless.ScriptImpl.run(ScriptImpl.java:105)\n\tat org.elasticsearch.xpack.watcher.condition.ScriptCondition.doExecute(ScriptCondition.java:85)\n\tat org.elasticsearch.xpack.watcher.condition.ScriptCondition.execute(ScriptCondition.java:76)\n\tat org.elasticsearch.xpack.watcher.execution.ExecutionService.executeInner(ExecutionService.java:466)\n\tat org.elasticsearch.xpack.watcher.execution.ExecutionService.execute(ExecutionService.java:317)\n\tat org.elasticsearch.xpack.watcher.transport.actions.execute.TransportExecuteWatchAction$1.doRun(TransportExecuteWatchAction.java:165)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n"
    },
    "stack_trace": "ScriptException[runtime error]; nested: IllegalArgumentException[Illegal list shortcut value [_source].];\n\tat org.elasticsearch.painless.PainlessScript.convertToScriptException(PainlessScript.java:101)\n\tat org.elasticsearch.painless.PainlessScript$Script.execute(return ctx.payload.hits.hits._source.http.response.status > params.status:8)\n\tat org.elasticsearch.painless.ScriptImpl.run(ScriptImpl.java:105)\n\tat org.elasticsearch.xpack.watcher.condition.ScriptCondition.doExecute(ScriptCondition.java:85)\n\tat org.elasticsearch.xpack.watcher.condition.ScriptCondition.execute(ScriptCondition.java:76)\n\tat org.elasticsearch.xpack.watcher.execution.ExecutionService.executeInner(ExecutionService.java:466)\n\tat org.elasticsearch.xpack.watcher.execution.ExecutionService.execute(ExecutionService.java:317)\n\tat org.elasticsearch.xpack.watcher.transport.actions.execute.TransportExecuteWatchAction$1.doRun(TransportExecuteWatchAction.java:165)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: java.lang.IllegalArgumentException: Illegal list shortcut value [_source].\n\tat org.elasticsearch.painless.Def.lookupGetter(Def.java:454)\n\tat org.elasticsearch.painless.DefBootstrap$PIC.lookup(DefBootstrap.java:149)\n\tat org.elasticsearch.painless.DefBootstrap$PIC.fallback(DefBootstrap.java:203)\n\tat org.elasticsearch.painless.PainlessScript$Script.execute(return ctx.payload.hits.hits._source.http.response.status > params.status:29)\n\t... 13 more\n"
  }

Here's the json I am trying to parse:

"hits": {
    "total": 6120,
    "max_score": 0.00008168934,
    "hits": [
      {
        "_index": "heartbeat-6.2.4-2018.05.08",
        "_type": "doc",
        "_id": "JsIMPWMBT22YMjfjXauE",
        "_score": 0.00008168934,
        "_source": {
          "tcp": {
            "rtt": {
              "connect": {
                "us": 1227
              }
            },
            "port": 80
          },
          "resolve": {
            "ip": "54.319.1.87",
            "host": "aaaaaa",
            "rtt": {
              "us": 6103
            }
          },
          "beat": {
            "name": "STOOR",
            "hostname": "STOOR",
            "version": "6.2.4"
          },
          "@timestamp": "2018-05-08T00:00:09.367Z",
          "type": "monitor",
          "http": {
            "response": {
              "status": 200
            },
            "rtt": {
              "write_request": {
                "us": 33
              },
              "total": {
                "us": 2737
              },
              "response_header": {
                "us": 1422
              },
              "validate": {
                "us": 1459
              },
              "content": {
                "us": 37
              }
            },
            "url": "http://aaaaaaaa"
          },
          "@version": "1",
          "host": "STOOR",
          "tags": [
            "beats_input_raw_event"
          ],
          "monitor": {
            "id": "http@http://aaaaaaaa",
            "status": "up",
            "ip": "54.319.1.87",
            "type": "http",
            "scheme": "http",
            "name": "http",
            "duration": {
              "us": 8898
            },
            "host": "aaaaaa"
          }
        }
      },

Note: I've only pasted the relevant json record from the complete output.

you are not taking into account, that ctx.payload.hits.hits is an array and not just an ordinary structure. I'd recommend to take a look at our examples repository - which contains a few scripts and watches to get inspired by.

I've gone through the examples. But returned back here being unhappier with the fact that painless is not actually seeming to be painless.

At first I didn't think it could be so difficult to accomplish with painless.

Still looking for better explanation for simpler constructs. Something which can help me accomplish this use case.

Maybe it makes sense to step back a second and explain, what this watch should actually do.

The direct solution to your problem would be to use ctx.payload.hits.hits[0]._source..., but this means you would only check the first hit of your query, which means, there could be many other documents being found and then be fully unchecked.

You also don't take into account, that your query needs to be limited by time, otherwise you are searching for the whole time range with every watch execution, which means, that only one host being down will result in an sent email forever.

This (and not the concrete issue) lead me to the hint, that it might make sense to step back a bit and study the examples closer first. And then come back with the concrete problem you are tring to solve.

Hope this makes sense!

The problem is simple:

  1. I have N number of websites
  2. I need a watcher which can monitor these websites, check the response code of each of them and alert me if any websites goes down.

Pretty much what web-based services like uptimerobot,pingdom,etc does. All I want to do is to accomplish this with ELK stack as we already have it and folks are more favored towards ELK stack for this purpose.

So, the direct solution ctx.payload.hits.hits[0]._source... wouldn't work here. If you insist, I'll still review the examples but I hope you get the problem statement now.

There is a reason, why there are fully fledged services out there for this - because it is not as easy as it sounds. Even getting up and running a small service to alert for your internal infrastructure will require some serious thought, let's take a look at your requirements and also what to keep in mind in addition.

The first question you have to ask yourself is, if you can write a query that returns the proper data you need for an alert.

You probably could. You need to receive all alerts from the last 5 minutes (time range), where the status was down, grouped by the hostname. Maybe not the hostname, because if a host runs two services but one is working, you do want to get notified only for that service.

So this means a query filtering for status and time, and an aggregation grouping by monitor.id (as an example).

Is this enough? hard to tell? Do you constantly want to alert on the same service being down, or do you only want to alert on new down events? If so, you would need to get the data from the previous run as well (maybe with a second query using the chain input, or with a second aggregation, that has different time filters.

Also, is it sufficient to monitor a special endpoint, or would it make more sense to monitor what is happening inside of your application (i.e. using APM), as it might make a difference if your website is reachable or if you are unable to put any item in your shopping cart.

What is your alerting strategy? Do you want to escalate immediataly via pagerduty? Should a single email be sent and then you might want to throttle for 10 minutes, before sending another one, while keeping to send updates to slack.

This is just the start. No special cases, no corner cases. This also requires your to take a deeper look at the search and aggregations feature of elasticsearch, throttling of watches, conditional actions inside of watches etc...

This query might help you as a start

GET heartbeat-6.2.4-2018.05.11/_search
{
  "size": 0, 
  "query": {
    "bool": {
      "filter": {
        "term": {
          "monitor.status": "down"
        }
      }
    }
  }, 
  "aggs": {
    "status": {
      "terms": {
        "field": "monitor.id"
      }
    }
  }
}

Again it is just a start, as you only get back ten buckets by default in the aggregation and you may have more hosts being down.

1 Like

Thanks for the head start. There are certainly more moving part than I thought when I first started this. I'll still give this a go to see how much I can accomplish.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.