Error when navigating to Watchers

Hi Guys,

So...

I deleted my index today and then added it again. After doing so I then tried to navigate to the watches within Kibana. This gave me the following error message:

Error: Watcher: Error 503 Service Unavailable: [search_phase_execution_exception] all shards failed

The health of my .watches-6 index is red.

How do I recover the health of my index without deleting it or losing any data within it?

Any advice?

J

When I attempted to run

GET .watches/_search/

I receive the following error:

{
  "error": {
    "root_cause": [],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": []
  },
  "status": 503  
}

Do you mean you deleted an index of your own data (not a system index)?

What method did you do to delete the data from Elasticsearch? It should have worked fine if you did a DELETE request to Elasticsearch:

DELETE /mydata

Meaning, send that as a request in the Kibana Console tool (located under Dev Tools). If you went into the data directory on the filesystem, and deleted shard files, then you'll have a problem because you may have deleted primary shards of other indices.

Check if you have any other indices that are red status by doing:

GET /_cat/indices?v

In the Kibana console tool.

Hi Tim,

Thanks for your response.

I deleted an index of my own data, correct. I used that delete request but without the "/". Not sure if that makes a difference.

When I ran:

GET /_cat/indices

I only have one index that is red.

red    open   .watches-6        2W_on-IVS26NWi65qA5Vbg   1   1       

When I run

GET /_cat/shards?v

The state of of a few of my indexes are "unassigned".

.watcher-history-6-2018.02.05                    0     r      UNASSIGNED    
.watches-6                                       0     p      UNASSIGNED                               
.watches-6                                       0     r      UNASSIGNED       
.kibana-6                                        0     r      UNASSIGNED                      
.watcher-history-6-2018.02.03                    0     r      UNASSIGNED          
.triggered_watches-6                             0     r      UNASSIGNED                             
.watcher-history-6-2018.02.06                    0     r      UNASSIGNED   
.watcher-history-6-2018.02.04                    0     r      UNASSIGNED           

The error I receive is that ALL my shards failed.

It might be worth noting that I only use one index pattern for my logs. That is the one that I deleted and added again

Somehow your .watches-6 index lost all of its shards, which means the data is lost for the watches. Fortunately it looks like it only happened to that 1 index.

If you have a snapshot backup taken, you can restore the data https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html

Otherwise you'll just have to delete the index and re-create the watches.

Deleting the .watches index can be tricky, because it doesn't usually allow direct access. There's a troubleshooting guide here: https://www.elastic.co/guide/en/watcher/2.4/troubleshooting.html#_dynamic_mapping_error_when_trying_to_add_a_watch

Hi Tim, we do have snapshots for the cluster. However there was an error when I was trying to restore the snapshot due to duplicated aliases. Is there a way I can increase the data retention period of those snapshots so that I have more time to recover the data? All of the recent ones have been failing. I would like to keep the older snapshots that were successful

Hm, not sure I understand the question since snapshots are incremental. If you grab snapshots on a regular basis and just keep the old ones as you go, you'll have data backed up as far back as when you started the process. I usually just create snapshots for testing, but when I do, I usually append a number in the name of the snapshot: snap_04, snap_05, etc. (using a date pattern would probably make more sense, though.)

If you haven't already, you probably should check out that troubleshooting guide I linked to above, because restoring the snapshot for the watcher indices might require configuring direct access to the index first. Also, when you restore a snapshot on top of an already-existing index, you need to close or delete that index first - hopefully that was the reason for the error you saw.

When you restore a snapshot, you can restore just specific indices, or a single index that's backed up. There is an indices property in the POST body JSON available: Snapshot module | Elasticsearch Guide [8.11] | Elastic, so you probably want to use that property and just restore the watches index.

Hey,

is it possible that you are trying to restore your .watches index, which right now also points to an alias (which in turn points to .watches-6?

You might want to take a look at the rename_pattern and rename_replacement options in the restore docs

--Alex

Hi @tsullivan

With regards to the snapshots. Originally I tried to create a new cluster and load a snapshot into that, however that didn't seem to work for me. So instead I created the cluster from the snapshot and it worked for me. I've managed to access what I needed from the watchers and have fixed the issue.

@spinscale thanks for replying but as you've probably ready I've managed to fix the issue!

Thank you for your time

J

@spinscale Hi.

Is it possible that deleting my .watches-6 index would mean that the watchers no longer execute when the conditions are met?

I tested it by triggering an event that matches one of my watchers (an already working, tested watcher). When manually triggering the watcher it says "execution not needed".

Could this be to do with my watchers mapping?

Your watches are loaded from the .watches index or alias. However in this case, it points to an index that is not available, so there is nothing that can be executed at all.

The watcher points to my tid* index pattern, correct? The tid* index has data inside that the conditions meet.

  {
  "trigger": {
    "schedule": {
      "interval": "2m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "tid*"
        ],

When I run

GET _cat/indices

green  open .watches   2VOq9oBUQ1-7QYjnSDrQBA 1 1       5    0   79.2kb  52.8kb

I do have a .watches index. I'm not 100% sure on how it works but on a high level I believe that the watchers index now cannot communicate with my tid* index?

that looks good.

Can you share the history entry of a watch execution? There is a dedicated watch history index for each day, where each watch execution is stored, so that you get a history of records.

You can search through it like this

GET .watcher-history-6-2018.02.13/_search
{
  "query": {
    "term": {
      "watch_id" : "YOUR_WATCH_ID_HERE"
    }
  },
  "sort": [
    {
      "trigger_event.triggered_time": {
        "order": "desc"
      }
    }
  ]
}
{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 55,
    "max_score": null,
    "hits": [
      {
        "_index": ".watcher-history-6-2018.02.13",
        "_type": "doc",
        "_id": "Account_Locked_Domain_95306c02-12a2-47bb-bf28-86eea45abd0b-2018-02-13T12:50:20.936Z",
        "_score": null,
        "_source": {
          "watch_id": "Account_Locked_Domain",
          "node": "bpFt6tPXR5q4jjhKbNluig",
          "state": "execution_not_needed",
          "status": {
            "state": {
              "active": true,
              "timestamp": "2018-02-13T11:08:15.762Z"
            },
            "last_checked": "2018-02-13T12:50:20.936Z",
            "actions": {
              "my-logging-action": {
                "ack": {
                  "timestamp": "2018-02-13T11:08:15.762Z",
                  "state": "awaits_successful_execution"
                }
              },
              "web_hook": {
                "ack": {
                  "timestamp": "2018-02-13T11:08:15.762Z",
                  "state": "awaits_successful_execution"
                }
              }
            },
            "version": -1
          },
          "trigger_event": {
            "type": "schedule",
            "triggered_time": "2018-02-13T12:50:20.936Z",
            "schedule": {
              "scheduled_time": "2018-02-13T12:50:20.906Z"
            }
          },
          "input": {
            "search": {
              "request": {
                "search_type": "query_then_fetch",
                "indices": [
                  "tid*"
                ],
                "types": [],
                "body": {
                  "size": 0,
                  "query": {
                    "bool": {
                      "filter": [
                        {
                          "range": {
                            "@timestamp": {
                              "gte": "now-{{ctx.metadata.window_period}}"
                            }
                          }
                        },
                        {
                          "term": {
                            "event_type_group": "domain_account_locked"
                          }
                        }
                      ]
                    }
                  },
                  "aggs": {
                    "reporting_ip": {
                      "terms": {
                        "field": "reporting_ip.keyword"
                      },
                      "aggs": {
                        "user": {
                          "terms": {
                            "field": "user.keyword"
                          },
                          "aggs": {
                            "compass_tenantId": {
                              "terms": {
                                "field": "compass_tenantId.keyword"
                              },
                              "aggs": {
                                "events": {
                                  "top_hits": {
                                    "size": 100,
                                    "_source": [
                                      "@timestamp",
                                      "event_type",
                                      "reporting_ip",
                                      "source_ip",
                                      "user",
                                      "computer",
                                      "win_logon_type",
                                      "raw_event_log"
                                    ]
                                  }
                                }
                              }
                            }
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
  },
      "condition": {
        "script": {
          "source": """
  def offenders = [];
  for (def reporting_ip: ctx.payload.aggregations.reporting_ip.buckets) {
    for (def user: reporting_ip.user.buckets) {
      for (def compass_tenantId: user.compass_tenantId.buckets) {
       if (compass_tenantId.doc_count >= 1 ) {
            offenders.add([
              'reporting_ip': reporting_ip.key,
              'user': user.key, 
              'compass_tenantId': compass_tenantId.key, 
              'attempts': compass_tenantId.doc_count,
              'events': compass_tenantId.events,
              'incident_name': 'ACCOUNT_LOCKOUT_DOMAIN',
              'status_open' : 'open',
              'description' : 'Accound Locked: Domain',
              'incident_severity' : '10',
              'conditions' : 'DomainAcctLockout, 10mins'
            ]);
          }
        }
    }
  }
  ctx.payload.offenders = offenders;
  return offenders.size() > 0;
""",
          "lang": "painless"
        }
      },
      "metadata": {
        "window_period": "3h"
      },
      "result": {
        "execution_time": "2018-02-13T12:50:20.936Z",
        "execution_duration": 4,
        "input": {
          "type": "search",
          "status": "success",
          "payload": {
            "_shards": {
              "total": 5,
              "failed": 0,
              "successful": 5,
              "skipped": 0
            },
            "hits": {
              "hits": [],
              "total": 0,
              "max_score": 0
            },
            "took": 3,
            "timed_out": false,
            "offenders": [],
            "aggregations": {
              "reporting_ip": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 0,
                "buckets": []
              }
            }
          },
          "search": {
            "request": {
              "search_type": "query_then_fetch",
              "indices": [
                "tid*"
              ],
              "types": [],
              "body": {
                "size": 0,
                "query": {
                  "bool": {
                    "filter": [
                      {
                        "range": {
                          "@timestamp": {
                            "gte": "now-3h"
                          }
                        }
                      },
                      {
                        "term": {
                          "event_type_group": "domain_account_locked"
                        }
                      }
                    ]
                  }
                },
                "aggs": {
                  "reporting_ip": {
                    "terms": {
                      "field": "reporting_ip.keyword"
                    },
                    "aggs": {
                      "user": {
                        "terms": {
                          "field": "user.keyword"
                        },
                        "aggs": {
                          "compass_tenantId": {
                            "terms": {
                              "field": "compass_tenantId.keyword"
                            },
                            "aggs": {
                              "events": {
                                "top_hits": {
                                  "size": 100,
                                  "_source": [
                                    "@timestamp",
                                    "event_type",
                                    "reporting_ip",
                                    "source_ip",
                                    "user",
                                    "computer",
                                    "win_logon_type",
                                    "raw_event_log"
                                  ]
                                }
                              }
                            }
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        },
        "condition": {
          "type": "script",
          "status": "success",
          "met": false
        },
        "actions": []
      },
      "messages": []
    },
    "sort": [
      1518526220936
    ]

check the result.input.payload JSON snippet from the history entry, it contains your search response and shows that your search did not match anything and your aggregations bucket was empty so the condition was not met.

A very minimal watcher which checks the one field. (This field 100% exists within the time period). Yet the execution is still not needed

   {
      "trigger": {
        "schedule": {
          "interval": "2m"
        }
      },
      "input": {
        "search": {
          "request": {
            "search_type": "query_then_fetch",
            "indices": [
              "tid*"
            ],
            "types": [],
            "body": {
              "size": 0,
              "query": {
                "bool": {
                  "filter": [
                    {
                      "range": {
                        "@timestamp": {
                          "gte": "now-{{ctx.metadata.window_period}}"
                        }
                      }
                    },
                    {
                      "term": {
                        "event_type_group": "office_365_logon_success"
                      }
                    }
                  ]
                }
              },
              "aggs": {
                "source_ip": {
                  "terms": {
                    "field": "source_ip.keyword"
                  }
                }
              }
            }
          }
        }
      },
      "condition": {
        "script": {
          "source": """
          def offenders = [];
          for (def source_ip: ctx.payload.aggregations.source_ip.buckets) {

                  if (source_ip.doc_count >= 1 ) { 
                  offenders.add([
                    'source_ip': source_ip.key,
                    'attempts': source_ip.doc_count,
                    'events': source_ip.events.hits
                  ]);
                }
              
            }
          ctx.payload.offenders = offenders;
          return offenders.size() > 0;
    """,
          "lang": "painless"
        }
      },
      "actions": {
        "web_hook": {
          "webhook": {
            "scheme": "https",
            "host": "staging-api.aurigacompass.com",
            "port": 443,
            "method": "post",
            "path": "/api/alert/post",
            "params": {},
            "headers": {
              "Authorization": "rryfCCgAH3xlgo4IElYHi9hlrTN2VlXTxW9H95f6t3gxpFd5abTGgxM7NfffjRMBAAoD0rU3TtAkuFuKoAjMnN4FLDhaej1Qr-xytxLKy10Wz3d8_yJ8Li-qF0Eg-xST1wXF25tVw77UXpgFSo15xeWDsnRaAUrd2iIh2eR_i0nPdNvGmzw5uChq34pfH_qsw7zhL1P",
              "Content-Type": "application/json"
            },
            "body": "{{#toJson}}ctx.payload.offenders{{/toJson}}"
          }
        },
        "my-logging-action": {
          "logging": {
            "level": "info",
            "text": "There are {{ctx.payload.hits.total}} documents -  {{ctx.payload.hits.hit}}"
          }
        }
      },
      "metadata": {
        "window_period": "20m"
      },
      "throttle_period_in_millis": 120000
    }

The result

"result": {
            "execution_time": "2018-02-13T14:07:16.737Z",
            "execution_duration": 4,
            "input": {
              "type": "search",
              "status": "success",
              "payload": {
                "_shards": {
                  "total": 5,
                  "failed": 0,
                  "successful": 5,
                  "skipped": 0
                },
                "hits": {
                  "hits": [],
                  "total": 20,
                  "max_score": 0
                },
                "took": 2,
                "timed_out": false,
                "offenders": [],
                "aggregations": {
                  "source_ip": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": []

Nothing appears at all to be wrong.. but it's still not executing. :face_with_monocle: