Failure in Writing Watcher History – Seeking Reliable Logging Mechanism

Hi Folks!

We are currently encountering challenges with debugging Watcher execution using the default .watcher-history-* indices. Our production Elasticsearch cluster hosts over 200 complex Watcher definitions, and we are seeing the following error in the cluster logs:

[2025-01-26T21:20:34,915][ERROR][o.e.x.w.Watcher] [node-1] watch history could not be written [mAql909PRIafzGS79f2KgA_elasticsearch_cluster_status_c87540ce-801c-4038-a074-03520920e616-2025-01-26T21:20:34.231251784Z], failure [java.lang.IllegalArgumentException: Limit of total fields [1000] has been exceeded]

Due to this, the execution history is only intermittently recorded, making it difficult to reliably debug and audit Watcher executions.

We are seeking a solution that allows us to capture and store the complete Watcher execution history—including inputs, conditions, actions, and results—in a custom index. Our goal is to persist the full execution context, not just the ctx object.

Is there a supported or recommended approach to achieve this level of detailed Watcher logging in a custom index?

Please advise.

Thanks a Ton!

Kind regards,
Souvik

It is interesting that your .watch-history index mapping has so many fields. You might try temporarily increasing index.mapping.total_fields.limit on the current .watch-history write index to say 2000, and take a look at what the mappings are after some successful writes. Maybe there is something in the output of one or more of your watches that is resulting in more history fields being written than we would expect? So for example, call

GET _data_stream/.watcher-history-17/

And grab the last index_name out of the indices field. That will be the current write index. Then on that write index:

POST .ds-.watcher-history-17-2025.05.28-000002/_settings
{
  "index.mapping.total_fields.limit": 2000
}

That will only change the field limit on the current write index, and those indices get rolled over automatically every few days. But it might be enough to help you with your debugging, and to find out (and maybe fix) the source of the large number of fields.

Hi @Keith_Massey ,

I've implemented the changes you suggested, and everything is working as expected now—the execution history is being captured correctly. Thanks a lot!

Currently, there are over 200 watcher scripts deployed in the production cluster. A majority of them use chained inputs with multiple levels of aggregation, which leads to a high number of unique fields in the index. Please refer to the screenshot below:

Would you happen to have any recommendations on how we can streamline these watcher scripts to help reduce the field count?