OK folks, it looks like we have found a solution for our stability issue.
After sending ES a heap dump from one of our spikes they were able to find a bug that affects filters in aggregations. In filtered aggs like the one below the bug causes the filter to be run as a bool query and it is therefore scored. This is a big issue for us bc we make A LOT of these.
{
"agg_name":{
"filter":{
"terms":{
"id":[1,2,...]
}
}
}
}
The temp fix is to use constant_score like this to avoid having the filter scored.
{
"agg_name": {
"filter": {
"constant_score": {
"filter": {
"terms": {
"id": []
}
}
}
}
}
That change looks to have completely stabilized us. The change went out at noon and this is how the graphs look pre and post change. Healthy saw tooth. We will be monitoring it closely and repost if anything changes. Thanks everyone for the help!