Metric Threshold Alert reporting incorrect document count

I have a metric threshold alert that will trigger when document count is above 30. This alert seems to trigger just fine. For the body I'm setting this:

And this is the data that I get back when the alert fires up:

{"alertId":"SOMEMADEUPID","alertName":"Aborted Alert","spaceId":"default","tags":["Dev"],"alertInstanceId":"US0418,TPASDEMO","alertActionGroup":"metrics.threshold.fired","alertActionGroupName":"Alert","context":{"group":"US0418,TPASDEMO","alertState":"ALERT","reason":"Document count is 62,474 in the last 2 hrs for US0418,TPASDEMO. Alert when > 30.","viewInAppUrl":"SOMEURL","timestamp":"2023-11-01T19:29:19.858Z","value":{"condition0":"62,474"},"threshold":{"condition0":["30"]},"metric":{}},"date":"2023-11-01T19:29:23.552Z","state":{"start":"2023-10-17T17:54:53.840Z","duration":"1301364441000000"},"kibanaBaseUrl":"SOMEBASEURL","params":{"criteria":[{"comparator":">","timeSize":2,"aggType":"count","threshold":[30],"timeUnit":"h"}],"sourceId":"default","alertOnNoData":true,"alertOnGroupDisappear":true,"groupBy":["labels.storeName","labels.retailer"],"filterQueryText":"labels.http_route: "/pos/order/{orderId}/{version}/void" and url.path : *"},"rule":{"id":"SOMEID","name":"Aborted Alert","type":"metrics.alert.threshold","spaceId":"default","tags":["Dev"]},"alert":{"id":"US0418,TPASDEMO","actionGroup":"metrics.threshold.fired","actionGroupName":"Alert"}}

The document count is an unusually large and incorrect number. Haven't quite understood why it's reporting that number and where it's coming from. Any insight would be appreciated

Hi @vsabado

You need to share the entire Alert Configuration so we can perhaps help.

Metrics create many documents ... so a document count on metrics may not be what you expect.

The Way I debug these is go to discover with the same criteria and compare side by side.

Here's my complete setup for this alert:

Under Observability -> Infrastructure -> Settings -> Indices -> Metric Indices, I have this value:

(rum-data-view),traces-apm,apm-,logs-apm,apm-,metrics-apm,apm-*

When I go plugging in that same filter into the Discover page, I get this:

So no explanation as to why I'm getting such a huge document count. Please let me know any further information that I can provide

What version of the Stack?

We are on version 8.5.2

Show me this exactly as a screen shot... (rum-data-view)* that concerns me...
That setting is very syntax sensitive ...

Exactly which index pattern do you expect that alert data to come from.

I would debug by JUST setting that as the only index in that Metrics Indices and test again...

Not getting any data now. At least before, there was data populated in the graph

Where did you get that data view name... something not right (I think)

Go to Discover and find out what Data Stream your documents are in...

Then try that...

Huh I see that in a few places can you just try

traces-apm*

I am checking internally on

(rum-data-view)*

that does not look correct, but I found it on one of my clusters too!

The graph is still showing data so that's a good sign. I'll get the alert to fire up and report back on what it says the document count is

1 Like

Yeah *** I THINK*** that is a bug / not a valid name I think I would take that out of the Data View and that metrics indices...

Okay, I got it 4 times the last 20 mins. This rule checks every 5 mins

Document count is 5,939 in the last 2 days for US0418,TPASDEMO. Alert when > 1

Document count is 7,669 in the last 2 days for US0418,TPASDEMO. Alert when > 1

Document count is 8,065 in the last 2 days for US0418,TPASDEMO. Alert when > 1

Document count is 8,113 in the last 2 days for US0418,TPASDEMO. Alert when > 1

That is really, really strange. The document hit is trending upwards, but it's changing by too much each time.

This is just with traces-apm*

I think your group by is not working the way you think ... it is an OR not an AND try a single group by

Hm didn't know that. How would I make it behave as AND instead of OR? That's a pivotal functionality for us for filtering out the large data that we have.

So filtering by just one gives me this:

Document count is 392,244 in the last 2 days for US0418. Alert when > 1.

image

Seems like half but that's still far too much what I really have

Those charts can be a bit missleading, try the actual alert...
There is a fix/workaround for the "AND" if you need it ....
Plus I commend you for your attempted work around with the metrics alert... soon I think there will get a "generic" alert to let you do everything you are trying but that will be in an upcoming release (you are a ways behind... 8.5.x)

BTW I need to triple check the Group By... the explanation is missleading

BTW Did you try Logs Threshold?

I need to check the Group By... but won't be able to do that now

@vsabado Looks like the GROUP BY are ANDed after all I am seeing Unique Combination of the 2 groups by I have done!!! So you were right!

My Test Was host.name and kubernetes.namespace

4587 log entries in the last 5 mins for gke-stephen-brown-gke-dev-larger-pool-17282d5a-j2a3, recommendation. Alert when > 75.
log-test is active.
gke-stephen-brown-gke-dev-larger-pool-17282d5a-j2a3, recommendation - 4587 log entries have matched the following conditions: host.name does not equal vader
1862 log entries in the last 5 mins for gke-stephen-brown-gke-dev-larger-pool-17282d5a-j2a3, ad. Alert when > 75.
log-test is active.
gke-stephen-brown-gke-dev-larger-pool-17282d5a-j2a3, ad - 1862 log entries have matched the following conditions: host.name does not equal vader

That is the same host with 2 different namespaces so we are good to go on that... apologies for the confusion

Still not sure what is going on with yours.... got to step away.

Good morning Stepehen,

Thank you so much for confirming that! Relieved to see that it's functioning as we had hoped. I just set up a log threshold alert. I'll try it now and report back

Got no alert coming from the log alert. But I'm seeing something really funky now. Alerts that really shouldn't be firing are firing.

This is my settings on the discover page:

I have 9 hits for this the last hour, which is accurate. However metric threshold rule for this exact filter is firing off even though it's only supposed to be > 10.

My metric indices is exactly the same

I'm not sure why I'm seeing a mismatching behavior here

@vsabado

I guess this got lost along the way .... I am all but positive that the

(rum-data-view)*

Is a bug and some part related to the inconsistency you are seeing.

Take it out....of the data view and those settings

Hey @stephenb, I stripped it down to just traces-apm* and still the same result. Works fine on my discover page though with the same data view.