Metric Threshold Alert reporting incorrect document count

vsabado · November 6, 2023, 6:57pm

I have a metric threshold alert that will trigger when document count is above 30. This alert seems to trigger just fine. For the body I'm setting this:

And this is the data that I get back when the alert fires up:

{"alertId":"SOMEMADEUPID","alertName":"Aborted Alert","spaceId":"default","tags":["Dev"],"alertInstanceId":"US0418,TPASDEMO","alertActionGroup":"metrics.threshold.fired","alertActionGroupName":"Alert","context":{"group":"US0418,TPASDEMO","alertState":"ALERT","reason":"Document count is 62,474 in the last 2 hrs for US0418,TPASDEMO. Alert when > 30.","viewInAppUrl":"SOMEURL","timestamp":"2023-11-01T19:29:19.858Z","value":{"condition0":"62,474"},"threshold":{"condition0":["30"]},"metric":{}},"date":"2023-11-01T19:29:23.552Z","state":{"start":"2023-10-17T17:54:53.840Z","duration":"1301364441000000"},"kibanaBaseUrl":"SOMEBASEURL","params":{"criteria":[{"comparator":">","timeSize":2,"aggType":"count","threshold":[30],"timeUnit":"h"}],"sourceId":"default","alertOnNoData":true,"alertOnGroupDisappear":true,"groupBy":["labels.storeName","labels.retailer"],"filterQueryText":"labels.http_route: "/pos/order/{orderId}/{version}/void" and url.path : *"},"rule":{"id":"SOMEID","name":"Aborted Alert","type":"metrics.alert.threshold","spaceId":"default","tags":["Dev"]},"alert":{"id":"US0418,TPASDEMO","actionGroup":"metrics.threshold.fired","actionGroupName":"Alert"}}

The document count is an unusually large and incorrect number. Haven't quite understood why it's reporting that number and where it's coming from. Any insight would be appreciated

stephenb · November 6, 2023, 7:00pm

Hi @vsabado

You need to share the entire Alert Configuration so we can perhaps help.

Metrics create many documents ... so a document count on metrics may not be what you expect.

The Way I debug these is go to discover with the same criteria and compare side by side.

vsabado · November 6, 2023, 7:09pm

Here's my complete setup for this alert:

Under Observability -> Infrastructure -> Settings -> Indices -> Metric Indices, I have this value:

(rum-data-view),traces-apm,apm-,logs-apm,apm-,metrics-apm,apm-*

When I go plugging in that same filter into the Discover page, I get this:

So no explanation as to why I'm getting such a huge document count. Please let me know any further information that I can provide

stephenb · November 6, 2023, 7:11pm

What version of the Stack?

vsabado · November 6, 2023, 7:12pm

We are on version 8.5.2

stephenb · November 6, 2023, 7:13pm

Show me this exactly as a screen shot... (rum-data-view)* that concerns me...
That setting is very syntax sensitive ...

Exactly which index pattern do you expect that alert data to come from.

I would debug by JUST setting that as the only index in that Metrics Indices and test again...

vsabado · November 6, 2023, 7:19pm

Not getting any data now. At least before, there was data populated in the graph

stephenb · November 6, 2023, 7:23pm

Where did you get that data view name... something not right (I think)

Go to Discover and find out what Data Stream your documents are in...

Then try that...

stephenb · November 6, 2023, 7:29pm

Huh I see that in a few places can you just try

traces-apm*

I am checking internally on

(rum-data-view)*

that does not look correct, but I found it on one of my clusters too!

vsabado · November 6, 2023, 7:37pm

The graph is still showing data so that's a good sign. I'll get the alert to fire up and report back on what it says the document count is

stephenb · November 6, 2023, 7:40pm

Yeah *** I THINK*** that is a bug / not a valid name I think I would take that out of the Data View and that metrics indices...

vsabado · November 6, 2023, 8:03pm

Okay, I got it 4 times the last 20 mins. This rule checks every 5 mins

Document count is 5,939 in the last 2 days for US0418,TPASDEMO. Alert when > 1

Document count is 7,669 in the last 2 days for US0418,TPASDEMO. Alert when > 1

Document count is 8,065 in the last 2 days for US0418,TPASDEMO. Alert when > 1

Document count is 8,113 in the last 2 days for US0418,TPASDEMO. Alert when > 1

That is really, really strange. The document hit is trending upwards, but it's changing by too much each time.

This is just with traces-apm*

stephenb · November 6, 2023, 8:06pm

I think your group by is not working the way you think ... it is an OR not an AND try a single group by

vsabado · November 6, 2023, 8:14pm

Hm didn't know that. How would I make it behave as AND instead of OR? That's a pivotal functionality for us for filtering out the large data that we have.

So filtering by just one gives me this:

Document count is 392,244 in the last 2 days for US0418. Alert when > 1.

Seems like half but that's still far too much what I really have

stephenb · November 6, 2023, 8:22pm

Those charts can be a bit missleading, try the actual alert...
There is a fix/workaround for the "AND" if you need it ....
Plus I commend you for your attempted work around with the metrics alert... soon I think there will get a "generic" alert to let you do everything you are trying but that will be in an upcoming release (you are a ways behind... 8.5.x)

BTW I need to triple check the Group By... the explanation is missleading

stephenb · November 6, 2023, 8:25pm

BTW Did you try Logs Threshold?

I need to check the Group By... but won't be able to do that now

@vsabado Looks like the GROUP BY are ANDed after all I am seeing Unique Combination of the 2 groups by I have done!!! So you were right!

My Test Was host.name and kubernetes.namespace

4587 log entries in the last 5 mins for gke-stephen-brown-gke-dev-larger-pool-17282d5a-j2a3, recommendation. Alert when > 75.
log-test is active.
gke-stephen-brown-gke-dev-larger-pool-17282d5a-j2a3, recommendation - 4587 log entries have matched the following conditions: host.name does not equal vader

1862 log entries in the last 5 mins for gke-stephen-brown-gke-dev-larger-pool-17282d5a-j2a3, ad. Alert when > 75.
log-test is active.
gke-stephen-brown-gke-dev-larger-pool-17282d5a-j2a3, ad - 1862 log entries have matched the following conditions: host.name does not equal vader

That is the same host with 2 different namespaces so we are good to go on that... apologies for the confusion

Still not sure what is going on with yours.... got to step away.

vsabado · November 7, 2023, 3:14pm

Good morning Stepehen,

Thank you so much for confirming that! Relieved to see that it's functioning as we had hoped. I just set up a log threshold alert. I'll try it now and report back

vsabado · November 7, 2023, 3:53pm

Got no alert coming from the log alert. But I'm seeing something really funky now. Alerts that really shouldn't be firing are firing.

This is my settings on the discover page:

I have 9 hits for this the last hour, which is accurate. However metric threshold rule for this exact filter is firing off even though it's only supposed to be > 10.

My metric indices is exactly the same

I'm not sure why I'm seeing a mismatching behavior here

stephenb · November 7, 2023, 10:18pm

@vsabado

I guess this got lost along the way .... I am all but positive that the

(rum-data-view)*

Is a bug and some part related to the inconsistency you are seeing.

Take it out....of the data view and those settings

vsabado · November 7, 2023, 11:13pm

Hey @stephenb, I stripped it down to just traces-apm* and still the same result. Works fine on my discover page though with the same data view.

Topic		Replies	Views
How to disable alerts on no data in Metric Threshold rule Elastic Observability elastic-stack-alerting	4	524	October 4, 2022
Doubts about Kibana Rules and conections alerts Kibana elastic-stack-alerting	2	217	June 23, 2023
Action variable context.threshold doesn't work for warning threshold Elastic Observability elastic-stack-alerting	2	454	November 4, 2022
Kibana Index Threshold Alert doesn't report 0 document Kibana elastic-stack-alerting	7	245	June 12, 2024
Functionality of alertOnNoData flag in Metric threshold rule [8.10.2] Metrics elastic-stack-alerting	1	326	November 3, 2023

Metric Threshold Alert reporting incorrect document count

Related topics