Metric Threshold Alert reporting incorrect document count

What version of the Stack?

We are on version 8.5.2

Show me this exactly as a screen shot... (rum-data-view)* that concerns me...
That setting is very syntax sensitive ...

Exactly which index pattern do you expect that alert data to come from.

I would debug by JUST setting that as the only index in that Metrics Indices and test again...

Not getting any data now. At least before, there was data populated in the graph

Where did you get that data view name... something not right (I think)

Go to Discover and find out what Data Stream your documents are in...

Then try that...

Huh I see that in a few places can you just try

traces-apm*

I am checking internally on

(rum-data-view)*

that does not look correct, but I found it on one of my clusters too!

The graph is still showing data so that's a good sign. I'll get the alert to fire up and report back on what it says the document count is

1 Like

Yeah *** I THINK*** that is a bug / not a valid name I think I would take that out of the Data View and that metrics indices...

Okay, I got it 4 times the last 20 mins. This rule checks every 5 mins

Document count is 5,939 in the last 2 days for US0418,TPASDEMO. Alert when > 1

Document count is 7,669 in the last 2 days for US0418,TPASDEMO. Alert when > 1

Document count is 8,065 in the last 2 days for US0418,TPASDEMO. Alert when > 1

Document count is 8,113 in the last 2 days for US0418,TPASDEMO. Alert when > 1

That is really, really strange. The document hit is trending upwards, but it's changing by too much each time.

This is just with traces-apm*

I think your group by is not working the way you think ... it is an OR not an AND try a single group by

Hm didn't know that. How would I make it behave as AND instead of OR? That's a pivotal functionality for us for filtering out the large data that we have.

So filtering by just one gives me this:

Document count is 392,244 in the last 2 days for US0418. Alert when > 1.

image

Seems like half but that's still far too much what I really have

Those charts can be a bit missleading, try the actual alert...
There is a fix/workaround for the "AND" if you need it ....
Plus I commend you for your attempted work around with the metrics alert... soon I think there will get a "generic" alert to let you do everything you are trying but that will be in an upcoming release (you are a ways behind... 8.5.x)

BTW I need to triple check the Group By... the explanation is missleading

BTW Did you try Logs Threshold?

I need to check the Group By... but won't be able to do that now

@vsabado Looks like the GROUP BY are ANDed after all I am seeing Unique Combination of the 2 groups by I have done!!! So you were right!

My Test Was host.name and kubernetes.namespace

4587 log entries in the last 5 mins for gke-stephen-brown-gke-dev-larger-pool-17282d5a-j2a3, recommendation. Alert when > 75.
log-test is active.
gke-stephen-brown-gke-dev-larger-pool-17282d5a-j2a3, recommendation - 4587 log entries have matched the following conditions: host.name does not equal vader
1862 log entries in the last 5 mins for gke-stephen-brown-gke-dev-larger-pool-17282d5a-j2a3, ad. Alert when > 75.
log-test is active.
gke-stephen-brown-gke-dev-larger-pool-17282d5a-j2a3, ad - 1862 log entries have matched the following conditions: host.name does not equal vader

That is the same host with 2 different namespaces so we are good to go on that... apologies for the confusion

Still not sure what is going on with yours.... got to step away.

Good morning Stepehen,

Thank you so much for confirming that! Relieved to see that it's functioning as we had hoped. I just set up a log threshold alert. I'll try it now and report back

Got no alert coming from the log alert. But I'm seeing something really funky now. Alerts that really shouldn't be firing are firing.

This is my settings on the discover page:

I have 9 hits for this the last hour, which is accurate. However metric threshold rule for this exact filter is firing off even though it's only supposed to be > 10.

My metric indices is exactly the same

I'm not sure why I'm seeing a mismatching behavior here

@vsabado

I guess this got lost along the way .... I am all but positive that the

(rum-data-view)*

Is a bug and some part related to the inconsistency you are seeing.

Take it out....of the data view and those settings

Hey @stephenb, I stripped it down to just traces-apm* and still the same result. Works fine on my discover page though with the same data view.

Okay, wanted to give a quick update here. I updated our Elasticsearch deployment to 8.11 so I'm really excited about all the new stuff here. Mainly the custom threshold alert! Playing with that now to see if it works better.

Back to the metric threshold, this is what I have for my data view

Now my rule looks like this:

What I see in the graph, and even in the metric explorer actually looks accurate. That's about what I expected. But then document count is still reported up as some large number. Unfortunately my alerts now fire up too when they really shouldn't be because it's using that count and not what we're seeing here

Okay, final update on this. I deleted all of my rules and recreated them (exact same settings). They're now firing off properly and reporting the correct document count.

It looks like old rules don't retroactively update their data views when changing the settings under infrastructure. Seems to be a bug in that case.

The custom threshold is fantastic! Worked perfectly for my case as well.

1 Like

Good to hear,
Thanks for posting your findings / solutions

Wow 8.11 ... right to the top, this is a BIG release.
Keep and eye out for an 8.11.1 etc as usually there is a patch release that comes in the next couple weeks... I would apply that when it comes out.

BTW the guidance I was given is if you see this

(rum-data-view)*

You should remove that. It is a "leftover"

1 Like