Creating multi metric job can only use distinct count on IP

macnibblet · February 2, 2018, 11:18am

Heya,

Maybe I have misunderstood something when trying to setup my anomaly detection but I want to track an unusual amount of messages being sent from a sender IP and then split that by service. But for some reason in Kibana I can only select distinct count on the sender IP and not count?

I'm not totally sure why this is, is it an underlying constraint of some kind? Same goes for a few other fields with but they are regular strings/keyword I would like to track on as well.

richcollier · February 2, 2018, 4:27pm

Hi there,

The Multi-metric job wizard prevents you from selecting fields to "split on" unless they of type keyword. So, therefore, you cannot "split" on numerical fields, etc. The Advanced Job wizard doesn't really have these constraints, and as such, allows you to "shoot yourself in the foot" if you make an incoherent choice.

Also, it's really hard to tell from your description what you're trying to accomplish. Using distinct_count(IP) makes sense, but distinct_count(count)? That doesn't. Please elaborate. Describe the fields you wish to analyze, their types, and perhaps a little more detail about the specific use case you are trying to satisfy....

macnibblet · February 4, 2018, 5:15pm

Heya,

What I'm trying to do is detect anomalies in output traffic from an SMTP server. So I have a few fields, of relevance.

senderIp
senderDomain
senderEmail
stateOfDelivery
service

So I want to split on a service basis and detect anomalies on the rest of the fields for spikes and drops in traffic och or perhaps an unusual amount of emails with the state rejected.

richcollier · February 4, 2018, 7:16pm

Just curious....If you want to split on service - can you give me an idea of the cardinality of this field (dozens, hundreds, thousands, etc)?

I'll assume that the cardinality of service is only dozens or hundreds (which is totally fine). As such, here are some example job ideas - configure these with the Advanced Job wizard

"find unusual senderIps (compared to other senderIps) along the dimension of event volume, split on service"
function: count
partition_field_name: service
over_field_name: senderIp
influencers: senderIp, service
"find unusually high volume split by stateOfDelivery and also partitioned on service (a double split)"
function: count
by_field_name: stateOfDelivery
partition_field_name: service
influencers: stateOfDelivery, service
"find unusual senderDomain (compared to other senderDomains) along the dimension of distinct senderIps, split on service"
function: distinct_count
field_name: senderIp
over_field_name: senderDomain
partition_field_name: service
influencers: service, senderDomain

Also, there is an assumption that it is okay to search every doc in the index over time. If the search requires filtering, I would suggest creating a Saved Search in Kibana and using that as the basis of the ML job.

macnibblet · February 5, 2018, 9:30am

Heya,

Thanks for the great feedback! The cardinality is pretty much as you expressed a few hundred up to perhaps 2 thousand in the future. Indexing ever doc based on the test data I have has not been a problem, so I'm not too worried about that. Worst case scenario in regards to the cardinality of service and data we could partition it.

Stupid question perhaps, but the examples you posted, do I put them all in a single ML job or create a new job for each detector?

richcollier · February 5, 2018, 1:51pm

You're welcome - but keep in mind that even my suggestions are just that - suggestions. You certainly should test things out and see if the results that you get are in line with your expectations.

Regarding the multiple-job vs. multiple-detector point, it depends. You can do it either way - but just know that:

putting multiple detectors in the same job is more efficient from a data query perspective - the data is only queried once and analyzed N ways, whereas N jobs requires N queries to the same raw data.
putting multiple detectors in the same job means that the overall score for the job is an aggregate of all detectors, and each detector operates independently. Thus, the multiple detectors acts like a logical "OR" (not an "AND").
the results UI may be "busier" with more anomalies as you add more detectors per job.

Hope that helps

macnibblet · February 5, 2018, 2:07pm

Heya,

Well, my plan is to either add webhooks and notify the service owners on anomalies or poll the actual index for the results. That said I think we should be fine with just storing everything in one job and more efficient query usage.

richcollier · February 5, 2018, 2:14pm

Ok cool - may i suggest the following blogs:

system · March 5, 2018, 2:14pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can we use "sub-partitioning" in ML? Elasticsearch elastic-stack-machine-learning	6	1327	October 5, 2017
Custom Job Management Kibana elastic-stack-machine-learning	4	430	December 12, 2019
Question on how to create a simple ML job Elasticsearch elastic-stack-machine-learning	12	1174	October 29, 2018
Help: Create multi metric machine learning job Elasticsearch elastic-stack-machine-learning	5	939	January 4, 2021
ML Kibana: problem with an advanced job using partitionfield Kibana elastic-stack-machine-learning	18	1226	September 3, 2019

Creating multi metric job can only use distinct count on IP

Related topics