[ML] Custom function in anomaly detection job

ylasri · February 23, 2023, 2:31pm

Dear community,

I'm feeding documents directly from a source index to an ML anomaly job detection, using population analysis for a high cardinality use case (I don't want to feed aggregated data directly or using transform as it consume too much time for the high cardinality i have)

I use built in function in detector like sum, avg, count ... etc
Is there any possibility to instruct the ML engine to use a custom function for detector,

for example, i want to analyze a ratio for 2 metrics:
per entity, i have metric1=sum(field1) and metric2=sum(field2)
And i want to add a new metric3 = ratio(metric1, metric2)

Thank You

richcollier · February 23, 2023, 3:46pm

You should do this with a set of query aggregations in the datafeed of the ML job so that the sums and the ratio are calculated first, then fed to ML.

An older, but relevant example is here: Analyzing a ratio of documents over time with Anomaly Detection

In your case, you'd do two different sum aggregations, then do the ratio with a bucket_script aggregation

ylasri · February 23, 2023, 3:49pm

@richcollier perfectly agree with you if i want to feed an aggregation result to ML job
But my case, the term i need to use in aggregation is a high cardinality field, it will not work my case, that's why i'm feeding raw documents directly and letting ML engine doing the heavy lifting instead of asking a data node to do it.

Imaging I'm aggregating on 15min bucket with millions of entities like email or IP adresses

richcollier · February 23, 2023, 8:11pm

Actually, it is worse for the ML engine to do it since it is a single process on a single ML node. In contrast, an elasticsearch aggregation can be broken down and distributed to all data nodes in the cluster.

If you have a high-cardinality field, perhaps the right approach is a two-step process:

Use Transforms to do the aggregations (it handles high cardinality fields) and save the results into a new index. They can run in batch or continuous mode.
Have ML use the index that is the output of step 1

ylasri · February 23, 2023, 8:36pm

Not agree with you @richcollier , unless you tried this approach in more than 10M cardinality use case
Continuous tranform on high cardinality is the worst case.
Feeding directly raw documents is more performant than passing by tranform

richcollier · February 25, 2023, 12:41pm

You could perhaps try using a script_field or a runtime field to calculate the ratio per document during the query for the ML job's datafeed

ylasri · February 25, 2023, 12:58pm

Thank You @richcollier for your contribution
I agree with you if the target is to add a new field to the raw document, mostly i do this at the ingest time and all fields in raw documents feeded to ML job are indexed for better performance.
But here in my case, i instructed the ML engine to do an analysis based on a bucket time span of 15min for example and the engine should compute some metrics using built in function like sum, mean ... etc, i would like it to compute a ratio of two metrics

system · March 25, 2023, 12:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ML calculates wrong averages on pre-aggregated data Elasticsearch	8	1247	November 21, 2017
Anomaly detection on ratio of two counts Elasticsearch elastic-stack-machine-learning	5	798	March 26, 2020
Referencing field name from datafeed aggregation to use as a detector in an ML job Kibana elastic-stack-machine-learning	10	982	January 16, 2019
X-pack Single metric job Kibana elastic-stack-machine-learning	19	901	April 17, 2019
Setting document limits for Machine Learning anomalies Elasticsearch elastic-stack-machine-learning	4	645	December 5, 2019

[ML] Custom function in anomaly detection job

Related topics