[ML] Custom function in anomaly detection job

Dear community,

I'm feeding documents directly from a source index to an ML anomaly job detection, using population analysis for a high cardinality use case (I don't want to feed aggregated data directly or using transform as it consume too much time for the high cardinality i have)

I use built in function in detector like sum, avg, count ... etc
Is there any possibility to instruct the ML engine to use a custom function for detector,

for example, i want to analyze a ratio for 2 metrics:
per entity, i have metric1=sum(field1) and metric2=sum(field2)
And i want to add a new metric3 = ratio(metric1, metric2)

Thank You

You should do this with a set of query aggregations in the datafeed of the ML job so that the sums and the ratio are calculated first, then fed to ML.

An older, but relevant example is here: Analyzing a ratio of documents over time with Anomaly Detection

In your case, you'd do two different sum aggregations, then do the ratio with a bucket_script aggregation

@richcollier perfectly agree with you if i want to feed an aggregation result to ML job
But my case, the term i need to use in aggregation is a high cardinality field, it will not work my case, that's why i'm feeding raw documents directly and letting ML engine doing the heavy lifting instead of asking a data node to do it.

Imaging I'm aggregating on 15min bucket with millions of entities like email or IP adresses

Actually, it is worse for the ML engine to do it since it is a single process on a single ML node. In contrast, an elasticsearch aggregation can be broken down and distributed to all data nodes in the cluster.

If you have a high-cardinality field, perhaps the right approach is a two-step process:

  1. Use Transforms to do the aggregations (it handles high cardinality fields) and save the results into a new index. They can run in batch or continuous mode.
  2. Have ML use the index that is the output of step 1

Not agree with you @richcollier , unless you tried this approach in more than 10M cardinality use case
Continuous tranform on high cardinality is the worst case.
Feeding directly raw documents is more performant than passing by tranform

You could perhaps try using a script_field or a runtime field to calculate the ratio per document during the query for the ML job's datafeed

Thank You @richcollier for your contribution
I agree with you if the target is to add a new field to the raw document, mostly i do this at the ingest time and all fields in raw documents feeded to ML job are indexed for better performance.
But here in my case, i instructed the ML engine to do an analysis based on a bucket time span of 15min for example and the engine should compute some metrics using built in function like sum, mean ... etc, i would like it to compute a ratio of two metrics

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.