I'm feeding documents directly from a source index to an ML anomaly job detection, using population analysis for a high cardinality use case (I don't want to feed aggregated data directly or using transform as it consume too much time for the high cardinality i have)
I use built in function in detector like sum, avg, count ... etc
Is there any possibility to instruct the ML engine to use a custom function for detector,
for example, i want to analyze a ratio for 2 metrics:
per entity, i have metric1=sum(field1) and metric2=sum(field2)
And i want to add a new metric3 = ratio(metric1, metric2)
@richcollier perfectly agree with you if i want to feed an aggregation result to ML job
But my case, the term i need to use in aggregation is a high cardinality field, it will not work my case, that's why i'm feeding raw documents directly and letting ML engine doing the heavy lifting instead of asking a data node to do it.
Imaging I'm aggregating on 15min bucket with millions of entities like email or IP adresses
Actually, it is worse for the ML engine to do it since it is a single process on a single ML node. In contrast, an elasticsearch aggregation can be broken down and distributed to all data nodes in the cluster.
If you have a high-cardinality field, perhaps the right approach is a two-step process:
Use Transforms to do the aggregations (it handles high cardinality fields) and save the results into a new index. They can run in batch or continuous mode.
Have ML use the index that is the output of step 1
Not agree with you @richcollier , unless you tried this approach in more than 10M cardinality use case
Continuous tranform on high cardinality is the worst case.
Feeding directly raw documents is more performant than passing by tranform
Thank You @richcollier for your contribution
I agree with you if the target is to add a new field to the raw document, mostly i do this at the ingest time and all fields in raw documents feeded to ML job are indexed for better performance.
But here in my case, i instructed the ML engine to do an analysis based on a bucket time span of 15min for example and the engine should compute some metrics using built in function like sum, mean ... etc, i would like it to compute a ratio of two metrics
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.