Kibana/Elastic version 7.5.2 I recently set up a new ML job in Kibana and what I want to do is track individual error categories over time broken down by PCR (similar to host) over the population of Host Endpoints (BINs). I think I got the configuration correct but it seems while the jobs breaks P…

Hi gatling822, The short answer is "no". The ML job will treat the array as an atomic entity. To get what you want, you'd have to split the category values into their own distinct documents when you index them.

Ok that is less than ideal we get 500k+ transactions per min i wouldn't want to "dup" that data for however many categories there are. Is there any other way to do this? Or am i mis understanding your statement?

I cannot think of any possible workaround - but I don't know what your raw data looks like. What does one document of this index look like?

There 180+ fields but this is a good idea of what i am working with. "_source": { "approved": false, "timestamp": 1586891700, "transactionTime": 1586891700000, "logicalRecordID": 5006, "vipID": "A", "cpcID": "0", "systemID": 3, "destinationVipID": "0", "sourceVip…

Ok - thanks for that. Selecting categories as a by_field_name indeed doesn't make sense in this use case, because there's no concept of a count per category code anyway. There is no way for you to know, for example, that there were x number of categories:5 and y number of categories:27, and so on. T…

To your first point you are correct there is no count per category code, but we still would like to know which specific category code is most influential in the anomaly. Would it make sense to create a separate ML jobs that feed only data of a certain category to the job (filter where category : 5 )…

[image] gatling822: Would it make sense to create a separate ML jobs that feed only data of a certain category to the job (filter where category : 5 ) and use the job itself to say what category is potentially causing an issue @richcollier Would this be a valid approach getting over this issu…

Well, that could be a solution, but I do wonder how many unique categories there are and therefore how many distinct ML jobs you'll need to cover the use case. Also, sounds like you are doing the right thing by choosing population analysis if the cardinality of BINS is that high. I hope that all BI…

[image] richcollier: Well, that could be a solution, but I do wonder how many unique categories there are and therefore how many distinct ML jobs you'll need to cover the use case. I would say that only a subset of the categories are actually interesting to monitor for ML but it would result …

Did you ever consider just simply modeling the PCRs independently then? i.e. just count partition_field_name= acquirerPCR ? You could still use acquirerBIN as an influencer, but not be burdened by trying to model over it.

ML doesn't track "array" type data using by_field_name individually

Elastic Stack Kibana

richcollier (rich collier) April 18, 2020, 1:34pm 13

Both partition_field_name and by_field_name cause a "split" to the analysis (creates separate baseline statistical models). Either can be used (or both can be used to effectively create a "double-split").

Some nuances to their differences: ML What is the difference between by_field_name and partition_field_name

Topic		Replies	Views
Question on how to create a simple ML job Elasticsearch elastic-stack-machine-learning	11	1301	August 2, 2018
Detector field "mlcategory" is not an aggregatable field Elasticsearch elastic-stack-machine-learning	1	1226	August 1, 2018
Can you set partition field and count by as the same field? Kibana elastic-stack-machine-learning	1	461	October 20, 2022
Referencing field name from datafeed aggregation to use as a detector in an ML job Kibana elastic-stack-machine-learning	9	1019	December 19, 2018
ML Kibana: problem with an advanced job using partitionfield Kibana elastic-stack-machine-learning	17	1285	August 6, 2019

ML doesn't track "array" type data using by_field_name individually

Related topics