ML doesn't track "array" type data using by_field_name individually

Kibana/Elastic version 7.5.2

I recently set up a new ML job in Kibana and what I want to do is track individual error categories over time broken down by PCR (similar to host) over the population of Host Endpoints (BINs).

I think I got the configuration correct but it seems while the jobs breaks PCR/BINs down how I would expect it is treating the error category "array" type as as a whole keyword (5,27,88) and not splitting each category in that array up into it own bucket. With our data each record can 1 or more error categories but it does not seem like I can track those individually in the ML jobs.

Does anyone know how to do this?

"analysis_config": {
    "bucket_span": "1m",
    "detectors": [
      {
        "detector_description": "High Count by Categories over Acquirer BIN partitioned on Acquirer PCR",
        "function": "high_count",
        "by_field_name": "categories",
        "over_field_name": "acquirerBIN",
        "partition_field_name": "acquirerPCR",
        "detector_index": 0
      }
    ],
    "influencers": [
      "acquirerBIN",
      "acquirerPCR",
      "categories"
    ]
  },

Hi gatling822,

The short answer is "no". The ML job will treat the array as an atomic entity. To get what you want, you'd have to split the category values into their own distinct documents when you index them.

Ok that is less than ideal we get 500k+ transactions per min i wouldn't want to "dup" that data for however many categories there are.

Is there any other way to do this? Or am i mis understanding your statement?

I cannot think of any possible workaround - but I don't know what your raw data looks like. What does one document of this index look like?

There 180+ fields but this is a good idea of what i am working with.

"_source": {
    "approved": false,
    "timestamp": 1586891700,
    "transactionTime": 1586891700000,
    "logicalRecordID": 5006,
    "vipID": "A",
    "cpcID": "0",
    "systemID": 3,
    "destinationVipID": "0",
    "sourceVipID": "0",
    "networkID": "3",
    "acquirerInstituteID": "315787",
    "transactionID": 300105692991384,
    "retrivalReferenceNumber": "10515878145",
    "gblProdID": "P",
    "acquirerPCR": "9002",
    "acquirerBIN": "315787",
    "authorizingPCR": "9007",
    "authorizingBIN": "407535",
    "sourceStationPCR": "9007",
    "destinationStationPCR": "9002",
    "acquirerBID": "10000056",
    "issuerBID": "10017172",
    "issuerLicenseeBIN": "407535",
    "sourceStationID": "874819",
    "destinationStationID": "762813",
    "isoBin": "407535",
    "messageTypeCode": "0210",
    "responseCode": "51",
    "processingCode": "004000",
    "merchantCategoryCode": "5411",
    "posConditionCode": "0",
    "responseSource": "5",
    "posEntryModeCode": "090",
    "cavvRsltCode": " ",
    "evsCalloutCode": "0",
    "dcvvVipResultCode": " ",
    "fraudTransTypeCode": "4",
    "returnIndicator": false,
    "rejectIndicator": false,
    "atrIndicator": false,
    "ltrIndicator": false,
    "stipIndicator": false,
    "dorIndicator": true,
    "usdAuthnAmt": 72.8,
    "txnClasses": [
      0,
      1
    ],
    "categories": [
      5,
      27,
      81
    ]
  }

Ok - thanks for that. Selecting categories as a by_field_name indeed doesn't make sense in this use case, because there's no concept of a count per category code anyway. There is no way for you to know, for example, that there were x number of categories:5 and y number of categories:27, and so on. That's not possible. You're only able to count documents per acquirerBIN or per acquirerPCR.

You are also using over_field_name, which implies you want to do a population analysis and compare the count of acquirerBINs against each other, and not against themselves over time. Is that your intention?

To your first point you are correct there is no count per category code, but we still would like to know which specific category code is most influential in the anomaly. Would it make sense to create a separate ML jobs that feed only data of a certain category to the job (filter where category : 5 ) and use the job itself to say what category is potentially causing an issue?

I think we do want to use population analysis as acquirerBIN values come and go somewhat regularly and we want to report outlier BINs for a given acquierPCR. Also the number of BINs is rather large >100,000 so this also may be the most efficient

@richcollier Would this be a valid approach getting over this issue?

Well, that could be a solution, but I do wonder how many unique categories there are and therefore how many distinct ML jobs you'll need to cover the use case.

Also, sounds like you are doing the right thing by choosing population analysis if the cardinality of BINS is that high. I hope that all BINs, however, should behave mostly homogeneously. If they don't, you'll constantly get those that are routinely different from the others being flagged as anomalous.

I would say that only a subset of the categories are actually interesting to monitor for ML but it would result in < 30 ML jobs if we that route.

That is a good point BINs are somewhat independent of each other that is to say a PCR (host) can use the BINs (an access point) however they feel, ideally they would be a round robining messages. Buts lets assume that they are used indepenedently, would it make more sense to then model them separately and not in a population. Does the high cardinality pose a big issue for that type of use case?

Did you ever consider just simply modeling the PCRs independently then? i.e. just count partition_field_name= acquirerPCR ? You could still use acquirerBIN as an influencer, but not be burdened by trying to model over it.

How does simple partitioning without a "by" field affect the job? I thought that you had to have a by field

This is something we have considered we have a higher level concept of Processing Center and each PCR has a connection to 1 or more processing centers, the processing center is a pretty small cardinality < 20 so that could be a partition and then add PCRs and maybe BINs as influencers to help narrow down issue.

Both partition_field_name and by_field_name cause a "split" to the analysis (creates separate baseline statistical models). Either can be used (or both can be used to effectively create a "double-split").

Some nuances to their differences: ML What is the difference between by_field_name and partition_field_name

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.