How to get ML Categorization results break down by users

CamillePapillon · December 1, 2022, 6:43pm

Hi!

I have an input index with tons of different logs messages coming from many different devices (same product type = similar logs, different users). My source index is break down with : timestamp, serial_number, message_data. When I do the ML categorization, I have 1536 different mlcategories across all message_data. What I would like to achieve is to have the num_matches per mlcategory per serial_number for a specific time range. What I have right now as output is num_matches per mlcategory, but over all serial_number. I can't figure our how to break it down by serial_number.

To sum up, this is what I want to achieve:

Build a categorization ML job on message_data. With that, I have approx. 1500 mlcategory. So each message_data can be mapped to one of the 1500 grok pattern (each mlcategory = grok pattern).
I want the total count of mlcategory (the grok pattern) for each SN per time bucket, instead of having the count from all SN.
I want to have after the total count over specific time range. My data would look like that :

image1408×618 21.5 KB

Note: I also do not have log format documentation for these devices and are hoping ML categorization can help us. A solution that was recommended was be a pipeline to pipeline communication using the generated grok patterns from the categorization job. Of course, the grok patterns are a bit too much and can add to processing time of every log that goes through ingest pipeline.

droberts195 · December 1, 2022, 7:03pm

In your detector you need to set partition_field_name to serial_number and by_field_name to mlcategory.

So for example you might create the job like this:

PUT _ml/anomaly_detectors/categories_per_serial_number?pretty
{
  "analysis_config": {
    "bucket_span": "15m",
    "categorization_field_name": "15m",
    "detectors": [
      {
        "detector_description": "Count of category partitioned on serial number",
        "function": "count",
        "by_field_name": "mlcategory",
        "partition_field_name": "serial_number"
      }
    ],
    "influencers": [ "serial_number", "mlcategory" ]
  },
  "data_description": {
    "time_field": "@timestamp",
    "time_format": "epoch_ms"
  },
  "analysis_limits": {
    "model_memory_limit": "1gb"
  },
  "results_index_name": "cat-by-sn"
}

Depending on which UI wizard you're using you might need to "convert to an advanced job" and edit the JSON directly to achieve this. Or of course you can create the job using the Elasticsearch API directly.

CamillePapillon · December 1, 2022, 7:21pm

Hi !

Thanks for your answer.

But if I do that, each serial_number will have independant categories (no mlcategories in common). I tried it and I was left with 3M different categories ( 1500 cat x 2000 different SN)
What I would like is to create a "universal" categorization amongst all serial_numbers messages (because serial_numbers have similar log messages since it comes from the same product) , and then after have the count of occurence of each mlcategories for each serial_number.

droberts195 · December 1, 2022, 7:54pm

To do this you need to make sure per_partition_categorization.enabled is false. Search for per_partition_categorization in Create anomaly detection jobs API | Elasticsearch Guide [8.11] | Elastic for details. But false is the default so it’s surprising you’re getting 3 million categories if you didn’t know about that setting.

CamillePapillon · December 6, 2022, 7:26pm

Thank you very much ! I did the ML job and it works. However, I see that some categories ( from result_type = category_definition) are not present in the record results (result_type = record). Let's say that I take mlcategory=2, in category_definition I see the regex and num_matches = 5M. When I look for a document with result_type = record and mlcategory =2 , I have no results. I should have at least 1 since I have 5M matches with that regex in my data.

My question is, is it normal to have categories defined in category definition but not present in the records ? Does the record results holds all the results or I have to look somewhere else to have the number of occurence of mlcategory per serial_numbers?

Thanks

droberts195 · December 6, 2022, 8:08pm

Yes, this is normal. Records are only created when there's an anomaly.

Anomaly detection jobs are for detecting anomalies. It sounds like you want counts per category per time bucket even when there are no anomalies. You might be able to do this using an aggregation that has a categorize_text aggregation to create the categories, a date_histogram aggregation to split into time buckets and a terms aggregation to split by serial_number. Then the doc_count at the lowest level would be the numbers you want.

richcollier · December 21, 2022, 6:09pm

Try something like:

POST message-logs-*/_search
{
  "size": 0, 
  "aggs" : {
    "daily" : {
      "date_histogram": {
        "field": "timestamp",
        "fixed_interval": "1d"
      },
      "aggs": {
        "categories": {
          "categorize_text": {
            "field": "message_data",
            "similarity_threshold" : 20
          },
          "aggs": {
            "by_sn": {
              "terms" : {
                "field": "serial_number.keyword"
              }
            }
          }
        }
      }
    }
  }
}

CamillePapillon · January 10, 2023, 7:57pm

Thanks ! But it is still not giving me any results... with this query :

POST message-logs-*/_search
{
  "size":0, 
  "aggs" : {
    "daily" : {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "1d"
      },
      "aggs": {
        "categories": {
          "categorize_text": {
            "field": "message_data",
            "similarity_threshold" : 20
          },
          "aggs": {
            "by_sn": {
              "terms" : {
                "field": "serial_number.keyword"
              }
            }
          }
        }
      }
    }
  }
}

I have this output :

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 0,
    "successful" : 0,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : 0.0,
    "hits" : [ ]
  }
}

richcollier · January 12, 2023, 12:45pm

The query in my last post works on an index that I have in my cluster, so I know that the syntax is not incorrect. There must be something simple that is incorrect in your setup (i.e. is your time field not really @timestamp but something else, like timestamp?)

Figure out where the problem is by simplifying the query. First do just the date_histogram part:

POST message-logs-*/_search
{
  "size": 0,
  "aggs": {
    "daily": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "1d"
      }
    }
  }
}

If that works, then add the sub aggregation of categorize_text:

POST message-logs-*/_search
{
  "size": 0, 
  "aggs" : {
    "daily" : {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "1d"
      },
      "aggs": {
        "categories": {
          "categorize_text": {
            "field": "message_data",
            "similarity_threshold" : 20
          }
        }
      }
    }
  }
}

...and see if that works, and so on. You'll find out which part is not working for you!

system · February 9, 2023, 12:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ML doesn't track "array" type data using by_field_name individually Kibana elastic-stack-machine-learning	13	637	May 16, 2020
Anomaly Detection Categorization: Kibana not showing all ml category Kibana elastic-stack-machine-learning	7	400	May 20, 2022
Anomaly detection in Machine learning kibana Kibana elastic-stack-machine-learning	7	516	December 23, 2021
Machine Learning - Categorization status changed to warn Elasticsearch elastic-stack-machine-learning	3	573	January 5, 2021
Category examples not available in Machine Learning module 5.5 Elasticsearch elastic-stack-machine-learning	4	958	September 27, 2017

How to get ML Categorization results break down by users

Related topics