How to get ML Categorization results break down by users

Hi!

I have an input index with tons of different logs messages coming from many different devices (same product type = similar logs, different users). My source index is break down with : timestamp, serial_number, message_data. When I do the ML categorization, I have 1536 different mlcategories across all message_data. What I would like to achieve is to have the num_matches per mlcategory per serial_number for a specific time range. What I have right now as output is num_matches per mlcategory, but over all serial_number. I can't figure our how to break it down by serial_number.

To sum up, this is what I want to achieve:

  1. Build a categorization ML job on message_data. With that, I have approx. 1500 mlcategory. So each message_data can be mapped to one of the 1500 grok pattern (each mlcategory = grok pattern).
  2. I want the total count of mlcategory (the grok pattern) for each SN per time bucket, instead of having the count from all SN.
  3. I want to have after the total count over specific time range. My data would look like that :

Note: I also do not have log format documentation for these devices and are hoping ML categorization can help us. A solution that was recommended was be a pipeline to pipeline communication using the generated grok patterns from the categorization job. Of course, the grok patterns are a bit too much and can add to processing time of every log that goes through ingest pipeline.

In your detector you need to set partition_field_name to serial_number and by_field_name to mlcategory.

So for example you might create the job like this:

PUT _ml/anomaly_detectors/categories_per_serial_number?pretty
{
  "analysis_config": {
    "bucket_span": "15m",
    "categorization_field_name": "15m",
    "detectors": [
      {
        "detector_description": "Count of category partitioned on serial number",
        "function": "count",
        "by_field_name": "mlcategory",
        "partition_field_name": "serial_number"
      }
    ],
    "influencers": [ "serial_number", "mlcategory" ]
  },
  "data_description": {
    "time_field": "@timestamp",
    "time_format": "epoch_ms"
  },
  "analysis_limits": {
    "model_memory_limit": "1gb"
  },
  "results_index_name": "cat-by-sn"
}

Depending on which UI wizard you're using you might need to "convert to an advanced job" and edit the JSON directly to achieve this. Or of course you can create the job using the Elasticsearch API directly.

Hi !

Thanks for your answer.

But if I do that, each serial_number will have independant categories (no mlcategories in common). I tried it and I was left with 3M different categories ( 1500 cat x 2000 different SN)
What I would like is to create a "universal" categorization amongst all serial_numbers messages (because serial_numbers have similar log messages since it comes from the same product) , and then after have the count of occurence of each mlcategories for each serial_number.

To do this you need to make sure per_partition_categorization.enabled is false. Search for per_partition_categorization in Create anomaly detection jobs API | Elasticsearch Guide [8.11] | Elastic for details. But false is the default so it’s surprising you’re getting 3 million categories if you didn’t know about that setting.

Thank you very much ! I did the ML job and it works. However, I see that some categories ( from result_type = category_definition) are not present in the record results (result_type = record). Let's say that I take mlcategory=2, in category_definition I see the regex and num_matches = 5M. When I look for a document with result_type = record and mlcategory =2 , I have no results. I should have at least 1 since I have 5M matches with that regex in my data.

My question is, is it normal to have categories defined in category definition but not present in the records ? Does the record results holds all the results or I have to look somewhere else to have the number of occurence of mlcategory per serial_numbers?

Thanks

Yes, this is normal. Records are only created when there's an anomaly.

Anomaly detection jobs are for detecting anomalies. It sounds like you want counts per category per time bucket even when there are no anomalies. You might be able to do this using an aggregation that has a categorize_text aggregation to create the categories, a date_histogram aggregation to split into time buckets and a terms aggregation to split by serial_number. Then the doc_count at the lowest level would be the numbers you want.

Try something like:

POST message-logs-*/_search
{
  "size": 0, 
  "aggs" : {
    "daily" : {
      "date_histogram": {
        "field": "timestamp",
        "fixed_interval": "1d"
      },
      "aggs": {
        "categories": {
          "categorize_text": {
            "field": "message_data",
            "similarity_threshold" : 20
          },
          "aggs": {
            "by_sn": {
              "terms" : {
                "field": "serial_number.keyword"
              }
            }
          }
        }
      }
    }
  }
}

Thanks ! But it is still not giving me any results... with this query :

POST message-logs-*/_search
{
  "size":0, 
  "aggs" : {
    "daily" : {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "1d"
      },
      "aggs": {
        "categories": {
          "categorize_text": {
            "field": "message_data",
            "similarity_threshold" : 20
          },
          "aggs": {
            "by_sn": {
              "terms" : {
                "field": "serial_number.keyword"
              }
            }
          }
        }
      }
    }
  }
}

I have this output :

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 0,
    "successful" : 0,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : 0.0,
    "hits" : [ ]
  }
}

The query in my last post works on an index that I have in my cluster, so I know that the syntax is not incorrect. There must be something simple that is incorrect in your setup (i.e. is your time field not really @timestamp but something else, like timestamp?)

Figure out where the problem is by simplifying the query. First do just the date_histogram part:

POST message-logs-*/_search
{
  "size": 0,
  "aggs": {
    "daily": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "1d"
      }
    }
  }
}

If that works, then add the sub aggregation of categorize_text:

POST message-logs-*/_search
{
  "size": 0, 
  "aggs" : {
    "daily" : {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "1d"
      },
      "aggs": {
        "categories": {
          "categorize_text": {
            "field": "message_data",
            "similarity_threshold" : 20
          }
        }
      }
    }
  }
}

...and see if that works, and so on. You'll find out which part is not working for you!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.