ML Kibana: difference between by_field_name and partition_field_name

  1. Yes the numbers are more of an "order of magnitude" estimate. You can certainly get jobs with 100,000+ partitions if you're willing to have the memory headroom. However, keep in mind that 1 job is tied to 1 ML node, so you'll never get horizontal scalability if you just have 1 massive job instead of many smaller jobs.

  2. In general, there is a concept in the ML job as to when a thing first happens - which I'll call the "dawn of time". When the dawn of time of something happens (i.e. the first time the ML job sees data for host=X or error_code=Y) there may be one of two situations:

  • That new entity is seen as "novel" and that, in itself, is notable and potentially worthy of being flagged as anomalous. To do that, you need to have your "dawn of time" be when the job starts.
  • That new entity is just part of the normal "expansion" of the data - perhaps a new server was added to the mix or a new product_id was added to the catalog. In this case, just start modeling that new entity and don't make a fuss about it showing up - and to do that, you need to have the "dawn of time" be when that entity first shows up

When analyzing split using by_field_name , the dawn of time is when the ML job was started and when split using partition_field_name , then dawn of time is when that partition first showed up in the data. As such, you will get different results if you split one way versus the other for a situation in which something "new" comes along.

5 Likes