I am confused with terminology and would like to please clarify it with an example if possible.
I'd like to use time_of_week function and I don't understand the difference between by_field_name and partition_field_name options (they seem to be both possible according to the documentation).
Let's say I have logs of users and want to understand their individual time behavior patterns with respect to their individual previous history.
What should I use by_field_name or partition_field_name option?
And what will be the difference if I used the other option?
Thank you!
P.S.
Version: Kibana 7.2.0 platinum license.
P.P.S.
Extractions from documentation. To me splitting and segmentation sound like synonyms. If I split the data by user wouldn't it be independent analysis for each individual user? Similarly, if I segment data by user it would be separate dataset with a separate baseline for each individual user?
by_field_name
(string) The field used to split the data. In particular, this property is used for analyzing the splits with respect to their own history. It is used for finding unusual values in the context of the split.
partition_field_name
(string) The field used to segment the analysis. When you use this property, you have completely independent baselines for each value of this field.
@richcollier, thank you for your prompt and detailed reply!
Are the number provided for distinct values per job "hard" or more of an order of magnitude estimate? Basically, can we scale the cluster to have 10x more memory and thus being able to analyze 100,000 distinct entities in the "hard split" analysis?
Is there any documentation that clarifies in details how exactly "scoring considers the history of other by-fields"? In your example host is an entity (hence, "hard" split) and error_code is an attribute (hence, "soft" split). I created a job with two independent detectors: by host and partition=host. Most of the times they return similar results (anomalies). However, it's not always the case: partition=host detects some critical anomalies that by host detector misses. I'd like to understand why it's the case.
Yes the numbers are more of an "order of magnitude" estimate. You can certainly get jobs with 100,000+ partitions if you're willing to have the memory headroom. However, keep in mind that 1 job is tied to 1 ML node, so you'll never get horizontal scalability if you just have 1 massive job instead of many smaller jobs.
In general, there is a concept in the ML job as to when a thing first happens - which I'll call the "dawn of time". When the dawn of time of something happens (i.e. the first time the ML job sees data for host=X or error_code=Y) there may be one of two situations:
That new entity is seen as "novel" and that, in itself, is notable and potentially worthy of being flagged as anomalous. To do that, you need to have your "dawn of time" be when the job starts.
That new entity is just part of the normal "expansion" of the data - perhaps a new server was added to the mix or a new product_id was added to the catalog. In this case, just start modeling that new entity and don't make a fuss about it showing up - and to do that, you need to have the "dawn of time" be when that entity first shows up
When analyzing split using by_field_name , the dawn of time is when the ML job was started and when split using partition_field_name , then dawn of time is when that partition first showed up in the data. As such, you will get different results if you split one way versus the other for a situation in which something "new" comes along.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.