ML: difference between partition_field_name and by_field_name in a population job?

shabnam · October 27, 2021, 12:04pm

Hi,
I want to know the difference of this two ml advance jobs:
a. func(x), by_field(y), over_field(z), partition_field(-)
b. func(x), by_field(-), over_field(z), partition_field(y)

, two tests performed:

I created two jobs with above configuration and let them for about 2.5 month learning with func(x): 200 (constant), then I passed them events like before and events with new y and z fields values and func(x)= 200 (constant). The result shows no anomaly for new values in both 2 jobs!
(I expected detected anomalies for first job when received new y filed value)
I created two other jobs with above configuration and let them for about 2.5 month learning with func(x): 200 (constant), then I passed them events like before and events with new y and z fields values and func(x)= 2000 (for this new y and z fields values). The result shows no anomaly for new values in both 2 jobs!
(I expected at least detected anomalies for first job when received new y filed value with func(x)=2000!)

so my another question is: "When does ml population job detect a new entity as an anomaly? does it depend to type of split (soft/hard)?"

and the last question: "What is the priority of this 3 in running and scoring? by_field, over_field, and partition_field?"

Thanks

richcollier · October 27, 2021, 1:59pm

The role of a population job is NOT to detect new entities in a population - it is to detect entities behaving differently from their peers . If you want something like detecting new entities, then choose a temporal (non-population) job and choose by_field as the split (Relevant: ML Kibana: difference between by_field_name and partition_field_name and my comment on the notion of "dawn of time")

Also relevant: Temporal vs. Population Analysis in Elastic Machine Learning | Elastic Blog

Alternatively, if you want to detect something novel, you can consider using the rare function. See : Dec 4th, 2018: [EN][ML] Rarity Analysis with Machine Learning

shabnam · October 30, 2021, 5:16am

Thank you

I read the links, my question is:

1. as "dawn of time" for new entities is meaningless in population jobs and as population job scoring is based on comparing with collective model of all peers as witnessed over time, so using "by_field" and "over_field" caused "by_field" just act as a splitter and not effect on splits scoring.

is it true?
if it's true, so:

2. the result of using {"by_field(x')"+"over_field(y')"} is the same as using {"partition_field(x')" + "over_field(y')"}!

is it true?

Thanks

richcollier · November 1, 2021, 6:39pm

Fact 1: Splitting a population job (which defines an over_field) ultimately creates sub-populations
Fact 2: Using a partition_field is a more "hard" split (meant to separate/isolate) from other values - and in population analysis, you may want to isolate sub-populations from each other.
Fact 3: Using by_field is more of a "soft_split" (where values of the by_field are more like attributes of an entity) and anomalies of distinct members of the population are aggregated in such a way that severity of anomalousness for an entity is increased with more simultaneously unusual values for the same member of the population.

For example, imagine a data set:

time,user, gender,feature_name,feature_value
0,Bob,male,age,30
0,Bob,male,weight,175
0,Bob,male,height,75
1, Sakura,female,age,44
1, Sakura,female,weight,105
1, Sakura,female,height,59
...

You could set up an analysis like:
max(feature_value) by_field=feature_name partition_field=gender over_field=user

where:

over_field=user - makes sense since we want to model users as members of a population
by_field=feature_name - age, weight, height are attributes of a particular user
partition_field=gender - might make sense to isolate genders from each other because (in general) men are generally bigger/heavier than women.

shabnam · November 2, 2021, 7:41am

Thanks a lot.

I would also like to know the operation sequence order (priority) of these fields("by-field", "over_field", and "partition_field") when they are all set.
I think at first, "hard-split" is the thing the operation goes through, then "soft-split", and at the last "population" runs for splitting and anomaly scoring. Is it true?
As you already know, the result would be different if order changed.

Thanks

richcollier · November 2, 2021, 2:02pm

There's not really an "order of operations" per se, because partition_field and by_field are optional parameters. If an over_field is used, it is population analysis. If it is not used, it is not a population analysis.

Here's the way I think about the "formula" for a detector:

shabnam · November 7, 2021, 6:04am

Hello

Thanks,
In first post, 2nd test, why ml jobs didn't detect any anomaly as events with func(x)=200 normally was seen and some events with func(x)= 2000 appeared? It looks like an anomaly.

Thank you

richcollier · November 8, 2021, 12:49pm

I just created a simple CSV with 1000 rows:

I put an anomaly in row 950:

I created a job with sum(x) over z partitionfield=y

I found the anomaly no problem:

shabnam · November 9, 2021, 7:29am

Hi
Thank you
I probably made a mistake somewhere.

Thanks

system · December 7, 2021, 7:30am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ML: difference between partition_field_name and by_field_name? Elasticsearch elastic-stack-machine-learning	4	846	August 27, 2021
ML What is the difference between by_field_name and partition_field_name Elasticsearch elastic-stack-machine-learning	2	2447	December 27, 2017
ML Kibana: difference between by_field_name and partition_field_name Kibana elastic-stack-machine-learning	4	2716	August 29, 2019
ML Kibana: problem with an advanced job using partitionfield Kibana elastic-stack-machine-learning	18	1139	September 3, 2019
Can you set partition field and count by as the same field? Kibana elastic-stack-machine-learning	3	411	December 14, 2022

ML: difference between partition_field_name and by_field_name in a population job?

Related topics