ML: difference between partition_field_name and by_field_name in a population job?

Hi,
I want to know the difference of this two ml advance jobs:
a. func(x), by_field(y), over_field(z), partition_field(-)
b. func(x), by_field(-), over_field(z), partition_field(y)

, two tests performed:

  1. I created two jobs with above configuration and let them for about 2.5 month learning with func(x): 200 (constant), then I passed them events like before and events with new y and z fields values and func(x)= 200 (constant). The result shows no anomaly for new values in both 2 jobs!
    (I expected detected anomalies for first job when received new y filed value)

  2. I created two other jobs with above configuration and let them for about 2.5 month learning with func(x): 200 (constant), then I passed them events like before and events with new y and z fields values and func(x)= 2000 (for this new y and z fields values). The result shows no anomaly for new values in both 2 jobs!
    (I expected at least detected anomalies for first job when received new y filed value with func(x)=2000!)

so my another question is: "When does ml population job detect a new entity as an anomaly? does it depend to type of split (soft/hard)?"

and the last question: "What is the priority of this 3 in running and scoring? by_field, over_field, and partition_field?"

Thanks

The role of a population job is NOT to detect new entities in a population - it is to detect entities behaving differently from their peers . If you want something like detecting new entities, then choose a temporal (non-population) job and choose by_field as the split (Relevant: ML Kibana: difference between by_field_name and partition_field_name and my comment on the notion of "dawn of time")

Also relevant: Temporal vs. Population Analysis in Elastic Machine Learning | Elastic Blog

Alternatively, if you want to detect something novel, you can consider using the rare function. See : Dec 4th, 2018: [EN][ML] Rarity Analysis with Machine Learning

1 Like

Thank you

I read the links, my question is:

1. as "dawn of time" for new entities is meaningless in population jobs and as population job scoring is based on comparing with collective model of all peers as witnessed over time, so using "by_field" and "over_field" caused "by_field" just act as a splitter and not effect on splits scoring.

is it true?
if it's true, so:

2. the result of using {"by_field(x')"+"over_field(y')"} is the same as using {"partition_field(x')" + "over_field(y')"}!

is it true?

Thanks

Fact 1: Splitting a population job (which defines an over_field) ultimately creates sub-populations
Fact 2: Using a partition_field is a more "hard" split (meant to separate/isolate) from other values - and in population analysis, you may want to isolate sub-populations from each other.
Fact 3: Using by_field is more of a "soft_split" (where values of the by_field are more like attributes of an entity) and anomalies of distinct members of the population are aggregated in such a way that severity of anomalousness for an entity is increased with more simultaneously unusual values for the same member of the population.

For example, imagine a data set:

time,user, gender,feature_name,feature_value
0,Bob,male,age,30
0,Bob,male,weight,175
0,Bob,male,height,75
1, Sakura,female,age,44
1, Sakura,female,weight,105
1, Sakura,female,height,59
...

You could set up an analysis like:
max(feature_value) by_field=feature_name partition_field=gender over_field=user

where:

  • over_field=user - makes sense since we want to model users as members of a population
  • by_field=feature_name - age, weight, height are attributes of a particular user
  • partition_field=gender - might make sense to isolate genders from each other because (in general) men are generally bigger/heavier than women.
1 Like

Thanks a lot.

I would also like to know the operation sequence order (priority) of these fields("by-field", "over_field", and "partition_field") when they are all set.
I think at first, "hard-split" is the thing the operation goes through, then "soft-split", and at the last "population" runs for splitting and anomaly scoring. Is it true?
As you already know, the result would be different if order changed.

Thanks

There's not really an "order of operations" per se, because partition_field and by_field are optional parameters. If an over_field is used, it is population analysis. If it is not used, it is not a population analysis.

Here's the way I think about the "formula" for a detector:

1 Like

Hello

Thanks,
In first post, 2nd test, why ml jobs didn't detect any anomaly as events with func(x)=200 normally was seen and some events with func(x)= 2000 appeared? It looks like an anomaly.

Thank you

I just created a simple CSV with 1000 rows:

I put an anomaly in row 950:

I created a job with sum(x) over z partitionfield=y

I found the anomaly no problem:

1 Like

Hi
Thank you
I probably made a mistake somewhere.

Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.