ML: difference between partition_field_name and by_field_name in a population job?

Fact 1: Splitting a population job (which defines an over_field) ultimately creates sub-populations
Fact 2: Using a partition_field is a more "hard" split (meant to separate/isolate) from other values - and in population analysis, you may want to isolate sub-populations from each other.
Fact 3: Using by_field is more of a "soft_split" (where values of the by_field are more like attributes of an entity) and anomalies of distinct members of the population are aggregated in such a way that severity of anomalousness for an entity is increased with more simultaneously unusual values for the same member of the population.

For example, imagine a data set:

time,user, gender,feature_name,feature_value
0,Bob,male,age,30
0,Bob,male,weight,175
0,Bob,male,height,75
1, Sakura,female,age,44
1, Sakura,female,weight,105
1, Sakura,female,height,59
...

You could set up an analysis like:
max(feature_value) by_field=feature_name partition_field=gender over_field=user

where:

  • over_field=user - makes sense since we want to model users as members of a population
  • by_field=feature_name - age, weight, height are attributes of a particular user
  • partition_field=gender - might make sense to isolate genders from each other because (in general) men are generally bigger/heavier than women.
1 Like