a a_s1 302
a a_s2 310
a a_s3 308
a a_s4 21
b b_s1 14
b b_s2 16
b b_s3 17
b b_s4 218
I have a population anomaly detection job to find the anomalies in child_count. The Table shown above is the type of data which we are processing. Here I want to find the anomalies in child_count across each state. Here as you can see for the country 'a' I have child_count of the range 300 and one with child_count '21' which can be treated as an anomlay(compairing it with other values of country 'a') and for country 'b' we have child_count of range 14 to 17 and one with child_count 218 which is also an anomaly.There is no other anomalies in this case. But after processing the data using population job where the entire data is splitted by 'state', it considers the entire data of one country as anomalies by comparing it with first country. But I don't want to compare
child_count of one country with another I just want to compare it with the previous child_count of the same country. How can I achieve this
(The actual data contains high cardinality value that's why I used population job here)
First and foremost - does your data also include a timestamp? If the data isn't really temporal in nature, then you should consider doing an Outlier Detection analysis rather than a Population Analysis.
It depends. Outlier detection is analysis mostly irrespective of time. Your data can be data from a certain time period (i.e. House sales prices from 2022).
But, population analysis is meant to be a moment-by-moment analysis (essentially comparing every entity witnessed in an arbitrary time window - as in "last hour" or "last day") and comparing those entities against a learned "global" model of all entities that has been built up over time (ever since the Population Analysis job has been running).
ok thanks. Can I use multiple arguments inside "by_field_name " as I want to split the data based on 3 field values(state,country,district) and analyze the splits with respect to their own history in population job.
Then choose the runtime field as the field to split on. Word of caution: don't split the data too thin - you might wind up with a very small number of unique combinations and thus have sparse data.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.