Now the question is whether or not splitting the analysis per partition (and then showing the influencers) is the same as analyzing the data altogether (without splitting) and then hoping that influencers will emerge.
The answer is that this will not necessarily be the same. Here's an excerpt from a write up I did once on the topic of "Influencers in Split vs. Non-Split jobs":
Before I go through an illustration of why splitting the job can be the proper way to analyze things instead of just relying on the usage of influencers, it is important to recognize the difference between the purpose of influencers and the purpose of splitting the job. An entity is identified by ML as an influencer if it has contributed significantly to the existence of the anomaly. This notion of deciding influential entities is completely independent of whether or not the job is split. An entity can be deemed influential on an anomaly only if an anomaly happens in the first place. If there is no anomaly detected, there is no need to figure out if there is an influencer. However, the job may or may not find something is anomalous, depending on whether or not the job is split into multiple time-series. When splitting the job, you are modeling (creating separate analyses) for each entity of the field chosen for the split.
To illustrate - let's look at my favorite demo dataset - farequote. This data set is essentially an access log of the number of times a piece of middleware is called in a travel portal to reach out to 3rd party airlines for a quote of airline fares. The JSON documents look like this:
{
"@timestamp": "2017-02-11T23:59:54.000Z",
"responsetime": 251.573,
"airline": "FFT"
}
The number of events per unit time corresponds to a number of requests being made and the "responsetime" field is the response time of that individual request to that airline's fare quoting web service.
Case 1) Analysis of count over time, NOT split on "airline", but use airline as an influencer
If we analyze the overall "count" of events (no split), we can see that the prominent anomaly (the spike) in the event volume was determined to be influenced by airline=AAL:
This is quite sensible because the increased occurrence of requests for AAL affects the overall event count (of all airlines together) very prominently.
Case 2) Analysis of count over time, split on "airline", and use "airline" as an influencer
If we do set "partition_field_name=airline" to split the analysis so that each airline's count of documents gets analyzed independently, then of course, we still properly see that "airline=AAL" is still the most unusual:
So far, so good. But...
Case 3) Analysis of "mean(responsetime)", no split, but use "airline" as an influencer
In this case, the results show the following:
Here, remember that all airline's response times are getting averaged together each bucket_span. In this case, the most prominent anomaly (even though it is a relatively minor variation above "normal") is shown and is deemed to be influenced by airline=NKS. However, this may seem misleading . You see, airline=NKS has a very stable response time during this period, but note its normal operating range is much higher than the rest of the group:
As such, the contribution of NKS to the total, aggregate response times of all airlines is more significant than the others. So, of course, ML identifies NKS as the most prominent influencer.
But this anomaly is not the most significant anomaly of "reponsetime" in the data set! That anomaly belongs to airline=AAL - but it isn't visible in the aggregate because all of the airline's data is drowns out the detail. See Case 4.
Case 4) Analysis of "mean(responsetime)", split on "airline", and use "airline" as an influencer
In this case, the most prominent response time anomaly for AAL properly shows itself when we set "partition_field_name=airline" to split the analysis.