ML Kibana: problem with an advanced job using partitionfield

I used the same dataset to train 2 ML Kibana jobs:

  1. count(field_1) over field_2 + detector on field_2
  2. count(field_1) over field_2 partitionfield field_3 + detectors on field_2 and field_3

The dataset has a severe anomaly that job #1 sees, but job #2 doesn't.
Moreover, when switching to the "view by" field_3, I see only a part of all possible values for field_3. (3 to be exact, while the dataset has over 50 different unique values for field_3).

Any suggestions on what could be wrong?

Thank you!

P.S.
Version: 6.2.4, Platinum license.

Your description is a little confusing because technically, the "count(field_1) over field_2" is a detector. And therefore, the "detector on field_2" doesn't make sense. Do you mean "influencer on field_2"?

I'm going to take a stab here and assume that's what you really meant and that the likely difference between job 1 and job 2 is the partition-based scoring. In versions prior to v6.5, all partitions were scored on the same normalization table, but after v6.5, were treated separately. I feel that the post v6.5 behaves more as people would expect, giving more independent scoring to each partition.

See: https://www.elastic.co/blog/changes-to-elastic-machine-learning-anomaly-scoring-in-6-5 for more information.

Thank you for your reply, richcollier!

Yes, you are right, I meant "influencer on field_2" and "influencers on field_2 and field_3" respectively.

Thank you for sending the link to the scoring introduced in 6.5. Indeed, it could be one of the possible explanations. However, I think that the issue is not related to that, because I don't see certain values when switching to "view by" field_3 and thus I expect that jobs internally analyze different data. I could be wrong though and I can't check it by upgrading to 6.5 to check it. Sorry about that.

When you say "View by" I assume you mean the selector in the Anomaly Explorer.

You should know that the middle section (the heatmap looking area) really shows the "influencer" scores, not the scores of the individual anomalies for the partitions (those are viewable in the bottom table).

More explanation here: https://www.elastic.co/blog/machine-learning-anomaly-scoring-elasticsearch-how-it-works

richcollier, thank you for your reply!

You are right, I meant the "View by" selector in the Anomaly Explorer.

If we use the plots in the document you linked to:


it would be "View by" airline and my problem would be that I see only 4 airlines (say AAL, VRD, AWE, ASA) in the top 10, even though I have much more than 4 (JZA, SWR, ...).

The center section doesn't show values of influencer fields if they're not influential during the time range.

For example, in this job, I have 3 declared influencer fields (clientip, status, and uri), but at this moment in time there are only 2 our of the 3 influencer fields that have any determination and for clientip, for example, there are only 3 really unusual:

Whereas, if I look over a longer range of time (and hence over many more anomalies) I'll see something different:

You can also see that in later versions we allow you to select more to view than just the top 10

Thank you for this clarifications, richcollier!

The question regarding seeing only part of the values in field_3 is solved now.

I still don't understand though why I see different anomalies in job #1 and job #2. Is it possible that introduction of several influencers affects the analysis results? I assumed that anomalies were calculated independently for different influencers.

Now the question is whether or not splitting the analysis per partition (and then showing the influencers) is the same as analyzing the data altogether (without splitting) and then hoping that influencers will emerge.

The answer is that this will not necessarily be the same. Here's an excerpt from a write up I did once on the topic of "Influencers in Split vs. Non-Split jobs":

Before I go through an illustration of why splitting the job can be the proper way to analyze things instead of just relying on the usage of influencers, it is important to recognize the difference between the purpose of influencers and the purpose of splitting the job. An entity is identified by ML as an influencer if it has contributed significantly to the existence of the anomaly. This notion of deciding influential entities is completely independent of whether or not the job is split. An entity can be deemed influential on an anomaly only if an anomaly happens in the first place. If there is no anomaly detected, there is no need to figure out if there is an influencer. However, the job may or may not find something is anomalous, depending on whether or not the job is split into multiple time-series. When splitting the job, you are modeling (creating separate analyses) for each entity of the field chosen for the split.

To illustrate - let's look at my favorite demo dataset - farequote. This data set is essentially an access log of the number of times a piece of middleware is called in a travel portal to reach out to 3rd party airlines for a quote of airline fares. The JSON documents look like this:

{
"@timestamp": "2017-02-11T23:59:54.000Z",
"responsetime": 251.573,
"airline": "FFT"
}

The number of events per unit time corresponds to a number of requests being made and the "responsetime" field is the response time of that individual request to that airline's fare quoting web service.

Case 1) Analysis of count over time, NOT split on "airline", but use airline as an influencer

If we analyze the overall "count" of events (no split), we can see that the prominent anomaly (the spike) in the event volume was determined to be influenced by airline=AAL:

This is quite sensible because the increased occurrence of requests for AAL affects the overall event count (of all airlines together) very prominently.

Case 2) Analysis of count over time, split on "airline", and use "airline" as an influencer

If we do set "partition_field_name=airline" to split the analysis so that each airline's count of documents gets analyzed independently, then of course, we still properly see that "airline=AAL" is still the most unusual:

So far, so good. But...

Case 3) Analysis of "mean(responsetime)", no split, but use "airline" as an influencer

In this case, the results show the following:

Here, remember that all airline's response times are getting averaged together each bucket_span. In this case, the most prominent anomaly (even though it is a relatively minor variation above "normal") is shown and is deemed to be influenced by airline=NKS. However, this may seem misleading . You see, airline=NKS has a very stable response time during this period, but note its normal operating range is much higher than the rest of the group:

As such, the contribution of NKS to the total, aggregate response times of all airlines is more significant than the others. So, of course, ML identifies NKS as the most prominent influencer.

But this anomaly is not the most significant anomaly of "reponsetime" in the data set! That anomaly belongs to airline=AAL - but it isn't visible in the aggregate because all of the airline's data is drowns out the detail. See Case 4.

Case 4) Analysis of "mean(responsetime)", split on "airline", and use "airline" as an influencer

In this case, the most prominent response time anomaly for AAL properly shows itself when we set "partition_field_name=airline" to split the analysis.

richcollier, thank you for your reply and detailed explanations - I got better understanding of the difference.

I tried to reproduce your result. Similar to you, in case 4 I see anomaly for AAL.
What puzzles me now is that I also see other anomalies where I don't expect to see them, specifically, for EGF.

Could you please explain:

  1. why does one see anomalies in EGF? (from the graph one can see that mean value is very close to the actual one at the point where "the anomaly" is spotted)
  2. what do the yellow/red "plus signs" stand for?

I use detector: mean(responsetime) partition_field_name="airline.keyword"
and influencer: airline.keyword

Thank you!

You won't be able to reproduce my original results because those results were created before v6.5 and as mentioned earlier, some changes were implemented in v6.5 that changed the way scoring is done on partitions and for multi-bucket anomalies. Again, I refer you to: https://www.elastic.co/blog/changes-to-elastic-machine-learning-anomaly-scoring-in-6-5

  1. Because EGF is its own partition and thus (after v6.5) has more independent scoring. So, the episode shown is the "worst behavior for EGF", despite the fact that relatively speaking, EGF's behavior isn't anything like AAL's
  2. These are "multi-bucket" anomalies explained in the blog above, plus there is a newer blog that dives into it more: https://www.elastic.co/blog/interpreting-multi-bucket-impact-anomalies-using-elastic-machine-learning-features

richcollier, thank you for your reply!

I looked into the "multi-bucket" anomalies link, thank you.

I zoomed into the EGF anomaly in single metric view and could observe that anomaly score 88 was due to the multi-bucket type of anomaly:
fq-4-EGF-multi

However, I was surprised to find out that there seems to be no way to tell from within .ml-anomalies-shared index that the specific result_type:record anomaly score (record_score) was affected by the multi-bucket anomaly:
fq-4-EGF-record

There was a surrounding document of result_type:influencer but it had different values of influencer_score and didn't seem to be referenced in any way to the result_type:record document:
fq-4-EGF-influencer

I'd like to create a filter in .ml-alerts-* index to:

  1. find alerts generated without the influence of multi-bucket anomalies
  2. find alerts generated with the influence of multi-bucket anomalies, but have field that clarify that those were affected by them (like it is shown in the single metric view when one hovers over an alert).

How can I do it?

Thank you,

You missed the detail in the blog and the screenshot of the result_type:record. It is a field called multi_bucket_impact.

From the blog:

Under the hood, we calculate the impact of multi-bucket analysis on the anomaly and rank it from -5 (no multi-bucket contribution) to +5 (all multi-bucket contribution). There is also now text for high, medium or low multi-bucket impact included in the anomaly marker tooltips as well as in the expanded row section of the anomalies table.

When querying the .ml-anomalies-* indices for record results (for alerting or other non-UI purposes), we now report the value of this new multi_bucket_impact field:

{
  "_index" : ".ml-anomalies-shared",
  "_type" : "doc",
  "_id" : "data_low_count_atc_record_1511486100000_900_0_29791_0",
  "_score" : 8.8717575,
  "_source" : {
    "job_id" : "data_low_count_atc",
    "result_type" : "record",
    "probability" : 5.399816125171105E-5,
    "multi_bucket_impact" : 5.0,
    "record_score" : 98.99735135436666,
    "initial_record_score" : 98.99735135436666,
    "bucket_span" : 900,
    "detector_index" : 0,
    "is_interim" : false,
    "timestamp" : 1511486100000,
    "function" : "low_count",
    "function_description" : "count",
    "typical" : [
      510.82320876196434
    ],
    "actual" : [
      497.0
    ]
  }
}

@richcollier, got it, thank you!

Let me please attempt to finish this long thread with double checking with you the following questions:

  1. Regarding the dataset that is taken into account when estimating record_score of a result_type=record document in .ml-anomalies-* index:
    According to my understanding, in the job with mean(responsetime) detector and influencer=airline,record is responsetime, influencers are airline and bucket_time.
    1.1 The anomaly analysis proceeds as follows: all responsetime values' in the newly analyzed time bucket are compared to those of the entire dataset history, subsequently bucketed into quantiles (even though I'm not sure how small quantiles should be to have resolution in probability of 1e-308) and then each individually assigned probabilities one can later observe in the .ml-anomaly-* result_type=record type of documents, right?
    1.2. And this is the case unless one specifies partition_field_name=airline in which case responsetime values in the new time bucket will be compared only to a subset of the dataset history corresponding to a specific airline, correct? (which will change their probabilities and affect scoring compared to 1.1. case)

Each of the above occurrences has a calculated probability...based upon the observed past behavior which has constructed a baseline probability model for that item. However, this raw probability value, while certainly useful, can lack some contextual information like:
• How does the current anomalous behavior compare to past anomalies? Is it more or less unusual than past anomalies?
• How does this item’s anomalousness compare to other potentially anomalous items (other users, other IP addresses, etc.)?
From https://www.elastic.co/blog/machine-learning-anomaly-scoring-elasticsearch-how-it-works

  1. Regarding the process of estimating influencer_score of a result_type=influencer document in .ml-anomalies-* index:
    I observe in the result_type=influencer document probability, initial_influencer_score and influencer_score fields.
    2.1. Let's start with probability. I guess you again use quantile analysis to estimate it? If so, in what space do you perform this analysis? Is it in the space of the previous influencer (AAL) values over the entire dataset history or in the relative space of all influencer(i.e., AAL compared to all other airlines) values in the current time bucket? Or both?
    2.2. what does the initial_influencer_score get affected by when turning to the influencer_score? multi_bucket_impact? anything else?
    2.3. It seems that bucket analysis is done separately from the influencer analysis. Is that correct? To verify that I created two jobs: one withmean(responsetime) detector and influencer=airline and another one count and mean(responsetime) detectors, and influencer=airline. But maybe I miss something?

Thank you!

Quantile analysis isn't used to calculate the probability - instead, quantile analysis is used to take the probability calculation (which is determined by referencing the current observation against the internal probability distribution function that was learned for that data set) and casting that probability on a normalized scale from 0-100. So, in other words, the probability is calculated first, then normalized (and can be re-normalized at a later time- which is why we keep track of the "initial score" as well).

Remember, analysis is done in chronological time order. So, when an anomaly is first encountered, the scores calculated at that time take into account the data that has been seen up to that point. As subsequent data is later processed, it may be determined that bigger, more unusual situations have occurred and that prior normalized scores need to be back-edited to keep them in balance with these new observations. Thus, this is why "initial scores" may differ from the current score (i.e. initial_record_score vs. record_score).

Bucket analysis is indeed done first. Influencer analysis is only done if there's an anomaly. The whole point of influencers is that if there is an anomaly - what influencers had an impact on creating that anomaly?

@richcollier, thank you for your reply!

Let me please take another stab at understanding the anomaly detection process that you perform.

  1. So, you start with clustering to determine the (likely) modality of the distribution and relative weights of modals. Yes/No?

  2. Then you fit data up to this time in history with parametric distributions.
    From what I've seen you use only three parametric distributions normal, log-normal, gamma, so, I guess you fit a linear combination of these three all the time? Yes/No?

  3. Once you have data for more than a day, you start introducing a daily baseline.
    3.1. Is it a non-parametric distribution that you use for a daily distribution representation? Yes/No?
    3.2. Then you introduce residual distribution which you fit again with parametric distributions (normal, lognormal, gamma). Yes/No?
    3.3 And every time you introduce a new base-line (this time daily) you re-initialize the parametric distributions because now the need to fit to the residual distribution: for example, when you introduce weekly baseline. Yes/No?

  4. Probabilities to records are being assigned based on the likelihood of the event given the fit of parametric distributions at that specific time in history. Initial_record_score is being assigned to the anomaly in case probability of this specific event is in the top unlikely quantile of all other historic anomalies. This score is modified for the entire historical dataset of anomalies every time step. Thus, what was a critical anomaly before, might become a warning anomaly.

  5. I'm still not sure how the influencers' analysis is being performed. Once you know that something is an anomaly you build a histogram with respect to influencers' bins and see whether one of them stand out with respect to uniform probability distribution?

Thank you!

Respectfully, I'm not sure this is the best time or place to attempt to explain this level of detail. It is not necessary to know this to effectively use the product. If indeed you are interested in learning more you can either watch these videos [1] [2] or review the source code.

[1] - https://www.elastic.co/elasticon/conf/2017/sf/machine-learning-and-statistical-methods-for-time-series-analysis
[2] - https://www.elastic.co/elasticon/conf/2018/sf/the-math-behind-elastic-machine-learning

@richcollier, thank you for your reply and the links.

I wish I didn't need to go into details, but after trial and error of setting up jobs and looking into the analysis results, I don't understand what's happening and can't make entire sense out of the results.

I thought that our conversation could be interesting to the community and thus I asked in the public forum. I can alternatively ask through requests -- as I mentioned above, we have the platinum license.

Please let me know.

Thank you!

If you do file requests through Support, you have not only the same access to our experts, but you also have a guarantee that someone will answer you. We try to monitor activity on these public forums, but there is no guarantee that questions will get answered.

If you're having trouble understanding the results, then ask questions about what you see, and we'll do our best to help you understand.

Again, if you're interested about the mechanics of how the modeling works or the decisions made in the algorithms (beyond a layman's understanding), then your best bet is to inspect the source code.