Dec 7th, 2019 [EN] Looking behind the scenes of anomaly detector models

Ever wondered what's really going on behind the scenes with Elastic's unsupervised machine learning anomaly detection modelling? (and if not why not?!). Sure you may think you know what's going on, you've read our extensive and beautifully written documentation, maybe you've even enabled model plot in the job configuration and viewed the results in the single metric viewer (but would you like to know more?). Maybe you've downloaded the backend source code, compiled it, and ran the extensive tests. If you have, kudos to you! But do you want to understand what those tests are about? Read on!

For many reasons, the anomaly detector model state is snapshotted periodically. You can view these model snapshots by retrieving the snapshot ID:


... then searching in the .ml-state index in Elasticsearch. However, unless you can easily decode the base64 encoded, compressed model state you probably won't learn very much from looking at it.


The good news is that you don't need to worry about doing that anymore. Hidden away in the ml-cpp repository is a little tool with big aspirations. The model_extractor source code lives in the devbin directory:


One of the goals of model_extractor is to fill the gap between the unit tests and the extensive integration tests in Elasticsearch. Not forgetting the quite frankly heroic efforts of
the Machine Learning QA team (seriously, these folks are the unsung heroes, slaving away behind the scenes, who make each release what it is).

The design and implementation of model_extractor is simple. Using the existing ml-cpp APIs, it decodes the compressed model state that was generated by the primary anomaly detector executable autodetect. Once this is done, the model state can be dumped to a file in human readable format (either XML or JSON) and easily parsed by any number of scripting languages such as perl or python. It can even do this at regular periodic intervals. We'll see exactly how to do that soon.

If you were really keen, you could even parse all the model state documents and display the evolution of model parameters over time. Which, incidentally, is exactly what I've done and am keen to share some of the more interesting aspects of that exercise with you now.

Perhaps the simplest example of an anomaly detection job is called simple count. This detector... well I guess the clue is in the name, right? As a first foray in to what model_extractor can show us about the evolution of model parameters, let's set up an anomaly detection job on the command line. We'll pass in similarly simple data, comprised of just two columns, contained in a CSV file. The first column is a time stamp and the second column is an integer representing a count of something (I'll leave it up to you to think of what the count might represent, but do be creative. It is the festive season after all!). I said the data set would be simple so let's make those counts conform to a normal distribution (what is "normal" anyway? Who's the judge? At Elastic we say "you do you and that's ok"

These are the first 10 lines of a CSV file containing a normally distributed timeseries of counts. Note the existence and values of the column headings:


I used python to generate the CSV file but you can easily do something similar using your preferred coding language. You are not restricted to use Elastic's source code

start_time = 1484006400


bucket_span = 60

num_buckets = 14 * 24 * 60

end_time = start_time + (num_buckets * bucket_span)

mu = 1000
sigma = 200
samples = np.random.normal(mu, sigma, size=num_buckets)

samples = samples.astype(int)

times = range(start_time, end_time, bucket_span)

As you can see, this code snippet generates a normal distribution with mu (mean) of 1000 and sigma (standard deviation) of 200. Remember those numbers, they will come in handy later.

Using this approach, you could generate many different kinds of synthetic data sets that exercise different aspects of autodetect's modelling. For example, you could also potentially generate datasets that conformed to one of the log-normal, gamma or poisson distributions, or any combination of these. The possibilities are endless (ok, maybe not endless, statistics never was my strong point!) but let's keep things simple for now.

Speaking of simple. Let's simply dive right in and look at how to run autodetect in a "pipeline" with model_extractor in order to extract model state after every bucket has been processed. Here's what running that anomaly detection job from the command line might look like:

autodetect --jobid=test --bucketspan=60 --summarycountfield=count --timefield=time --delimiter=, --modelplotconfig=modelplotconfig.conf --fieldconfig=fieldconfig.conf --persist=normal.pipe --persistIsPipe --bucketPersistInterval=1 --persistInForeground --input=normal.csv --output=normal.out

where the contents of modelplotconfig.conf are:

boundspercentile = 95.0
terms =

and fieldconfig.conf contains:

detector.0.clause = count

Some explanation might help here. Fortunately all is explained in Incidentally, the modelplotconfig.conf configuration is the same as that used when you select the generate model plot option when creating a job in our super easy-to-use anomaly detector job wizard in Kibana. This also helps explain the mystery of what the model plot bounds actually represent - they indicate that we are 95% confident that a point in "the shaded area" in the single metric plot is not an anomaly. Finally, I think the fieldconfig.conf configuration speaks for itself.

And here is the corresponding command line for the model_extractor:

model_extractor --input=normal_named_pipe --inputIsPipe --output=normal.xml --outputFormat=XML

Again some explanation of the parameters might help you understand what's going on:

./model_extractor --help
Usage: model_extractor [options]
  --help                     Display this information and exit
  --version                  Display version information and exit
  --logProperties arg        Optional logger properties file
  --input arg                Optional file to read input from - not present
                             means read from STDIN
  --inputIsPipe              Specified input file is a named pipe
  --output arg               Optional file to write output to - not present
                             means write to STDOUT
  --outputIsPipe             Specified output file is a named pipe
  --outputFormat arg (=JSON) Format of output documents [JSON|XML].

Let's pull all that together a script:


if [ $# != 1 ]; then
    echo "Usage: $0 <autodetect csv input file>"
    exit 1

PREFIX=$(basename ${INPUT} .csv)

autodetect --jobid=test --bucketspan=60 --summarycountfield=count \
--timefield=time --delimiter=, --modelplotconfig=modelplotconfig.conf --fieldconfig=fieldconfig.conf --persist \
normal_named_pipe --persistIsPipe --bucketPersistInterval=1 < ${INPUT} > ${PREFIX}.out 2> ${PREFIX}.log &

model_extractor --input  normal_named_pipe --inputIsPipe --output normal.xml --outputFormat="XML"

... but those details aren't that important right now.

What is important is the output from model_extractor. We told it to write the decoded model state data to an XML file called normal.xml. We won't look directly at that file. Suffice to say that it consists of a number of model state documents in XML format (actually, the reality is that the XML strings are then wrapped up as JSON in order to be compatible with Elasticsearch's indexing.), also. caveat, it's not strictly compliant XML anyway. Let's fix that, fix the formatting and hone in on something interesting.


The highlighted rectangle shows that the normal prior model generated by autodetect from the input data has a mean of 999.641 and standard_deviation of 198.407. Hark back to the generated input data that had mu (mean) of 1000 and sigma (standard deviation) of 200? This shows that autodetect did eventually learn the correct model parameters after a number of iterations (buckets processed).

The other thing of interest from this screenshot of a model state document is that the normal prior model has a log_weight value of 0 (so an actual weight of 1) while the other possible candidate priors have negative log_weight values (and hence their actual weights are less than 1). This indicates that autodetect has quite some confidence that the normal prior model is indeed the sole or primary candidate for modelling the distribution of the input data.

Pretty neat eh? No? Perhaps seeing some plots of the anomaly detection results and the evolution of model parameters over time might convince you?

Here are the results, with model bounds overlaid and anomalies represented by the same colours as they are in the single metric viewer in Kibana. As a reminder, here's the colour key:


Time (represented as seconds since epoch (Jan 1, 1970)) is along the x-axis, while data count is along the y-axis.


As you can see, there are a number of anomalies found of all severities. Moving on... What? you have questions? The answers are... well, complicated. For those interested here is a blog on something called normalisation and another about multi-bucket impact anomalies that should answer all of those questions. There are also a number of other resources available, primarily the documentation I referred to earlier but if you're still stuck just ask here, someone (maybe even me!) will be happy to help.

Back to the results! Here's the evolution of model parameters over time for the normal model prior:


As you can see, the values are "all over the place" (to use a technical term) initially but as time goes on (more buckets are processed) the values stabilise.

Here are those same parameters again but this time we'll show the evolution of the resulting normal distribution:


Again, you can see that after a number of iterations the model parameters have stabilised quite nicely.

And here's the evolution of the prior weights of all the candidate models:


You can see several things from this plot. The most obvious is that there are four potential models present. We consider all four separately and in combination. Also, you can see that the weight of all but the normal model drops to 0 eventually but that, apart from an initial minor blip, the weight for the normal model stays strong and stable at 1. This shows that autodetect has automatically detected that this particular data set is almost certainly best modelled exclusively by a normal distribution.

I could go on (I do tend to get carried away by this stuff) but I feel I should leave it there.

Oh ok, you convinced me. Just one more screenshot. Here's a top hat with baubles on (otherwise known as a transient bi-modal normal distribution, with anomalies shown):


Happy festive season every one! I hope this little piece has inspired you to look at anomaly detection in a new light!