Ever wondered what's really going on behind the scenes with Elastic's unsupervised machine learning anomaly detection modelling? (and if not why not?!). Sure you may think you know what's going on, you've read our extensive and beautifully written documentation, maybe you've even enabled model plot in the job configuration and viewed the results in the single metric viewer (but would you like to know more?). Maybe you've downloaded the backend source code, compiled it, and ran the extensive tests. If you have, kudos to you! But do you want to understand what those tests are about? Read on!
For many reasons, the anomaly detector model state is snapshotted periodically. You can view these model snapshots by retrieving the snapshot ID:
get_model_snapshots... then searching in the .ml-state
index in Elasticsearch. However, unless you can easily decode the base64 encoded, compressed model state you probably won't learn very much from looking at it.
The good news is that you don't need to worry about doing that anymore. Hidden away in the ml-cpp repository is a little tool with big aspirations. The model_extractor
source code lives in the devbin
directory:
One of the goals of model_extractor
is to fill the gap between the unit tests and the extensive integration tests in Elasticsearch. Not forgetting the quite frankly heroic efforts of
the Machine Learning QA team (seriously, these folks are the unsung heroes, slaving away behind the scenes, who make each release what it is).
The design and implementation of model_extractor
is simple. Using the existing ml-cpp
APIs, it decodes the compressed model state that was generated by the primary anomaly detector executable autodetect
. Once this is done, the model state can be dumped to a file in human readable format (either XML
or JSON
) and easily parsed by any number of scripting languages such as perl
or python
. It can even do this at regular periodic intervals. We'll see exactly how to do that soon.
If you were really keen, you could even parse all the model state documents and display the evolution of model parameters over time. Which, incidentally, is exactly what I've done and am keen to share some of the more interesting aspects of that exercise with you now.
Perhaps the simplest example of an anomaly detection job is called simple count
. This detector... well I guess the clue is in the name, right? As a first foray in to what model_extractor
can show us about the evolution of model parameters, let's set up an anomaly detection job on the command line. We'll pass in similarly simple data, comprised of just two columns, contained in a CSV file. The first column is a time stamp and the second column is an integer representing a count of something (I'll leave it up to you to think of what the count might represent, but do be creative. It is the festive season after all!). I said the data set would be simple so let's make those counts conform to a normal distribution (what is "normal" anyway? Who's the judge? At Elastic we say "you do you and that's ok"
These are the first 10 lines of a CSV file containing a normally distributed timeseries of counts. Note the existence and values of the column headings:
time,count
1484006400,1352
1484006460,1080
1484006520,1195
1484006580,1448
1484006640,1373
1484006700,804
1484006760,1190
1484006820,969
1484006880,979
I used python
to generate the CSV file but you can easily do something similar using your preferred coding language. You are not restricted to use Elastic's source code
start_time = 1484006400
np.random.seed(0)
bucket_span = 60
num_buckets = 14 * 24 * 60
end_time = start_time + (num_buckets * bucket_span)
mu = 1000
sigma = 200
samples = np.random.normal(mu, sigma, size=num_buckets)
samples = samples.astype(int)
times = range(start_time, end_time, bucket_span)
As you can see, this code snippet generates a normal distribution with mu
(mean) of 1000 and sigma
(standard deviation) of 200. Remember those numbers, they will come in handy later.
Using this approach, you could generate many different kinds of synthetic data sets that exercise different aspects of autodetect
's modelling. For example, you could also potentially generate datasets that conformed to one of the log-normal
, gamma
or poisson
distributions, or any combination of these. The possibilities are endless (ok, maybe not endless, statistics never was my strong point!) but let's keep things simple for now.
Speaking of simple. Let's simply dive right in and look at how to run autodetect
in a "pipeline" with model_extractor
in order to extract model state after every bucket has been processed. Here's what running that anomaly detection job from the command line might look like:
autodetect --jobid=test --bucketspan=60 --summarycountfield=count --timefield=time --delimiter=, --modelplotconfig=modelplotconfig.conf --fieldconfig=fieldconfig.conf --persist=normal.pipe --persistIsPipe --bucketPersistInterval=1 --persistInForeground --input=normal.csv --output=normal.out
where the contents of modelplotconfig.conf
are:
boundspercentile = 95.0
terms =
and fieldconfig.conf
contains:
detector.0.clause = count
Some explanation might help here. Fortunately all is explained in README.md. Incidentally, the modelplotconfig.conf
configuration is the same as that used when you select the generate model plot
option when creating a job in our super easy-to-use anomaly detector job wizard in Kibana. This also helps explain the mystery of what the model plot bounds actually represent - they indicate that we are 95% confident that a point in "the shaded area" in the single metric plot is not an anomaly. Finally, I think the fieldconfig.conf
configuration speaks for itself.
And here is the corresponding command line for the model_extractor
:
model_extractor --input=normal_named_pipe --inputIsPipe --output=normal.xml --outputFormat=XML
Again some explanation of the parameters might help you understand what's going on:
./model_extractor --help
Usage: model_extractor [options]
Options::
--help Display this information and exit
--version Display version information and exit
--logProperties arg Optional logger properties file
--input arg Optional file to read input from - not present
means read from STDIN
--inputIsPipe Specified input file is a named pipe
--output arg Optional file to write output to - not present
means write to STDOUT
--outputIsPipe Specified output file is a named pipe
--outputFormat arg (=JSON) Format of output documents [JSON|XML].
Let's pull all that together a script:
#!/bin/bash
if [ $# != 1 ]; then
echo "Usage: $0 <autodetect csv input file>"
exit 1
fi
INPUT=$1
PREFIX=$(basename ${INPUT} .csv)
autodetect --jobid=test --bucketspan=60 --summarycountfield=count \
--timefield=time --delimiter=, --modelplotconfig=modelplotconfig.conf --fieldconfig=fieldconfig.conf --persist \
normal_named_pipe --persistIsPipe --bucketPersistInterval=1 < ${INPUT} > ${PREFIX}.out 2> ${PREFIX}.log &
model_extractor --input normal_named_pipe --inputIsPipe --output normal.xml --outputFormat="XML"
... but those details aren't that important right now.
What is important is the output from model_extractor
. We told it to write the decoded model state data to an XML
file called normal.xml
. We won't look directly at that file. Suffice to say that it consists of a number of model state documents in XML
format (actually, the reality is that the XML
strings are then wrapped up as JSON
in order to be compatible with Elasticsearch's indexing.), also. caveat, it's not strictly compliant XML
anyway. Let's fix that, fix the formatting and hone in on something interesting.
The highlighted rectangle shows that the normal prior model generated by autodetect
from the input data has a mean of 999.641 and standard_deviation of 198.407. Hark back to the generated input data that had mu
(mean) of 1000 and sigma
(standard deviation) of 200? This shows that autodetect
did eventually learn the correct model parameters after a number of iterations (buckets processed).
The other thing of interest from this screenshot of a model state document is that the normal prior model has a log_weight
value of 0 (so an actual weight of 1) while the other possible candidate priors have negative log_weight
values (and hence their actual weights are less than 1). This indicates that autodetect
has quite some confidence that the normal prior model is indeed the sole or primary candidate for modelling the distribution of the input data.
Pretty neat eh? No? Perhaps seeing some plots of the anomaly detection results and the evolution of model parameters over time might convince you?
Here are the results, with model bounds overlaid and anomalies represented by the same colours as they are in the single metric viewer in Kibana. As a reminder, here's the colour key:
kibana_anomaly_coloursTime (represented as seconds since epoch (Jan 1, 1970)) is along the x-axis, while data count is along the y-axis.
normal_results_with_model_bounds_and_anomaliesAs you can see, there are a number of anomalies found of all severities. Moving on... What? you have questions? The answers are... well, complicated. For those interested here is a blog on something called normalisation and another about multi-bucket impact anomalies that should answer all of those questions. There are also a number of other resources available, primarily the documentation I referred to earlier but if you're still stuck just ask here, someone (maybe even me!) will be happy to help.
Back to the results! Here's the evolution of model parameters over time for the normal
model prior:
As you can see, the values are "all over the place" (to use a technical term) initially but as time goes on (more buckets are processed) the values stabilise.
Here are those same parameters again but this time we'll show the evolution of the resulting normal distribution:
normal_params_3dAgain, you can see that after a number of iterations the model parameters have stabilised quite nicely.
And here's the evolution of the prior weights of all the candidate models:
evolution_prior_weightsYou can see several things from this plot. The most obvious is that there are four potential models present. We consider all four separately and in combination. Also, you can see that the weight of all but the normal
model drops to 0 eventually but that, apart from an initial minor blip, the weight for the normal
model stays strong and stable at 1. This shows that autodetect
has automatically detected that this particular data set is almost certainly best modelled exclusively by a normal distribution.
I could go on (I do tend to get carried away by this stuff) but I feel I should leave it there.
Oh ok, you convinced me. Just one more screenshot. Here's a top hat with baubles on (otherwise known as a transient bi-modal normal distribution, with anomalies shown):
top_hatHappy festive season every one! I hope this little piece has inspired you to look at anomaly detection in a new light!