This indicates that we've failed to calculate a quantity, because of numerical precision issues, we use to decide whether to create multiple clusters for the data. In this case we will be "cautious" and choose not create multiple clusters. This may well be the correct decision anyway given the data characteristics. It shouldn't mean that the modelling is significantly impaired, i.e. we should still have valid models with which we are able to detect valid anomalies. With this error message I should be able to have a good chance of reproducing the issue and fixing the instability in our code.
Interestingly, the values printed suggest that the input data has a very large range, specifically values as negative as -1e18. I notice that some of your detectors are running metric functions on what are described as hashes of quantities. If these are unsigned 64 bit integer hashes then you may be running into overflow storing them in elasticsearch (whose integer type is signed). [Also, I'm wondering whether you expect these values to be confined to some interval and are interested when they outside that interval. If the hash is uniform over the whole range I'm not sure anomaly detection on the mean value is useful. If you are interested in say hashes becoming less diverse then a better measure would be to look at the variation you see in the hash values using our varp function.]
You've hit the nail on the head, we're experimenting to try and perform anomoly detection ingested text data by storing length and hash (murmur3) for several fields per document.
I would bet that in some cases the hash is overflowing and your point about hashes becoming less or more diverse is correct.
Am I correct in thinking that I would need to use an 'advanced' job to make use of varp function ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.