Recently I have been working with the machine learning capabilities of X-Pack and have found that there is a certain point where it seems that having an understanding of how the ML algorithms are working, even on a basic level, would be highly useful, but haven't found any information on how they work (beyond the highest level viewpoint).
To give a more concrete reason why, I have been playing around with learning on some firewall data and what keeps on happening is that the learning capabilities are thrown off by highly variable data. I have adjusted the bucket sizes, but that hasn't helped much. Also, I would like to understand what the machine learning "influencers" are affected by, since the options I choose are not revealing anything beyond the obvious.
I have a decent background in math, though none in machine learning. I would be happy to read up on how these algorithms work and how to pick better features. But, looking at the reference pages, the best information I found was that X-Pack employs "proprietary machine learning algorithms".
Is there more information than that posted somewhere? I am assuming that these algorithms are based off of some published papers, could those references be given? I understand that something like the source code can't be given out, but there has got to be more information.
Of note, while the below post was useful in terms of getting started, the recipes didn't provide good reasons of why one feature was chosen besides another. Are highly variable fields useful? Are fields with a high amount of normal data useful?
Good robust useful configurations for complex multi-dimensional datasets such as firewall logs can be difficult, and our recipes are trying to simplify this process by shrinkwrapping recommended configurations. We are currently putting significant effort into these and we're happy to collaborate. If you can share more details about your use cases and data characteristics we can provide more insights.
Thank you for that information Steve, those appear to be excellent resources. Here is some of the background information and what I was running into. The third issue seems more like a misconfiguration / bug more than anything, so I can make a new topic for that if you want.
From another tool, we are receiving a score for network flows based on behavior and that information is going into a daily index in our cluster. Both of these are still in development, but I am trying to use the ML capabilities to see if it can provide a good sense of the data / give better alerting capacity. We are holding 25 days of the score data, so after figuring out how the ML jobs work, I set up one to run from the start of the data and then to continue learning with what is being streamed in after it has caught up. It calculates the mean of the score with 5m buckets. (I also tried 15m buckets, but the results were worse than what I have now). One note is that the score is known to settle around 5 to 10 with normal traffic, while highly anomalous traffic should get above 100.
Here are some highlights of the information that I ended up getting.
Near the beginning of the data, the ML looked like it was dialing in very well into finding the expected values of the data. Then it suddenly switched from smooth to jagged lines, right after it came across a detected anomaly. Those two big spikes are some randomized network scanners we have, they are known anomalous behavior. It seems that those spikes threw off the entire learning algorithms for quite a while. Is there a way to tell the algorithm that certain data is anomalous and that it shouldn't learn from it?
The next picture below is the status of the job a while later. You can see that the actual data is staying pretty close to the expected 10 range, but the ML expected values are all over the place. Why is that? Shouldn't it have narrowed back down by now? This is five days of data after those first threat scanner spikes and the data has been calm, with no major spikes.
Finally, the job caught up to the current timing and seems to have broken completely. The actual scores are still nice and calm, but the expected is between 140 to -10. (Note, there are no negative scores). Why is this happening? Do the error bounds wildly expand when it starts working on current data? Can I stop that from happening?
Many thanks for the detail and there is a lot to explain and comment on.
Firstly, the zoomed in charts are not always visually representative of the analysed data. Data at 5m granularity displayed at 30m granularity will have spikes smoothed out, and so is not a true representation of what is actually happening. The analysed granularity is different to the displayed granularity and so I suspect charts 2&3 are visually misleading. We currently allow a user to display at a granularity greater than analysed, which allows flexibility but can be misleading if it is not understood.
Secondly, if the scores are +ve integers, we should model them as such. This involves modelling them as counts via the summary_count_field.
Finally, our models are constantly evolving and we use techniques such as winsorization to reduce the effect of outliers on models. Seeing the real data at 5m granularity should help explain this.
If you can share this data as a simple time series (e.g. time, value) we can show you an optimal configuration and if there are any bugs we can resolve.
Just to add some additional information regarding the change from smooth/spiky bounds: this is because we have detected a periodic component in the mean score with a period of around 2hrs. After this the bounds will expand somewhat because of uncertainty in the estimates of the additional model parameters (but also see below). The nearer in time to changes in the modelling and also the start of the data set the more sensitive the modelling is to large outliers. This explains the relatively large impact of the two spikes in the first image.
As Steve says, these images are somewhat confused by the difference in the chart aggregation interval and the analysis bucketing interval. The actual values will show the mean of the scores at the 30min intervals (which apparently smoothes out the periodic component in the scores). The bounds on the other hand show the maximum value for the upper bound and the minimum value for the lower bound and so the interval is widened w.r.t. the actual bounds you would see if the chart aggregation interval were 5 minutes.
I agree the large change in the bounds at the beginning Wednesday is odd. Just to confirm, is this exactly the point at which the modelling switched from historical to realtime? There should be nothing unusual about this point from a modelling perspective. One possibility is that the model has identified scores around 100 as a separate mode in their distribution. This is possible if they occur frequently enough. In this case, a confidence interval as we display is clearly a poor representation of our predicted distribution. (This is something we are considering improving.) If you could share this data set it would be very useful to help you further.
I am happy to pass off the base version of the data, but I will not be able to get approval till Monday morning. (People are off on vacation)
As for the large change in bounds, I was slightly off in what I said earlier. Checking the time stamps, the bounds changed at midnight of the job's start day. So it was started Tuesday at 10 a.m., took most of the day to catch up to the current time, but was still normal until it switched to a new index at 12 a.m. Wednesday morning.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.