I have two sets of Data Set X{a,b,c} & Set Y{A,B,C,D}.
I want to take an event(attrib A) from Y identify an pattern(based on attrib a) in X
And thus forecast for any possible event in Y in future.
I have uploaded both data sets in same index to elk, however i am confused with ml job configuration. I have explored multi-metric job but unable to figure out how to configure above scenario. Please help.
Does your data fall into the category of timeseries? Reading your description of the problem I wonder whether this is a timeseries problem or a classification task.
Can you maybe describe it in more detail, best give an example of the input and the expected output.
Forecasting extrapolates timeseries based on a modeling the past timeseries, e.g. rate of events over time. Maybe forecasting isn't what you look for, nevertheless if you can elaborate a bit more, I might be able to help even in the case ML/Forecast is not for you.
I want to identify pattern for events in intrusionDetection Set with leakDetection events and thus forecast future leaklocation. So, I was plotting on Max(leakLocation) with bucket size 1m taking intrusionLocation as influencer. I am not whether this is correct way or not
There are number of events in intrusion and not all intrusion can lead to leak, i am looking for an mechanism through which system can identify pattern in intrusion events when leak occurred and detect or give probabilistic analysis of future leak location.
the max() function is a metric-based function to analyze the maximum value of a numerical field over time (such as, max(response_time) or max(cpu_pct), etc.). It is not appropriate for determining the probability of occurrence or non-occurrence of events. We have count functions for analysis of occurrence.
However, with that said - our anomaly detection and forecasting of occurrence is currently univariate - meaning it models a single variable (for example: "What's the probability that A occurs in the next X minutes, given the rate that it has occurred in the past"?).
There is NO current mechanism that allows one to answer questions like "Given A and B just happened, what's the probability that C should occur"? So, if this is the kind of thing you're trying to do, then it is not possible with the current state of our ML product.
I just thought I'd add some additional context since this touches on some areas we're currently investigating.
First of all, to formalise slightly, what you're asking for is the following: you have a training data set with examples (X, Y), where, X are some time stamped examples of intrusion detection plus attributes, and Y is one of 1{leak will occur}, time of leak, location of leak, etc and you want to learn a function f(X) in each case to minimise some error measure between Y and f(X).
This is really just a classification/regression problem, albeit the time stamps give it a time series flavour. However, I would probably address this by throwing in some time related features, such as values of things at different time lags from the leak (whose attributes) you want to predict and using a discriminative approach.
We don't currently support regression analysis with our time series modelling capability. These sorts of problems typically require a certain amount of exploratory analysis with you (the user) trying out different transformations of the data X and testing accuracy. As I mentioned, developing tooling for these sorts of problems is something we're actively thinking about.
However, as part of the process of building a good set of predictive features you might consider augmenting X with anomalies in telemetry data, user/machine behaviour etc. Which you could configure jobs to generate.
A general note feasibility: I don't know how many examples of leaks you have, but it seems likely that it won't be many. In this case, I think it would unlikely that you can learn any function f(X) which would generalise well. Ask yourself "is it conceivable that training set contains a representative set of leaks to be able to predict things about future ones". If the answer is no then it doesn't matter what approach you use to learn f(X) it won't be accurate on most unseen examples. In summary, if you have little data, I'd focus on simple features and models which might be able to learn simple things that do somewhat better than guessing.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.