I have an issue with the packetbeat rare dns question ml job, which generates quite a bit of anomalies due to the fact that our hosts are frequently contacting *.avqs.mcafee.com url's, which have a random part. For example:
So I'd like to discuss what would the best long term and flexible solution, so I can exclude certain domains when needed, without having to rebuild the ml job.
Some possible solutions:
I could filter out *.avqs.mcafee.com in dns.question.name in the ml datafeed query
Even better (so I don't have to use expensive leading wildcard query) I could filter out mcafee.com in dns.question.registered_domain
But both above options would require me to stop the datafeed, job and then update the datafeed query, which is not really user-friendly.
Ideally I'd love to use a whitelist filter list like this:
But dns.question.registered_domain is not an option to scope. Feedback to enable me to dynamically filter on dns.question.registered_domain is welcome.
Or is my only option to update the datafeed query in the ml job?
Have you used the Filter lists from machine learning under settings? That might help you out some with what you're trying to do. I haven't used it directly myself but I hear good things about it from others. It will filter those things out before the anomalies are produced though but to a lot of people that's what they're aiming for:
While working on this, I got some additional questions.
Is it possible to configure a rule for an ml job before the ml job has been started? For example during creation time or while editing. I'm asking this, because I created a new job from scratch, trying to prevent internal url's and other known domains that should be whitelist to 'pollute' my ml model.
Afaik this is not possible yet. Is this already on Elastic's to do? If not, should I make a GH issue for it?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.