So I just began with machine learning jobs and I wanna create a job to detect port scans.
I wanna aggregate data by source.ip and then by destination.ip and finally count the number of destination.port
Could you tell me how can I make an aggregation in machine learning jobs !
However, you likely have a very high cardinality of IP addresses. May I suggest that you instead use Population Analysis and configure something like the following:
detector: distinct_count(destination.port) over destination.ip
influencers: destination.ip, source.ip
The population analysis will effectively ease the burden on the high-cardinality destination IP field and the source IP as an influencer will only get analyzed if there's an anomaly on the distinct count, as defined by the detector.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.