I found that IP address on same network will be detected as anomaly.
e.g. let say 10.180.1.161 is the regularly used IP address by a specific user, when that same user log in with IP address of 10.180.1.162, the ML job will report that as anomaly as well.
Is there a way to filter above scenario out from the result ? Is it possible to use custom rules to do that filter? If yes, may i have some sample code on how to do so?
What if, instead of analyzing the entire IP address, in the ML datafeed use a script_field to create just a subsection of the IP address (i.e. the first octet?)
So, instead of passing 109.180.1.161 to ML, you'd only be passing 109. In that way, the rarity of the first octet per user should be more effective.
(obviously, above needs to be adapted to be incorporated into an ML datafeed query)
Also: note that in order to get the above to work, you might have to set script.painless.regex.enabled: true in elasticsearch.yml to allow regex matching
If this idea works effectively, consider doing the subsection at ingest time to avoid the overhead of calculating the script_field at query time.
Thanks richcollier, after few days of research, i finally see your point now. I will pre-process the IP address during ingest time to read just the 1st octet for IPv4.
And for IPv6, can i just read in the first 3 blocks (the Global Unitcast Address)? I did tried to understand IPv6 structure from here but i think i still need more research to fully understand the structure.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.