I configured lat_long job in order to find anomalies in login geolocation.
Scenario:
In a day, I had a user login from a IP1 (transforms to location1 IOWA) 4-5 times in a span of 5 min interval. After that I had the same user logged in from IP2 (transforms to location2 Texas). As per my understanding the attempt from texas should be detected as anomalous by the lat_long job. But it does NOT detect that as anomalous.
Does the model need data over a period of time in order to learn the geolocation for a user? Is it not detecting anomalies as the learning data is not more than a day?
The official xpack documentation says, "lat_long function detects anomalies where the geographic location of a credit card transaction is unusual for a particular customer’s credit card". If you anyone can briefly tell me what is "unusual" with an example scenario, it would be great for me to put the job together.
FYI, the job gives me results, just that I am testing the job by simulating the above scenario which is not being caught by the job.
The lat_long function approximately defines unusual values as those that occur in regions where the density of points is relatively low.
I imagine that in your scenario you were using a detector that was
lat_long(ip_location) partition = user
, or maybe
lat_long(ip_location) by = user.
In both those cases ML would create a different model for each user. If the user of interest was seen for the first time the day the anomaly you want to detect occurred, then most probably the problem is that the model has not seen enough data to be confident.
If your use case involves a fixed set of regularly appearing users of small cardinality, you should expect to get good results with what you're doing provided you back fill the job with enough data.
However, if your use case involves a large number of users that appear sparsely, then having a separate model per user is not going to be effective. Instead, we should be interested in doing a population analysis where we look for users that have unusual velocities. We can define a velocity for latitude (vlat) and longitude (vlong) as their respective derivatives. Then we could have a ML job with 2 detectors:
max(vlat) over user and max(vlong) over user with user as an influencer to bring those together.
You can read about how to use derivatives in this blog post .
I am trying to find anomaly in the login location for the users who login our website. Basically I want to identify when same user login from two different cities in the past 24 hrs timeframe. And as far as users concerned we have fixed set of regularly appearing users , but have a high number of users. I am filtering the data for Login code (SA) and for the last 24hrs of data and applying the lat_long function on this data.
Does this serve the purpose or you think I should use population velocity.
What would really help us understand the use case is to post an example of a document so that we see which fields are available. Also, what is the order of magnitude for # of users? 10K, 100K, 1M, etc. ?
Finally, performing time filtering in the datafeed query is not something I would recommend. But understanding the raw data will allow a more useful discussion.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.