I'm testing x-pack and the machine learning, and I'm trying to develop a test case to test its effectiveness and usefulness for my application.
The idea is that I'm tracking posts on a site and I'm watching a few metrics: age, number of comments, score (up+down votes). I want to identify posts that are going to "go viral" early on.
The idea seems very similar to the general case of looking for anomalies - most posts have a growth pattern within a certain range, I'm looking for things that are "anomalous" and exceed that range.
Are there examples like this or do you have any suggestions how I would approach it?
yehosef - I think your best bet in this case is to use population analysis. In this case, the population would be all "posts" (thus you'd have to make the over_field_name set to the field that identifies the posts, like a post_id). Then, as a function, you could count the number of documents (assuming the existence of each document represents a re-post). An example configuration (requires use of Advanced Job) would be something like:
count over post_id
In this way, ML will look for post_ids that have a higher number of documents in the index, per unit time (bucket_span) than the "typical" post.
If your data instead has a field that captures the number of posts per unit time (i.e. num_posts) then you would have to use something more like:
sum(num_posts) over post_id
But, without knowing how your data looks or its characteristics, the above is only speculative.
I should have explained better how I'm modeling the data. There is an index for the posts themselves with the post data, title, text, author, etc. It's also has the score, num comments, etc but I don't think with data alone you'd be able to do much since it's changing.. but I'm not sure.
I'm going to have another index called "history" that would have the post id, and the changes from the last measurement point - eg score changed +1, number of comments changed +4, etc. I could include the post age in the history table and I could include the raw scores there also if that helped.
The measurement points are relatively frequent (up to 2/minute), though only there when there is a change. There should be no history entry without a change - though sometimes there is a change of the title, etc so there might not be a score change - but that's rare. So each change will probably be only +1 or sometimes +2. I could also use the raw values of score and num_comments compared to age to get the rate of change.
The way I'm doing it, the number of documents in the change set over time will roughly correspond to the growth, but I think it would be better to use a metric connected to the values/rates.
Seems like the history index is the one to use for ML. I think things would work best if you periodically captured the number of updates/comments. So for example, in a 5 minute window, a particular post_id may have entries that look like this (very simplified):
If the data were bucketed into 5 min bucket_spans, then you could see that the bucket of 00:00:00 to 00:04:59 would have sum(num_new_comments)=200 and sum(num_new_votes)=145. (The data for 00:05:00 will go into the next bucket).
ML could determine if the sum(num_new_comments) per 5 mins for post_id:58dj233 is unusual compared to the typical post_id.
Would I be able to get this data via an date_histogram aggregation - or i would need to store it that way. Meaning, I'm storing it more granularly for other reasons. Doing a date/time aggregation on that data would group it into data that looks like what you have. But would that be ok, or would I need an index to have that data directly?
Also, is it possible to have a scripted metric as an input to the ML? eg, number_comments/age_in_minutes.
NM - I misread your post - I can use the aggregations, which is great. (I got thrown off that the sum(num_new_comments)=200 and the num_new_comments:200 at the 5 minute mark..)
Does that machine learning look at each metric independently or look at them collectively? Eg, it may be common for something to get 100 new votes or 20 new comments, but to get 100 new votes AND 20 new comments is special.
The sum() function in ML is acting like an aggregation already. However, with that said, it turns out that you can use an Elasticsearch aggregation instead by modifying the underlying query the Datafeed uses (and you'll have to set summary_count_field to be doc.count from the output of aggregation (see: https://www.elastic.co/guide/en/x-pack/5.4/ml-gs-jobs.html for ). This is a little tricky to config since there is a fair amount of hand-editing to the job.
Right now, the ML will detect anomalies on each field independently, but the aggregated job anomaly score will reflect the unusualness taking into account multiple detectors in the job.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.