Looking for help setting up a machine learning test to find "virility"

yehosef · August 23, 2017, 11:10am

I'm testing x-pack and the machine learning, and I'm trying to develop a test case to test its effectiveness and usefulness for my application.

The idea is that I'm tracking posts on a site and I'm watching a few metrics: age, number of comments, score (up+down votes). I want to identify posts that are going to "go viral" early on.

The idea seems very similar to the general case of looking for anomalies - most posts have a growth pattern within a certain range, I'm looking for things that are "anomalous" and exceed that range.

Are there examples like this or do you have any suggestions how I would approach it?

richcollier · August 23, 2017, 11:45am

yehosef - I think your best bet in this case is to use population analysis. In this case, the population would be all "posts" (thus you'd have to make the over_field_name set to the field that identifies the posts, like a post_id). Then, as a function, you could count the number of documents (assuming the existence of each document represents a re-post). An example configuration (requires use of Advanced Job) would be something like:

count over post_id

In this way, ML will look for post_ids that have a higher number of documents in the index, per unit time (bucket_span) than the "typical" post.

If your data instead has a field that captures the number of posts per unit time (i.e. num_posts) then you would have to use something more like:

sum(num_posts) over post_id

But, without knowing how your data looks or its characteristics, the above is only speculative.

yehosef · August 23, 2017, 11:58am

Hi Rich,

Thanks for the info and suggestions.

I should have explained better how I'm modeling the data. There is an index for the posts themselves with the post data, title, text, author, etc. It's also has the score, num comments, etc but I don't think with data alone you'd be able to do much since it's changing.. but I'm not sure.

I'm going to have another index called "history" that would have the post id, and the changes from the last measurement point - eg score changed +1, number of comments changed +4, etc. I could include the post age in the history table and I could include the raw scores there also if that helped.

The measurement points are relatively frequent (up to 2/minute), though only there when there is a change. There should be no history entry without a change - though sometimes there is a change of the title, etc so there might not be a score change - but that's rare. So each change will probably be only +1 or sometimes +2. I could also use the raw values of score and num_comments compared to age to get the rate of change.

The way I'm doing it, the number of documents in the change set over time will roughly correspond to the growth, but I think it would be better to use a metric connected to the values/rates.

Does that help? Thanks for your time.

richcollier · August 23, 2017, 12:13pm

Seems like the history index is the one to use for ML. I think things would work best if you periodically captured the number of updates/comments. So for example, in a 5 minute window, a particular post_id may have entries that look like this (very simplified):

...
{@timestamp: 00:00:00, post_id: 58dj233, num_new_comments:5, num_new_votes:5}
{@timestamp: 00:01:00, post_id: 58dj233, num_new_comments:10, num_new_votes:5}
{@timestamp: 00:02:00, post_id: 58dj233, num_new_comments:10, num_new_votes:10}
{@timestamp: 00:03:00, post_id: 58dj233, num_new_comments:75, num_new_votes:50}
{@timestamp: 00:04:00, post_id: 58dj233, num_new_comments:100, num_new_votes:75}
{@timestamp: 00:05:00, post_id: 58dj233, num_new_comments:200, num_new_votes:100}
...

If the data were bucketed into 5 min bucket_spans, then you could see that the bucket of 00:00:00 to 00:04:59 would have sum(num_new_comments)=200 and sum(num_new_votes)=145. (The data for 00:05:00 will go into the next bucket).

ML could determine if the sum(num_new_comments) per 5 mins for post_id:58dj233 is unusual compared to the typical post_id.

yehosef · August 23, 2017, 12:24pm

Thanks!

Would I be able to get this data via an date_histogram aggregation - or i would need to store it that way. Meaning, I'm storing it more granularly for other reasons. Doing a date/time aggregation on that data would group it into data that looks like what you have. But would that be ok, or would I need an index to have that data directly?

Also, is it possible to have a scripted metric as an input to the ML? eg, number_comments/age_in_minutes.

yehosef · August 23, 2017, 12:27pm

NM - I misread your post - I can use the aggregations, which is great. (I got thrown off that the sum(num_new_comments)=200 and the num_new_comments:200 at the 5 minute mark..)

yehosef · August 23, 2017, 12:28pm

Does that machine learning look at each metric independently or look at them collectively? Eg, it may be common for something to get 100 new votes or 20 new comments, but to get 100 new votes AND 20 new comments is special.

richcollier · August 23, 2017, 12:50pm

The sum() function in ML is acting like an aggregation already. However, with that said, it turns out that you can use an Elasticsearch aggregation instead by modifying the underlying query the Datafeed uses (and you'll have to set summary_count_field to be doc.count from the output of aggregation (see: https://www.elastic.co/guide/en/x-pack/5.4/ml-gs-jobs.html for ). This is a little tricky to config since there is a fair amount of hand-editing to the job.

Yes, you can also use scripted fields as part of the ML detector.

Right now, the ML will detect anomalies on each field independently, but the aggregated job anomaly score will reflect the unusualness taking into account multiple detectors in the job.

yehosef · August 23, 2017, 12:53pm

great - thanks for the info!

system · September 20, 2017, 12:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to find data anomalies? Elasticsearch elastic-stack-machine-learning	10	458	November 4, 2022
Keeping anomaly scoring constant Elasticsearch elastic-stack-machine-learning	4	403	February 15, 2023
Setting document limits for Machine Learning anomalies Elasticsearch elastic-stack-machine-learning	4	623	December 5, 2019
Machine Learning (Anomaly Detection) Random Sampling Support Elasticsearch elastic-stack-machine-learning	1	308	February 10, 2023
Struggling to understand the value of ML for my data Elasticsearch	12	2049	January 10, 2018

Looking for help setting up a machine learning test to find "virility"

Related topics