[Machine Learning] No document detection

fbonningues · June 26, 2019, 8:25am

Hi,

I have activated the trial licence to make some test with Machine Learning feature.

I have a continus data feed with defined templates, each document having a field with a specific value (in this case, a shop_id).
The data feed pattern is to receive data throughout the day, on a typical basis : 7AM-9PM.
The current depth of data is more than a month, for almost 250M documents.

I want to detect alimentation failure for a specific time series (with shop_id as query_key), so a detection of 'no document received'.

I tried to create a job with 'Population' wizard and used it for data in which I knew some documents are missing, but that was not seen as an anomaly.

In the documentation, I didn't find any reference for that kind of detection, can you help me solve this problem ?

Thx a lot

François

richcollier · June 26, 2019, 12:31pm

Yes, this type of detection is possible and would employ the count function (or more likely, the low_count function)

So, do you want to detect low document counts for EVERY shop_id (in other words, splitting the analysis for every instance of that field) or are you asking how to filter the data so that only a certain value of shop_id gets considered?

fbonningues · June 26, 2019, 12:46pm

Thx for the answer.

In the current job, I use this detector : low_count over shop_id (so the function you mentionned)

The time series are independent over shop_id (independent data feed) and I want to know if one of the shops doesn't send documents. That in order to detect issues of connection, treatments, etc for each shop.

In my current data, one of the shops didn't send documents on the range of time usually populated, so I expected to see an anomaly, that I not had.

richcollier · June 26, 2019, 12:55pm

I'd suggest not using a population job and using a multi-metric job split (or using the partition_field_name in the API).

In a population analysis - you are comparing shop_ids against each other (which maybe ok) but I would think that it would be better to compare a particular shop_id against itself over time (which a population analysis will not do).

Also, if the condition is a set rule (i.e. any occurrence of zero documents) - you pretty much don't need ML - you can accomplish this with a standard Watch

fbonningues · June 26, 2019, 1:03pm

"Compare a particular shop_id against itself over time", that exactly what I need, because each shop have a particular pattern of sells (i.e. documents sends).

A standard watch wouldn't be suffisant because of the downtime of documents reception (during the night or on week end)

I'll try that kind of job, and I'll keep you posted !

richcollier · June 26, 2019, 1:17pm

Cool

By the way - here's a blog that compares the different approaches of Population analysis versus Temporal analysis: https://www.elastic.co/blog/temporal-vs-population-analysis-in-elastic-machine-learning

fbonningues · June 26, 2019, 2:30pm

Thx @richcollier !
Using a multi metric job with detector : low_count partition_field_name=shop_id worked fine !

system · July 24, 2019, 2:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.