Question on how to create a simple ML job


I am new to machine learning in elastic products, it was released here recently and we are on version 5.4 .

I want to create a machine learning job, but I am not sure on how to do it.

Sample data (obfuscated):

Let's say we have data coming in (filebeat-->logstash-->elastic search), field number one is - person_id ( with this field we can identify the unique person), every person either eats an apple or an orange (field - "food")(apple means good and orange means bad). I need to see if there are irregularities between the two (either check/look at both, or just count "orange", though eating too many apples may also prove to be important to know/bad) .

I want to create a machine learning job that would find anomalies in the top 10 most frequent/common persons eating apples or oranges.

(The top 10 is for eg.: Jimmy ate an apple 10 times (he has 10 data points) and Alex has 9 data points and the others have 3 - 5 points, so Jimmy and Alex is now Top 2).

The issue: I can only see count "events" and "offset" in "fields" and not any specific fields, I can select them in "Key fields" but that doesn't seem to affect/do anything (or can I do that after creating the job?).

I can choose to split data (this separates the persons individually, which is good since I want to track their activity individually and not as a group, but there is more than ten of them (probably more, since I can only see ten in the preview)

Do I create 10 different jobs and define a specific "person_id" or can it be done in a single job? Can I seperate the persons and have machine learning look at them either having field "food = apple" or field "food = orange".

An ideal way would be to have an event count for the 10 persons, and set the "food" field as the influencer. One main issue is how do I split data, while only keeping the top 10 persons?

Right now, I am going through the advanced editor, since the basic one "seems" to not be applicable to this scenario... perhaps I may be wrong though.

Related a bit: Related topic I found

Thank you!

May I suggest first watching these getting started videos:

8 Min Tutorial #1 - How to create a single metric job:

8 Min Tutorial #2 - How to create a multi-metric job:

8 Min Tutorial #3 - Detect outliers in a population:

And also working through the sample tutorial would be beneficial:

This answers most of my questions! One left though - how can I "split" data? Can I define for eg.: look for anomalous activity in person_id: "john" in his field "food"? ("food=apple, orange ...).

End result:

Machine learning job that looks for anomalous activity in food consumption:
10 specific "top 10" people"

John - apple,orange, apple, orange ...
Jimmy orange,orange ,orange, apple ....
Bob apple, apple, apple, apple ...

Now I know how to get it to look at the food consumption, but I can only get it to look at it as one whole group and not specific people.

If you want to limit the data to specific people use a terms query in the datafeed

  "query": {
    "terms": {
      "person_id": ["John", "Jimmy", "Bob"]

Create an advance job then enter the query in the datafeed sectiondf_query

Perfect! That was what I was looking for!

The last thing that I am not sure of is, what analysis type to use? Do I just use count (high count) to look for anomalously high occurrences of let's say "apple" in field "food" (There can be three of them: "apple", "orange", "pear"). But I want to look for anomalies in all three occurrences, which would require creating three different jobs for a single person (with query filtering of apple, orange and pear).

"person_id": ["John", "Jimmy", "Bob"...]
"food": ["apple", "orange", "pear"]

Job1: high_count for "apple" in person_id: "John"
Job2: high_count for "orange" in person_id: "John"
Job1: high_count for "pear" in person_id: "John"

Is there any other way of doing this? Since I realize that if I want to monitor specific "people" I will have to create separate jobs for them, but do I have to create separate jobs for every variable I want to monitor too?

Perhaps distinct, high_distinct or varp ? Goal: find anomalies in the number of occurrences of apple, orange and pear.

Monday 1PM "John" "Apple"
Monday 1PM "John" "Pear"
Monday 2PM "John" "Apple"
Monday 2PM "John" "Pear"
Monday 3PM "John" "Orange"
Monday 3PM "John" "Pear"
Monday 4PM "John" "Orange"
Monday 4PM "John" "Apple" I would consider this anomalous - higher than usual occurrence of "Orange

If you want to "split" the data/analysis on a field, then set the partition_field_name to person_id (or in a multi-metric job, make that the Split field). In this way, you don't need separate jobs per person_id - it can be done in one job.

What Dave suggests with a query filter is only relevant if you ONLY want to analyze "John", "Jimmy" and "Bob", but want to exclude others in the data. If you don't do this filter, then you'll have separate baselines/analyses for every person_id

In terms of the function to use, yes, you can certainly use high_count to detect more-than-usual occurrences of documents for a particular person_id. However, the count functions DO NOT COUNT FIELDS - it counts documents.

So, in your example, you want to count the occurrent of food per person_id then you need to do a double-split, which is only available via the Advanced Job Wizard.

To accomplish a double-split, set partition_field_name=person_id and by_field_name=food

Wow! Great!

Yes! I do want to do specific analysis since I have many many person's, but I only care about the most active ones, since if something "wrong" happens to them, I have to look at them quickly and solve it, hence the "top 10" thing. People who only come up rarely are not a "high" priority :slight_smile:

Okay! That looks like it makes sense!, now I will set that detector and who should I set as the influencer? should I set person_id as the influencer? or food? or perhaps both ? :sweat_smile: Though I believe it won't let me set "food" as the influencer :slight_smile:

What has happened:

Goal: analyse specific 10 people and see their eating habits (apple, orange, pear), find anomalies. IF an anomaly is found I have to know which person is having it.

Execution: Will limit input with a query to those specific ten people (or make separate jobs for each person to know which person is having issues?) and then follow above given example.

You can put both person_id and food as an influencer. As a general rule, any field that you're splitting on is a good candidate to be an influencer.

1 Like

While trying to save my new configuration I get:

Save failed: [illegal_argument_exception] Can't merge a non object mapping [food] with an object mapping [food]


When using this solution, I get:

[no_shard_available_action_exception] No shard available for [get [][data_counts][]: routing [null]]

Upgrading my version to above 5.4, will report back if issue is solved.

Using a dedicated results index for the ML job avoids this problem, but I also do heavily suggest that you get off of v5.4 as that was the "beta" version of ML. Can you upgrade to at least v6.1?

Sadly only to 5.6 for now, update still in progress, using a dedicated index caused an issue earlier, though I believe it was 5.4 related OR because we ran out of available memory. Will report once upgrade is finished

The update to 5.6 has fixed all of our issues, no more bugs and UI weirdness in general. Data analysis seems to be working perfectly!