How to aggregate multiple events coming from different logs with slight different timestamp, when the only field is timestamp to combine those?

Hi,

I have this situation:
"How to aggregate multiple events coming from different logs with slight different timestamp, when the only common field could be the timesamp?

I need to combine all data points into one data set so i can use them in my ML jobs as features.

I know there is a plugin aggregate but the situation is

  1. I am not sure if timestamp can be used as the task_id
  2. There can be a difference of milli seconds between the time stamp for the events from different logs that we want to aggregate?

Is there some other solution for this?

Do i need to use my own service to aggregate events and then push back to elastic?

Thanks

Not sure if you're using the word "aggregate" (which can have specific meaning with respect to elasticsearch aggregations) or if you really mean something more like "collate" or "bring together"?

If it is the latter, then I suppose that you realize that ingesting events/documents into Elasticsearch causes them to be in an index. There can be many different indices within your instance of Elasticsearch and they can be arbitrarily named.

However, if you have events/logs/documents coming from different sources, you can put them into indices with similar names (i.e. logs-syslog and logs-nginx or whatever). If you wanted to "collate" these different events with a single query, you can address the indices with a wildcard (i.e. logs-* ) and define a Kibana index pattern to also view the multiple indices as one in the Kibana UI.

An ML job can also leverage defined Kibana index patterns to match multiple distinct indices.

Thanks @richcollier for responding. I agree aggregation can have different meanings at different places. I meant 'combine' events into one event.

Let me elaborate more what i want and why i want that then may be it would be easier to find a best solution.

I am new to ML jobs, I've created Single metric jobs and also have explored Multi-metric ML jobs a little bit. I know we can add influencers to the Multi-Metric jobs.

Now i need to create a real scenario job. Where there can be multiple detectors and multiple influencers.

We have microservices, all logs and metric beat data is coming from them into 1 Kibana index.

I have extracted fields of interest

This is an outlier detection ML job.

So we are getting data points from different sources. And data fields are specific to a specific source but they may have an impact on each other.

Following is a sample data set, all the fields are available in one index but some documents would have some of the data fields and others would have some others.

index = 2021IDX

proecessing_time, slow_query, cpu_usage_service1, cpu_usage_service2, data_field2_log1, data_field1_log2, data_field1_metricbeat1, data_field1_metricbeat2, database_fragmentation_count_metricbeat2

My question is:

Do i really need to combine those events somehow so all the data fields of interest are available on each event/doc?

If not does it leave any impact on the ML job behavior?

e.g.

event 1 looks like this:
2021/1/1T12:0:0.162, data_field1_metricbeat2 , database_fragmentation_count_metricbeat2, cpu_usage_service1

event 2 looks like this:
2021/1/1T12:0:0.180, proecessing_time , data_field1_metricbeat1, data_field2_log1, slow_query

event 3 looks like this:
2021/1/1T12:0:1.178, cpu_usage_service2 , data_field1_log2

Now If i create an ML Job with these fields some as metric/detector some as influencer.

How the ELK would combine all these fields to consider them as one row?

I know jobs does aggregation of events over the given time bucket but to keep things simple i am just talking about 1 instance of each type of event.

To address this i was thinking to combine them all into one.

Is it really required in my scenario? Or i don't have to worry about it and ML job would take care?

I know for sure that Data frame analytics is not working for this scenario, When i create the data frame analytics it only shows some fields to include in the job.
And to include others i need to filter the data based by the 'log-type' only then i can see those fields but in that case the fields from other logs are not available.

Considering this I still feel that we need to have them available in all indexed documents, what do you say?

That's why i was thinking to combine them.

I would really appreciate if i can get any guidance on this. I don't find these scenarios explained anywhere in the documentation or videos.

Thanks

First of all, metric analysis (using functions like max, min, avg, etc) can handle the data being sparse in a bucket_span and even sparse between bucket_spans. In other words, if you have a detector with max(proecessing_time) and a 5-minute bucket_span, but data comes every 10 minutes, it'll still work.

But, given your sample data, I have two major questions:

  1. do you plan to split the data for each entity (something like host.name?)
  2. Why do you feel compelled putting these all in the same ML job? Why not have several jobs, one for each data "type"?

Thanks @richcollier for your response.

Since I have to analyze the root cause of the problem e.g. delay in processing and any one or more of those data points coming from different hosts may cause that delay. So i was thinking to keep them on one data set instead of splitting it.

Is it correct? or Am i expecting more than what ELK outlier jobs offers?

Do i need a different type of job for the root cause analysis?

Thanks

Anomaly detection jobs are good for root cause analysis. You can use the Anomaly Explorer and view multiple different jobs together, and look across a common timescale - see the strength of shared Influencers, etc.

The above shows 4 jobs (one on logs, one on a KPI, a job on database metrics, and one on network metrics)

Thanks @richcollier that clarified my confusion. Your example was really helpful to open the knot i had in my mind about how these things may work together.

I will setup these jobs separately. Lets see how it goes.

Really appreciate your response.

Hi @richcollier ,

I created separate jobs with specific query filter and now i am able to see the anomalies for different jobs for a given time frame.

  1. I need to understand how the overall score for a given time instance is calculated?

Like i am getting different values for each anomaly as shown in the picture but the overall is showing the score of the top one, is there any calculation being done or not?

  1. The influencer side is not showing anything for me.

I have 1 multi metric job where i used 3 influencers, but i don't see any.

the rest are all single metric job and i didn't see anyway to specify the influencer from the ui for those, so they don't have any influencer. I guess i would need to convert them into multi-metric jobs is that correct?

But before trying that i want to understand why my multi metric job is not showing any data?
Is it due to the insufficient data? what else could be the cause?

When i look at only the multi metric job where i have defined influencers even then i don't see any values.

One of the influencers I used for my multi-metric job "multi-pd-queue-sizes", is "slow-queries-n4center" which is indicating anomaly in the below picture at the same time when the "multi-pd-queue-sizes" is showing orange anomaly, then why the value of slow-queries-n4center is not considered as the top influencers?

Any help would be really appreciated.

Thanks

  1. First of all, read this blog to understand how scoring works: Machine Learning Anomaly Scoring and Elasticsearch - How it Works | Elastic Blog
  2. Influencers won't show if no particular value of a field was indeed influential. Single metric jobs if built with the UI don't have influencers because of the fact that behind the scenes, the data is pre-aggregated (not processed raw) and therefore any potential influencer information is lost. You can overcome this by creating the "single metric" job with the Advanced Job wizard and manually specifying influencers.

Thanks @richcollier

I went through the blog. Here is my understanding

After reading the 'Influencer scoring' my understanding is that the fields that i have been using for the influencers are not the perfect ones.

In the light of this example
"In an analysis of a population of users’ internet activity, in which the ML job looks at unusual bytes sent and unusual domains visited, you could specify “user” as a possible influencer since that is the entity that is “causing” the anomaly to exist (something has to be sending those bytes to a destination domain)."

It made the influencer sounds like a 'category' to me, more than another data point that may get anomalous itself.

whereas,
I think the fields that i am using as influencers are just other data points like cpu-usage-percentage, free-memory, num_Slow_queries, db-slowness, index_fragmentation_count etc.

where the value may vary all the time. The value may fluctuate and we cannot guarantee a specific value.

So, In my opinion they cannot be used to group the records where the anomaly were found.

Is my understanding correct?
If not then, Can an influencer be any other numerical data field? If yes then, if the value may vary all the time then can it still be useful to measure its impact on the detector's anomaly?

Based on my understanding my conclusion is,
I shouldn't use them as influencer. Is that correct?
If it is correct then the question is what i should use as my influencer?
I would need to think about it.
In real time they can be my influencer (means they are the other factors that i can put blame on for the anomalousness of my detector, they may not be the single root cause of the problem but somehow collectively they might've put their share in the problem ).

If i didn't sound crazy and can get answers for my questions then may be that would help me in finding the right influencers from my data set.

If i do sound crazy then i guess i need to do more reading.

Thanks for time anyways.

Yes, influencers should be categorical fields (like "user" or "host" or "location", etc.) but should NOT be numerical fields (response_time or cpu_utilization, etc.)

From: Machine Learning with the Elastic Stack: Gain valuable insights from your data with Elastic Stack's machine learning features, 2nd Edition: Collier, Rich, Montonen, Camilla, Azarmi, Bahaaldine: 9781801070034: Amazon.com: Books

Within the anomaly detection job configuration, there is the ability to define fields as an influencer. The concept of an influencer is a field that describes an entity for which you'd like to know whether it is to blame for the existence of the anomaly, or at least whether it had a significant contribution. Note that any field chosen as a candidate to be an influencer doesn't need to be part of the detection logic, although it is natural to pick fields that are used as splits or populations to also be influencers.

If we revisit the example shown in Figure 5.13, we see that both the clientip and the response.keyword fields were declared as influencers for the job (where clientip was part of the detector configuration, but response.keyword was not). The client
IP address of 30.156.16.164 is identified as a top influencer. This seems a bit of a redundant declaration, because the anomaly was for that client IP – but this is an expected situation when influencers are chosen for the fields that define the population or are the split fields. The other top influencer (response.keyword) has a value of 404. This particular piece of information is extremely relevant in that it gives the user an immediate clue of whatever the 30.156.16.164 IP address was doing during the anomaly. If we investigate the anomalous IP address at the time of the anomaly, we will see that 100% of the requests made resulted in a response code of 404:

As such, the value of 404 has a high influencer score (50, as shown in Figure 5.13). You may think that because 100% of the requests were 404, the influencer score should
also be 100, but it is not that simple. The influencer scores are normalized against other influencer scores and the influencer score is also expressing how unusual the value of 404 has been over time. In this specific example dataset, there are hundreds more occurrences of 404 over time, but most of those have not been associated with anomalies. As such, the influencer score for this particular anomaly is tempered by that fact. There may be a compelling argument for Elastic ML to separate these two concepts – one score that expresses the unusualness of the entity over time, and another score for how much a field value influences a particular anomaly – but for the time being, those notions are blended into the influencer score.

It is also key to understand that the process of finding potential influencers happens after Elastic ML finds the anomaly. In other words, it does not affect any of the probability calculations that are made as part of the detection. Once the anomaly has been determined, ML will systematically go through all instances of each candidate influencer field and remove that instance's contribution to the data in that time bucket. If, once removed, the remaining data is no longer anomalous, then via counterfactual reasoning, that instance's contribution must have been influential and is scored accordingly (with an influencer_score in the results).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.