Event/Resource Correlation Detection

Hi All,

I was wondering if anyone has any ideas on how to achieve correlation detection between events and resources.

Use Case

I have a visualization which shows the 10 top process by memory usage on a system. It also has annotations that show when new processes start on the system

(Example)

Today, I use this visualization to manually identify correlations between processes starting and resource (memory) utilization of other processes to try an identify correlations between different processes.

I'd like to move this to more of a machine learning setup as trying to manually correlate thing over long periods with potentially hundreds of different dimensions isn't really scalable.

What I'd like to see is if something could detect frequent patterns between a set of events and historical data (metrics) to find correlations (and if possible, the strength of the correlations).

Examples:

  1. If process A starts, and then frequently/consistently process B starts to increase memory usage shortly after, then I would like something to say: Process A starting strongly correlates to Process B's memory usage.
  2. If process C starts, but infrequently corresponds to process B memory increase, then I would like something to say: Process C weakly correlates or doesn't correlate at all to Process B's memory usage.

Notes:

  • I don't really care about things like: If process A starts and process A starts to increase memory usage. This type of thing would always be true/a causation and would provide little valuable insight.
  • While I'm using process start times in my example, I'd like any sort of event to be usable; example: Query execution starts
  • While I'm using process memory consumption in my example, I'd like any sort of historical metric to be usage. Examples: CPU, Memory (RAM), DiskIO
  • A nice to have would be able to do cross correlation as well, does Memory usage increase of process A correlate to DiskIO of Process B increasing.
  • Another nice to have would be sub-correlations (not really sure if this is the correct phrase).
    • If consistently Process A starts and is shortly followed by Query Z, and then shortly followed by Process B using more memory, then Process A followed by Query Z strongly correlates to Process B's memory usage.

I recognize that this might be a fairly complex topic/thing to implement, and I wasn't really able to find an existing way to implement this in Elastic, so I'm not really sure if it's even possible.

I think to be able to frame this problem precisely we need a bit more background information:

  1. How many historic occurrences of process A starting are you likely to have?
  2. Do you have multiple processes starting at the same time?
  3. How many times, of process A starting, can you wait before you want to be notified about potential correlations?
  4. How long do you expect before process B's memory increases having started process A? In particular, do you expect this to be i) short or long compared to the time between processes starting, ii) consistent or variable for different possible causal relationships?

I also presume that in general you would want the identification of metric changes to be automated, i.e. you don't have a rule or alert which tells you when resource utilisation has changed. Is this correct?

Hi @Tom_Veasey, thanks for the reply, regarding your questions:

  1. How many historic occurrences of process A starting are you likely to have?

In the context of Process A I'd say it happens every few hours, but in general I would expect an event (process start) with a higher frequency of occurrence to be easier to "accurately" detect correlations with than an event which happens less frequently.

  1. Do you have multiple processes starting at the same time?

Generally, not really, but in the context of events (process start, query start) I could expect multiple events to have at (or near) the same time. In cases like this, I would expect something like consistency of the occurrences to dictate correlation strength.

Examples:

  1. If Process A & Query Z start at almost the exact same time consistently, and Process B resource usage increases, I wouldn't expect the machine learning to know which event Process A or Query Z "caused" the Process B resource usage increase, but I would expect the machine learning to say both Process A & Query Z strongly correlate to Process B's resource increase.
  2. If Process A starts and consistently Process B's resource usage increases, but occasionally Query Z runs at the same time as Process A starts. I would expect the Machine Learning to say Process A strongly correlates to Process B's increased resource usage, while Query Z either weakly correlates or doesn't correlate at all to Process B's increased resource usage.
  1. How many times, of process A starting, can you wait before you want to be notified about potential correlations?

This is a challenging question, I think if you were to frame it in the context of the existing anomaly detection stuff, I would like to see something that starts off with a lower strength correlation but can learn over time and adjust the strength depending on the frequency/clarity of the events and historical metrics. You'd use the "strength" of the correlation to determine whatever you need to from it.

  1. How long do you expect before process B's memory increases having started process A? In particular, do you expect this to be i) short or long compared to the time between processes starting, ii) consistent or variable for different possible causal relationships?

This is another tough question to answer, but again, this is where I think the machine learning part would come into play.

i) short or long compared to the time between processes starting

I would mainly expect two main factors to play a role here:

  1. The consistency in the time between the event (process starting) and the resource increasing
    • The more consistent the stronger the correlation
  2. The duration between the event (process starting) and the resource increasing
    • The shorter the duration the stronger the correlation
    • Note: I do think that there would be a "cut-off" on the look back duration, I wouldn't expect a process which started a day ago to be cause resource usage of another process a day later. (I would expect maybe that process to run a query that could cause it, but then the query would be correlated, not the initial process)

Note: I wouldn't really care about when the process and resource usage happen, but more that they are consistent between each other.

  • Events can really be caused by a number of factors and relying on the consistency of when they are run isn't really helpful.
    • If someone manually executes a query and it causes a resource usage spike it can happen at any time, but if consistently that same query causes the resource usage spike, then I would want the 2 to be correlated.

ii) consistent or variable for different possible causal relationships?

If I understand this question correctly, I'd think variable for different possible casual relationships.

Example:

  • If 100% of the time, Process A starts and Process B's resource usage increases, I would expect a strong correlation
  • Of the 100% times of the above, 25% of the time Query Z Runs in the middle; Process A -> Query Z -> Process B's resource usage increase
    • For this one, I would expect:
      • Process A to have a strong correlation with Process B's resource usage increase.
      • Query Z to have a weak correlation with Process B's resource usage increase.
      • Process A followed by Query Z to have normal correlation with Process B's resource usage increase.
        • In this case, while Process A is strong with Process B, Query Z might only have a chance of being caused by Process A (it could just be a coincidence)

I also presume that in general you would want the identification of metric changes to be automated, i.e. you don't have a rule or alert which tells you when resource utilization has changed. Is this correct?

The overall goal I would like to achieve is to be able to provide others with a way to see potential causes (correlations) of resource utilization that takes into context the many different things that could be going on in a system at once. This would in theory help narrow down scopes during troubleshooting/incident response where aspects like resource utilization come into play, while also providing insight into system operations to identify inefficiencies or potential problems.

Ok, I think from your answer for, 3. you'd basically be happy with us quantifying the evidence for a "causal" relationship (somehow) and that it is what it is, i.e. you don't have a hard and fast I need to extract relationships within such and such a time frame or this isn't useful, and for 4. that you would like to accommodate for the fact that there would lagged relationship between a change and its cause.

Generally, not really, but in the context of events (process start, query start) I could expect multiple events to have at (or near) the same time. In cases like this, I would expect something like consistency of the occurrences to dictate correlation strength.

This may well be the case, but the problem is that other changes happening near by in time act as confounders: based on just the time window around the change any one of these processes starting could have caused the change. In general, you don't really have the possibility of determining true causality because you don't get to see the counterfactual, i.e. you don't know if process B would have changed if you hadn't started process A. Other times the process A starts act as natural experiments, but the variation in all the different system parameters between times mean that this isn't like having a gold standard like a control experiment.

If I understand this question correctly, I'd think variable for different possible casual relationships.

What I was really driving at was that would you expect that for your system you have process A starts and 5 mins later process B increases memory usage versus process C starts and 15 mins later process B starts to increase its memory usage. The key point being that the lags are significantly different in the two cases. I think based on your answer to 4 though it sounds like this is probably the case, or at least one couldn't really assume effect would be at a fixed time interval.

The overall goal I would like to achieve is to be able to provide others with a way to see potential causes (correlations) of resource utilization that takes into context the many different things that could be going on in a system at once. This would in theory help narrow down scopes during troubleshooting/incident response where aspects like resource utilization come into play, while also providing insight into system operations to identify inefficiencies or potential problems.

Thanks this context is useful. I asked this question because I was wondering if you saw the key tool here as exploring relationships, and the method of detecting changes was already well defined, or whether it also required us to decide that say memory had increased sufficiently at some point to be considered something you would like to find a cause for. In this later case really ideally the problem isn't decoupled, i.e. if a change even happened at all is determined in the context of determining possible causes. Although, practically it is probably necessary to decouple the two problems.

So this sort of problem is definitely something I have considered in the past. There are different degrees of sophistication with which it can be tackled. For example, a rigorous way of tackling it starts from a model of possible causal relationships and then treats observed occurrence frequencies via causal calculus. There is however a tricky part to this: the model is a priori, i.e. not determined from the data, but supplied by person who wants to understand causality. If we are prepared to lower the bar for accuracy then there are certainly mathematical tools which allow one to find suspiciously co-occurring. This is as you allude to really more correlation than causation although it can provide evidence for causation. One stack tool which is useful which is coming in 8.4 is frequent item set mining. If you pivot the data into time intervals, you can look for a process starting which occurs frequently with some other process's memory increasing. Of course frequency of co-occurrence is not sufficient since it doesn't acknowledge the difference between frequencies with which different processes start. There are ways of quantifying how much more are events occurring than by chance if they were independent: for example P(B | A) vs P(B) x P(A).

It is really too much to go into much detail of how to solve this problem in comment. The above gives a flavour of one potential approximate approach. It is something we may consider providing specific tooling around. A key barrier to this being wide spread useful is how frequently you get "experiments". In your case it sounds like you get them frequently, but this feels somewhat unusual to me. More often people are in the situation that something changed for the first time and it is unrelated to previous events, say someone pushed a code change. In this case troubleshooting simply involves understanding which of many many metrics are changing or recently changed, so you can isolate the effect or trace it back to some component of a complex system. Both anomaly detection and some aggregations we are adding have the capability to surface this sort of information automatically.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.