Ok, I think from your answer for, 3. you'd basically be happy with us quantifying the evidence for a "causal" relationship (somehow) and that it is what it is, i.e. you don't have a hard and fast I need to extract relationships within such and such a time frame or this isn't useful, and for 4. that you would like to accommodate for the fact that there would lagged relationship between a change and its cause.
Generally, not really, but in the context of events (process start, query start) I could expect multiple events to have at (or near) the same time. In cases like this, I would expect something like consistency of the occurrences to dictate correlation strength.
This may well be the case, but the problem is that other changes happening near by in time act as confounders: based on just the time window around the change any one of these processes starting could have caused the change. In general, you don't really have the possibility of determining true causality because you don't get to see the counterfactual, i.e. you don't know if process B would have changed if you hadn't started process A. Other times the process A starts act as natural experiments, but the variation in all the different system parameters between times mean that this isn't like having a gold standard like a control experiment.
If I understand this question correctly, I'd think variable for different possible casual relationships.
What I was really driving at was that would you expect that for your system you have process A starts and 5 mins later process B increases memory usage versus process C starts and 15 mins later process B starts to increase its memory usage. The key point being that the lags are significantly different in the two cases. I think based on your answer to 4 though it sounds like this is probably the case, or at least one couldn't really assume effect would be at a fixed time interval.
The overall goal I would like to achieve is to be able to provide others with a way to see potential causes (correlations) of resource utilization that takes into context the many different things that could be going on in a system at once. This would in theory help narrow down scopes during troubleshooting/incident response where aspects like resource utilization come into play, while also providing insight into system operations to identify inefficiencies or potential problems.
Thanks this context is useful. I asked this question because I was wondering if you saw the key tool here as exploring relationships, and the method of detecting changes was already well defined, or whether it also required us to decide that say memory had increased sufficiently at some point to be considered something you would like to find a cause for. In this later case really ideally the problem isn't decoupled, i.e. if a change even happened at all is determined in the context of determining possible causes. Although, practically it is probably necessary to decouple the two problems.
So this sort of problem is definitely something I have considered in the past. There are different degrees of sophistication with which it can be tackled. For example, a rigorous way of tackling it starts from a model of possible causal relationships and then treats observed occurrence frequencies via causal calculus. There is however a tricky part to this: the model is a priori, i.e. not determined from the data, but supplied by person who wants to understand causality. If we are prepared to lower the bar for accuracy then there are certainly mathematical tools which allow one to find suspiciously co-occurring. This is as you allude to really more correlation than causation although it can provide evidence for causation. One stack tool which is useful which is coming in 8.4 is frequent item set mining. If you pivot the data into time intervals, you can look for a process starting which occurs frequently with some other process's memory increasing. Of course frequency of co-occurrence is not sufficient since it doesn't acknowledge the difference between frequencies with which different processes start. There are ways of quantifying how much more are events occurring than by chance if they were independent: for example P(B | A) vs P(B) x P(A).
It is really too much to go into much detail of how to solve this problem in comment. The above gives a flavour of one potential approximate approach. It is something we may consider providing specific tooling around. A key barrier to this being wide spread useful is how frequently you get "experiments". In your case it sounds like you get them frequently, but this feels somewhat unusual to me. More often people are in the situation that something changed for the first time and it is unrelated to previous events, say someone pushed a code change. In this case troubleshooting simply involves understanding which of many many metrics are changing or recently changed, so you can isolate the effect or trace it back to some component of a complex system. Both anomaly detection and some aggregations we are adding have the capability to surface this sort of information automatically.