Struggling to understand OneOfNPrior::addSamples formulae

Hi guys!

I've been trying to understand the way One of N Prior (COneOfNPrior::addSamples) deals with model selection.
The code documentation does a neat work explaining all the formulae:

This means that the joint posterior distribution factorises into the posterior distribution for the model parameters given the data and the posterior weights for each model, i.e. f({p(m), m} | x) = f'(p(m) | x) * P'(m | x)

So I was happy when I get to the code and realized some sort of BIC & BF were utilized:

    TDouble5Vec minusBics;
    TDouble5Vec varianceMismatchPenalties;
    double maxLogBayesFactor{-m * MAXIMUM_LOG_BAYES_FACTOR};

Even though I've read a few papers about BIC and BF, still can't understand the way the code resolves model comparison through weights with this kind of sentences:

maxLogBayesFactor = std::max(maxLogBayesFactor, logLikelihood - std::max(modeLogLikelihood, logLikelihood))
addLogFactor(std::max(minusBics[i] / 2.0, maxMinusBic / 2.0 - maxLogBayesFactor) + varianceMismatchPenalties[i])

Is there any paper backing up this or perhaps you could explain me the general intuition behind this resolution approach which deals with BIC and BF at the same time?

Many thanks!

I'm also sharing my findings on the unittest simulation I've run, where it seems BIC is selected each time for weight contribution for each one of the samples evaluation...

Maybe MaxBayesFactor is only used as a "hard limit treshold" as the code states:

If none of the models are a good fit for the new data then restrict
the maximum Bayes Factor since the data likelihood given the model
is most useful if one of the models is (mostly) correct.

Finally, I'm also attaching a table summarizing variables evolution on the Poisson/Normal case test for a 10k Poisson samples, if it helps to anyone...


The extended comment at the start of the method introduces the basic idea of what we're doing here. We want to calculate model weights assuming all data comes from one of the supplied models and we have some prior odds on the models (typically equal). That is, in particular, we want to use the likelihoods we would get for all data after integrating over the current priors for each model's parameters. (Note as well any parameter for which we have full Bayesian treatment doesn't need to be counted when we compute the BIC approximation to the Bayes Factor, i.e. we're actually getting the true BF w.r.t. these. Also note we do abuse BIC in the weight calculation from a theoretical perspective here, since one is generally advised not to treat it like a BF. However, I felt we 1. needed to acknowledge some of the models had additional parameters without a prior and we needed some accounting for their expressive power, 2. for our use cases we actually usually have a surfeit of data (the modelling is run for a long time) and so the Taylor approximation in BIC should generally be good where the prior is largely non-zero.) The gist of this long comment is to explain why one can update the weights and model parameters independently and give the recursion formula for the weights under the assumption we made that all data come from one of the candidate models.

As you observe there are indeed some modifications we made to Vanilla model weights implied by BF/BIC. These are driven by cases where the models we include do not fit the data well and our needs for anomaly detection.

One is to deal with models which cluster the data. In particular, we allow for the data to cluster and for each mode to use one of our existing distribution models (i.e. we describe it as a mixture of COneOfNPrior models). What one then finds is that if we don't identify any clusters the collection of candidate models will collapse on the best model for the data among the candidates and we'll end up with essentially a duplicate of the model which best describes the data. The likelihoods this model generates will match the best model likelihoods so we will get no pressure to discard it. This is inefficient since most of the operations we perform are computed for all models with significantly non-zero weights, so we just discard any such model if we choose not to cluster the data.

The comment you pointed to is referencing another modification. We were finding that large outliers were unduly penalising models with light tails, even if they better described most of the data. To address this, we simply cap the maximum ratio between the BF for two models we use for a single update. This is a heuristic and was found to work well enough in practice. For a statistical analogue I feel it is rather reminiscent of Winsorization for outliers for say robust mean statistics. There may be a better approach for dealing with this sort of "none of the candidate models describe a subset of samples well", for example if I wasn't trying to make everything recursive there must be some mixture model treatment of a small set of very different values, but this approach worked well enough in practice. (We actually avoid creating clusters for too few values because we don't want to fit repeated events which are useful anomalies to identify.)

Finally, on a related note. We had an issue where some of our skewed heavy tailed models could be selected purely based on BF for data which were mostly a small constant and the occasional much larger value. The fit of these models was also generally pretty bad for our purposes in these cases: they have super heavy tails and say there is a reasonable chance of seeing values orders magnitude larger than the largest value we've seen. From an anomaly detection point of view this is bad since we will assign relatively high probability to values which are clearly unlike any we've seen before. We again use a heuristic here. If the model variance is not very different from the data variance then we just use BF, if the variance is much larger than the data variance we start to penalise that model up to our cap on the BF.

Hope this helps!

1 Like

Thanks for such deep, passionate and kindly explanation!
I wish I had a professor like thirty years ago,
Best regards!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.