Beyond Bayes

Context: We want to learn an appropriate function ff provided samples from a dataset Dn={(X,Y)}nD_n = \{(X, Y)\}^n.

Turns out, you can do better than the naive Bayes update,

P(fDn)=P(Dnf) P(f)P(Dn). P(f|D_n) = \frac{P(D_n| f)\ P(f)}{P(D_n)}.

Tempered Bayes

Introduce an inverse temperature, β\beta, to get the tempered Bayes update [1]:

Pβ(fDn)=P(Dnf)β P(f)Pβ(Dn). P_\beta(f|D_n) = \frac{P(D_n|f)^{\beta}\ P(f)}{P_{\beta}(D_n)}.

At first glance, this looks unphysical. Surely P(AB)β P(B)=P(A,B)P(A|B)^{\beta}\ P(B) = P(A, B) only when β=1\beta=1?

If you're one for handwaving, you might just accept that this is just a convenient way to vary between putting more weight on the prior and more weight on the data. In any case, the tempered posterior is proper (integrable to one), as long as the untempered posterior is [2].

If you're feeling more thorough, think about the information. Introducing an inverse temperature is simply scaling the number of bits contained in the distribution. P(X,Yf)=exp{βI(X,Yf)}P(X, Y|f) = \exp\{-\beta I(X, Y|f)\}.

TODO: Check out Grünwald's Safe Bayes papers

Generalized Bayes

If you're feeling even bolder, you might replace the likelihood with a general loss term, β,n(f)\ell_{\beta, n}(f), which measures performance on your dataset DnD_n,

Pβ(fDn)=β,n(f) P(f)Zn, P_\beta(f|D_{n)}= \frac{\ell_{\beta,n}(f)\ P(f)}{Z_n},

where we write the normalizing constant or partition function as ZnZ_n to emphasize that it isn't really an "evidence" anymore.

The most natural choice for β,n\ell_{\beta,n} is the Gibbs measure:

β,n(f)=exp{βrn(f)}, \ell_{\beta, n}(f) = \exp\left\{-\beta\, r_n(f)\right\},

where rnr_n is the empirical risk of classical machine learning. You can turn any function into a probability.

Jeffreys Updates