Beyond Bayes

Context: We want to learn an appropriate function $f$ provided samples from a dataset $D_n = \{(X, Y)\}^n$ .

Turns out, you can do better than the naive Bayes update,

P(f|D_n) = \frac{P(D_n| f)\ P(f)}{P(D_n)}.

Tempered Bayes

Introduce an inverse temperature, $\beta$ , to get the tempered Bayes update [1]:

P_\beta(f|D_n) = \frac{P(D_n|f)^{\beta}\ P(f)}{P_{\beta}(D_n)}.

At first glance, this looks unphysical. Surely $P(A|B)^{\beta}\ P(B) = P(A, B)$ only when $\beta=1$ ?

If you're one for handwaving, you might just accept that this is just a convenient way to vary between putting more weight on the prior and more weight on the data. In any case, the tempered posterior is proper (integrable to one), as long as the untempered posterior is [2].

If you're feeling more thorough, think about the information. Introducing an inverse temperature is simply scaling the number of bits contained in the distribution. $P(X, Y|f) = \exp\{-\beta I(X, Y|f)\}$ .

TODO: Check out Grünwald's Safe Bayes papers

Generalized Bayes

If you're feeling even bolder, you might replace the likelihood with a general loss term, $\ell_{\beta, n}(f)$ , which measures performance on your dataset $D_n$ ,

P_\beta(f|D_{n)}= \frac{\ell_{\beta,n}(f)\ P(f)}{Z_n},

where we write the normalizing constant or partition function as $Z_n$ to emphasize that it isn't really an "evidence" anymore.

The most natural choice for $\ell_{\beta,n}$ is the Gibbs measure:

\ell_{\beta, n}(f) = \exp\left\{-\beta\, r_n(f)\right\},

where $r_n$ is the empirical risk of classical machine learning. You can turn any function into a probability.

Jeffreys Updates

TODO