# Beyond Bayes

Context: We want to *learn* an appropriate function $f$ provided samples from a dataset $D_n = \{(X, Y)\}^n$.

Turns out, you can do better than the naive Bayes update,

# Tempered Bayes

Introduce an inverse temperature, $\beta$, to get the *tempered Bayes update* [1]:

At first glance, this looks unphysical. Surely $P(A|B)^{\beta}\ P(B) = P(A, B)$ only when $\beta=1$?

If you're one for handwaving, you might just accept that this is just a convenient way to vary between putting more weight on the prior and more weight on the data. In any case, the tempered posterior is *proper* (integrable to one), as long as the untempered posterior is [2].

If you're feeling more thorough, think about the *information*. Introducing an inverse temperature is simply scaling the number of bits contained in the distribution. $P(X, Y|f) = \exp\{-\beta I(X, Y|f)\}$.

TODO: Check out Grünwald's Safe Bayes papers

# Generalized Bayes

If you're feeling even bolder, you might replace the likelihood with a general loss term, $\ell_{\beta, n}(f)$, which measures performance on your dataset $D_n$,

where we write the normalizing constant or partition function as $Z_n$ to emphasize that it isn't really an "evidence" anymore.

The most natural choice for $\ell_{\beta,n}$ is the Gibbs measure:

where $r_n$ is the empirical risk of classical machine learning. You can turn any function into a probability.

# Jeffreys Updates

TODO