Context: We want to learn an appropriate function provided samples from a dataset .
Turns out, you can do better than the naive Bayes update,
Introduce an inverse temperature, , to get the tempered Bayes update :
At first glance, this looks unphysical. Surely only when ?
If you're one for handwaving, you might just accept that this is just a convenient way to vary between putting more weight on the prior and more weight on the data. In any case, the tempered posterior is proper (integrable to one), as long as the untempered posterior is .
If you're feeling more thorough, think about the information. Introducing an inverse temperature is simply scaling the number of bits contained in the distribution. .
TODO: Check out Grünwald's Safe Bayes papers
If you're feeling even bolder, you might replace the likelihood with a general loss term, , which measures performance on your dataset ,
where we write the normalizing constant or partition function as to emphasize that it isn't really an "evidence" anymore.
The most natural choice for is the Gibbs measure:
where is the empirical risk of classical machine learning. You can turn any function into a probability.