You can turn any function into a probability

Machine learning has gotten sloppy over the years.

It used to be that we thought carefully about the theoretical underpinnings of our models and proceeded accordingly.

We used the L2 loss in regression because, when doing maximum likelihood estimation over, L2 loss follows from the assumption that our samples are distributed according to i.i.d. Gaussian noise around some underlying deterministic function, fwf_w. If we have a likelihood defined as

p(Dnw)=i=1np(xi,yiw),wherep(xi,yi)N(yifw(xi),σ2),    argmaxx p(Dnw)=argmaxx i=1n(yifw(xi))2, \begin{align} p(D_n|w) = \prod_{i=1}^np(x_{i}, y_{i}|w),\quad&\text{where}\quad p(x_{i}, y_{i}) \sim \mathcal N(y_{i}|f_{w}(x_{i}), \sigma^2),\\ \implies\underset{x}{\text{argmax}}\ p(D_n|w)&=\underset{x}{\text{argmax}}\ \sum\limits_{i=1}^{n} (y_{i} - f_w(x_i))^2, \end{align}

where ww are the weights parametrizing our model.

We'd squeeze in some weight decay because when performing maximum a posteriori estimation it was equivalent to having a Gaussian prior over our weights, φ(w)N(0,λ1)\varphi(w)\sim \mathcal N(0, \lambda^{-1}). For the same likelihood as above,

argmaxx p(Dnw)φ(w)=argmaxx (i=1n(yifw(xi))2)λw2. \underset{x}{\text{argmax}}\ p(D_n|w)\varphi(w) = \underset{x}{\text{argmax}}\ \left(\sum\limits_{i=1}^{n}(y_i-f_w(x_{i}))^{2}\right)- \lambda |w|^2.

Nowadays, you just choose a loss function and twiddle with the settings until it works. Granted, this shift away from Bayesian-grounded techniques has given us a lot of flexibility. And it actually works (unlike much of the Bayesian project which turns out to just be disgustingly intractable).

But when you're a theorist trying to catch up to the empirical results, it turns out the Bayesian frame is rather useful. So we want a principled way of recovering probability distributions from arbitrary choices of loss function. Fortunately, this is possible.

The trick is simple: multiply your loss, rn(w)r_n(w), by some parameter β\beta, whose units are such that βrn(w)\beta\, r_n(w) is measured in bits. Now, negate and exponentiate and out pop a set of probabilities:

p(Dnw)=eβrn(w)Zn. p(D_n|w) = \frac{e^{-\beta\, r_{n}({w})}}{Z_n}.

We put a probability distribution over the function space we want to model, and we can extract probabilities over outputs for given inputs.