Review of Path Dependence
Outline
 Recent progress in the theory of neural networks (MacAulay, 2019) (20 mins)
 In the infinitewidth limit:
 Neal 1994: With Gaussiandistributed weights NNs limit to Gaussian process.
 Novak 2018: Extension to CNNs
 Yang 2019: And other NNs
 Infinitywidth limit > Neural Tangent Kernel (like the firstorder Taylor expansion of a NN about its initial parameters)
 "The works above on the infinitewidth limit explain, to some extent, the success of SGD at optimizing neural nets, because of the approximately linear nature of their parameterspace."
 PACBayes bounds. Binarized MNIST, Link with compression, and more. Bounds as a way of testing your theories
 In the infinitewidth limit:
 Influence functions (sensitivity to a change in training examples) & Adversarial training examples[1]
 We’re interested in how the optimal solution depends on $\theta$, so we define $w_* = r(\theta)$ to emphasize the functional dependency. The function $r(\theta)$ is called the response function, or rational reaction function, and the Implicit Function Theorem (IFT)
 Low path dependence: that one paper that adds adversarial noise, trains on it, then removes the adversarial noise (see Distill thread).
How much do the models we train depend on the paths they follow through weight space?
Should we expect to always get the same models for a given choice of hyperparameters and dataset? Or do the outcomes depend highly on quirks of the training process, such as the weight initialization and batch schedule?
If models are highly pathdependent, it could make alignment harder: we'd have to keep a closer eye on our models during training for chance forays into deception. Or it could make alignment easier: if alignment is unlikely by default, then increasing the variance in outcomes increases our odds of success.
Vivek Hebbar and Evan Hubinger have already explored the implications of pathdependence for alignment here. In this post, I'd like to take a step back and survey the machine learning literature on path dependence in current models. In future posts, I'll present a formalization of Hebbar and Hubinger's definition of "path dependence" that will let us start running experiments.
Evidence Of Low Path Dependence
Symmetries of Neural Networks
At a glance, neural networks appear to be rather pathdependent. Train the same network twice from a different initial state or with a different choice of hyperparameters, and the final weights will end up looking very different.
But weight space doesn't tell us everything. Neural networks are full of internal symmetries that allow a given function to be implemented by different choices of weights.
For example, you can permute two columns in one layer, and as long as you permute the corresponding rows in the next layer, the overall computation is unaffected. Likewise, with ReLU networks, you can scale up the inputs to a layer as long as you scale the outputs accordingly, i.e.,
$\text{ReLU}(x) = \alpha\, \text{ReLU}\left(\frac{x}{\alpha}\right),$
for any $\alpha > 0$^{1}. There are also a handful of "nongeneric" symmetries, which the model is free to build or break during training. These correspond to, for example, flipping the orientations of the activation boundaries^{2} or interpolating between two degenerate neurons that have the same ingoing weights.
Formally, if we treat a neural network as a mapping $f: \mathcal X \times \mathcal W \to \mathcal Y$, parametrized by weights $\mathcal W$, these internal symmetries mean that the mapping $\mathcal W \ni w \mapsto f_{w} \in \mathcal F$ is noninjective, where $f_{w}(x):=f(x, w)$.^{3}
What we really care about is measuring similarity in $\mathcal F$. Unfortunately, this is an infinitedimensional space, so we can't fully resolve where we are on any finite set of samples. So we have a tradeoff: we can either perform our measurements in $\mathcal W$, where we have full knowledge, but where fully correcting for symmetries is intractable, or we can perform our measurements in $\mathcal F$, where we lack full knowledge, but
Figure 1. A depiction of some of the symmetries of NNs.Correcting for Symmetries
Permutationadjusted Symmetries
Ainsworth, Hayase, and Srinivasa [2] find that, after correcting for permutation symmetries, different weights are connected by a closetozero loss barrier linear mode connection. In other words, you can linearly interpolate between the permutationcorrected weights, and every point in the linearly interpolation has essentially the same loss. They conjecture that there is only global basin after correcting for these symmetries.
In general, correcting for these symmetries is NPhard, so the argument of these authors depends on several approximate schemes to correct for the permutations [2].
![400](/media/pastedimage20221220152152.png)
Universal Circuits
Some of the most compelling evidence for the low pathdependence world comes from the circuitsstyle research of Olah and collaborators. Across a range of computer vision models (AlexNet, InceptionV1, VGG19, ResnetV250), the circuits thread [3] finds features common to all of them such as curve and highlow frequency detectors [4], branch specialization [5], and weight banding [6]. More recently, the transformer circuits thread [7] has found universal features in language models, such as induction heads and bumps [8]. This is path independence at the highest level: regardless of architecture, hyperparameters, and initial weights different models learn the same things. In fact, low pathdependence ("universality") is often taken as a starting point for research on transparency and interpretability [4].
Universal circuits of computer vision models [4].
ML as Bayesian Inference

$P_\beta(f)$ is the probability that $M$ expresses $f$ on $D$ upon a randomly sampled parametrization. This is our "prior"; it's what our network expresses on initialization.

$V_\beta(f)$ is a volume with Gaussian measure that equals $P_\beta(f)$ under Gaussian sampling of network parameters.
 This is a bit confusing. We're not talking about a continuous region of parameter space, but a bunch of variously distributed points and lowerdimensional manifolds. Mingard never explicitly points out why we expect a contiguous volume. That or maybe it's not necessary for it to be contiguous

$\beta$ denotes the "Bayesian prior"

$P_\text{opt}(fS)$ is the probability of finding $f$ on $E$ under a stochastic optimizer like SGD trained to 100% accuracy on $S$.

$P_\beta(fS) = \frac{P(Sf) P_\beta(f)}{P_\beta(S)},$ is the probability of finding $f$ on $E$ upon randomly sampling parameters from i.i.d. Gaussians to get 100% accuracy on $S$.
 This is what Mingard et al. call "Bayesian inference"
 $P(Sf)=1$ if $f$ is consistent with $S$ and $0$ otherwise

Double descent & Grokking

Mingard et al.'s work on NNs as Bayesian.
Evidence Of High Path Dependence
 Why Comparing Single Performance Scores Does Not Allow to Conclusions About Machine Learning Approaches (Reimers et al., 2018)
 Deep Reinforcement Learning Doesn’t Work Yet (Irpan, 2018)
 Lottery Ticket Hypothesis
BERTS of a feather do not generalize together
Footnotes

This breaks down in the presence of regularization if two subsequent layers have different widths. ↩

For a given set of weights which sum to $0$, you can flip the signs of each of the weights, and the sum stays $0$. ↩

The claim of singular learning theory is that this noninjectivity is fundamental to the ability of neural networks to generalize. Roughly speaking, "simple" models that generalize better take up more volume in weight space, which make them easier to find. ↩