Scaling Laws of Path dependence

NOTE: Work in Progress

How much do the models we train depend on the path they follow through weight space?

Should we expect to always get the same models for a given choice of hyperparameters and dataset? Or do the outcomes depend highly on quirks of the training process, such as the weight initialization and batch schedule?

If models are highly path-dependent, it could make alignment harder: we'd have to keep a closer eye on our models during training for chance forays into deception. Or it could make alignment easier: if alignment is unlikely by default, then increasing the variance in outcomes increases our odds of success.

Vivek Hebbar and Evan Hubinger have already explored the implications of path-dependence for alignment here. In this post, I'd like to formalize "path-dependence" in the language of dynamical systems, and share the results of experiments inspired by this framing.

From these experiments, we find novel scaling laws relating choices of hyperparameters like model size, learning rate, and momentum to the trajectories of these models during training. We find functional forms that provide a good empirical fit to these trajectories across a wide range of hyperparameter choices, datasets, and architectures.

Finally, we study a set of toy models that match the observed scaling laws, which gives us insight into the possible mechanistic origins of these trends, and a direction towards building a theory of the dynamics of training.

Formalizing Path-dependence

Hebbar and Hubinger define path-dependence "as the sensitivity of a model's behavior to the details of the training process and training dynamics." Let's make that definition more precise. To start, let us formalize the learning process as a dynamical system over weight space.

In the context of machine learning, our goal is (usually) to model some "true" function f:XYf^* : \mathcal X \to \mathcal Y1. To do this, we construct a neural network, f:X×WYf: \mathcal X \times \mathcal W \to \mathcal Y, which induces a model of ff^* upon fixing a choice of weights, wWRDw \in \mathcal W \subset \mathbb R^D. For convenience, we'll denote the resulting model with a subscript, i.e., Ffw(x):=f(x,w)\mathcal F \ni f_{w}(x):= f(x, w).2

To find the weights, woptw_\mathrm{opt}, such that fopt:=fwoptf_\text{opt} := f_{w_\text{opt}} is as "close" as possible to ff^*, we specify a loss function LnL_n, which maps a choice of parameters and a set of training examples, dtrain={(xi,yi)}i=1nZn\mathbf d_\text{train} = \{(x_{i},y_i)\}_{i=1}^n \in \mathcal Z^n, to a scalar "loss"3:

Ln:W×ZnR.L_n: \mathcal W \times \mathcal Z^n \to \mathbb R.

For a fixed choice of dataset, we denote the empirical loss, Ltrain(w)=L(w,dtrain)L_\text{train}(w) =L(w, \mathbf d_\text{train}). Analogously, we can define a test loss, LtestL_\text{test} over a corresponding test set, and batch loss, Lb(t)(w)L^{(t)}_\mathbf{b}(w), for a batch b(t)dtrain\mathbf b^{(t)} \subset \mathbf d_\text{train}.

Next, we choose an optimizer, Φ\Phi, which iteratively updates the weights (according to some variant of SGD),

Φ:W×B×HW(w(t), b(t), h)w(t+1). \begin{align} \Phi:\mathcal W \times \mathcal B\times \mathcal H &\to \mathcal W\\ (w^{(t)},\ b^{(t)},\ h) &\mapsto w^{(t+1)}. \end{align}

The optimizer depends on some hyperparameters, hHh \in \mathcal H, and learning schedule b={b(t)}t=1T\mathbf b = \{\mathbf b^{(t)}\}_{t=1}^T of TT batches, b(t)B\mathbf b^{(t)} \in \boldsymbol{\mathcal B}. Note: this learning schedule covers all NepochsN_\text{epochs} epochs. In other words, if an epoch consists of TepochsT_\text{epochs} batches, then T/Tepochs=NepochsT/ T_\text{epochs} = N_\text{epochs}. Within any given epoch, the batches are disjoint, but across epochs, repetition is allowed (though typically no batch will be repeated sample for sample).

If we take the learning schedule and hyperparameters as constant, we obtain a discrete-time dynamical system over parameter space, which we denote Φb,h:WW\Phi_{\mathbf b, h}: \mathcal W \to \mathcal W.

Two Kinds of Path Dependence

Given this dynamical system, Φb,h\Phi_{\mathbf b, h} over weight space, let us contrast two kinds of path dependence:

  1. Global path dependence. Given some distribution over starting weights, p(w(0))p(w^{(0)}), hyperparameters, p(h)p(h), or batches, p(b)p(\mathbf b), what is the resulting distribution over final weights, p(w(T))p(w^{(T)})?
  2. Local path dependence. For some choice of initial weights, w0w^{0}, hyperparameters, hh, or batches, b\mathbf b, and a small perturbation (of size ϵ1\epsilon \ll 1) to one of these values, how "different" does the perturbed model end up being from the baseline unperturbed model? In the limit of infinitesimal perturbations, this reduces to computing derivatives of the kind w(T)w(0)\frac{\partial w^{(T)}}{\partial{w^{(0)}}}, w(T)h\frac{\partial w^{(T)}}{\partial{h}}, and w(T)b(t)\frac{\partial w^{(T)}}{\partial{b^{(t)}}}.

Though we ultimately want to form global statements of path dependence, the distributions involved are generally intractable (which require us to resort to empirical approximations). In the local view, we rarely care about specific perturbations, so we end up studying the relation between a similarly intractable distribution over perturbations p(δ)p(\delta) and p(w(T))p(w^{(T)}). Still, this is much easier to explore empirically because of its smaller support.

Continuous vs. Discrete Hyperparameters

Below, is a table of the various hyperparameters we encounter in our model, weight initializer, optimizer, and dataloader. Here, we're taking "hyperparameter" in the broadest sense to incorporate even the dataset and choice of model architecture.

Hyperparameters(h)Model (hm)Weight Initializer (hinit.)Optimizer (hΦ)Dataloader (hdl)“type"(FC, ResNet, GPT-N, etc.)“type" (KH normal/uniform, etc.)“type" (SGD, Adam, etc.)d,dtrain,dtestFC:{nhidden(l)}l=1nlayers  seeds (wref, δ)β1,β2,λ,ϵ,shuffle seedResNet:nlayers=18,34,w(0), ϵηTbatch\begin{array}{c c c c} &\text{Hyperparameters}&(\mathbf h)& \\ \hline \text{Model } (\mathbf{h}_{m}) & \text{Weight Initializer }(\mathbf{h}_\text{init.}) & \text{Optimizer }(\mathbf h_\Phi) &\text{Dataloader }(\mathbf h_\text{dl}) \\ \hline \text{``type"\tiny{(FC, ResNet, GPT-N, etc.)}} & \text{``type" \tiny{(KH normal/uniform, etc.)}}& \text{``type" \small{(SGD, Adam, etc.)}} & \mathbf{d}, \mathbf{d}_\text{train}, \mathbf{d}_\text{test} \\ \hookrightarrow \small{\text{FC}:\{n^{(l)}_\text{hidden}\}_{l=1}^{n_\text{layers}}}\ \ \, & \text{seeds } \small{(w_\text{ref},\ \delta)} & \hookrightarrow \small{\beta_{1},\beta_{2}, \lambda, \epsilon, \dots} & \text{shuffle seed} \\ \hookrightarrow \small{\text{ResNet}:\tiny{n_\text{layers}=18, 34, \dots}} & |w^{(0)}|,\ \epsilon & \eta & T_\text{batch} \end{array}

To study local perturbations, we're interested in those hyperparameters that we can vary continuously. E.g.: we can gradually increase the magnitude ϵ\epsilon of the perturbation applied to our baseline model, as well as the momenta, regularization coefficients, and learning rates of the optimizer, but we can't smoothly vary the number of layers in our model or the type of optimizer we're using.

From weights to functions

The problem with studying dynamics over W\mathcal W is that we don't actually care about wWw \in \mathcal W. Instead, we care about the evolution in function space, F\mathcal F, which results from the mapping m:WwfwFm: \mathcal W \ni w \mapsto f_{w}\in \mathcal F.

We care about this evolution because the mapping to function space is non-injective; the internal symmetries of mm ensure that different choices of wWw\in \mathcal W can map to the same function fFf \in \mathcal F.[1] As a result, distance in W\mathcal W can be misleading as to the similarity of the resulting function.

The difficulty with function space is that it is infinite-dimensional, so studying it is that much more intractable than studying weight space. In practice, then, we have to estimate where we are in function space over a finite number of samples of input-output pairs, for which we typically use the training and test sets.

However, even though we have full knowledge of W\mathcal W, we can't resolve exactly where we are in F\mathcal F from any finite set of samples. At best, we can resolve the level set in F\mathcal F with a certain empirical performance (like training/test loss). This means our sampling procedure acts as a second kind of non-injective map from FFobs=(X×Y)n\mathcal F\to \mathcal F_\text{obs} = (\mathcal X \times \mathcal Y)^n.

A note on optimizer state: If we want to be exhaustive, let us recall that optimizers might have some internal state, sSs \in \mathcal S. E.g., for Adam, st=(mt,vt)s_{t}= (m_t, v_t), a running average of the first moment and second moment of the gradients, respectively. Then, the dynamical system over W\mathcal W is more appropriately a dynamical system over the joint space W×S\mathcal W \times \mathcal S. Completing this extension will have to wait for another day; we'll ignore it for now for the sake of simplicity and because, in any case, sts_t, tends to be a deterministic function of b\mathbf b and hh.

\begin{array} &&\text{Metrics}& \\ \hline \mathcal W & &\mathcal F \\ \hline d^{(p)}_{W}(w, w')= ||w-w'||_{p}&&L_{\text{cf.}}(f_{w}, f_{w'}) = \frac{1}{N}\sum\limits_{i=1}^{N} \ell\left(f_{w}(x_i), f_{w'}(x_{i})\right) \\ S_{C}(w, w') = \frac{w \cdot w'}{|w||w'|} \end{array}

Experiments

In section XX, we defined path dependence as about understanding the relation between (P(Bt),P(W0),P(H))(P(B_t), P(W_0), P(H)) and (P(WT),P(FT))(P(W_{T}), P(F_{T})). As mentioned, there are two major challenges:

  1. In general, we have to estimate these distributions empirically, and training neural networks is already computationally expensive, so we're restricted to studying smaller networks and datasets.
  2. The mapping m:WFm:\mathcal W\to \mathcal F is non-straightforward because of the symmetries of mm.

To make it easier for ourselves, let us restrict our attention to the narrower case of studying local path-dependence.

a (small) perturbation, ϵ\epsilon, to one of (bt,w0,h)(b_{t}, w_{0}, h). E.g., we'll study probability densities of the kind p(w0)=N(w0wbaseline,ϵ1)p(\mathbf w_0)=\mathcal N(\mathbf w_{0}|\mathbf w_\text{baseline}, \epsilon \mathbf 1). For discrete variables (like the number of layers, network width, and batch schedule), there's a minimum size we can make ϵ\epsilon,

Within dynamical systems theory, there are two main ways to view dynamical systems:

  1. The trajectory view studies the evolution of points like (w0,w1,,wT)(w_0, w_1, \dots, w_T) and (f1,f2,,fT)(f_1, f_2, \dots, f_T).
  2. The probability view studies the evolution of densities like (p0(w),p1(w),,pT(w))(p_0(w), p_1(w), \dots, p_T(w)) and (p1(f),p2(f),,pT(f))(p_1(f), p_2(f), \dots, p_T(f)).

Both have something to tell us about path dependence:

  1. In the trajectory view, sensitivity of outcomes becomes a question of calculating Lyapunov exponents, the rate at which nearby trajectories diverge/converge.
  2. In the probability view, sensitivity of outcomes becomes a question of measuring autocorrelations, wiwi+τ\langle w_{i} w_{i+\tau}\rangle and fifi+τ\langle f_{i} f_{i+\tau} \rangle.

Tracking Trajectories

The experimental set-up involves taking some baseline model, m0m_0, then applying a Gaussian perturbation to the initial weights with norm ϵ\epsilon, to obtain a set of perturbed models {mi}i=1nmodels\{m_i\} _{i=1}^{n _\text{models}}. We train these models on MNIST (using identical batch schedules for each model).

Over the course of training, we track how these models diverge via several metrics (see next subsection). For each of these, we study how the rate of divergence varies for different choices of hyperparameters: momentum β\beta, weight decay λ\lambda, learning rate η\eta, hidden layer width (for one-hidden-layer models), and number of hidden layers. Moving on from vanilla SGD, we compare adaptive variants like Adam and RMSProp.

TODO: Non-FC models. Actually calculate the rate of divergence (for the short exponential period at the start). Perform these experiments for several different batch schedules. Perturbations in batch schedule. Other optimizers besides vanilla SGD.

Measuring distance

  • dw(p)(mi,m0)d_\mathbf{w}^{(p)}(m_i, m_0): the pp-norm between the weight vectors of each perturbed model and the baseline model. We'll ignore the pp throughout and restrict to the case p=2p=2. This is the most straightforward notion of distance, but it's flawed in that it can be a weak proxy for distance in F\mathcal F because of the internal symmetries of the model.
  • dw(mi,m0)w0\frac{d_{\mathbf w}(m_{i}, m_0)}{|\mathbf w_{0}|}: the relative pp-norm between weight vectors of each perturbed model.
  • Ltrain,test(mi)L_\text{train,test}(m_i): the training/test losses
  • δLtrain, test(mi,m0)\delta L_\text{train, test}(m_{i}, m_0): the training/test loss relative (difference) to the baseline model m0m_0
  • ftrain, test(mi)f_\text{train, test}(m_i), δftrain, test(mi,m0)\delta f_\text{train, test}(m_i, m_0): the training/test set classification accuracy and relative classification accuracy.
  • Ltrain cf., test cf.L_\text{train cf., test cf.} and ftrain cf., test cf.f_\text{train cf., test cf.}: same as above, but we take the predictions of the baseline model as the ground truth (rather than the actual labels in the test set).

A few more metrics to include:

  • dwperm.d^\text{perm.}_{\mathbf w}: the l2 norm after adjusting for permutation differences (as described in Ainsworth et al.)
  • w0wiperm.Ltrain (cf.), test (cf.)(w)dw\int_{\mathbf w_0}^{\mathbf w_{i}^\text{perm.}} L_\text{train (cf.), test (cf.)}(\mathbf w) d\mathbf w. The loss integrated over the linear interpolation between two models' weights after correcting for permutations (as described in Ainsworth et al.)

Tracking Densities

TODO

Results

One-hidden-layer Models

What's remarkable about one-hidden-layer models is how little the model depends on weight initialization: almost all of the variance seems to be explained by the batch schedule. Even for initial perturbations of size ϵ=10\epsilon=101, the models appear to become almost entirely equivalent in function space across the entire training process. In the figure below, you can see that the differences in LtrainL_\text{train} are imperceptible (both for a fixed ϵ\epsilon and across averages over different ϵ\epsilon).

TODO: I haven't checked ϵ/w\epsilon/|\mathbf w|. I'm assuming this is >1 for ϵ=10\epsilon=10 (which is why I find this surprising), but I might be confused about PyTorch's default weight initialization scheme. So take this all with some caution.

400300

400

The rate of growth for dwd_\mathbf{w} has a very short exponential period (shorter than a single epoch), followed by a long linear period (up to 25 epochs, long past when the error has stabilized). In some cases, you'll see a slight bend upwards or downwards. I need more time to test over more perturbations (right now it's 10 perturbed models) to clean this up. Maybe these trends change over longer periods, and with a bit more time I'll test longer training runs and over more perturbations.

300300

Hyperparameters

  • Momentum: (β=0.1,0.5,0.9\beta=0.1, 0.5, 0.9) The dwd_\textbf w curves look slightly more curved upwards/exponential but maybe that's confirmation bias (I expect momentum to make divergence more exponential back when I expected exponential divergence). Small amounts of momentum appear to increase convergence (relative to other models equivalent up to momentum), while large amounts increase divergence (towards . For very small amounts, the effect appears to be negligible. Prediction: the dwd_w curves curve down because in flat basins, momentum slows you down.
  • Learning rate: Same for η=103,102,101\eta=10^{-3}, 10^{-2}, 10^{-1}. (Not too surprising) TODO: Compare directly between different learning rates. Can we see η\eta back in the slope of divergence? Prediction: This should not make much of a difference. I expect that correcting for the learning rate should give nearly identical separation slopes.
  • Weight Decay: TODO for λ=103,102,101\lambda=10^{-3}, 10^{-2}, 10^{-1}. Prediction: dwd_w will shrink (because w|w| will shrink). The normalized dw/wd_w/|w| will curve downwards because we break the ReLU-scaling symmetry which means the volume of our basins will shrink
  • Width: TODO. Both this and depth require care. With different sizes of w\mathbf w, we can't naively compare norms. Is it enough to divide by dimw\dim \mathbf w? Prediction: I expect the slope of divergence to increase linearly with the width of a one-layer network (from modeling each connection as separating of its own linear accord).
  • Depth: TODO. Prediction: I expect the shape of dwd_w for each individual layer to become more and more exponential with depth (because the change will be the product of layers that diverge linearly independently). I expect this to dominate the overall divergence of the networks.
  • Architecture: Prediction: I expect convolutional architectures to decrease the rate of separation (after correcting for the number of parameters).
  • Optimizer:
    • Prediction: I expect adaptive techniques to make the rate of divergence exponential.

Datasets

  • Computer Vision
    • MNIST
    • Fashion-MNIST
    • Imagenet: Prediction: I expect Imagenet to lead to the same observations as MNIST (after correcting for model parameter count).
  • Natural Language
    • IMDb movie review database

Why Linear?

I'm not sure. My hunch going in was that I'd see either exponential divergence (indicating chaos) or square root divergence (i.e., Brownian noise). Linear surprises me, and I don't yet know what to make of it.

Maybe all of this is just a fact about one-hidden-layer networks. If each hidden layer evolves independently linearly, maybe this combines additively into something Brownian. Or maybe it's multiplicative (so exponential).

Deep Models

TODO

Appendix

Weight Initialization

For ReLU-based networks, we use a modified version of Kaiming (normal) initialization, which is based on the intuition of Kaiming initialization as sampling weights from a hyperspherical shell of vanishing thickness.

Kaiming initialization is sampling from a hyperspherical shell

Consider a matrix, w(l)\mathbf w^{(l)}, representing the weights of a particular layer ll with shape (Din(l),Dout(l+1))(D_\mathrm{in}^{(l)}, D_\mathrm{out}^{(l+1)}). Din(l)D_\mathrm{in}^{(l)} is also called the fan-in of the layer, and Din(l+1)D_\mathrm{in}^{(l+1)} the fan-out. For ease of presentation, we'll ignore the bias, though the following reasoning applies equally well to the bias.

We're interested in the vectorized form of this matrix, w(l)RD(l)\vec w^{(l)} \in \mathbb R^{D^{(l)}}, where D(l)=Din(l)×Dout(l+1)D^{(l)} =D_\mathrm{in}^{(l)} \times D_\mathrm{out}^{(l+1)}.

In Kaiming initialization, we sample the components, wi(l)w_i^{(l)}, of this vector, i.i.d. from a normal distribution with mean 0 and variance σ2\sigma^2 (where σ2=2Din(l)\sigma^2 = \frac{2}{D_\mathrm{in}^{(l)}}).

Geometrically, this is equivalent to sampling from a hyperspherical shell, SD1S^{D-1} with radius Dσ\sqrt{D}\sigma and (fuzzy) thickness, δ\delta.

This follows from some straightforward algebra (dropping the superscript ll for simplicity):

E[w2]=E[i=1Dwi2]=i=1DE[wi2]=i=1Dσ2=Dσ2,\mathbb E[|\mathbf w|^2] = \mathbb E\left[\sum_{i=1}^D w_i^2\right] = \sum_{i=1}^D \mathbb E[w_i^2] = \sum_{i=1}^D \sigma^2 = D\sigma^2,

and

δ2var[w2]=E[(i=1Dwi2)2]E[i=1Dwi2]2=i,j=1DE[wi2wj2](Dσ2)2=ijDE[wi2]E[wj2]+i=1DE[wi4](Dσ2)2=D(D1)σ4+D(3σ4)(Dσ2)2=2Dσ4.\begin{align} \delta^2 \propto \mathrm{var} [|\mathbf w|^2] &= \mathbb E\left[\left(\sum_{i=1}^D w_i^2\right)^2\right] - \mathbb E\left[\sum_{i=1}^D w_i^2\right]^2 \\ &= \sum_{i, j=1}^D \mathbb E[w_i^2 w_j^2] - (D\sigma^2)^2 \\ &= \sum_{i \neq j}^D \mathbb E[w_i^2] \mathbb E[w_j^2] + \sum_{i=1}^D \mathbb E[w_i^4]- (D\sigma^2)^2 \\ &= D(D-1) \sigma^4 + D(3\sigma^4) - (D\sigma^2)^2 \\ &= 2D\sigma^4. \end{align}

So the thickness as a fraction of the radius is

δDσ=2DσD=2σ=2Din(l),\frac{\delta}{\sqrt{D}\sigma} = \frac{\sqrt{2D}\sigma}{\sqrt{D}} = \sqrt{2}\sigma = \frac{2}{\sqrt{D_\mathrm{in}^{(l)}}},

where the last equality follows from the choice of σ\sigma for Kaiming initialization.

This means that for suitably wide networks (Din(l)D_\mathrm{in}^{(l)} \to \infty), the thickness of this shell goes to 00.

Taking the thickness to 0

This suggests an alternative initialization strategy: sample directly from the boundary of a hypersphere with radius Dσ\sqrt{D}\sigma, i.e., modify the shell thickness to be 00.

This can easily be done by sampling each component from a normal distribution with mean 0 and variance 11 and then normalizing the resulting vector to have length Dσ\sqrt{D}\sigma (this is known as the Muller method).

Perturbing Weight initialization

Naïvely, if we're interested in a perturbation analysis of the choice of weight initialization, we prepare some baseline initialization, w0\mathbf w_0, and then apply i.i.d. Gaussian noise, δ\boldsymbol \delta, to each of its elements, δiN(0,ϵ2)\delta_i \sim \mathcal N(0, \epsilon^2).

The problem with this is that the perturbed weights w=w0+δ\mathbf w = \mathbf w_0 + \boldsymbol\delta are no longer sampled from the same distribution as the baseline weights. In terms of the geometric picture from the previous section, we're increasing the thickness of the hyperspherical shell in the vicinity of the baseline weights.

There is nothing wrong with this per se, but it introduces a possible confounder (the thickness).

The modification we made to Kaiming initialization was to sample directly from the boundary of a hypersphere, rather than from a hyperspherical shell. This is a more natural choice when conducting a perturbation analysis, because it makes it easier to ensure that the perturbed weights are sampled from the same distribution as the baseline weights.

Geometrically, the intersection of a hypersphere SDS^D of radius w0=w0w_0=|\mathbf w_0| with a hypersphere SDS^D of radius ϵ\epsilon that is centered at some point on the boundary of the first hypersphere, is a lower-dimensional hypersphere SD1S^{D-1} of a modified radius ϵ\epsilon'. If we sample uniformly from this lower-dimensional hypersphere, then the resulting points will follow the same distribution over the original hypersphere.

This suggests a procedure to sample from the intersection of the weight initialization hypersphere and the perturbation hypersphere.

First, we sample from a hypersphere of dimension D1D-1 and radius ϵ\epsilon' (using the same technique we used to sample the baseline weights). From a bit of trigonometry, see figure below, we know that the radius of this hypersphere will be ϵ=w0cosθ\epsilon' = w_0\cos \theta, where θ=cos1(1ϵ22w02)\theta = \cos^{-1}\left(1-\frac{\epsilon^2}{2w_0^2}\right).

400

Next, we rotate the vector so it is orthogonal to the baseline vector w0\mathbf w_0. This is done with a Householder reflection, HH, that maps the current normal vector n^=(0,,0,1)\hat{\mathbf n} = (0, \dots, 0, 1) onto w0\mathbf w_0:

H=I2ccTcTc,H = \mathbf I - 2\frac{\mathbf c \mathbf c^T}{\mathbf c^T \mathbf c},

where

c=n^+w^0,\mathbf c = \hat{\mathbf n} + \hat {\mathbf w}_0,

and w^0=w0w0\hat{\mathbf w}_0 = \frac{\mathbf w_0}{|w_0|} is the unit vector in the direction of the baseline weights.

Implementation note: For the sake of tractability, we directly apply the reflection via:

Hy=y2cTycTcc.H\mathbf y = \mathbf y - 2 \frac{\mathbf c^T \mathbf y}{\mathbf c^T\mathbf c} \mathbf c.

Finally, we translate the rotated intersection sphere along the baseline vector, so that its boundary goes through the intersection of the two hyperspheres. From the figure above, we find that the translation has the magnitude w0=w0cosθw_0' = w_0 \cos \theta.

By the uniform sampling of the intersection sphere and the uniform sampling of the baseline vector, we know that the resulting perturbed vector will have the same distribution as the baseline vector, when restricted to the intersection sphere.

sampling-perturation 1.png

Dynamical Systems

Formally, a dynamical system is a tuple (T,M,Φ)(\mathcal T, \mathcal M, \Phi), where T\mathcal T is the "time domain" (some monoid), M\mathcal M is the phase space over which the evolution takes place (some manifold), and Φ\Phi is the evolution function, a map,

Φ:T×MM,\Phi: \mathcal T \times \mathcal M \to \mathcal M,

that satisfies, xM,t1,t2T\forall x \in \mathcal M, \forall t_{1}, t_{2}\in \mathcal T,

Φ(0,x)=x,Φ(t2,Φ(t1,x))=Φ(t2+t1,x).\begin{align} \Phi(0, x) &= x,\\ \Phi(t_{2}, \Phi(t_{1}, x)) &= \Phi(t_{2}+t_{1}, x). \end{align}

4

Informally, we're usually interested in one of two perspectives:

  1. Trajectories of individual points in M\mathcal M, or
  2. Evolution of probability densities over M\mathcal M.

The former is described in terms of differential equations (for continuous systems) or difference equations (for discrete systems), i.e.,

xt=ϕ(x,t)\frac{\partial x}{\partial t} = \phi(x, t)

or

xt+1xt=ψ(xt,t)x_{t+1}- x_{t} = \psi(x_{t}, t)

The latter is described in terms of a transfer operator:

tp(x)=Lp(x).\nabla_{t} p(x) = \mathcal L p(x).

Both treatments have their advantages: the particle view admits easier empirical analysis (we just simulate a trajectory), while the latter

In terms of the former, path-sensitivity becomes the question of chaos: do nearby

Varying the Depth

For a fair comparison across network depths, we want to keep the total number of parameters DD constant as we increase depth.

Given nhn_h hidden layers with biases that each have width, ww, an input dimensionality of ninn_\mathrm{in}, and an output dimensionality of noutn_\mathrm{out}, the total number of parameters is

D(w,nh)=(nin+1) w+nh (w+1) w+(w+1) nout=nhw2+(nin+nh+nout)w+nout.\begin{align} D(w, n_h) &= (n_{\text{in}}+ 1)\ w + n_{h}\ (w +1)\ w + (w + 1)\ n_{\text{out}} \\ &= n_{h}w^{2}+ (n_\text{in} + n_{h}+n_\text{out})w + n_{\text{out}}. \end{align}

Lyapunov spectrum

The Lyapunov exponent λ\lambda quantifies the rate of separation of infinitesimally close trajectories:

δZ(t)eλtδZ0. |\delta \mathbf{Z}(t)| \approx e^{\lambda t}\left|\delta \mathbf{Z}_0\right|.

If the exponent is positive, then nearby trajectories will diverge, and the dynamics are chaotic. If the exponent is negative, then nearby trajectories will converge

For a DD-dimensional system, there are DD Lyapunov exponents. We often focus exclusively on the maximal Lyapunov exponent because it dominates the overall rate of separation between two neighboring trajectories. However, the full spectrum contains valuable additional information like the rate of entropy production, the fractal dimension, the Hausdorff dimension, and the Lyapunov dimension.

The first major limitation is that the full Lyapunov spectrum is intractable for large DD. The other major limitation is that the Lyapunov spectrum requires a suitable norm. We can use the l2 norm in W\mathcal W, but as we already saws what we really care about is measuring distance in F\mathcal F with some metric d:F×FRd : \mathcal F \times \mathcal F \to \mathbb R.

One option would be to repurpose the loss function \ell:

d(f1,f2)=1Ni=1N(f1(xi),f2(xi)), d(f_1, f_2) = \frac{1}{N} \sum_{i=1}^N \ell(f_1(x_i), f_2(x_i)),

Here, we treat f2(xi)f_2(x_i) as the truth, and use the empirical risk as our distance metric. We could define an analogous distance over either the train or test set. As long as \ell is a suitable metric (or easily converted into a metric through, e.g., symmetrization), dd will be too.

This suffers a major downside that two functions will be seen as identical as long as they have the same performance on the dataset in question. They can have totally different performance on unseen examples. Devising suitable distance metrics for neural networks requires more work.

Pasted image 20221213224220.png

Autocorrelations

The autocorrelation of a random process {Xt}\{X_t\} is the correlation between two values of that process at different times,

RXX(t1,t2)=Xt1Xt2XX. R_{XX}(t_1, t_2) = \langle X_{t_1} X_{t_2}\rangle_{XX}.

For stationary processes (where Xt\langle X_t\rangle is independent of tt), this becomes a function of one variable, τ=t2t1\tau=t_2-t_1,

RXX(τ)=XtXt+τXX. R_{XX}(\tau) = \langle X_tX_{t+\tau}\rangle_{XX}.

In the case of training, the dynamics aren't stationary: learning rate schedules and the the "descent" of gradient descent ensures that these correlations will depend on our choice of starting time. However, we're not typically interested in correlations between later time steps. We care about autocorrelations relative to the starting point t1=0t_1=0.

In practice, the probabilistic and point-wise views are intimately connected: the relevant autocorrelation timescales can often be directly related to the maximal Lyapunov component. These two formulas are only a small sample of the available tooling.

F\mathcal F is the space of pp-integrable functions, fwFf_w \in \mathcal F iff

Xfw(x)pdx<,\int_{\mathcal X} |f_{w}(x)|^{p}\, \mathrm dx < \infty,

which is equipped with a metric,

dF(f,g)=Xf(x)g(x)pdx.d_{\mathcal F}(f, g) = \int_{\mathcal X}|f(x)-g(x)|^{p }\, \mathrm d x.

So we exchange evolution over a finite-dimensional W\mathcal W with an infinite-dimensional

Footnotes

  1. For regression, Y=RN\mathcal Y = \mathbb R^N, for classification, YN\mathcal Y \subset \mathbb N, for self-supervised tasks, we're often interested in Y=X\mathcal Y = \mathcal X, etc. 2

  2. To start, we'll ignore the model hyperparameters. You can view this as either absorbing the hyperparameters into the weights, into the optimizer, or into our choice of ff. Later, it'll be useful to separate out hyperparameters (i.e., f:X×Y×HYf: \mathcal X \times \mathcal Y \times \mathcal H \to \mathcal Y).

  3. A quick note on notation: Often, you'll see the dataset denoted with a capital DD. Here, we're using a lowercase because it'll be useful to treat d\mathbf d as an instance of a random variable D\mathbf D. Similarly for xx, yy, and ww (XX, YY, and WW). Even though xx, yy, and ww are (typically) all vectors, we'll reserve boldface for sets. TODO: Maybe just bite the bullet and bold all of them.

  4. It's possible to generalize this further so that Φ:U(T×M)M\Phi : U \subseteq (\mathcal T \times \mathcal M) \to \mathcal M, but this won't be necessary for us.