How to Perturb Weights

I'm running a series of experiments that involve some variation of: (1) perturb a weight initialization; (2) train the perturbed and baseline models in parallel, and (3) track how the perturbation grows/shrinks over time.

Naively, if we're interested in a perturbation analysis of the choice of weight initialization, we prepare some baseline initialization, w0\mathbf w_0, and then apply i.i.d. Gaussian noise, δ\boldsymbol \delta, to each of its elements, δiN(0,ϵ2)\delta_i \sim \mathcal N(0, \epsilon^2). (If we want, we can let this vary layer-by-layer and let it depend on, for example, the norm of the layer it's being applied to.)

The problem with this strategy is that the perturbed weights w=w0+δ\mathbf w = \mathbf w_0 + \boldsymbol\delta are, in general, no longer sampled from the same distribution as the baseline weights.

There is nothing wrong with this per se, but it introduces a possible confounder (the thickness). This is especially relevant if we're interested specifically in the question of how behavior changes with the size of a perturbation, this problem introduces a possible confounder. As responsible experimentalists, we don't like confounders.

Fortunately, there's an easy way to "clean up" Kaiming He to make it better suited to this perturbative analysis.

Kaiming initialization lives in a hyperspherical shell

Consider a matrix, w(l)\mathbf w^{(l)}, representing the weights of a particular layer ll with shape (Din(l),Dout(l+1))(D_\mathrm{in}^{(l)}, D_\mathrm{out}^{(l+1)}). Din(l)D_\mathrm{in}^{(l)} is also called the fan-in of the layer, and Din(l+1)D_\mathrm{in}^{(l+1)} the fan-out. For ease of presentation, we'll ignore the bias, though the following reasoning applies equally well to the bias.

We're interested in the vectorized form of this matrix, w(l)RD(l)\vec w^{(l)} \in \mathbb R^{D^{(l)}}, where D(l)=Din(l)×Dout(l+1)D^{(l)} =D_\mathrm{in}^{(l)} \times D_\mathrm{out}^{(l+1)}.

In Kaiming initialization, we sample the components, wi(l)w_i^{(l)}, of this vector, i.i.d. from a normal distribution with mean 0 and variance σ2\sigma^2 (where σ2=2Din(l)\sigma^2 = \frac{2}{D_\mathrm{in}^{(l)}}).

Geometrically, this is equivalent to sampling from a hyperspherical shell, SD1S^{D-1} with radius Dσ\sqrt{D}\sigma and (fuzzy) thickness, δ\delta. (Ok, so technically, because the radius can vary from layer-to-layer, it's a hyperellipsoidal shell.)

This follows from some straightforward algebra (dropping the superscript ll for simplicity):

E[w2]=E[i=1Dwi2]=i=1DE[wi2]=i=1Dσ2=Dσ2, \mathbb E[|\mathbf w|^2] = \mathbb E\left[\sum_{i=1}^D w_i^2\right] = \sum_{i=1}^D \mathbb E[w_i^2] = \sum_{i=1}^D \sigma^2 = D\sigma^2,

and

δ2var[w2]=E[(i=1Dwi2)2]E[i=1Dwi2]2=i,j=1DE[wi2wj2](Dσ2)2=ijDE[wi2]E[wj2]+i=1DE[wi4](Dσ2)2=D(D1)σ4+D(3σ4)(Dσ2)2=2Dσ4. \begin{align} \delta^2 \propto \mathrm{var} [|\mathbf w|^2] &= \mathbb E\left[\left(\sum_{i=1}^D w_i^2\right)^2\right] - \mathbb E\left[\sum_{i=1}^D w_i^2\right]^2 \\ &= \sum_{i, j=1}^D \mathbb E[w_i^2 w_j^2] - (D\sigma^2)^2 \\ &= \sum_{i \neq j}^D \mathbb E[w_i^2] \mathbb E[w_j^2] + \sum_{i=1}^D \mathbb E[w_i^4]- (D\sigma^2)^2 \\ &= D(D-1) \sigma^4 + D(3\sigma^4) - (D\sigma^2)^2 \\ &= 2D\sigma^4. \end{align}

So the thickness as a fraction of the radius is

δDσ=2DσD=2σ=2Din(l), \frac{\delta}{\sqrt{D}\sigma} = \frac{\sqrt{2D}\sigma}{\sqrt{D}} = \sqrt{2}\sigma = \frac{2}{\sqrt{D_\mathrm{in}^{(l)}}},

where the last equality follows from the choice of σ\sigma for Kaiming initialization.

This means that for suitably wide networks (Din(l)D_\mathrm{in}^{(l)} \to \infty), the thickness of this shell goes to 00.

Taking the thickness to 0

This suggests an alternative initialization strategy. What if we immediately take the limit Din(l)D_\text{in}^{(l)} \to \infty, and sample directly from the boundary of a hypersphere with radius Dσ\sqrt{D}\sigma, i.e., modify the shell thickness to be 00.

This can easily be done by sampling each component from a normal distribution with mean 0 and variance 11 and then normalizing the resulting vector to have length Dσ\sqrt{D}\sigma (this is known as the Muller method).

Perturbing Weight initialization

The modification we made to Kaiming initialization was to sample directly from the boundary of a hypersphere, rather than from a hyperspherical shell. This is a more natural choice when conducting a perturbation analysis, because it makes it easier to ensure that the perturbed weights are sampled from the same distribution as the baseline weights.

Geometrically, the intersection of a hypersphere SDS^D of radius w0=w0w_0=|\mathbf w_0| with a hypersphere SDS^D of radius ϵ\epsilon that is centered at some point on the boundary of the first hypersphere, is a lower-dimensional hypersphere SD1S^{D-1} of a modified radius ϵ\epsilon'. If we sample uniformly from this lower-dimensional hypersphere, then the resulting points will follow the same distribution over the original hypersphere.

This suggests a procedure to sample from the intersection of the weight initialization hypersphere and the perturbation hypersphere.

First, we sample from a hypersphere of dimension D1D-1 and radius ϵ\epsilon' (using the same technique we used to sample the baseline weights). From a bit of trigonometry, see figure below, we know that the radius of this hypersphere will be ϵ=w0cosθ\epsilon' = w_0\cos \theta, where θ=cos1(1ϵ22w02)\theta = \cos^{-1}\left(1-\frac{\epsilon^2}{2w_0^2}\right).

400

Next, we rotate the vector so it is orthogonal to the baseline vector w0\mathbf w_0. This is done with a Householder reflection, HH, that maps the current normal vector n^=(0,,0,1)\hat{\mathbf n} = (0, \dots, 0, 1) onto w0\mathbf w_0:

H=I2ccTcTc, H = \mathbf I - 2\frac{\mathbf c \mathbf c^T}{\mathbf c^T \mathbf c},

where

c=n^+w^0, \mathbf c = \hat{\mathbf n} + \hat {\mathbf w}_0,

and w^0=w0w0\hat{\mathbf w}_0 = \frac{\mathbf w_0}{|w_0|} is the unit vector in the direction of the baseline weights.

Implementation note: For the sake of tractability, we directly apply the reflection via:

Hy=y2cTycTcc. H\mathbf y = \mathbf y - 2 \frac{\mathbf c^T \mathbf y}{\mathbf c^T\mathbf c} \mathbf c.

Finally, we translate the rotated intersection sphere along the baseline vector, so that its boundary goes through the intersection of the two hyperspheres. From the figure above, we find that the translation has the magnitude w0=w0cosθw_0' = w_0 \cos \theta.

By the uniform sampling of the intersection sphere and the uniform sampling of the baseline vector, we know that the resulting perturbed vector will have the same distribution as the baseline vector, when restricted to the intersection sphere.

sampling-perturation 1.png