Here are some dominoes based on [1]. The idea behind this dataset is that there are two "patterns" in the data: the MNIST image and the CIFAR image.

Pasted image 20230316175757.png

Notice that some of the dominoes have only one "pattern" present. By tracking training/test loss on these one-sided dominoes, we can tease apart how quickly the model learns the two different patterns.

We'd like to compare these pattern-learning curves to the curves predicted by the toy model of [2]. In particular, we'd like to compare predictions to the empirical curves as we change the relevant macroscopic parameters (e.g., prevalence, reliability, and simplicity1).

Which means running sweeps over these macroscopic parameters.


What happens as we change the relative incidence of MNIST vs CIFAR images in the dataset? We can accomplish this by varying the frequency of one-sided MNIST dominoes vs. one-sided CIFAR dominoes.

We control two parameters:

  • pmp_m, the probability of a domino containing an MNIST image (either one-sided or two-sided),
  • pcp_c, the probability of a domino containing a CIFAR image (either one-sided or two-sided), and

Two parameters are fixed by our datasets:

  • NmN_m, the number of samples in the MNIST dataset.
  • NcN_c, the number of samples in the CIFAR dataset.

Given these parameters, we have to determine:

  • rm0r_{m0}, the fraction of the MNIST dataset that we reject,
  • rm1r_{m1}, the fraction of the MNIST dataset that ends up in one-sided dominoes,
  • rm2r_{m2}, the fraction of the MNIST dataset that ends up in two-sided dominoes,

and, similarly, rc0r_{c0}, rc1r_{c1}, and rc2r_{c2} for the CIFAR dataset.

IMG_495F2C6C4A1E-1.jpeg Here's the corresponding Sankey diagram (in terms of numbers of samples rather than probabilities, but it's totally equivalent).

Six unknowns means we need six constraints.

We get the first two from the requirement that probabilities are normalized,

rm0+rm1+rm2=rc0+rc1+rc2=1, r_{m0} + r_{m1} + r_{m2} = r_{c0} + r_{c1} + r_{c2} = 1,

and the another from the double dominoes requiring the sample number of samples from both datasets,

rm2Nm=rc2Nc. r_{m2} N_m = r_{c2} N_c.

Out of convenience, we'll introduce an additional variable, which we immediately constrain,

N=rc1Nc+rm1Nm+rm2Nm, N = r_{c1}N_c + r_{m1}N_m + r_{m2} N_m,

the number of samples in the resulting dominoes dataset.

We get the last three constraints from our choices of pmp_{m}, pcp_c, and p1p_1:

Npm=Nm1+N2=rm1Nm+rm2Nm, N p_m = N_{m1} + N_2 = r_{m1} N_m + r_{m2} N_m,
Npc=Nc1+N2=rc1Nc+rc2Nc, N p_c = N_{c1} + N_2 = r_{c1} N_c + r_{c2} N_c,

In matrix format,

(1110000000111000Nm00Nc00NmNm0Nc010NmNm000pm0000NcNcpc0Nm00Nc0p1)(rm0rm1rm2rc0rc1rc2N)=(1100000),\begin{pmatrix} 1 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 & 1 & 0 \\ 0 & 0 & N_m & 0 & 0 & -N_c & 0 \\ 0 & N_m & N_m & 0 & N_c & 0 & 1 \\ 0 & N_m & N_m & 0 & 0 & 0 & -p_m \\ 0 & 0 & 0 & 0 & N_c & N_c & -p_c \\ 0 & N_m & 0 & 0 & N_c & 0 & -p_1 \end{pmatrix} \cdot \begin{pmatrix} r_{m0} \\ r_{m1} \\ r_{m2} \\ r_{c0} \\ r_{c1} \\ r_{c2} \\ N \end{pmatrix} = \begin{pmatrix} 1 \\ 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{pmatrix},

where p1=2pcpmp_1 = 2 - p_c - p_m.

So unfortunately, this yields trivial answers where rm0=rc0=1r_{m0}=r_{c0}=1 and all other values are 0. The solution seems to be to just allow there to be empty dominoes.


We can vary the reliability by inserting "wrong" dominoes. I.e.: with some probability make either of the two sides display the incorrect class for the label.


One of the downsides of this task is that we don't have much control over the simplicity of the feature. MNIST is simpler than CIFAR, sure, but how much? How might we control this?


  1. Axes conceived by Ekdeep Singh Lubana.

0. The shallow reality of 'deep learning theory'

Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort

Most results under the umbrella of "deep learning theory" are not actually deep, about learning, or even theories.

This is because classical learning theory makes the wrong assumptions, takes the wrong limits, uses the wrong metrics, and aims for the wrong objectives. Learning theorists are stuck in a rut of one-upmanship, vying for vacuous bounds that don't say anything about any systems of actual interest.

400 Yudkowsky tweeting about statistical learning theorists. (Okay, not really.)

In particular, I'll argue throughout this sequence that:

  • Empirical risk minimization is the wrong framework, and risk is a weak foundation.
  • In approximation theory, the universal approximation results are too general (they do not constrain efficiency) while the "depth separation" results meant to demonstrate the role of depth are too specific (they involve constructing contrived, unphysical target functions).
  • Generalization theory has only two tricks, and they're both limited:
    • Uniform convergence is the wrong approach, and model class complexities (VC dimension, Rademacher complexity, and covering numbers) are the wrong metric. Understanding deep learning requires looking at the microscopic structure within model classes.
    • Robustness to noise is an imperfect proxy for generalization, and techniques that rely on it (margin theory, sharpness/flatness, compression, PAC-Bayes, etc.) are oversold.
  • Optimization theory is a bit better, but training-time guarantees involve questionable assumptions, and the obsession with second-order optimization is delusional. Also, the NTK is bad. Get over it.
  • At a higher level, the obsession with deriving bounds for approximation/generalization/learning behavior is misguided. These bounds serve mainly as political benchmarks rather than a source of theoretical insight. More attention should go towards explaining empirically observed phenomena like double descent (which, to be fair, is starting to happen).

That said, there are new approaches that I'm more optimistic about. In particular, I think that singular learning theory (SLT) is the most likely path to lead to a "theory of deep learning" because it (1) has stronger theoretical foundations, (2) engages with the structure of individual models, and (3) gives us a principled way to bridge between this microscopic structure and the macroscopic properties of the model class1. I expect the field of mechanistic interpretability and the eventual formalism of phase transitions and "sharp left turns" to be grounded in the language of SLT.

Why theory?

A mathematical theory of learning and intelligence could form a valuable tool in the alignment arsenal, that helps us:

That's not to say that the right theory of learning is risk-free:

  • A good theory could inspire new capabilities. We didn't need a theory of mechanics to build the first vehicles, but we couldn't have gotten to the moon without it.
  • The wrong theory could mislead us. Just as theory tells us where to look, it also tells us where not to look. The wrong theory could cause us to neglect important parts of the problem.
  • It could be one prolonged nerd-snipe that draws attention and resources away from other critical areas in the field. Brilliant string theorists aren't exactly helping advance living and technology standards by computing the partition functions of black holes in 5D de-Sitter spaces.2

All that said, I think the benefits currently outweigh the risks, especially if we put the right infosec policy in place when if learning theory starts showing signs of any practical utility. It's fortunate, then, that we haven't seen those signs yet.


My aims are:

  • To discourage other alignment researchers from wasting their time.
  • To argue for what makes singular learning theory different and why I think it the likeliest contender for an eventual grand unified theory of learning.
  • To invoke Cunningham's law — i.e., to get other people to tell me where I'm wrong and what I've been missing in learning theory.

There's also the question of integrity: if I am to criticize an entire field of people smarter than I am, I had better present a strong argument and ample evidence.

Throughout the rest of this sequence, I'll be drawing on notes I compiled from lecture notes by Telgarsky, Moitra, Grosse, Mossel, Ge, and Arora, books by Roberts et al. and Hastie et al., a festschrift of Chervonenkis, and a litany of articles.3

The sequence follows the three-fold division of approximation, generalization, and optimization preferred by the learning theorists. There's an additional preface on why empirical risk minimization is flawed (up next) and an epilogue on why singular learning theory seems different.


  1. This sequence was inspired by my worry that I had focused too singularly on singular learning theory. I went on a journey through the broader sea of "learning theory" hopeful that I would find other signs of useful theory. My search came up mostly empty, which is why I decided to write >10,000 words on the subject.

  2. Though, granted, string theory keeps on popping up in other branches like condensed matter theory, where it can go on to motivate practical results in material science (and singular learning theory, for that matter).

  3. I haven't gone through all of these sources in equal detail, but the content I cover is representative of what you'll learn in a typical course on deep learning theory.