The last half year has been one of the most turbulent periods of my life. It's also been one of the best.

I quit the start-up that was sucking out my soul and rotting my intellect (Okay maybe that's a tad melodramatic). I started working on a problem I care about and reviving my brain. I found the community, mentors, and projects I'd been looking for. I started doing original work and advocating for a neglected area of research (singular learning theory). It's been pretty great.

Which makes it a great time for reflection and looking forward. What's in store for the rest of the year?

The last six months

Six months ago, I got an FTX Future Fund grant to do some upskilling. One of the conditions for receiving that grant was to write a reflection after the grant period (six months) expired. So, yes, that's part of my motivation for writing this post. Even if FTX did implode in the interim, and even if there is likely no one to read this, it's better to be safe than sorry.

A quick summary:

  • Reading: Mathematics for Machine Learning, Bishop, Cracking the Coding Interview, Sutton & Barto, Russell & Norvig, Watanabe, and lots of miscellaneous articles, sequences, etc.
  • Courses: Fast.ai (which I quit early because it was too basic), OpenAI's spinning up (abandoned in favor of other RL material), and ARENA (modeled after MLAB).
  • SERI MATS: An unexpected development was that I ended up participating in SERI MATS. For two months, I was in Berkeley with a cohort of others in a similar position as mine (i.e., transitioning to technical AI safety research).
  • Output: singular learning theory sequence & classical learning theory sequence.

It's been quite a lot more productive than I anticipated both in terms of input absorbed and output written. I also ended up with a position as a research assistant with David Krueger's lab.

The next six months

But we're not done yet. The next six months are shaping up to be the most busy in my life. As I like 'em.


I'm organizing a summit on SLT and alignment. My guess is that, looking back a few years from now, I will have accelerated this field by up to two years (compared to worlds in which I don't exist). The aim will be to foster research applying SLT within AI safety towards developing better interpretability tools, with specific attention given to detecting phase transitions.


So many projects. Unlike some, I think writing publications is actually a pretty decent goal to work to. You need some kind of legible output to work towards and that can serve as a finishing line.

In the order of most finished to least:

  • (SLT) The Shallow Reality of 'Deep Learning Theory': when I'm done writing the sequence on LessWrong, I'm going to work with Zach Furman and Mark Chiu Chong to turn this into something publishable.
  • Pattern-learning model: this is the project I'm currently working on with Lauro Langosco in the Krueger lab. The aim is to devise a simplified toy model of neural network training dynamics akin to Michaud et al.'s quantization model of neural scaling.
  • Neural (network) divergence: a project I'm working on with Samuel Knoche on reviewing and implementing the various ways people have come up with to compare different neural networks.
  • What are inductive biases, really?: a project I'm working on with Alexandra Bates to review all the existing literature on inductive biases and provide some much needed formalization.
  • (SLT) Singularities and dynamics: the aim is to develop toy models of the loss landscape in which to investigate the role of singularities on training dynamics.
  • Path dependence in NNs: this the project I started working on in SERI MATS. The idea is to study how small perturbations (to the weights or hyperparameters) grow over the course of training. There's a lot here, which is why it's taking quite some time to finish up.
  • (SLT) Phase detectors: a project I recently started during an Apart Hackathon, which explores how to detect "phase transitions" during training.

There's a lot here, which is why some of these projects (the last three) are currently parked.

(And to make it worse I've just accepted a part-time technical writing position.)


What's next? After the summit? After wrapping up a few of these projects? After the research assistant position comes to a close (in the fall)?

Do I…

I'm leaning more and more to the last one (/two).

A job with Anthropic would be great, but I think I think I could accomplish more by pursuing a slightly different agenda and if I had a bit more slack to invest in learning.

Meanwhile, I think a typical PhD is too much lock-in, especially in the US where they might require me (with a physics background) to do an additional masters degree. As a century fellow, I'd be free to create my own custom PhD-like program. I'd spend some time in Australia with Daniel Murfet, in Boston with the Tegmark group, in New York with the Bowman lab, in London with Conjecture, in the Bay Area with everyone.

I think it's very likely that I'll end up starting a research organization focused on bringing SLT to alignment. That's going to take a slightly atypical path.

Robustness and Distribution Shifts

Distribution Shifts

I. Introduction

In the world of finance, quants develop trading algorithms to gain an edge in the market. As these algorithms are deployed and begin to interact with each other, they change market dynamics and can end up in different environments from what they were developed for. This leads to continually degrading performance and the ongoing need to develop and refine new trading algorithms. When deployed without guardrails, these dynamics can lead to catastrophic failures such as the flash crash of 2010, in which the stock market temporarily lost more than one trillion dollars.

This is an extreme example of distribution shift, where the data a model is deployed on diverges from the data it was developed on. It is key concern within the field of AI safety, where a concern is that mass-deployed SOTA models could lead to similar catastrophic outcomes with impacts not limited to financial markets.

In the more prosaic setting, distribution shift is concerned with questions like: Will a self-driving car trained in sunny daytime environments perform well when deployed in wet or nighttime conditions? Will a model trained to diagnose X-rays transfer to a new machine? Will a sentiment analysis model trained on data from one website work when deployed on a new platform?

In this document, we will explore this concept of distribution shift, discuss its various forms and causes, and explore some strategies for mitigating its effects. We will also define key related terms such as out-of-distribution data, train-test mismatch, robustness, and generalization.

II. The Learning Problem

To understand distribution shift, we must first understand the learning problem.

The dataset. In the classification or regression setting, there is a space of inputs, XX, and a space of outputs, YY, and we would like to learn a function ("hypothesis") h:XYh: X\to Y. We are given a dataset, D={(xi,yi)}i=1nD=\{(x_i, y_i)\}_{i=1}^n, of nn samples of input-output behavior and assume that each sample is sampled independently and identically from some "true" underlying distribution, P(x,y)P(x, y).

The model. The aim of learning is to find some optimal model, y=fw(x)y = f_w(x), parametrized by wWw \in \mathcal W, where optimal is defined via a loss function (y^,y)\ell(\hat y, y) that evaluates how different a prediction y^\hat y is from the true outcome yy.

Empirical risk minimization. We would like to find the model that minimizes the expected loss over all possible input-output pairs; that is, the population risk:

R(h)=E[(h(x),y)]=(h(x),y)dP(x,y). R(h) = \mathbb E[\ell(h(x), y)] = \int \ell(h(x), y)\,\mathrm{d}P(x,y).

However, we do not typically have direct access to P(x,y)P(x, y) (and even with knowledge of P(x,y)P(x, y) the integral would almost certainly be intractable). Instead, as a proxy for the population risk, we minimize the loss averaged over the dataset, which is known as the empirical risk:

RD(h)=1ni=1n(h(xi),yi). R_D(h)=\frac{1}{n}\sum\limits_{i=1}^n\ell(h(x_i), y_i).

Training and testing. In practice, to avoid overfitting we split the dataset into a training set, SS, and a test set, TT. We train the model on SS but report performance on TT. If we want to find the optimal hyperparameters in addition to the optimal parameters, we may further split part of the dataset into additional cross-validation sets. Then, we train on the training set, select hyperparameters via the cross-validation sets, and report performance on a held-out test set.

Deployment. Deployment, rather than testing, is the end goal. At a first pass, distribution shift is simply when the performance on the training set or test set is no longer predictive of performance during deployment. Most of the difficulty in detecting and mitigating this phenomenon comes down to their being few or no ground-truth labels, yy, during deployment.

III. Distribution Shift and Its Causes

Distribution shift. Usually, distribution shift refers to when the data in the deployment environment is generated by some distribution, PdeploymentP_\text{deployment} that differs from the distribution, PP, responsible for generating the dataset. Additionally, it may refer to train-test mismatch in which the distribution generating the training set, PSP_S, differs from the distribution generating the test set, PTP_T.

Train-test mismatch. Train-test mismatch is easier to spot than distribution shift between training and deployment, as in the latter case there may be no ground truth to compare against. In fact, train-test mismatch is often intentional. For example, to deploy a model on future time-series data, one may split the training and test set around a specific date. If the model translates from historical data in the training set to later data in the test set, it may also extend further into the future.

Generalization and robustness. Understanding how well models translate to unseen examples from the same distribution (generalization or concentration) is different to understanding how well models translate to examples from a different distribution (robustness). That's because distribution shift is not about unbiased sampling error; given finite sample sizes, individual samples will necessarily differ between training, test, and deployment environments. (If this were not the case, there would little point to learning.) Authors may sometimes muddy the distinction (e.g., "out-of-distribution generalization"), which is why we find it worth emphasizing their difference.

Out-of-distribution. "Distribution shift" is about bulk statistical differences between distributions. "Out-of-distribution" is about individual differences between specific samples. Often, an "out-of-distribution sample" refers to the more extreme case in which that sample comes from outside the training or testing domain (in which case, "out-of-domain" may be more appropriate). See, for example, the figure below.


The main difference between "distribution shift" and "out-of-distribution" is whether one is talking about bulk properties or individual properties, respectively. On the left-hand side, the distributions differ, but the sample is equally likely for either distribution. On the right-hand side, the distributions differ, and the sample is out-of-domain.

Causes of Distribution Shift

Non-stationarity. Distribution shift can result from the data involving a temporal component and the distribution being non-stationary, such as when one tries to predict future commodity prices based on historical data. Similar effects can occur as a result of non-temporal changes (such as training a model on one geographical area or on a certain source of images before applying it elsewhere).

Interaction effects. A special kind of non-stationarity, of particular concern within AI safety, is the effect that the model has in deployment on the systems it is interacting with. In small-scale deployments, this effect is often negligible, but when deployed on massive scales, this effect (as in finance where automated bots can move billions of dollars) the consequences can become substantial.


Stationary vs. non-stationary data.

Sampling bias. Though distribution shift does not refer to unbiased sampling error, it can refer to the consequences of biased sampling error. If one trains a model on the results of an opt-in poll, it may not perform well when deployed to the wider public. These kinds of biases are beyond the reach of conventional generalization theory and up to the study of robustness.

Types of Distribution Shift

The true data-generating distribution can be factored,

P(x,y)=P(yx)P(x)=P(xy)P(y), P(x, y) = P(y|x) P(x)=P(x|y)P(y),

which helps to distinguish several different kinds of distribution shift.

Covariate shift is when P(yx)P(y|x) is held constant while P(x)P(x) changes. The actual transformation from inputs to outputs remains the same, but the relative likelihoods of different inputs changes. This can result from any of the causes listed above.

Label shift is the reverse of covariate shift, where P(xy)P(x|y) is held constant while P(y)P(y) changes. Through Bayes' rule and marginalization, covariate shift induces a change in P(y)P(y) and vice-versa, so the two are related, but not exactly the same: assuming that either P(yx)P(y|x) remains constant or that P(xy)P(x|y) remains constant are not the same assumptions.

Concept drift is where P(x)P(x) is held constant while P(yx)P(y|x) changes. The distribution over inputs is unchanged while the transformation from inputs to outputs changes. In practice, it is rarely the case that a distribution shift falls cleanly into one of these three categories. Still, this taxonomy can be useful as a practical approximation.

Internal covariate shift is a phenomenon specific to deep neural networks, where sampling error between batches can induce large changes in the distribution of internal activations, especially for those activations deeper in the model. That said, this is not a distribution shift in the classical sense, which refers to a change p(x,y)p(x, y)).

IV. Conclusion

Techniques for mitigating distribution shift include data augmentation, adversarial training, regularization techniques like dropout, domain adaptation, model calibration, mechanistic anomaly detection, batch normalization for internal covariate shift, online learning, and even simple interventions like using larger, pre-trained models.

In this document, we discussed the importance of distribution shift in machine learning, its causes, and strategies for mitigating its effects. We also defined key terms such as distribution shift, out-of-distribution, and train-test mismatch. Addressing distribution shift


Here are some dominoes based on [1]. The idea behind this dataset is that there are two "patterns" in the data: the MNIST image and the CIFAR image.

Pasted image 20230316175757.png

Notice that some of the dominoes have only one "pattern" present. By tracking training/test loss on these one-sided dominoes, we can tease apart how quickly the model learns the two different patterns.

We'd like to compare these pattern-learning curves to the curves predicted by the toy model of [2]. In particular, we'd like to compare predictions to the empirical curves as we change the relevant macroscopic parameters (e.g., prevalence, reliability, and simplicity1).

Which means running sweeps over these macroscopic parameters.


What happens as we change the relative incidence of MNIST vs CIFAR images in the dataset? We can accomplish this by varying the frequency of one-sided MNIST dominoes vs. one-sided CIFAR dominoes.

We control two parameters:

  • pmp_m, the probability of a domino containing an MNIST image (either one-sided or two-sided),
  • pcp_c, the probability of a domino containing a CIFAR image (either one-sided or two-sided), and

Two parameters are fixed by our datasets:

  • NmN_m, the number of samples in the MNIST dataset.
  • NcN_c, the number of samples in the CIFAR dataset.

Given these parameters, we have to determine:

  • rm0r_{m0}, the fraction of the MNIST dataset that we reject,
  • rm1r_{m1}, the fraction of the MNIST dataset that ends up in one-sided dominoes,
  • rm2r_{m2}, the fraction of the MNIST dataset that ends up in two-sided dominoes,

and, similarly, rc0r_{c0}, rc1r_{c1}, and rc2r_{c2} for the CIFAR dataset.

IMG_495F2C6C4A1E-1.jpeg Here's the corresponding Sankey diagram (in terms of numbers of samples rather than probabilities, but it's totally equivalent).

Six unknowns means we need six constraints.

We get the first two from the requirement that probabilities are normalized,

rm0+rm1+rm2=rc0+rc1+rc2=1, r_{m0} + r_{m1} + r_{m2} = r_{c0} + r_{c1} + r_{c2} = 1,

and the another from the double dominoes requiring the sample number of samples from both datasets,

rm2Nm=rc2Nc. r_{m2} N_m = r_{c2} N_c.

Out of convenience, we'll introduce an additional variable, which we immediately constrain,

N=rc1Nc+rm1Nm+rm2Nm, N = r_{c1}N_c + r_{m1}N_m + r_{m2} N_m,

the number of samples in the resulting dominoes dataset.

We get the last three constraints from our choices of pmp_{m}, pcp_c, and p1p_1:

Npm=Nm1+N2=rm1Nm+rm2Nm, N p_m = N_{m1} + N_2 = r_{m1} N_m + r_{m2} N_m,
Npc=Nc1+N2=rc1Nc+rc2Nc, N p_c = N_{c1} + N_2 = r_{c1} N_c + r_{c2} N_c,

In matrix format,

(1110000000111000Nm00Nc00NmNm0Nc010NmNm000pm0000NcNcpc0Nm00Nc0p1)(rm0rm1rm2rc0rc1rc2N)=(1100000),\begin{pmatrix} 1 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 & 1 & 0 \\ 0 & 0 & N_m & 0 & 0 & -N_c & 0 \\ 0 & N_m & N_m & 0 & N_c & 0 & 1 \\ 0 & N_m & N_m & 0 & 0 & 0 & -p_m \\ 0 & 0 & 0 & 0 & N_c & N_c & -p_c \\ 0 & N_m & 0 & 0 & N_c & 0 & -p_1 \end{pmatrix} \cdot \begin{pmatrix} r_{m0} \\ r_{m1} \\ r_{m2} \\ r_{c0} \\ r_{c1} \\ r_{c2} \\ N \end{pmatrix} = \begin{pmatrix} 1 \\ 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{pmatrix},

where p1=2pcpmp_1 = 2 - p_c - p_m.

So unfortunately, this yields trivial answers where rm0=rc0=1r_{m0}=r_{c0}=1 and all other values are 0. The solution seems to be to just allow there to be empty dominoes.


We can vary the reliability by inserting "wrong" dominoes. I.e.: with some probability make either of the two sides display the incorrect class for the label.


One of the downsides of this task is that we don't have much control over the simplicity of the feature. MNIST is simpler than CIFAR, sure, but how much? How might we control this?


  1. Axes conceived by Ekdeep Singh Lubana.

0. The shallow reality of 'deep learning theory'

Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort

Most results under the umbrella of "deep learning theory" are not actually deep, about learning, or even theories.

This is because classical learning theory makes the wrong assumptions, takes the wrong limits, uses the wrong metrics, and aims for the wrong objectives. Learning theorists are stuck in a rut of one-upmanship, vying for vacuous bounds that don't say anything about any systems of actual interest.

400 Yudkowsky tweeting about statistical learning theorists. (Okay, not really.)

In particular, I'll argue throughout this sequence that:

  • Empirical risk minimization is the wrong framework, and risk is a weak foundation.
  • In approximation theory, the universal approximation results are too general (they do not constrain efficiency) while the "depth separation" results meant to demonstrate the role of depth are too specific (they involve constructing contrived, unphysical target functions).
  • Generalization theory has only two tricks, and they're both limited:
    • Uniform convergence is the wrong approach, and model class complexities (VC dimension, Rademacher complexity, and covering numbers) are the wrong metric. Understanding deep learning requires looking at the microscopic structure within model classes.
    • Robustness to noise is an imperfect proxy for generalization, and techniques that rely on it (margin theory, sharpness/flatness, compression, PAC-Bayes, etc.) are oversold.
  • Optimization theory is a bit better, but training-time guarantees involve questionable assumptions, and the obsession with second-order optimization is delusional. Also, the NTK is bad. Get over it.
  • At a higher level, the obsession with deriving bounds for approximation/generalization/learning behavior is misguided. These bounds serve mainly as political benchmarks rather than a source of theoretical insight. More attention should go towards explaining empirically observed phenomena like double descent (which, to be fair, is starting to happen).

That said, there are new approaches that I'm more optimistic about. In particular, I think that singular learning theory (SLT) is the most likely path to lead to a "theory of deep learning" because it (1) has stronger theoretical foundations, (2) engages with the structure of individual models, and (3) gives us a principled way to bridge between this microscopic structure and the macroscopic properties of the model class1. I expect the field of mechanistic interpretability and the eventual formalism of phase transitions and "sharp left turns" to be grounded in the language of SLT.

Why theory?

A mathematical theory of learning and intelligence could form a valuable tool in the alignment arsenal, that helps us:

That's not to say that the right theory of learning is risk-free:

  • A good theory could inspire new capabilities. We didn't need a theory of mechanics to build the first vehicles, but we couldn't have gotten to the moon without it.
  • The wrong theory could mislead us. Just as theory tells us where to look, it also tells us where not to look. The wrong theory could cause us to neglect important parts of the problem.
  • It could be one prolonged nerd-snipe that draws attention and resources away from other critical areas in the field. Brilliant string theorists aren't exactly helping advance living and technology standards by computing the partition functions of black holes in 5D de-Sitter spaces.2

All that said, I think the benefits currently outweigh the risks, especially if we put the right infosec policy in place when if learning theory starts showing signs of any practical utility. It's fortunate, then, that we haven't seen those signs yet.


My aims are:

  • To discourage other alignment researchers from wasting their time.
  • To argue for what makes singular learning theory different and why I think it the likeliest contender for an eventual grand unified theory of learning.
  • To invoke Cunningham's law — i.e., to get other people to tell me where I'm wrong and what I've been missing in learning theory.

There's also the question of integrity: if I am to criticize an entire field of people smarter than I am, I had better present a strong argument and ample evidence.

Throughout the rest of this sequence, I'll be drawing on notes I compiled from lecture notes by Telgarsky, Moitra, Grosse, Mossel, Ge, and Arora, books by Roberts et al. and Hastie et al., a festschrift of Chervonenkis, and a litany of articles.3

The sequence follows the three-fold division of approximation, generalization, and optimization preferred by the learning theorists. There's an additional preface on why empirical risk minimization is flawed (up next) and an epilogue on why singular learning theory seems different.


  1. This sequence was inspired by my worry that I had focused too singularly on singular learning theory. I went on a journey through the broader sea of "learning theory" hopeful that I would find other signs of useful theory. My search came up mostly empty, which is why I decided to write >10,000 words on the subject.

  2. Though, granted, string theory keeps on popping up in other branches like condensed matter theory, where it can go on to motivate practical results in material science (and singular learning theory, for that matter).

  3. I haven't gone through all of these sources in equal detail, but the content I cover is representative of what you'll learn in a typical course on deep learning theory.

The shallow reality of 'deep learning theory'

Classical learning theory makes the wrong assumptions, takes the wrong limits, uses the wrong metrics, and aims for the wrong objectives.

In this sequence, I review the current state of learning theory and the many ways in which it is broken. I argue that the field as it currently stands is profoundly useless, and that developing a useful theory of deep learning will require turning elsewhere, likely to something that builds on singular learning theory.