Highlights from Q2

  • Launched the developmental interpretability ("devinterp") research agenda with Alexander Gietelink Oldenziel, Stan van Wingerden, and Daniel Murfet.
    • This came out of the 2023 SLT & Alignment Summit, which I co-organized with the same people.
    • I prepared six lectures, contributing to over 20 hours of recorded materials.
  • Worked as a research assistant at the university of Cambridge.
    • Submitted "Unifying Grokking & Double Descent" with Xander Davies, Lauro Langosco, and David Krueger to NeurIPS.
    • Started working on a project on capability unlearning with Jake Mendel, Bilal Chughtai, and Lauro Langosco.
  • Worked as a writer with CAIS on ██ ████████ ████████ ██ ██ ██████ .
  • I'm making progress on the posts I set out to complete in my previous quarterly review: The Shallow Reality of 'Deep Learning Theory', What are inductive biases, really?
    • I dropped/modified some of these: Neural (network) divergence, toy models of loss landscapes, and path dependence.

Plans for the rest of 2023

I have only one priority the coming 6 months: to test the basic claims behind the devinterp research agenda.

That means saying no to pretty much everything else and closing off my responsibilities with the Krueger Lab and CAIS ASAP. It's time to grind.

Research. My primary role will be leading the empirical component of the investigations into devinterp. I will manage (and contribute to):

  • Building tooling/libraries for measuring RLCTs, singular fluctuations, and prosaic "progress measures".
  • Building out a "zoo" of models and settings in which to test these tools from models trained on synthetic data to vision models, simple language models, and full-fledged LLMs.

I will also move to Melbourne for a few months (September—December) to work alongside Daniel Murfet on this agenda.

Organizational. My secondary role is laying the groundwork for devinterp to scale rapidly if the empirical claims survive scrutiny. This means:

  • Organizing a follow-up devinterp summit in November.
  • Managing and coordinating contributors to the empirical branch of devinterp.
  • And ████████ █ ████████ ███.

Now, all that said, I'm not exactly planning to neglect the rest of my life. It's time for a more in-depth reflection.




  • Diet. My diet's been pretty good, but I've been eating many more meals not prepared by myself. That's meant a lot more seed oils/added sugars/etc. than I'd like.
    • Breakfast is usually oatmeal + peanut butter + protein + banana, etc. in smoothie or porridge form.
    • I could use an equally easy default lunch.
    • As long as I'm in the same location with Robin, she's happy to cook dinner.
  • Meat. I've been eating much less meat. Perhaps too little. As much as I'd like vegetarianism to be equivalent in terms of health, it's not.
  • Alcohol. I drink occasionally, usually not more than 1 or 2 drinks per week, but it's time to stop.
  • Protein. I've been supplementing with protein shakes regularly since I'm trying to build muscle. I've gained ~3-5kg over the last half year and would like to gain another ~5kg over the rest of the year.
  • Intermittent fasting. I've fallen out of the habit of 16/8 IF and would like to restart some kind of fasting. 16/8 isn't ideal since I like being able to drink a cappuccino in the morning and because of the bulking. Alternatively, I can try a 5-day fast once a quarter, which might even be better, but I love food too much, so I need someone else to force me to do this.
  • Supplements. I've been taking Athletic greens, creatine, fish oils, and occasionally vitamin C + zinc (when I have a cold).
  • Caffeine. I drink two-three cups of coffee per day (and stop responsibly at noon).
  • Nicotine. Most days, I take 2mg of Nicotine in the afternoon sometime between 15:00 and 17:00.
  • Melatonin. I take 3mg of melatonin per night (or rather I was, but then fell out of the habit). Not very consistent.


  • Time to cook (esp. because travel).
  • Willpower for fasting.
  • Meat is unethical.


  • Talk to Robin: ask for quick lunch recipes, or plan a meal prep day, or find someone in Melbourne who can do meal prep for me. She also recommends more broccoli sprouts.
  • Get someone to hold me accountable to do longer fasts.
  • Find access to more stimulants: Modafinil for regular use and Vyvanse or Adderall for occasional use. Maybe also LSD for microdosing.
  • Find a somewhat ethical source of meat like venison or kangaroo.


I think of three main components to physical fitness:

  • Endurance: aerobic/cardio, sauna.
  • Strength: anaerobic, weight-lifting, calisthenics.
  • Dexterity (mobility, agility, flexibility, plasticity, elasticity, stability & balance): yoga, pilates, handstand practice.

Target. My ideal schedule would look like:

  • ~30min light cardio/yoga/core to start the day (maybe jump-rope and some sun salutations).
  • ~45m Weight-lifting/calisthenics and HIIT on alternating days, followed by 30min of skills/stretching/sauna.
  • ~15min of restorative yoga before bed.

Weight-lifting/calisthenics. I started weight-lifting again a few months ago but then pulled something in my back, then went traveling for a month, and fell out of the habit again. I'm going to go from StrongLifts to Starting Strength, since the time commitment is lower, and I'll add weight more slowly this time.

Cardio. Historically, I've found cardio the hardest to commit to (and probably the one I need most). I recently discovered a love for hot yoga and pilates, which combines endurance and dexterity reasonably well but isn't easy to do when you're traveling all the time (even with ClassPass). The best option is probably more HIIT-focused approaches, which I enjoy much more than steady-state work on, e.g., an erg.


  • Injury risk with weight-lifting.
  • Access to gyms (due to travel).
  • Inconsistency (due to travel & having three separate times to do exercise).
  • Cardio sucks.


  • Meet with a PT to check weightlifting form. Focus on calisthenics until then.
  • Use ClassPass & Alo Moves (which I already have).
  • Find a gym in Melbourne.
  • Assemble a set of default routines for each of these moments + alternatives that don't require gym access.
    • Either select from Alo Moves or ask a PT.
  • Finally figure out some kind of habit tracking software (maybe just a spreadsheet?). Obsidian isn't good because you shouldn't shit (=track habits) where you eat (=come up with ideas) and Linear doesn't have great tooling for repetitive tasks.

Sleep / Rest / Stress

  • I've been sleeping in rooms that have too much light and sleeping less as a result.
  • I'd also like to wake up a bit earlier (~6:00)


  • Buy a good eye mask.
  • Put my phone away from my bed and get some light immediately after waking.


  • Sitting quietly in a sauna is the closest thing that comes to meditation for me, and it seems to be enough for now.


  • Main problem is I've been getting too many colds (~2-3 this year already). It's too much of a productivity decrease, and I'm not sure what's causing this. Maybe lack of rest? Maybe travel?
  • I'd like to put more effort into skincare and will ask Robin to hold me accountable for that.
  • I am 3 months behind on going to an oral hygienist and need to schedule an appointment to get my wisdom teeth pulled.
  • Need to be more consistent in wearing my differentials when doing near work.


  • Talk to Robin about colds & skincare. See what she recommends.
  • Unregister with current dentist and reregister in Amsterdam. Schedule mouth cleaning.
    • Start flossing again.
  • Add differentials to habit tracker.

Family & friends

  • Family-wise, I'm too far away from everyone, but that's not really to be avoided. I still get to see everyone every few months intensively for a week or two at a time.
  • Friend-wise, the last half year has been great. I've gotten very close to Alexander, maybe Stan soon as well, and formed a bunch more intermediate friendships. Having a community is overpowered.


  • Schedule a weekly reminder to call parents & Elmer.


  • The last few months have been tough because Robin and I have been long-distance so frequently. I think this will go better the next half year, since Robin is planning to come with me to Australia, but it won't be fully resolved (since I'll go back to the UK for November). I'd very much like to settle with her in one place with more than a three month horizon. Maybe that's in the cards in 2024.


  • Restart biweekly relationship check-ins. Plan in a time.
  • Schedule weekly reminder to plan a date.


  • The SERI MATS extension grant has been the biggest help. Working as an RA pays dirt. Working as a writer with CAIS has been much better, but I haven't had very many hours, so it's not a major addition.
  • I'll be applying to the Century Fellowship, and I think my odds are reasonable of getting it, which would make a major difference to my financial security. Otherwise, I'll manage on an R.A. budget in Australia.


  • Actually budget out this year.
  • Figure out how/where to invest money when I receive grants so it doesn't just sit in my account.

Career & impact

  • Everything here is going well. (See above.)
  • Need to grow my twitter clout. A few months ago, I started strong, but I haven't been posting recently. I'm at O(500) followers, and would like to get to O(5,000) by the end of the year.


  • Schedule weekly reminder to post.

Personal growth & learning

  • I think the past few months have been too focused on execution and not enough on growth. In particular, I haven't been learning in a structured way as much as I'd like to. I've fallen out of my Anki routines, I haven't learned any new languages recently, and I'm not reading enough outside of technical articles (and even there I think I could be doing more).


  • Decide between FluentForever & Anki + iTalki (or some in-between) for learning Japanese (because Watanabe speaks Japanese and we want to honeymoon here). I'd also like to learn Mandarin, but one thing at a time, this now seems pressing.
  • Map out a learning plan for this summer.
  • Remember to call Alexander whenever I need tutoring.

Leisure / play

  • Definitely could be doing more here though I've had plenty of time for social events and don't particularly feel like I'm coming short.
  • There's room to refactor this with fitness into some kind of team sport or martial arts, but for now it's probably too much of a time commitment or too intermittent (therefore hard to form habits around) or has too much of a risk of brain damage.
  • Also room for having fun in learning a new language (see personal growth).
  • Most of all, there's room for reading more fiction.


  • Polish off my want-to-read list on GoodReads.
  • Charge my kindle & download the top books off the list.
  • Schedule a regular massage?
  • Schedule a trip with Robin in Australia.


  • Too much twitter.
  • Not enough care and maintenance for my Obsidian.


  • Find a day to go through my Obsidian and clean.


  • The main thing is I'd like to be able to settle in one place in 2024 for more than 6 months. This is a problem for Q4.
  • Let's pay for a cleaner.


  • Find a cleaner in Melbourne.
  • Get a CO2 monitor.


The last half year has been one of the most turbulent periods of my life. It's also been one of the best.

I quit the start-up that was sucking out my soul and rotting my intellect (Okay maybe that's a tad melodramatic). I started working on a problem I care about and reviving my brain. I found the community, mentors, and projects I'd been looking for. I started doing original work and advocating for a neglected area of research (singular learning theory). It's been pretty great.

Which makes it a great time for reflection and looking forward. What's in store for the rest of the year?

The last six months

Six months ago, I got an FTX Future Fund grant to do some upskilling. One of the conditions for receiving that grant was to write a reflection after the grant period (six months) expired. So, yes, that's part of my motivation for writing this post. Even if FTX did implode in the interim, and even if there is likely no one to read this, it's better to be safe than sorry.

A quick summary:

  • Reading: Mathematics for Machine Learning, Bishop, Cracking the Coding Interview, Sutton & Barto, Russell & Norvig, Watanabe, and lots of miscellaneous articles, sequences, etc.
  • Courses: Fast.ai (which I quit early because it was too basic), OpenAI's spinning up (abandoned in favor of other RL material), and ARENA (modeled after MLAB).
  • SERI MATS: An unexpected development was that I ended up participating in SERI MATS. For two months, I was in Berkeley with a cohort of others in a similar position as mine (i.e., transitioning to technical AI safety research).
  • Output: singular learning theory sequence & classical learning theory sequence.

It's been quite a lot more productive than I anticipated both in terms of input absorbed and output written. I also ended up with a position as a research assistant with David Krueger's lab.

The next six months

But we're not done yet. The next six months are shaping up to be the most busy in my life. As I like 'em.


I'm organizing a summit on SLT and alignment. My guess is that, looking back a few years from now, I will have accelerated this field by up to two years (compared to worlds in which I don't exist). The aim will be to foster research applying SLT within AI safety towards developing better interpretability tools, with specific attention given to detecting phase transitions.


So many projects. Unlike some, I think writing publications is actually a pretty decent goal to work to. You need some kind of legible output to work towards and that can serve as a finishing line.

In the order of most finished to least:

  • (SLT) The Shallow Reality of 'Deep Learning Theory': when I'm done writing the sequence on LessWrong, I'm going to work with Zach Furman and Mark Chiu Chong to turn this into something publishable.
  • Pattern-learning model: this is the project I'm currently working on with Lauro Langosco in the Krueger lab. The aim is to devise a simplified toy model of neural network training dynamics akin to Michaud et al.'s quantization model of neural scaling.
  • Neural (network) divergence: a project I'm working on with Samuel Knoche on reviewing and implementing the various ways people have come up with to compare different neural networks.
  • What are inductive biases, really?: a project I'm working on with Alexandra Bates to review all the existing literature on inductive biases and provide some much needed formalization.
  • (SLT) Singularities and dynamics: the aim is to develop toy models of the loss landscape in which to investigate the role of singularities on training dynamics.
  • Path dependence in NNs: this the project I started working on in SERI MATS. The idea is to study how small perturbations (to the weights or hyperparameters) grow over the course of training. There's a lot here, which is why it's taking quite some time to finish up.
  • (SLT) Phase detectors: a project I recently started during an Apart Hackathon, which explores how to detect "phase transitions" during training.

There's a lot here, which is why some of these projects (the last three) are currently parked.

(And to make it worse I've just accepted a part-time technical writing position.)


What's next? After the summit? After wrapping up a few of these projects? After the research assistant position comes to a close (in the fall)?

Do I…

I'm leaning more and more to the last one (/two).

A job with Anthropic would be great, but I think I think I could accomplish more by pursuing a slightly different agenda and if I had a bit more slack to invest in learning.

Meanwhile, I think a typical PhD is too much lock-in, especially in the US where they might require me (with a physics background) to do an additional masters degree. As a century fellow, I'd be free to create my own custom PhD-like program. I'd spend some time in Australia with Daniel Murfet, in Boston with the Tegmark group, in New York with the Bowman lab, in London with Conjecture, in the Bay Area with everyone.

I think it's very likely that I'll end up starting a research organization focused on bringing SLT to alignment. That's going to take a slightly atypical path.

Robustness and Distribution Shifts

Distribution Shifts

I. Introduction

In the world of finance, quants develop trading algorithms to gain an edge in the market. As these algorithms are deployed and begin to interact with each other, they change market dynamics and can end up in different environments from what they were developed for. This leads to continually degrading performance and the ongoing need to develop and refine new trading algorithms. When deployed without guardrails, these dynamics can lead to catastrophic failures such as the flash crash of 2010, in which the stock market temporarily lost more than one trillion dollars.

This is an extreme example of distribution shift, where the data a model is deployed on diverges from the data it was developed on. It is key concern within the field of AI safety, where a concern is that mass-deployed SOTA models could lead to similar catastrophic outcomes with impacts not limited to financial markets.

In the more prosaic setting, distribution shift is concerned with questions like: Will a self-driving car trained in sunny daytime environments perform well when deployed in wet or nighttime conditions? Will a model trained to diagnose X-rays transfer to a new machine? Will a sentiment analysis model trained on data from one website work when deployed on a new platform?

In this document, we will explore this concept of distribution shift, discuss its various forms and causes, and explore some strategies for mitigating its effects. We will also define key related terms such as out-of-distribution data, train-test mismatch, robustness, and generalization.

II. The Learning Problem

To understand distribution shift, we must first understand the learning problem.

The dataset. In the classification or regression setting, there is a space of inputs, XX, and a space of outputs, YY, and we would like to learn a function ("hypothesis") h:XYh: X\to Y. We are given a dataset, D={(xi,yi)}i=1nD=\{(x_i, y_i)\}_{i=1}^n, of nn samples of input-output behavior and assume that each sample is sampled independently and identically from some "true" underlying distribution, P(x,y)P(x, y).

The model. The aim of learning is to find some optimal model, y=fw(x)y = f_w(x), parametrized by wWw \in \mathcal W, where optimal is defined via a loss function (y^,y)\ell(\hat y, y) that evaluates how different a prediction y^\hat y is from the true outcome yy.

Empirical risk minimization. We would like to find the model that minimizes the expected loss over all possible input-output pairs; that is, the population risk:

R(h)=E[(h(x),y)]=(h(x),y)dP(x,y). R(h) = \mathbb E[\ell(h(x), y)] = \int \ell(h(x), y)\,\mathrm{d}P(x,y).

However, we do not typically have direct access to P(x,y)P(x, y) (and even with knowledge of P(x,y)P(x, y) the integral would almost certainly be intractable). Instead, as a proxy for the population risk, we minimize the loss averaged over the dataset, which is known as the empirical risk:

RD(h)=1ni=1n(h(xi),yi). R_D(h)=\frac{1}{n}\sum\limits_{i=1}^n\ell(h(x_i), y_i).

Training and testing. In practice, to avoid overfitting we split the dataset into a training set, SS, and a test set, TT. We train the model on SS but report performance on TT. If we want to find the optimal hyperparameters in addition to the optimal parameters, we may further split part of the dataset into additional cross-validation sets. Then, we train on the training set, select hyperparameters via the cross-validation sets, and report performance on a held-out test set.

Deployment. Deployment, rather than testing, is the end goal. At a first pass, distribution shift is simply when the performance on the training set or test set is no longer predictive of performance during deployment. Most of the difficulty in detecting and mitigating this phenomenon comes down to their being few or no ground-truth labels, yy, during deployment.

III. Distribution Shift and Its Causes

Distribution shift. Usually, distribution shift refers to when the data in the deployment environment is generated by some distribution, PdeploymentP_\text{deployment} that differs from the distribution, PP, responsible for generating the dataset. Additionally, it may refer to train-test mismatch in which the distribution generating the training set, PSP_S, differs from the distribution generating the test set, PTP_T.

Train-test mismatch. Train-test mismatch is easier to spot than distribution shift between training and deployment, as in the latter case there may be no ground truth to compare against. In fact, train-test mismatch is often intentional. For example, to deploy a model on future time-series data, one may split the training and test set around a specific date. If the model translates from historical data in the training set to later data in the test set, it may also extend further into the future.

Generalization and robustness. Understanding how well models translate to unseen examples from the same distribution (generalization or concentration) is different to understanding how well models translate to examples from a different distribution (robustness). That's because distribution shift is not about unbiased sampling error; given finite sample sizes, individual samples will necessarily differ between training, test, and deployment environments. (If this were not the case, there would little point to learning.) Authors may sometimes muddy the distinction (e.g., "out-of-distribution generalization"), which is why we find it worth emphasizing their difference.

Out-of-distribution. "Distribution shift" is about bulk statistical differences between distributions. "Out-of-distribution" is about individual differences between specific samples. Often, an "out-of-distribution sample" refers to the more extreme case in which that sample comes from outside the training or testing domain (in which case, "out-of-domain" may be more appropriate). See, for example, the figure below.


The main difference between "distribution shift" and "out-of-distribution" is whether one is talking about bulk properties or individual properties, respectively. On the left-hand side, the distributions differ, but the sample is equally likely for either distribution. On the right-hand side, the distributions differ, and the sample is out-of-domain.

Causes of Distribution Shift

Non-stationarity. Distribution shift can result from the data involving a temporal component and the distribution being non-stationary, such as when one tries to predict future commodity prices based on historical data. Similar effects can occur as a result of non-temporal changes (such as training a model on one geographical area or on a certain source of images before applying it elsewhere).

Interaction effects. A special kind of non-stationarity, of particular concern within AI safety, is the effect that the model has in deployment on the systems it is interacting with. In small-scale deployments, this effect is often negligible, but when deployed on massive scales, this effect (as in finance where automated bots can move billions of dollars) the consequences can become substantial.


Stationary vs. non-stationary data.

Sampling bias. Though distribution shift does not refer to unbiased sampling error, it can refer to the consequences of biased sampling error. If one trains a model on the results of an opt-in poll, it may not perform well when deployed to the wider public. These kinds of biases are beyond the reach of conventional generalization theory and up to the study of robustness.

Types of Distribution Shift

The true data-generating distribution can be factored,

P(x,y)=P(yx)P(x)=P(xy)P(y), P(x, y) = P(y|x) P(x)=P(x|y)P(y),

which helps to distinguish several different kinds of distribution shift.

Covariate shift is when P(yx)P(y|x) is held constant while P(x)P(x) changes. The actual transformation from inputs to outputs remains the same, but the relative likelihoods of different inputs changes. This can result from any of the causes listed above.

Label shift is the reverse of covariate shift, where P(xy)P(x|y) is held constant while P(y)P(y) changes. Through Bayes' rule and marginalization, covariate shift induces a change in P(y)P(y) and vice-versa, so the two are related, but not exactly the same: assuming that either P(yx)P(y|x) remains constant or that P(xy)P(x|y) remains constant are not the same assumptions.

Concept drift is where P(x)P(x) is held constant while P(yx)P(y|x) changes. The distribution over inputs is unchanged while the transformation from inputs to outputs changes. In practice, it is rarely the case that a distribution shift falls cleanly into one of these three categories. Still, this taxonomy can be useful as a practical approximation.

Internal covariate shift is a phenomenon specific to deep neural networks, where sampling error between batches can induce large changes in the distribution of internal activations, especially for those activations deeper in the model. That said, this is not a distribution shift in the classical sense, which refers to a change p(x,y)p(x, y)).

IV. Conclusion

Techniques for mitigating distribution shift include data augmentation, adversarial training, regularization techniques like dropout, domain adaptation, model calibration, mechanistic anomaly detection, batch normalization for internal covariate shift, online learning, and even simple interventions like using larger, pre-trained models.

In this document, we discussed the importance of distribution shift in machine learning, its causes, and strategies for mitigating its effects. We also defined key terms such as distribution shift, out-of-distribution, and train-test mismatch. Addressing distribution shift