Most results under the umbrella of "deep learning theory" are not actually deep, about learning, or even theories.
This is because classical learning theory makes the wrong assumptions, takes the wrong limits, uses the wrong metrics, and aims for the wrong objectives. Learning theorists are stuck in a rut of one-upmanship, vying for vacuous bounds that don't say anything about any systems of actual interest.
Yudkowsky tweeting about statistical learning theorists. (Okay, not really.)
In particular, I'll argue throughout this sequence that:
Empirical risk minimization is the wrong framework, and risk is a weak foundation.
In approximation theory, the universal approximation results are too general (they do not constrain efficiency) while the "depth separation" results meant to demonstrate the role of depth are too specific (they involve constructing contrived, unphysical target functions).
Generalization theory has only two tricks, and they're both limited:
Uniform convergence is the wrong approach, and model class complexities (VC dimension, Rademacher complexity, and covering numbers) are the wrong metric. Understanding deep learning requires looking at the microscopic structure within model classes.
Robustness to noise is an imperfect proxy for generalization, and techniques that rely on it (margin theory, sharpness/flatness, compression, PAC-Bayes, etc.) are oversold.
Optimization theory is a bit better, but training-time guarantees involve questionable assumptions, and the obsession with second-order optimization is delusional. Also, the NTK is bad. Get over it.
At a higher level, the obsession with deriving bounds for approximation/generalization/learning behavior is misguided. These bounds serve mainly as political benchmarks rather than a source of theoretical insight. More attention should go towards explaining empirically observed phenomena like double descent (which, to be fair, is starting to happen).
That said, there are new approaches that I'm more optimistic about. In particular, I think that singular learning theory (SLT) is the most likely path to lead to a "theory of deep learning" because it (1) has stronger theoretical foundations, (2) engages with the structure of individual models, and (3) gives us a principled way to bridge between this microscopic structure and the macroscopic properties of the model class1. I expect the field of mechanistic interpretability and the eventual formalism of phase transitions and "sharp left turns" to be grounded in the language of SLT.
Why theory?
A mathematical theory of learning and intelligence could form a valuable tool in the alignment arsenal, that helps us:
The wrong theory could mislead us. Just as theory tells us where to look, it also tells us where not to look. The wrong theory could cause us to neglect important parts of the problem.
It could be one prolonged nerd-snipe that draws attention and resources away from other critical areas in the field. Brilliant string theorists aren't exactly helping advance living and technology standards by computing the partition functions of black holes in 5D de-Sitter spaces.2
All that said, I think the benefits currently outweigh the risks, especially if we put the right infosec policy in place when if learning theory starts showing signs of any practical utility. It's fortunate, then, that we haven't seen those signs yet.
Outline
My aims are:
To discourage other alignment researchers from wasting their time.
To argue for what makes singular learning theory different and why I think it the likeliest contender for an eventual grand unified theory of learning.
To invoke Cunningham's law — i.e., to get other people to tell me where I'm wrong and what I've been missing in learning theory.
There's also the question of integrity: if I am to criticize an entire field of people smarter than I am, I had better present a strong argument and ample evidence.
The sequence follows the three-fold division of approximation, generalization, and optimization preferred by the learning theorists. There's an additional preface on why empirical risk minimization is flawed (up next) and an epilogue on why singular learning theory seems different.
This sequence was inspired by my worry that I had focused too singularly on singular learning theory. I went on a journey through the broader sea of "learning theory" hopeful that I would find other signs of useful theory. My search came up mostly empty, which is why I decided to write >10,000 words on the subject. ↩
Though, granted, string theory keeps on popping up in other branches like condensed matter theory, where it can go on to motivate practical results in material science (and singular learning theory, for that matter). ↩
I haven't gone through all of these sources in equal detail, but the content I cover is representative of what you'll learn in a typical course on deep learning theory. ↩
A big thank you to all of the people who gave me feedback on this post: Edmund Lao, Dan Murfet, Alexander Gietelink Oldenziel, Lucius Bushnaq, Rob Krzyzanowski, Alexandre Variengen, Jiri Hoogland, and Russell Goyder.
Statistical learning theory is lying to you: "overparametrized" models actually aren't overparametrized, and generalization is not just a question of broad basins.
The standard explanation thrown around here for why neural networks generalize well is that gradient descent settles in flat basins of the loss function. On the left, in a sharp minimum, the updates bounce the model around. Performance varies considerably with new examples. On the right, in a flat minimum, the updates settle to zero. Performance is stabler under perturbations.
To first order, that's because loss basins actually aren't basins but valleys, and at the base of these valleys lie "rivers" of constant, minimum loss. The higher the dimension of these minimum sets, the lower the effective dimensionality of your model.1 Generalization is a balance between expressivity (more effective parameters) and simplicity (fewer effective parameters).
Symmetries lower the effective dimensionality of your model. In this example, a line of degenerate points effectively restricts the two-dimensional loss surface to one dimension.
In particular, it is the singularities of these minimum-loss sets — points at which the tangent is ill-defined — that determine generalization performance. The remarkable claim of singular learning theory (the subject of this post), is that "knowledge … to be discovered corresponds to singularities in general" [1]. Complex singularities make for simpler functions that generalize further.
The central claim of singular learning theory is that the singularities of the set of minima of the loss function determine learning behavior and generalization. Models close to more complex singularities generalize further.
Mechanistically, these minimum-loss sets result from the internal symmetries of NNs2: continuous variations of a given network's weights that implement the same calculation. Many of these symmetries are "generic" in that they are predetermined by the architecture and are always present. The more interesting symmetries are non-generic symmetries, which the model can form or break during training.
In terms of these non-generic symmetries, the power of NNs is that they can vary their effective dimensionality. Generality comes from a kind of internal model selection in which the model finds more complex singularities that use fewer effective parameters that favor simpler functions that generalize further.
At the risk of being elegance-sniped, SLT seems like a promising route to develop a better understanding of generalization and the dynamics of training. If we're lucky, SLT may even enable us to construct a grand unified theory of scaling.
A lot still needs to be done (in terms of actual calculations, the theorists are still chewing on one-layer tanh models), but, from an initial survey, singular learning theory feels meatier than other explanations of generalization. It's more than just meatiness; there's a sense in which singular learning theory is a non-negotiable prerequisite for any theory of deep learning. Let's dig in.
Back to the Bayes-ics
Singular learning theory begins with four things:
The "truth", q(x), which is some distribution that is generating our samples;
A model, p(x∣w), parametrized by weights w∈W⊂Rd, where W is compact;
A prior over weights, φ(w);
And a dataset of samples Dn={X1,…,Xn}, where each random variable Xi is i.i.d. according to q(x).
The low-level aim of "learning" is to find the optimal weights, w, for the given dataset. As good Bayesians, this has a very specific and constrained meaning:
p(w∣Dn)=p(Dn)p(Dn∣w)φ(w).
The higher-level aim of "learning" is to find the optimal model, p(x∣w), for the given dataset. Rather than try to find the weights that maximize the likelihood or even the posterior, the true aim of a Bayesian is to find the model that maximizes the model evidence,
p(Dn)=∫Wp(Dn∣w)ϕ(w)dw.
The fact that the Bayesian paradigm can integrate out its weights to make statements over entire model classes is one of its main strengths. The fact that this integral is oftenalmost always intractable is one of its main weaknesses. So the Bayesians make a concession to the frequentists with a much more tractable Laplace approximation: we find a choice of weights, w(0), that maximizes the likelihood and then approximate the distribution as Gaussian in the vicinity of that point.
The Laplace Approximation is just a probability theorist's (second-order) Taylor expansion.
This is justified on the grounds that as the dataset grows (n→∞), thanks to the central limit theorem, the distribution becomes asymptotically normally (cf. physicists and their "every potential is a harmonic oscillator if you look closely enough / keep on lowering the temperature.").
From this approximation, a bit more math leads us to the following asymptotic form for the negative log evidence (in the limit n→∞):
This formula is known as the Bayesian Information Criterion (BIC), and it (like the related Akaike information criterion) formalizes Occam's razor in the language of Bayesian statistics. We can end up with models that perform worse as long as they compensate by being simpler. (For the algorithmic-complexity-inclined, the BIC has an alternate interpretation as a device for minimizing the description length in an optimal coding context.)
Unfortunately, the BIC is wrong. Or at least the BIC doesn't apply for any of the models we actually care to study. Fortunately, singular learning theory can compute the correct asymptotic form and reveal its much broader implications.
Statistical learning theory is built on a lie
The key insight of Watanabe is that when the parameter-function map,
W∋w→p(⋅∣w)
is not one-to-one, things get weird. That is, when different choices of weights can implement the same functions, the tooling of conventional statistical learning theory to break down. We call such models "non-identifiable".
When the parameter-function map is not one-to-one, the right object of study is not parameter space but function/distribution space.
Take the example of the Laplace approximation. If there's a local continuous symmetry in weight space, i.e., some direction you can walk that doesn't affect the probability density, then your density isn't locally Gaussian.
The Laplace approximation breaks down when there is a direction of perfect flatness.
Even if the symmetries are non-continuous, the model will not in general be asymptotically normal. In other words, the standard central limit theorem does not hold.
The same problem arises if you're looking at loss landscapes in standard presentations of machine learning. Here, you'll find attempts to measure basin volume by fitting a paraboloid to the Hessian of the loss landscape at the final trained weights. It's the same trick, and it runs into the same problem.
This isn't the kind of thing you can just solve by adding a small ϵ to the Hessian and calling it a day. There are ways to recover "volumes", but they require care. So, as a practical takeaway, if you ever find yourself adding ϵ to make your Hessians invertible, recognize that those zero directions are important to understanding what's really going on in the network. Offer those eigenvalues the respect they deserve.
Adding epsilon to fudge your paraboloids is a dirty, insidious practice.
The consequence of these zeros (and, yes, they really exist in NNs) is that they reduce the effective dimensionality of your model. A step in these directions doesn't change the actual model being implemented, so you have fewer parameters available to "do things" with.
So the basic problem is this: almost all of the models we actually care about (not just neural networks, but Bayesian networks, HMMs, mixture models, Boltzmann machines, etc.) are loaded with symmetries, and this means we can't apply the conventional tooling of statistical learning theory.
Learning is physics with likelihoods
Let's rewrite our beloved Bayes' update as follows,
p(w∣Dn)=Zn1φ(w)e−nβLn(w),
where Ln(w) is the negative log likelihood,
Ln(w):=−n1logp(Dn∣w)=−n1i=1∑nlogp(xi∣w),
and Zn is the model evidence,
Zn:=p(Dn)=∫Wφ(w)e−nβLn(w)dw.
Notice that we've also snuck in an inverse "temperature", β>0, so we're now in the tempered Bayes paradigm [4].
The immediate aim of this change is to emphasize the link with physics, where Zn is the preferred notation (and "partition function" the preferred name). The information theoretic analogue of the partition function is the free energy,
Fn:=−logZn,
which which will be the central object of our study.
Under the definition of a Hamiltonian (or "energy function"),
Hn(w):=nLn(w)−β1logφ(w),
the translation is complete: statistical learning theory is just mathematical physics where the Hamiltonian is the random process given by the log likelihood ratio function. Just as the geometry of the energy landscape determines the behavior of the physical systems we study, the geometry of the log likelihood ends up determining the behavior of the learning systems we study.
In terms of this physical interpretation, the a posteriori distribution is the equilibrium state corresponding to this empirical Hamiltonian. The importance of the free energy is that it is the minimum of the free energy (not of the Hamiltonian) that determines the equilibrium.
Our next step will be to normalize these quantities of interest to make them easier to work with. For the negative log likelihood, this means subtracting its minimum value.3
The empirical Kullback-Leibler divergence is just a rescaled and shifted version of the negative log likelihood. Maximum likelihood estimation is equivalent to minimizing the empirical KL divergence.
Similarly, we normalize the partition function to get
Zn0:=∏i=1nq(Xi)βZn,
and the free energy to get
Fn0:=−logZn0.
This lets us rewrite the posterior as
p(w∣Dn)=Zn01φ(w)e−nβKn(w).
The more important aim of this conversion is that now the minima of the term in the exponent, K(w), are equal to 0. If we manage to find a way to express K(w) as a polynomial, this lets us to pull in the powerful machinery of algebraic geometry, which studies the zeros of polynomials. We've turned our problem of probability theory and statistics into a problem of algebra and geometry.
Why "singular"?
Singular learning theory is "singular" because the "singularities" (where the tangent is ill-defined) of the set of your loss function's minima,
W0:={w0∈W∣K(w0)=0},
determine the asymptotic form of the free energy. Mathematically, W0 is an algebraic variety, which is just a manifold with optional singularities where it does not have to be locally Euclidean.
Example of the curve y2=x2+x3 (equivalently, the algebraic variety of the polynomial f(x,y)=x2+x3−y2). There's a singularity at the origin. [Source]
By default, it's difficult to study these varieties close to their singularities. In order to do so anyway, we need to "resolve the singularities." We construct another well-behaved geometric object whose "shadow" is the original object in a way that this new system keeps all the essential features of the original.
It'll help to take a look at the following figure. The idea behind resolution of singularities is to create a new manifold U and a map g:U→W, such that K(g(u)) is a polynomial in the local coordinates of U. We "disentangle" the singularities so that in our new coordinates they cross "normally".
Based on Figure 2.5 of [1]. The lines represent the points that are in W0. The colors are just there to help you keep track of the points.
Because this "blow up" creates a new object, we have to be careful that the quantities we end up measuring don't change with the mapping — we want to find the birational invariants.
We are interested in one birational invariant in particular: the real log canonical threshold (RLCT). Roughly, this measures how "bad" a singularity is. More precisely, it measures the "effective dimensionality" near the singularity.
After fixing the central limit theorem to work in singular models, Watanabe goes on to derive the asymptotic form of the free energy as n→∞ ,
Fn=nβSn+λlogn−(m−1)loglogn+FR(ξ)+op(1),
where, m is the "multiplicity" associated to the RLCT, FR(ξ) is a random variable, and op(1) is a random variable that converges (in probability) to zero.
The important observation here is that the global behavior of your model is dominated by the local behavior of its "worst" singularities.
For regular (=non-singular) models, the RLCT is d/2, and with the right choice of inverse temperature, the formula above simplifies to
Fn≈nSn+2dlogn(for regular models),
which is just the BIC, as expected.
The free energy formula generalizes the BIC from classical learning theory to singular learning theory, which strictly includes regular learning theory as a special case. We see that singularities act as a kind of implicit regularization that penalizes models with higher effective dimensionality.
Phase transitions are singularity manipulations
Minimizing the free energy is maximizing the model evidence, which, as we saw, is the preferred Bayesian way of doing model selection. Other paradigms may disagree4, but at least among us this makes minimizing the free energy the central aim of statistical learning.
As in statistical learning, so in physics.
In physical systems, we distinguish microstates, such as the particular position and speed of every particle in a gas, with macrostates, such as the values of the volume and pressure. The fact that the mapping from microstates to macrostates is not one-to-one is the starting point for statistical physics: uniform distributions over microstates lead to much more interesting distributions over macrostates.
Often, we're interested in how continuously varying our levers (like temperature or the positions of the walls containing our gas) leads to discontinuous changes in the macroscopic parameters. We call these changes phase transitions.
The free energy is the central object of study because its derivatives generate the quantities we care about (like entropy, heat capacity, and pressure). So a phase transition means a discontinuity in one of the free energy's derivatives.
So too, in the setting of Bayesian inference, the free energy generates the quantities we care about, which are now quantities like the expected generalization loss,
Gn=EXn+1[Fn+1]−Fn.
Except for the fact that the number of samples, n, is discrete, this is just a derivative.5
So too, in learning, we're interested in how continuously changing either the model or the truth leads to discrete changes in the functions we implement and, thereby, to discontinuities in the free energy and its derivatives.
One way to subject this question to investigation is to study how our models change when we restrict our models to some subset of parameter space, W(i)⊂W. What happens when as vary this subset?
Recall that the free energy is defined as the negative log of the partition function. When we restrict ourselves to W(i), we derive a restricted free energy,
which has a completely analogous asymptotic form (after swapping out the integrals over all of weight space with integrals over just this subset). The important difference is that the RLCT in this equation is the RLCT associated to the largest singularity in W(i) rather than the largest singularity in W.
What we see, then, is that phase transitions during learning correspond to discrete changes in the geometry of the "local" (=restricted) loss landscape. The expected behavior for models in these sets is determined by the largest nearby singularities.
In a Bayesian learning process, the singularity becomes progressively simpler with more data. In general, learning processes involve trading off a more accurate fit against "regularizing" singularities. Based on Figure 7.6 in [1].
In this light, the link with physics is not just the typical arrogance of physicists asserting themselves on other people's disciplines. The link goes much deeper.
Physicists have known for decades that the macroscopic behavior of the systems we care about is the consequence of critical points in the energy landscape: global behavior is dominated by the local behavior of a small set of singularities. This is true everywhere from statistical physics and condensed matter theory to string theory. Singular learning theory tells us that learning machines are no different: the geometry of singularities is fundamental to the dynamics of learning and generalization.
Neural networks are freaks of symmetries
The trick behind why neural networks generalize so well is something like their ability to exploit symmetry. Many models take advantage of the parameter-function map not being one-to-one. Neural networks take this to the next level.
There are discrete permutation symmetries, where you can flip two columns in one layer as long as you flip the two corresponding rows in the next layer, e.g.,
There are scaling symmetries associated to ReLU activations,
ReLU(x)=α1ReLU(αx),α>0,
and associated to layer norm,
LayerNorm(αx)=LayerNorm(x),α>0.
(Note: these are often broken by the presence of regularization.)
And there's a GLn symmetry associated to the residual stream (you can multiply the embedding matrix by any invertible matrix as long as you apply the inverse of that matrix before the attention blocks, the MLP layers, and the unembedding layer, and if you apply the matrix after each attention block and MLP layer).
But these symmetries aren't actually all that interesting. That's because they're generic. They're always present for any choice of w. The more interesting symmetries are non-generic symmetries that depend on w.
It's the changes in these symmetries that correspond to phase transitions in the posterior; this is the mechanism by which neural networks are able to change their effective dimensionality.
These non-generic symmetries include things like a degenerate node symmetry, which is the well-known case in which a weight is equal to 0 and performs no work, and a weight annihilation symmetry in which multiple weights are non-zero but combine to have an effective weight of zero.
The consequence is that even if our optimizers are not performing explicit Bayesian inference, these non-generic symmetries allow the optimizers to perform a kind of internal model selection. There's a trade-off between lower effective dimensionality and higher accuracy that is subject to the same kinds of phase transitions as discussed in the previous section.
The dynamics may not be exactly the same, but it is still the singularities and geometric invariants of the loss landscape that determine the dynamics.
Discussion and limitations
All of the preceding discussion holds in general for any model where the parameter-function mapping is not one-to-one. When this is the case, singular learning theory is less a series of interesting and debate-worthy conjectures than a necessary frame.
The more relevant question is whether this theory actually tells us anything useful in practice. Quantities like the RLCT are exceedingly difficult to calculate for realistic systems, so can we actually put this theory to use?
I'd say the answer is a tentative yes. Results so far suggest that the predictions of SLT hold up to experimental scrutiny — the predicted phase transitions are actually observable for small toy models.
That's not to say there aren't limitations. I'll list a few from here and a few of my own.
Before we get to my real objections, here are a few objections I think aren't actually good objections:
But we care about function-approximation. This whole discussion is couched in a very probabilistic context. In practice, we're working with loss functions and are approximating functions, not densities. I don't think this is much of a problem as it's usually possible to recover your Bayesian footing in deterministic function approximation. Even when this isn't the case, the general claim — that the geometry of singularities determine dynamics — seems pretty robust.
But we don't even train to completion! (/We're not actually reaching the minimum loss solutions). I expect most of the results to hold for any level set of the loss landscape — we'll just be interested in the dominant singularities of the level sets we end up in (even if they don't perfectly minimize the loss).
But calculating (and even approximating) the RLCT is pretty much intractable. In any case, knowing of something's theoretical existence can often help us out on what may initially seem like unrelated turf. A more optimistic counter would be something like "maybe we can compute this for simple one-layer neural networks, and then find a straightforward iterative scheme to extend it to deeper layers." And that really doesn't seem all too unreasonable — when I see all the stuff physicists can squeeze out of nature, I'm optimistic about what learning theorists can squeeze out of neural networks.
But how do you adapt the results from tanh to realistic activations like swishes? In the same way that many of the universal approximation theorems don't depend on the particulars of your activation function, I don't expect this to be a major objection to the theory.
But ReLU networks are not analytic. Idk man, seems unimportant.
But what do asymptotic limits in n actually tell us about the finite case? I guess it's my background in statistical physics, but I'd say that a few trillion tokens is a heck of a lot closer to infinity than it is to zero. In all seriousness, physics has a long history of success with finite-size scaling and perturbative expansions around well-behaved limits, and I expect these to transfer.
But isn't this all just a fancy way of saying it was broad basins this entire time? Yeah, so I owe you an apology for all the Hessian-shaming and introduction-clickbaiting. In practice, I do expect small eigenvalues to be a useful proxy to how well specific models can generalize — less than zeros, but not nothing. Overall, the question that SLT answers seems to be a different question: it's about why we should expect models on average (and up to higher order moments) to generalize.
But what does Bayesian inference actually have to do with SGD and its variants? This complaint seems rather important especially since I'm not sold on the whole NNs-are-doing-Bayesian-inference thing. I think it's conceivable that we can find a way to relate any process that decreases free energy to the predictions here, but this does remain my overall biggest source of doubt.
But the true distribution is not realizable. For the above presentation, we assumed there is some choice of parameters w0 such that p(x∣w0) is equal to q(x) almost everywhere (this is "realizability" or "grain of truth"). In real-world systems, this is never the case. For renormalizable6 models, extending the results to the non-realizable case turns out to be not too difficult. For non-renormalizable theories, we're in novel territory.
Where Do We Go From Here?
I hope you've enjoyed this taster of singular learning theory and its insights: the sense of learning theory as physics with likelihoods, of learning as the thermodynamics of loss, of generalization as the presence of singularity, and of the deep, universal relation between global behavior and the local geometry of singularities.
The work is far from done, but the possible impact for our understanding of intelligence is profound.
To close, let me share one of directions I find most exciting — that of singular learning theory as a path towards predicting the scaling laws we see in deep learning models [5].
There's speculation that we might be able to transfer the machinery of the renormalization group, a set of techniques and ideas developed in physics to deal with critical phenomena and scaling, to understand phase transitions in learning machines, and ultimately to compute the scaling coefficients from first principles.
It is truly remarkable that resolution of singularities, one of the deepest results in algebraic geometry, together with the theory of critical phenomena and the renormalisation group, some of the deepest ideas in physics, are both implicated in the emerging mathematical theory of deep learning. This is perhaps a hint of the fundamental structure of intelligence, both artificial and natural. There is much to be done!
The dimensionality of the optimal parameters also depends on the true distribution generating your distribution, but even if the set of optimal parameters is zero-dimensional, the presence of level sets elsewhere can still affect learning and generalization. ↩
To be precise, this rests on the assumption of realizability — that there is some weight w0 for which p(x∣w0) equals q(x) almost everywhere. In this case, the minimum value of the negative log likelihood is the empirical entropy. ↩
So n is really a kind of inverse temperature, like β. Increasing the number of samples decreases the effective temperature, which brings us closer to the (degenerate) ground state. ↩
A word with a specific technical sense but that is related to renormalization in statistical physics. ↩
No, human brains are not more efficient than computers
Epistemic status: grain of salt. There's lots of uncertainty in how many FLOP/s the brain can perform.
In informal debate, I've regularly heard people say something like, "oh but brains are so much more efficient than computers" (followed by a variant of "so we shouldn't worry about AGI yet"). Putting aside the weakly argued AGI skepticism, brains actually aren't all that much more efficient than computers (at least not in any way that matters).
The first problem is that these people are usually comparing the energy requirements of training large AI models to the power requirements of running the normal waking brain. These two things don't even have the same units.
The only fair comparison is between the trained model and the waking brain or between training the model and training the brain. Training the brain is called evolution, and evolution isn't particularly known for its efficiency.
Let's start with the easier comparison: a trained model vs. a trained brain. Joseph Carlsmith estimates that the brain delivers roughly 1 petaFLOP/s (=1015 floating-point operations per second)1. If you eat a normal diet, you're expending roughly 10−13 J/FLOP.
Meanwhile, the supercomputer Fugaku delivers 450 petaFLOP/s at 30 MW, which comes out to about 10−11.5 J/FLOP…. So I was wrong? Computers require almost 500 times more energy per FLOP than humans?
Human J/FLOPSupercomputer J/FLOP
What this misses is an important practical point: supercomputers can tap pretty much directly into sunshine; human food calories are heavily-processed hand-me-downs. We outsource most of our digestion to mother nature and daddy industry.
Even the most whole-foods-grow-your-own-garden vegan is 2-3 orders of magnitude less efficient at capturing calories from sunlight than your average device2. That's before animal products, industrial processing, or any of the other Joules it takes to run a modern human.
After this correction, humans and computers are about head-to-head in energy/FLOP, and it's only getting worse for us humans. The fact that the brain runs on so little actual juice suggests there's plenty of room left for us to explore specialized architectures, but it isn't the damning case many think it is. (We're already seeing early neuromorphic chips out-perform neurons' efficiency by four orders of magnitude.)
Biological efficiencyElectronic efficiency
But what about training neural networks? Now that we know the energy costs per FLOP are about equal, all we have to do is compare FLOPs required to evolve brains to the FLOPs required to train AI models. Easy, right?
Here's how we'll estimate this:
For a given, state-of-the-art NN (e.g., GPT-3, PaLM), determine how many FLOP/s it performs when running normally.
Find a real-world brain which performs a similar number of FLOP/s.
Determine how long that real-world brain took to evolve.
Compare the number of FLOPs (not FLOP/s) performed during that period to the number of FLOPs required to train the given AI.
Going off Wikipedia, social insects evolved only about 150 million years ago. That translates to between 1038 and 1044 FLOPs. GPT-3, meanwhile, took about 1023.5 FLOPs. That means evolution is 1015 to 1022 times less efficient.
log10(total FLOPs to evolve bee brains)
Now, you may have some objections. You may consider bees to be significantly more impressive than GPT-3. You may want to select a reference animal that evolved earlier in time. You may want to compare unadjusted energy needs. You may even point out the fact that the Chinchilla results suggest GPT-3 was "significantly undertrained".
Object all you want, and you still won't be able to explain away the >15 OOM gap between evolution and gradient descent. This is no competition.
Brain are not magic. They're messy wetware, and hardware will catch up has caught up.
Postscript: brains actually might be magic. Carlsmith assigns less than 10% (but non-zero) probability that the brain computes more than 1021 FLOP/s. In this case, brains would currently still be vastly more efficient, and we'd have to update in favor of additional theoretical breakthroughs before AGI.
If we include the uncertainty in brain FLOP/s, the graph looks more like this:
brainEnergyPerFlop = { humanBrainFlops = 15; //10 to 23; // Median 15; P(>21) < 10% humanBrainFracEnergy = 0.2; humanEnergyPerDay = 8000 to 10000; // Daily kJ consumption humanBrainPower = humanEnergyPerDay / (60 * 60 * 24); // kW humanBrainPower * 1000 / (10 ^ humanBrainFlops) // J / FLOP}supercomputerEnergyPerFlop = { // https://www.top500.org/system/179807/ power = 25e6 to 30e6; // J flops = 450e15 to 550e15; power / flops}supercomputerEnergyPerFlop / brainEnergyPerFlop
humanFoodEfficiency = { photosynthesisEfficiency = 0.001 to 0.03 trophicEfficiency = 0.1 to 0.15 photosynthesisEfficiency * trophicEfficiency }computerEfficiency = { solarEfficiency = 0.15 to 0.20 transmissionEfficiency = 1 - (0.08 to .15) solarEfficiency * transmissionEfficiency}computerEfficiency / humanFoodEfficiency
evolution = { // Based on Ayeja Cotra's "Forecasting TAI with biological anchors" // All calculations are in log space. secInYear = log10(365 * 24 * 60 * 60); // We assume that the average ancestor pop. FLOP per year is ~constant. // cf. Humans at 10 to 20 FLOP/s & 7 to 10 population ancestorsAveragePop = uniform(19, 23); # Tomasik estimates ~1e21 nematodes ancestorsAverageBrainFlops = 2 to 6; // ~ C. elegans ancestorsFlopPerYear = ancestorsAveragePop + ancestorsAverageBrainFlops + secInYear; years = log10(850e6) // 1 billion years ago to 150 million years ago ancestorsFlopPerYear + years}
Watch out for FLOP/s (floating point operations per second) vs. FLOPs (floating point operations). I'm sorry for the source of confusion, but FLOPs usually reads better than FLOP. ↩
LessWrong has gotten big over the years: 31,260 posts, 299 sequences, and more than 120,000 users.1 It has budded offshoots like the alignment and EA forums and earned itself recognition as "cult". Wonderful!
There is a dark side to this success: as the canon grows, it becomes harder to absorb newcomers (like myself).2 I imagine this was the motivation for the recently launched "highlights from the sequences".
There's built-in support to export notes & definitions to Anki, goodies for tracking your progress through the notes, useful metadata/linking, and pretty visualizations of rationality space…
It's not perfect — I'll be doing a lot of fine-tuning as I work my way through all the content — but there should be enough in place that you can find some value. I'd love to hear your feedback, and if you're interested in contributing, please reach out! I'll also soon be adding support for the AF and the EAF .
More generally, I'd love to hear your suggestions for new aspiring rationalists. For example, there was a round of users proposing alternative reading orders about a decade ago (by Academian, jimrandomh, and XiXiDu) and may be worth revisiting in 2022.
When you start collecting principles, a natural question arises: how to organize these principles? Clear organization is not just useful for quicker access but — when the collecting is crowd-sourced — critical to ensuring that the database of principles grows healthily and sustainably. We need a balance between the extremes of hairballs and orphan principles.
Now, there are books written on this subject, knowledge management (I promise, it's not nearly as dull (or settled) a subject as you might think). That said, one thing at a time. In this post, all I want to do is propose a few dimensions I think might be useful for classifying principles in the future.
Here they are:
Normative vs. Descriptive
Universal vs. Situational (or "First" and "Derived")
Deterministic vs. Stochastic
Normative and Descriptive.
There's a big difference between principles that tell you how the world *is* and how it (or you) *should be*. The former are the domain of the traditional sciences. It's what we mean when we talk about principles and postulates in physics. The latter are the domain of decision theory/philosophy/etc.
There's a bridging principle between the two in that accomplishing any normative goals requires you to have an accurate descriptive view of how the world is. Still, in general, we can make a pretty clean break between these categories.
Universal and Situational ("First" and "Derived")
The universe looks different at different length scales: the discrete, quantum atoms in Angströms give rise to continuous, classical fluids at meter scales and might yet contain continuous strings at Planck-lengths.
Physics gives us a formal way to linking the descriptive principles of one length scale to those of another—the "the Renormalization Group". This is a (meta-)principled approach to constructing "coarse-grained", higher order principles out of base principles. In this way, the postulates of quantum gravity would give rise to those of classical mechanics, but also those of chemistry, in turn biology, psychology, etc.
In general, the "first principles" in these chains of deduction tend to be more universal (and apply across a wider range of phenomena). Evolution doesn't just apply to biological systems but to any replicators, be it cultures, cancers, or memes. 1
Deterministic and Stochastic
One of the main failure modes of a "principles-driven approach" is becoming overly rigid—seeing principles as ironclad laws that never change or break.
I believe one of the main reasons for this is error that we tend to think of principles as deterministic "rules". We tend to omit qualifiers like "usually", "sometimes", "occasionally" from our principles because they sound weaker. But randomness has a perfectly important role in description (the quantum randomness of measurement or the effective randomness of chaotic systems) and in prescription (e.g., divination rituals may have evolved as a randomizing device to improve decision-making).
So we shouldn't shy away from statements like "play tit-for-tat with 5% leakiness". But also less precise statements like "avoid refined sugars, but, hey, it's okay if you have a cheat day every once in a while because hey you also deserve to take it easy on yourself."
A Few Examples
Using these classifications, we can make more thorough sense of the initial set of Open Principles divisions:
"Generic"/"situational" principles and "mental models" are descriptive principles that differ in how universal they are. "Values" and "virtues" are universal normative principles with "habits" as their derived counterparts. "Biases" are a specific type of derived descriptive principle reserved to the domain of agents.
A few more examples:
Call to Action
A few things that might help us keep the Open Principles healthy:
Decide what not to include as a principle. Constraints can be wonderfully liberating.
Define and contrast terms like axioms, postulates, laws, hypotheses, heuristics, biases, fallacies, aphorisms, adages, maxims, platitudes, etc.