Articles

0. The shallow reality of 'deep learning theory'

Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort

Most results under the umbrella of "deep learning theory" are not actually deep, about learning, or even theories.

This is because classical learning theory makes the wrong assumptions, takes the wrong limits, uses the wrong metrics, and aims for the wrong objectives. Learning theorists are stuck in a rut of one-upmanship, vying for vacuous bounds that don't say anything about any systems of actual interest.

400 Yudkowsky tweeting about statistical learning theorists. (Okay, not really.)

In particular, I'll argue throughout this sequence that:

  • Empirical risk minimization is the wrong framework, and risk is a weak foundation.
  • In approximation theory, the universal approximation results are too general (they do not constrain efficiency) while the "depth separation" results meant to demonstrate the role of depth are too specific (they involve constructing contrived, unphysical target functions).
  • Generalization theory has only two tricks, and they're both limited:
    • Uniform convergence is the wrong approach, and model class complexities (VC dimension, Rademacher complexity, and covering numbers) are the wrong metric. Understanding deep learning requires looking at the microscopic structure within model classes.
    • Robustness to noise is an imperfect proxy for generalization, and techniques that rely on it (margin theory, sharpness/flatness, compression, PAC-Bayes, etc.) are oversold.
  • Optimization theory is a bit better, but training-time guarantees involve questionable assumptions, and the obsession with second-order optimization is delusional. Also, the NTK is bad. Get over it.
  • At a higher level, the obsession with deriving bounds for approximation/generalization/learning behavior is misguided. These bounds serve mainly as political benchmarks rather than a source of theoretical insight. More attention should go towards explaining empirically observed phenomena like double descent (which, to be fair, is starting to happen).

That said, there are new approaches that I'm more optimistic about. In particular, I think that singular learning theory (SLT) is the most likely path to lead to a "theory of deep learning" because it (1) has stronger theoretical foundations, (2) engages with the structure of individual models, and (3) gives us a principled way to bridge between this microscopic structure and the macroscopic properties of the model class1. I expect the field of mechanistic interpretability and the eventual formalism of phase transitions and "sharp left turns" to be grounded in the language of SLT.

Why theory?

A mathematical theory of learning and intelligence could form a valuable tool in the alignment arsenal, that helps us:

That's not to say that the right theory of learning is risk-free:

  • A good theory could inspire new capabilities. We didn't need a theory of mechanics to build the first vehicles, but we couldn't have gotten to the moon without it.
  • The wrong theory could mislead us. Just as theory tells us where to look, it also tells us where not to look. The wrong theory could cause us to neglect important parts of the problem.
  • It could be one prolonged nerd-snipe that draws attention and resources away from other critical areas in the field. Brilliant string theorists aren't exactly helping advance living and technology standards by computing the partition functions of black holes in 5D de-Sitter spaces.2

All that said, I think the benefits currently outweigh the risks, especially if we put the right infosec policy in place when if learning theory starts showing signs of any practical utility. It's fortunate, then, that we haven't seen those signs yet.

Outline

My aims are:

  • To discourage other alignment researchers from wasting their time.
  • To argue for what makes singular learning theory different and why I think it the likeliest contender for an eventual grand unified theory of learning.
  • To invoke Cunningham's law — i.e., to get other people to tell me where I'm wrong and what I've been missing in learning theory.

There's also the question of integrity: if I am to criticize an entire field of people smarter than I am, I had better present a strong argument and ample evidence.

Throughout the rest of this sequence, I'll be drawing on notes I compiled from lecture notes by Telgarsky, Moitra, Grosse, Mossel, Ge, and Arora, books by Roberts et al. and Hastie et al., a festschrift of Chervonenkis, and a litany of articles.3

The sequence follows the three-fold division of approximation, generalization, and optimization preferred by the learning theorists. There's an additional preface on why empirical risk minimization is flawed (up next) and an epilogue on why singular learning theory seems different.


Footnotes

  1. This sequence was inspired by my worry that I had focused too singularly on singular learning theory. I went on a journey through the broader sea of "learning theory" hopeful that I would find other signs of useful theory. My search came up mostly empty, which is why I decided to write >10,000 words on the subject.

  2. Though, granted, string theory keeps on popping up in other branches like condensed matter theory, where it can go on to motivate practical results in material science (and singular learning theory, for that matter).

  3. I haven't gone through all of these sources in equal detail, but the content I cover is representative of what you'll learn in a typical course on deep learning theory.

No, human brains are not more efficient than computers

Epistemic status: grain of salt. There's lots of uncertainty in how many FLOP/s the brain can perform.

In informal debate, I've regularly heard people say something like, "oh but brains are so much more efficient than computers" (followed by a variant of "so we shouldn't worry about AGI yet"). Putting aside the weakly argued AGI skepticism, brains actually aren't all that much more efficient than computers (at least not in any way that matters).

The first problem is that these people are usually comparing the energy requirements of training large AI models to the power requirements of running the normal waking brain. These two things don't even have the same units.

The only fair comparison is between the trained model and the waking brain or between training the model and training the brain. Training the brain is called evolution, and evolution isn't particularly known for its efficiency.

Let's start with the easier comparison: a trained model vs. a trained brain. Joseph Carlsmith estimates that the brain delivers roughly 11 petaFLOP/s (=101510^{15} floating-point operations per second)1. If you eat a normal diet, you're expending roughly 101310^{-13} J/FLOP.

Meanwhile, the supercomputer Fugaku delivers 450450 petaFLOP/s at 3030 MW, which comes out to about 1011.510^{-11.5} J/FLOP…. So I was wrong? Computers require almost 500500 times more energy per FLOP than humans?

Supercomputer J/FLOPHuman J/FLOP\frac{\text{Supercomputer J}/\text{FLOP}}{\text{Human J} /\text{FLOP}}

Pasted image 20220906142829.png

What this misses is an important practical point: supercomputers can tap pretty much directly into sunshine; human food calories are heavily-processed hand-me-downs. We outsource most of our digestion to mother nature and daddy industry.

Even the most whole-foods-grow-your-own-garden vegan is 22-33 orders of magnitude less efficient at capturing calories from sunlight than your average device2. That's before animal products, industrial processing, or any of the other Joules it takes to run a modern human.

After this correction, humans and computers are about head-to-head in energy/FLOP, and it's only getting worse for us humans. The fact that the brain runs on so little actual juice suggests there's plenty of room left for us to explore specialized architectures, but it isn't the damning case many think it is. (We're already seeing early neuromorphic chips out-perform neurons' efficiency by four orders of magnitude.)

Electronic efficiencyBiological efficiency\frac{\text{Electronic efficiency}}{\text{Biological efficiency}}

Pasted image 20220906143040.png

But what about training neural networks? Now that we know the energy costs per FLOP are about equal, all we have to do is compare FLOPs required to evolve brains to the FLOPs required to train AI models. Easy, right?

Here's how we'll estimate this:

  1. For a given, state-of-the-art NN (e.g., GPT-3, PaLM), determine how many FLOP/s it performs when running normally.
  2. Find a real-world brain which performs a similar number of FLOP/s.
  3. Determine how long that real-world brain took to evolve.
  4. Compare the number of FLOPs (not FLOP/s) performed during that period to the number of FLOPs required to train the given AI.

Fortunately, we can piggyback off the great work done by Ajeya Cotra on forecasting "Transformative" AI. She calculates that GPT-3 performs about 101210^{12} FLOP/s3, or about as much as a bee.

Going off Wikipedia, social insects evolved only about 150 million years ago. That translates to between 103810^{38} and 104410^{44} FLOPs. GPT-3, meanwhile, took about 1023.510^{23.5} FLOPs. That means evolution is 101510^{15} to 102210^{22} times less efficient.

log10(total FLOPs to evolve bee brains)\log_{10}\left(\text{total FLOPs to evolve bee brains}\right)

Pasted image 20220906143416.png

Now, you may have some objections. You may consider bees to be significantly more impressive than GPT-3. You may want to select a reference animal that evolved earlier in time. You may want to compare unadjusted energy needs. You may even point out the fact that the Chinchilla results suggest GPT-3 was "significantly undertrained".

Object all you want, and you still won't be able to explain away the >1515 OOM gap between evolution and gradient descent. This is no competition.

What about other metrics besides energy and power? Consider that computers are about 10 million times faster than human brains. Or that if the human brain can store a petabyte of data, S3 can do so for about $20,000 (2022). Even FLOP for FLOP, supercomputers already underprice humans.4 There's less and less for us to brag about it.

$/(Human FLOP/s)$/(Supercomputer FLOP/s)\frac{\$/(\text{Human FLOP/s})}{\$/(\text{Supercomputer FLOP}/s)}

Pasted image 20220906143916.png

Brain are not magic. They're messy wetware, and hardware will catch up has caught up.

Postscript: brains actually might be magic. Carlsmith assigns less than 10% (but non-zero) probability that the brain computes more than 102110^{21} FLOP/s. In this case, brains would currently still be vastly more efficient, and we'd have to update in favor of additional theoretical breakthroughs before AGI.

If we include the uncertainty in brain FLOP/s, the graph looks more like this:

Supercomputer J/FLOPHuman J/FLOP\frac{\text{Supercomputer J}/\text{FLOP}}{\text{Human J} /\text{FLOP}}

Pasted image 20220906150914.png

(With a mean of ~101910^{19} and median of 830830.)

Appendix

Squiggle snippets used to generate above graphs. (Used in conjunction with obsidian-squiggle).

brainEnergyPerFlop = {
	humanBrainFlops = 15; //10 to 23;	// Median 15; P(>21) < 10%
	humanBrainFracEnergy = 0.2;
	humanEnergyPerDay = 8000 to 10000; // Daily kJ consumption
	humanBrainPower = humanEnergyPerDay / (60 * 60 * 24); // kW
	humanBrainPower * 1000 / (10 ^ humanBrainFlops) // J / FLOP
}

supercomputerEnergyPerFlop = {
    // https://www.top500.org/system/179807/ 
	power = 25e6 to 30e6; // J
	flops = 450e15 to 550e15;
	power / flops
}

supercomputerEnergyPerFlop / brainEnergyPerFlop
humanFoodEfficiency = {
	photosynthesisEfficiency = 0.001 to 0.03
	trophicEfficiency = 0.1 to 0.15
	photosynthesisEfficiency * trophicEfficiency 
}

computerEfficiency = {
    solarEfficiency = 0.15 to 0.20
    transmissionEfficiency = 1 - (0.08 to .15)
    solarEfficiency * transmissionEfficiency
}

computerEfficiency / humanFoodEfficiency
evolution = {
    // Based on Ayeja Cotra's "Forecasting TAI with biological anchors"
    // All calculations are in log space.
	
	secInYear = log10(365 * 24 * 60 * 60);
	
	// We assume that the average ancestor pop. FLOP per year is ~constant.
	// cf. Humans at 10 to 20 FLOP/s & 7 to 10 population
	ancestorsAveragePop = uniform(19, 23); # Tomasik estimates ~1e21 nematodes
    ancestorsAverageBrainFlops = 2 to 6; // ~ C. elegans
	ancestorsFlopPerYear = ancestorsAveragePop + ancestorsAverageBrainFlops + secInYear;

	years = log10(850e6) // 1 billion years ago to 150 million years ago
	ancestorsFlopPerYear + years
}
humanLife$ = 1e6 to 10e6
humanBrainFlops = 1e15
humanBrain$PerFlops = humanLife$ / humanBrainFlops 

supercomputer$ = 1e9
supercomputerFlops = 450e15
supercomputer$PerFlop = supercomputer$ / supercomputerFlops


supercomputer$PerFlops/humanBrain$PerFlops

Footnotes

  1. Watch out for FLOP/s (floating point operations per second) vs. FLOPs (floating point operations). I'm sorry for the source of confusion, but FLOPs usually reads better than FLOP.

  2. Photosynthesis has an efficiency around 1%, and jumping up a trophic level means another order of magnitude drop. The most efficient solar panels have above 20% efficiency, and electricity transmission loss is around 10%.

  3. Technically, it's FLOP per "subjective second" — i.e., a second of equivalent natural thought. This can be faster or slower than "truth thought."

  4. Compare FEMA's value of a statistical life at $7.5 million to the $1 billion price tag of the Fukuga supercomputer, and we come out to the supercomputer being a fourth the cost per FLOP.

Rationalia starter pack

LessWrong has gotten big over the years: 31,260 posts, 299 sequences, and more than 120,000 users.1 It has budded offshoots like the alignment and EA forums and earned itself recognition as "cult". Wonderful!

There is a dark side to this success: as the canon grows, it becomes harder to absorb newcomers (like myself).2 I imagine this was the motivation for the recently launched "highlights from the sequences".

To make it easier on newcomers (veterans, you're also welcome to join in), I've created an Obsidian starter-kit for taking notes on the LessWrong core curriculum (the Sequences, CodexHPMOR, best of, concepts, various jargon, and other odds and ends).

There's built-in support to export notes & definitions to Anki, goodies for tracking your progress through the notes, useful metadata/linking, and pretty visualizations of rationality space…

vault-graph.png

It's not perfect — I'll be doing a lot of fine-tuning as I work my way through all the content — but there should be enough in place that you can find some value. I'd love to hear your feedback, and if you're interested in contributing, please reach out! I'll also soon be adding support for the AF and the EAF .

More generally, I'd love to hear your suggestions for new aspiring rationalists. For example, there was a round of users proposing alternative reading orders about a decade ago (by Academianjimrandomh, and XiXiDu) and may be worth revisiting in 2022.

Footnotes

  1. From what I can tell using the graphql endpoint.

  2. Already a decade ago, jimrandomh was worrying about LW's intimidation factor — we're now about an order of magnitude ahead.

We need a taxonomy for principles

When you start collecting principles, a natural question arises: how to organize these principles? Clear organization is not just useful for quicker access but — when the collecting is crowd-sourced — critical to ensuring that the database of principles grows healthily and sustainably. We need a balance between the extremes of hairballs and orphan principles.

Now, there are books written on this subject, knowledge management (I promise, it's not nearly as dull (or settled) a subject as you might think). That said, one thing at a time. In this post, all I want to do is propose a few dimensions I think might be useful for classifying principles in the future.

Here they are:

  • Normative vs. Descriptive
  • Universal vs. Situational (or "First" and "Derived")
  • Deterministic vs. Stochastic

Normative and Descriptive.

There's a big difference between principles that tell you how the world *is* and how it (or you) *should be*. The former are the domain of the traditional sciences. It's what we mean when we talk about principles and postulates in physics. The latter are the domain of decision theory/philosophy/etc.

There's a bridging principle between the two in that accomplishing any normative goals requires you to have an accurate descriptive view of how the world is. Still, in general, we can make a pretty clean break between these categories.

Universal and Situational ("First" and "Derived")

The universe looks different at different length scales: the discrete, quantum atoms in Angströms give rise to continuous, classical fluids at meter scales and might yet contain continuous strings at Planck-lengths.

Physics gives us a formal way to linking the descriptive principles of one length scale to those of another—the "the Renormalization Group". This is a (meta-)principled approach to constructing "coarse-grained", higher order principles out of base principles. In this way, the postulates of quantum gravity would give rise to those of classical mechanics, but also those of chemistry, in turn biology, psychology, etc.

The same is true on the normative end. "Do no harm" can look very different in different situations, and the 2 Areas/Principles/Aphorisms & Platitudes/Golden Rule has more subtleties and gradations than I can count.

In general, the "first principles" in these chains of deduction tend to be more universal (and apply across a wider range of phenomena). Evolution doesn't just apply to biological systems but to any replicators, be it cultures, cancers, or memes. 1

Final Project — Anthropology of Science and Tech through …|700

Deterministic and Stochastic

One of the main failure modes of a "principles-driven approach" is becoming overly rigid—seeing principles as ironclad laws that never change or break.

I believe one of the main reasons for this is error that we tend to think of principles as deterministic "rules". We tend to omit qualifiers like "usually", "sometimes", "occasionally" from our principles because they sound weaker. But randomness has a perfectly important role in description (the quantum randomness of measurement or the effective randomness of chaotic systems) and in prescription (e.g., divination rituals may have evolved as a randomizing device to improve decision-making).

So we shouldn't shy away from statements like "play tit-for-tat with 5% leakiness". But also less precise statements like "avoid refined sugars, but, hey, it's okay if you have a cheat day every once in a while because hey you also deserve to take it easy on yourself."

A Few Examples

Using these classifications, we can make more thorough sense of the initial set of Open Principles divisions:

"Generic"/"situational" principles and "mental models" are descriptive principles that differ in how universal they are. "Values" and "virtues" are universal normative principles with "habits" as their derived counterparts. "Biases" are a specific type of derived descriptive principle reserved to the domain of agents.

A few more examples:

500

Call to Action

A few things that might help us keep the Open Principles healthy:

Cheers, Jesse

Footnotes

  1. This isn't always true: the real world is not very quantum mechanic. But it's probably a good enough starting point for now.