A common sentiment I have been hearing from several prominent researchers in machine learning is that the right way to understand the workings of large language models and the like is to emulate physicists and seek to discover the fundamental laws that govern these systems. On the one hand, this reflects the growing recognition that the field of machine learning has regressed to the preparadigmatic stage, where we cannot even agree on what’s going on, why we do certain things and not others, and why certain things work well while others don’t. For instance, what does batch normalization do? Does it reduce “internal covariate shift” (whatever that is) or does it make the optimization landscape more benign? What about transformers? Are they universal approximators of sequence-to-sequence maps? Or are they support vector machines? One can go on and on with examples like these, so it’s small wonder that many of us would attempt to seek a more fundamental approach to things. The way of the physicist seems righteous! As Richard Feynman put it very eloquently in his Lectures on Physics,
[w]hat do we mean by “understanding” something? We can imagine that this complicated array of moving things which constitutes “the world” is something like a great chess game being played by the gods, and we are observers of the game. We do not know what the rules of the game are; all we are allowed to do is to watch the playing. Of course, if we watch long enough, we may eventually catch on to a few of the rules. The rules of the game are what we mean by fundamental physics.
Now, given that I constantly emphasize the distinction between physics as the realm of discovery and engineering as the realm of invention, I’d like to argue that thinking of machine learning models in the same way as natural phenomena is a category error. To start with, we need to question the concept of laws in the context of engineered systems and to contrast it with the concept of rules or constraints. The distinction between laws and rules was articulated nicely by Howard Pattee, who wrote the following in his 1978 paper on “The Complementarity Principle in Biological and Social Structures:”
The basic distinction between laws and rules can be made by these criteria: laws are (a) inexorable, (b) incorporeal and (c) universal; rules are (a) arbitrary, (b) structure-dependent and (c) local. In other words, we can never alter or evade laws of nature; we can always evade or change rules. Laws of nature do not need embodiments or structures to execute them; rules must have a real physical structure or constraint if they are to be executed. Finally, laws hold at all times and all places; rules only exist when and where there are physical structures to execute them.
Pattee, who was trained as a physicist but then turned his attention to biology, was mostly speaking about the living world. In that context, laws govern processes like protein folding, which depends on an intricate interplay between covalent and hydrogen bonds, which in turn supervene on atomic and subatomic phenomena, etc.; by contrast, rules and constraints have to do with things like functional organization—e.g., of organelles into cells, cells into tissues, tissues into organs. Why does a certain protein have a certain function in the organism? In fact, why is it that nucleic acids serve as digital controllers in living cells, while proteins act as sensors and as actuators? Pattee correctly recognizes these as frozen historical accidents: “accidents because their antecedent events are unobservable, historical because the crucial events occurred only once, and frozen because the result persists as a coherent, stable, hereditary constraint.” Max Delbrück, another physicist who made an illustrious career in biology, voiced similar ideas in his 1949 paper “A physicist looks at biology:”
On the whole, the successful theories of biology always have been and are still today simple and concrete. Presumably this is not accidental, but is bound up with the fact that every biological phenomenon is essentially an historical one, one unique situation in the infinite total complex of life.
Such a situation from the outset diminishes the hope of understanding any one living thing by itself and the hope of discovering universal laws, the pride and ambition of physicists. The curiosity remains, though, to grasp more clearly how the same matter, which in physics and in chemistry displays orderly and reproducible and relatively simple properties, arranges itself in the most astounding fashions as soon as it is drawn into the orbit of the living organism.
Just like living systems, engineered artifacts are also endowed with histories: histories of research, development, deployment, maintenance, interaction with their environment, their eventual replacement by newer, better artifacts. Herbert Simon recognized this when he wrote the following in The Sciences of the Artificial about viewing computing devices as empirical objects:
Since there are now many such devices in the world, and since the properties that describe them also appear to be shared by the human central nervous system, nothing prevents us from developing a natural history of them. We can study them as we would rabbits or chipmunks and discover how they behave under different patterns of environmental stimulation. Insofar as their behavior reflects largely the broad functional characteristics we have described, and is independent of details of their hardware, we can build a general—but empirical—theory of them.
Some questionable analogies to the human central nervous system notwithstanding, this gives us a more appropriate framing of studying machine learning systems empirically as engineered artifacts with an evolutionary history. Unlike the observers of the divine game in Feynman’s quote, we know the rules of the game because we designed the game. What we lack is the ability to fully describe or to comprehend all the behaviors that emerge as a result of these rules once the game starts unfolding. One could argue that this is a fundamental limitation akin to Gödel’s incompleteness, which ultimately comes down to our fundamental inability to describe the behavior of certain machines, even with full knowledge of their rules of operation, but Gödel’s incompleteness and its equivalents, such as Turing’s result on the undecidability of the halting problem, are not laws of nature, but rather true propositions in certain formal systems.
The rules of the game, the building blocks, and the design principles of modern machine learning systems are, to a great extent, frozen historical accidents. They are the combined results of technological path dependence, economies and diseconomies of scale, or the emergence of common practices like data sharing and competitive testing (as David Donoho and Ben Recht like to remind us). The widespread use of gradient descent and its variants is not any kind of a law of nature, it’s a consequence of a whole bunch of engineering and business decisions, the intricacies of disk and memory access, the design of data centers, all the way to the historical contingencies that led to the invention of the transistor and the computing revolution. It’s more or less the same with massive, complex, highly interconnected systems like the Internet, where we can trace some the key ideas to the early work by people like Robert Kahn and Vinton Cerf on the Interface Message Protocol, and to Wesley A. Clark’s idea to separate computation and communication (a perfect example of engineering principles of modularization and abstraction), and to early independent proposals of packet switching by Paul Baran, Leonard Kleinrock, and Donald Davies. The rules underlying TCP/IP are relatively simple to describe, yet the Internet as a system is as complex and inscrutable as ChatGPT, and it makes no sense to speak of any universal laws governing its behavior. Instead, we have to resort to ideas of emergent properties, system effectiveness and coherence, and the like.
Looking for universal laws underlying LLMs is just as misguided as looking for universal laws that govern the Internet, the stock market, or the brain. We don’t need to give in to physics envy in order to conceive of and develop a sound empirical approach to large-scale machine learning systems, engineering is already the right framework for this.
I like this a lot, but I want you to write a follow up on how internet networks and neural networks are different. I agree that the internet as a whole is complex, but it is designed so that identification and recovery from failure is paramount at all layers. I don't think we can even define what a failure means in machine learning.
I need to think about this more, myself...
Questions
+ Are mechanisms as conceived and discussed by _The New Mechanists_ a better foil for rules than laws?
+ Do you have a specific citation in mind for the precise sense in which you use "Verum et factum convertuntur"? (Minimal google search pointed me to Vico as the source, but am curious to know your specific source)
+ Since social systems are built systems, do you think sociology _ought_ to be studied using engineering principles instead of as a (social) science?