Study notes: Why does deep and cheap learning work so well?

I read this paper a few days ago. Though I could not understand the paper for most part because of my limited Maths and Physics background, what I gathered was that

  1. The paper contains some interesting ideas.
  2. Beyond the interesting ideas, the paper is hand waving.

In the rest of this article, I describe the ideas which I found intersting and could understand. At places I add my own interpretation, which, inadvertantly may be different from what the authors intended.

The Question

Universal Approximation Theorem says that almost any function can be approximated by a neural network. However, it does not talk about complexity or learnability of such networks. Then why is it that layered neural networks can compute the functions that matter to us (for computer vision, speech recognition etc use cases), and why is that the parameters of the network can be learnt too rather cheaply. This explains the title of the paper: "Why does deep and cheap learning work so well?"

Appreciating the Question

Consider the task of classifying whether the image is of a cat or not. Let's say the image is a 1000 x 1000 pixel in size. Each pixel can have 256 values (say it is a gray scale image). The neural net that you compute will finally correspond to one of the $2^{256^{1000,000}}$ possible functions. Of these functions, a vanishingly small subet of functions are good approximations for cat vs non cat classfier functions. Then how is that a neural net is able to pick up one of the good approximating functions?

The Answer

There are two reasons why the above is possible:

1(a). Laws of nature are simple

The laws of nature are simple. They do not involve large exponents. Consider for example some popular laws:

Simple laws of nature imply that the functions which map images to classifications are also simple and computable cheaply by NN. The paper talks something about low degree Hamiltonians, but I do not know what Hamiltonians are.

1(b). Nature exhibits locality

Things only affect what is in their vicinity. So, when you make a physical model, most of the interaction terms become zero.

1(c). Symmetry

A rotated cat is a cat, a translated cat is a cat. A rotated non cat is a non cat, a translated non cat is a non cat. This reduces parameter count and hence computation cost.

2. Processes of nature are hierarchical

Physical world is organized hierarchically: atoms, molecules, cells, organisms, planets, solar system, galaxies. Layers of ANN are able to reverse the generative hierarchical process. Getting an ANN to approximate an arbitrary function would be hard, but a function which is decomposing the generative process to hierarchy of simpler steps cna be done.

Conclusion

It is the Physics of the world that leads to ANNs being successful at classification tasks.


Thanks to Soumyadeb Mitra for finding an error in the previous version of this post.