Published June 1st, 2021

*Suchismita Padhy is a key voice at *Intel Labs*. Here, she explores the underlying mechanisms that govern deep neural networks and applies those insights to develop machine learning systems at scale.*

**Highlights:**

- Even though deep neural networks (DNNs) are capable of memorizing large quantities of randomly labeled data, they will first learn the most general features.
- After learning general features, DNNs memorize data by changing parameters primarily in deeper layers.
- Understanding this behavior better can help us know when and where networks will fail when they encounter noisy data.

Deep neural networks typically have many more learnable parameters than training examples in common datasets. As a result, DNNs can simply memorize the training data instead of converging to a better, more general solution (Novak et al., 2018). In many cases, regularization can prevent memorization in common datasets; however, standard methods are insufficient to eliminate memorization in deep networks (Zhang et al., 2016, Neyshabur et al., 2014). Yet even though memorizing solutions exist, they are rarely learned by DNNs in practice (Rolnick et al., 2017). Why might this be?

Recent work has shown that the predominant learning algorithm, stochastic gradient descent, as well as specific layer types, bias the training dynamics towards generalizable solutions (Hardt et al., 2016, Soudry et al., 2018, Brutzkus et al., 2017, Li & Liang, 2018, Saxe et al., 2013, Lampinen & Ganguli, 2018). However, these claims were studied only in simple, shallow networks. Where and when memorization is favored within DNNs remains an open question. Are all layers equally susceptible to memorization or does it concentrate in the initial or deeper layers?

Our latest work, presented recently at the 2021 International Conference on Learning Representations (ICLR), forces a deep network to memorize some of the training examples by randomly changing their labels. Then, we employ a newly developed geometric probe (Chung et al., 2018, Stephenson et al., 2019), based on replica mean-field theory from statistical physics, to analyze the training dynamics and the resulting structure of memorization. The probe measures the layer capacity and the geometric properties of the object manifolds, explicitly linked by the theory.

**Figure 1: **2D visualization of the model outputs for different images in the dataset

We find that DNNs ignore randomly labeled data in the early layers and epochs and learn generalizing features instead. Memorization occurs abruptly in the deeper layers, later in training. Notably, this phenomenon cannot be attributed to gradients vanishing with depth. Instead, early in training, the gradients from the noisy examples contribute minimally to the total gradient. Networks can ignore the noise and focus on the shared features of the correctly labeled examples.

Figure 1 shows a 2D visualization of the model outputs for different images in the dataset. The color of each dot shows which label is used for the image. Because randomly labeled images were used in the training data, this visualization can be viewed using three different types of labeling:

- images are given to the network with
*correct*labels (“unpermuted”) - images with
*incorrect*labels, but visualized with their color determined by their random label (“permuted”) - images with
*incorrect*labels, but visualized with the color determined by their*correct*label (“restored”).

In this visualization, you can see that correctly labeled examples are learned first. Hence, they separate into distinct clusters, regardless of whether labeling was correct when used to train the network. It’s not until later in training that the network learns the random, “permuted” label and begins to show clusters based on that. This visualization shows an interesting difference in how the model learns to classify different examples.

The concept for how these clusters come about using geometric analysis is explained further in the paper, “On the geometry of generalization and memorization in deep neural networks.” Interestingly, since memorization occurs mainly in deeper layers later in training, it can be undone to regain generalization by selectively rolling back the parameters of the final layers of the network to an earlier epoch.

**Conclusion**

Our work helps to explain why DNNs are biased against solutions that simply memorize the training data. It also raises some interesting follow-up questions. For example, can the observed differences in gradients on real versus randomly labeled examples be used to detect mislabeled examples during training? Such a technique may aid in training models on noisy data or in the curation of large datasets. Moreover, the cause of the different behaviors observed in early compared to later layers remains to be understood.

We speculate that the features learned by the early layers must be simpler, i.e., close to linear, owing to the smaller number of transformations of the input at that stage of the network. Random data may have fewer simpler shared features so memorization may not occur in these layers. Further experimental or theoretical work on this topic can improve our understanding of modern DNNs and hopefully lead to more efficient or more accurate models.

This work was recently presented at ICLR 2021. You can find the full manuscript here. In Intel’s AI Lab, we are extremely interested in how a model’s compute requirements shape its ability to generalize new data. This work is just one thread within this broader research effort.

**Citations**

Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HJC2SzZCW.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.

David Rolnick, Andreas Veit, Serge Belongie, and Nir Shavit. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225–1234, 2016.

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878 , 2018.

Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over-parameterized networks that provably generalize on linearly separable data. arXiv preprint arXiv:1710.10174, 2017.

Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166 , 2018.

Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.

Andrew K Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and transfer learning in deep linear networks. arXiv preprint arXiv:1809.10374, 2018.

SueYeon Chung, Daniel D Lee, and Haim Sompolinsky. Linear readout of object manifolds. Physical Review E, 93(6):060301, 2016

Cory Stephenson, Jenelle Feather, Suchismita Padhy, Oguz Elibol, Hanlin Tang, Josh McDermott, and SueYeon Chung. Untangling in invariant speech recognition. In Advances in Neural Information Processing Systems, pages 14368–14378 , 2019.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.