My current research focuses on understanding the training dynamics of neural network models, and its connection with the generalization performance. I explore the topic from two perspective: (1) the implicit bias of optimization algorithms; (2) the continuous viewpoint of neural network models and their training dynamics.

The implicit bias of optimization algorithms

Modern deep neural networks are usually over-parameterized, which means they have more parameters than the number of training data. For over-parameterized models, the empirical loss function has many global minima. These global minima all have zero training loss, but their generalization performance can vary drastically. In this case, the generalization of the model depends heavily on the implicit bias of the training dynamics---a phenomenon that the optimization algorithm favors some global minima over others. My research along this direction employs the linear stability theory to explain why SGD selects flat minima[1]. Also, the good generalization performance of flat minima is the result of the special multiplicative structure of most neural network models[2]. Therefore, the implicit bias results from the interaction of optimization algorithms and the neural network architectures. Similar flat minima selection results are also obtained for the Adam optimizer[3].

In the over-parameterized case, the minima of the empirical loss function are not isolated. Instead, they form manifolds. In a recent work, my collaborator and I studied the behavior of GD around the minima manifold[4]. We studied how GD bounces on the wall of the valley and at the same time moves slowly along the valley of the minima manifold. The resulting theory is used to explain the edge of stability phenomenon.

[1] (with Lei Wu and Weinan E) How SGD selects the global minima in over-parameterized learning

[2] (with Lexing Ying) On Linear Stability of SGD and Input-Smoothness of Neural Networks

[3] (with Pan Zhou et al.) Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

[4] (with Daniel Kunin et al.) Beyond the Quadratic Approximation: the Multiscale Structure of Neural Network Loss Landscape

The continuous viewpoint for neural networks

Viewing machine leaning models as discretizations of continuous models can bring new mathematical tools to the understanding and analysis of the original models. I have explored the continuous formulation for a series of models. In [1], a quasistatic algorithm was proposed for mean-field two-player games. The algorithm can be applied to train mixed GANs. Convergence is also proven. In [2], we conducted a mean-field analysis for two-layer neural networks with batch normalization. The GD dynamics for the mean-field formulation is a Wasserstein gradient flow on a Riemannian manifold. In our perspective, what BN does is a change of metric. In [3], we derived and analyzed a continuous formulation for ResNets. We also have a paper on general methodology [4].

[1] (with Lexing Ying) Provably Convergent Quasistatic Dynamics for Mean-Field Two-Player Zero-Sum Games

[2] (with Lexing Ying) A Riemannian Mean Field Formulation for Two-layer Neural Networks with Batch Normalization

[3] (with Yiping Lu et al.) A mean-field analysis of deep resnet and beyond: Towards provable optimization via overparameterization from depth

[4] (with Weinan E et al.) Machine learning from a continuous viewpoint, I

Previously, I have done researches on the following topics:

  • Mathematical theory for neural network models regarding their approximation and generalization capacity.

  • Machine learning based methods for scientific computing applications, such as fluid dynamics.

  • Nonconvex optimization algorithms for signal and image processing problems.