No Thumbnail Available

On the training of shallow neural networks with the gradient method

(2024)

Files

Brisbois_52641800_2024.pdf
  • Open access
  • Adobe PDF
  • 6.34 MB

Details

Supervisors
Faculty
Degree label
Abstract
Due to the rise of complex and wide neural networks, there is a growing need for efficient and fast learning algorithms to support this trend. Currently, Neural Networks are trained using first-order optimization algorithms like the (Stochastic) Gradient Method, made possible by the discovery of backpropagation. This thesis aims to better understand why and how first-order optimization methods, such as the Gradient Method, perform well when minimizing nonconvex loss functions. The Gradient Method is known to work effectively for convex optimization problems, but the loss functions used in deep learning are typically highly nonconvex, even locally. To address this, we first review the literature to identify existing research in this area and summarize the findings of Boursier and al. [5], which details the convergence of Gradient Flow with small initialization and the training dynamics of a one-hidden layer ReLU Network for orthogonal input data. After reproducing their numerical experiments, we conduct our own experiments to explore neuron behavior in groups. Notably, we experimentally find that for small initializations, the training dynamics of Sigmoid and Tanh one-hidden layer neural networks exhibit similar properties to those described in Boursier and al. [5] for ReLU networks. Finally, we examine the Polyak-Łojasiewicz inequality for a one-hidden layer network and provide a proof of an important lemma from Boursier and al. [5] regarding the balancedness of iterates generated by the Gradient Method, which the authors do not provide.