Papers by Behrooz Ghorbani

neural information processing systems, 2019
We study the supervised learning problem under either of the following two models: (1) Feature ve... more We study the supervised learning problem under either of the following two models: (1) Feature vectors xi are d-dimensional Gaussians and responses are yi = f * (xi) for f * an unknown quadratic function; (2) Feature vectors xi are distributed as a mixture of two d-dimensional centered Gaussians, and yi's are the corresponding class labels. We use two-layers neural networks with quadratic activations, and compare three different learning regimes: the random features (RF) regime in which we only train the second-layer weights; the neural tangent (NT) regime in which we train a linearization of the neural network around its initialization; the fully trained neural network (NN) regime in which we train all the weights in the network. We prove that, even for the simple quadratic model of point (1), there is a potentially unbounded gap between the prediction risk achieved in these three training regimes, when the number of neurons is smaller than the ambient dimension. When the number of neurons is larger than the number of dimensions, the problem is significantly easier and both NT and NN learning achieve zero risk.

Journal of Statistical Mechanics: Theory and Experiment, 2021
For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural net... more For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothness classes than RKHS and we know of special examples for which SGD-trained NN provably outperform RKHS. This is true even in the wide network limit, for a different scaling of the initialization. How can we reconcile the above claims? For which tasks do NNs outperform RKHS? If covariates are nearly isotropic, RKHS methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation. Here we show that this curse of dimensionality becomes milder if the covariates display the same low-dimensional structure as the target function, and w...
It is well-known that deeper neural networks are harder to train than shallower ones. In this sho... more It is well-known that deeper neural networks are harder to train than shallower ones. In this short paper, we use the (full) eigenvalue spectrum of the Hessian to explore how the loss landscape changes as the network gets deeper, and as residual connections are added to the architecture. Computing a series of quantitative measures on the Hessian spectrum, we show that the Hessian eigenvalue distribution in deeper networks has substantially heavier tails (equivalently, more outlier eigenvalues), which makes the network harder to optimize with first-order methods. We show that adding residual connections mitigates this effect substantially, suggesting a mechanism by which residual connections improve training.

Topic models are Bayesian models that are frequently used to capture the latent structure of cert... more Topic models are Bayesian models that are frequently used to capture the latent structure of certain corpora of documents or images. Each data element in such a corpus (for instance each item in a collection of scientific articles) is regarded as a convex combination of a small number of vectors corresponding to `topics' or `components'. The weights are assumed to have a Dirichlet prior distribution. The standard approach towards approximating the posterior is to use variational inference algorithms, and in particular a mean field approximation. We show that this approach suffers from an instability that can produce misleading conclusions. Namely, for certain regimes of the model parameters, variational inference outputs a non-trivial decomposition into topics. However --for the same parameter values-- the data contain no actual information about the true decomposition, and hence the output of the algorithm is uncorrelated with the true topic decomposition. Among other conse...

We study the supervised learning problem under either of the following two models: (1) Feature ve... more We study the supervised learning problem under either of the following two models: (1) Feature vectors ${\boldsymbol x}_i$ are $d$-dimensional Gaussians and responses are $y_i = f_*({\boldsymbol x}_i)$ for $f_*$ an unknown quadratic function; (2) Feature vectors ${\boldsymbol x}_i$ are distributed as a mixture of two $d$-dimensional centered Gaussians, and $y_i$'s are the corresponding class labels. We use two-layers neural networks with quadratic activations, and compare three different learning regimes: the random features (RF) regime in which we only train the second-layer weights; the neural tangent (NT) regime in which we train a linearization of the neural network around its initialization; the fully trained neural network (NN) regime in which we train all the weights in the network. We prove that, even for the simple quadratic model of point (1), there is a potentially unbounded gap between the prediction risk achieved in these three training regimes, when the number of n...

To understand the dynamics of optimization in deep neural networks, we develop a tool to study th... more To understand the dynamics of optimization in deep neural networks, we develop a tool to study the evolution of the entire Hessian spectrum throughout the optimization process. Using this, we study a number of hypotheses concerning smoothness, curvature, and sharpness in the deep learning literature. We then thoroughly analyze a crucial structural feature of the spectra: in non-batch normalized networks, we observe the rapid appearance of large isolated eigenvalues in the spectrum, along with a surprising concentration of the gradient in the corresponding eigenspaces. In batch normalized networks, these two effects are almost absent. We characterize these effects, and explain how they affect optimization speed through both theory and experiments. As part of this work, we adapt advanced tools from numerical linear algebra that allow scalable and accurate estimation of the entire Hessian spectrum of ImageNet-scale neural networks; this technique may be of independent interest in other...

arXiv: Statistics Theory, 2018
We study estimation of the covariance matrix under relative condition number loss $\kappa(\Sigma^... more We study estimation of the covariance matrix under relative condition number loss $\kappa(\Sigma^{-1/2} \hat{\Sigma} \Sigma^{-1/2})$, where $\kappa(\Delta)$ is the condition number of matrix $\Delta$, and $\hat{\Sigma}$ and $\Sigma$ are the estimated and theoretical covariance matrices. Optimality in $\kappa$-loss provides optimal guarantees in two stylized applications: Multi-User Covariance Estimation and Multi-Task Linear Discriminant Analysis. We assume the so-called spiked covariance model for $\Sigma$, and exploit recent advances in understanding that model, to derive a nonlinear shrinker which is asymptotically optimal among orthogonally-equivariant procedures. In our asymptotic study, the number of variables $p$ is comparable to the number of observations $n$. The form of the optimal nonlinearity depends on the aspect ratio $\gamma=p/n$ of the data matrix and on the top eigenvalue of $\Sigma$. For $\gamma > 0.618...$, even dependence on the top eigenvalue can be avoided. ...

The Annals of Statistics, 2021
We consider the problem of learning an unknown function f on the d-dimensional sphere with respec... more We consider the problem of learning an unknown function f on the d-dimensional sphere with respect to the square loss, given i.i.d. samples {(y i , x i)} i≤n where x i is a feature vector uniformly distributed on the sphere and y i = f (x i) + ε i. We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons N diverges, for a fixed dimension d. We consider two specific regimes: the approximation-limited regime, in which n = ∞ while d and N are large but finite; and the sample size-limited regime in which N = ∞ while d and n are large but finite. In the first regime, we prove that if d +δ ≤ N ≤ d +1−δ for small δ > 0, then RF effectively fits a degree-polynomial in the raw features, and NT fits a degree-(+1) polynomial. In the second regime, both RF and NT reduce to kernel methods with rotationally invariant kernels. We prove that, if the number of samples is d +δ ≤ n ≤ d +1−δ , then kernel methods can fit at most a a degree-polynomial in the raw features. This lower bound is achieved by kernel ridge regression. Optimal prediction error is achieved for vanishing ridge regularization. Contents
Annals of Statistics, 2020
First we would like to congratulate Professor Johannes Schmidt-Hieber for his excellent paper, wh... more First we would like to congratulate Professor Johannes Schmidt-Hieber for his excellent paper, which shows the surprising result that deep neural networks can achieve good rates of convergence even in case of nonsmooth activation functions. In the following we divide our discussion into three parts: 1. The importance of compository assumptions. 2. The necessity of the sparsity of the networks. 3. The theoretical difference between ReLU and sigmoidal functions. 1. The importance of compository assumptions. In the sequel we use the following definition of (p, C)-smoothness.

In this work, we study the effect of varying the architecture and training data quality on the da... more In this work, we study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT). First, we establish that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size. Then, we systematically vary aspects of the training setup to understand how they impact the data scaling laws. In particular, we change the following (1) Architecture and task setup: We compare to a transformer-LSTM hybrid, and a decoder-only transformer with a language modeling loss (2) Noise level in the training distribution: We experiment with filtering, and adding iid synthetic noise. In all the above cases, we find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data. Lastly, we find that using back-translated data instead of parallel data, can si...

Natural language understanding and generation models follow one of the two dominant architectural... more Natural language understanding and generation models follow one of the two dominant architectural paradigms: language models (LMs) that process concatenated sequences in a single stack of layers, and encoder-decoder models (EncDec) that utilize separate layer stacks for input and output processing. In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs. In this work, we thoroughly examine the role of several architectural design choices on the performance of LMs on bilingual, (massively) multilingual and zero-shot translation tasks, under systematic variations of data conditions and model sizes. Our results show that: (i) Different LMs have different scaling properties, where architectural differences often have a significant impact on model performance at small scales, but the performance gap narrows as the number of parameters increases, (ii) Several design choices, including causal masking and language-modelin...

ArXiv, 2021
We present an empirical study of scaling properties of encoder-decoder Transformer models used in... more We present an empirical study of scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT). We show that cross-entropy loss as a function of model size follows a certain scaling law. Specifically (i) We propose a formula which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and show that it gives accurate predictions under a variety of scaling approaches and languages; we show that the total number of parameters alone is not sufficient for such purposes. (ii) We observe different power law exponents when scaling the decoder vs scaling the encoder, and provide recommendations for optimal allocation of encoder/decoder capacity based on this observation. (iii) We also report that the scaling behavior of the model is acutely influenced by composition bias of the train/test sets, which we define as any deviation from naturally generated text (either via machine generated or human translated ...

ArXiv, 2021
In this work, we study the evolution of the loss Hessian across many classification tasks in orde... more In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics. Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training heuristics such as gradient clipping and learning rate warmup. Our results demonstrate that successful model and hyperparameter choices allow the early optimization trajectory to either avoid— or navigate out of—regions of high curvature and into flatter regions that tolerate a higher learning rate. Our results suggest a unifying perspective on how disparate mitigation strategies for training instability ultimately address the same underlying failure mode of neural network optimization, namely poor conditioning. Inspired by the conditioning perspective, we show that learning rate warmup can improve tr...
We consider a linear regression $y=X\beta+u$ where $X\in\mathbb{\mathbb{{R}}}^{n\times p}$, $p\gg... more We consider a linear regression $y=X\beta+u$ where $X\in\mathbb{\mathbb{{R}}}^{n\times p}$, $p\gg n,$ and $\beta$ is $s$-sparse. Motivated by examples in financial and economic data, we consider the situation where $X$ has highly correlated and clustered columns. To perform sparse recovery in this setting, we introduce the \emph{clustering removal algorithm} (CRA), that seeks to decrease the correlation in $X$ by removing the cluster structure without changing the parameter vector $\beta$. We show that as long as certain assumptions hold about $X$, the decorrelated matrix will satisfy the restricted isometry property (RIP) with high probability. We also provide examples of the empirical performance of CRA and compare it with other sparse recovery techniques.
Uploads
Papers by Behrooz Ghorbani