Multiple‐Output Quantile Regression through Optimal Quantization

Jérôme Saracco

doi:10.1111/SJOS.12426

Outline

Multiple‐Output Quantile Regression through Optimal Quantization

Jérôme Saracco

Scandinavian Journal of Statistics

https://doi.org/10.1111/SJOS.12426

visibility

…

description

29 pages

Abstract

Charlier et al. (2015a,b) developed a new nonparametric quantile regression method based on the concept of optimal quantization and showed that the resulting estimators often dominate their classical, kernel-type, competitors. The construction, however, remains limited to single-output quantile regression. In the present work, we therefore extend the quantization-based quantile regression method to the multiple-output context. We show how quantization allows to approximate the population multiple-output regression quantiles introduced in Hallin et al. (2015), which are conditional versions of the location multivariate quantiles from Hallin et al. (2010). We prove that this approximation becomes arbitrarily accurate as the size of the quantization grid goes to infinity. We also consider a sample version of the proposed quantization-based quantiles and establish their weak consistency for their population version. Through simulations, we compare the performances of the proposed quantization-based estimators with their local constant and local bilinear kernel competitors from Hallin et al. (2015). We also compare the corresponding sample quantile regions. The results reveal that the proposed quantization-based estimators, which are local constant in nature, outperform their kernel counterparts and even often dominate their local bilinear kernel competitors.

Multiple-Output Regression through Optimal Quantization Isabelle Charlier SBS-EM, ECARES and Département de Mathématique Université libre de Bruxelles and Université de Bordeaux Davy Paindaveine SBS-EM, ECARES and Département de Mathématique, Université libre de Bruxelles Jérôme Saracco Université de Bordeaux April 2016 ECARES working paper 2016-18 ECARES ULB - CP 114/04 50, F.D. Roosevelt Ave., B-1050 Brussels BELGIUM www.ecares.org ECARES ULB - CP 114/04 50, F.D. Roosevelt Ave., B-1050 Brussels BELGIUM www.ecares.org Mutiple-Output Quantile Regression through Optimal Quantization Isabelle Charlier(1,2)∗ Davy Paindaveine(1)† Jérôme Saracco(2) April 3, 2016 (1) Université Libre de Bruxelles, ECARES and Département de Mathématique, 50, Avenue F.D. Roosevelt, CP114/04, B-1050, Brussels, Belgium. [email protected], [email protected] (2) Université de Bordeaux, Institut de Mathématiques de Bordeaux, UMR CNRS 5251 and INRIA Bordeaux Sud-Ouest, CQFD team, 351 Cours de la Libération, 33405 Talence, France. [email protected] Abstract Charlier et al. (2015a,b) developed a new nonparametric quantile regression method based on the concept of optimal quantization and showed that the resulting estimators often dominate their classical, kernel-type, competitors. The construction, however, remains limited to single-output quantile regression. In the present work, we therefore extend the quantization-based quantile regression method to the multiple-output context. We show how quantization allows to approximate the population multiple-output regression quantiles introduced in Hallin et al. (2015), which are conditional versions of the location multivariate quantiles from Hallin et al. (2010). We prove that this approximation becomes arbitrarily accurate as the size of the quantization grid goes to infinity. We also consider a sample version of the proposed quantization-based quantiles and establish their weak consistency for their population version. Through simulations, we compare the performances of the proposed quantization-based estimators with their local constant and local bilinear kernel competitors from Hallin et al. (2015). We also compare the corresponding sample quantile regions. The results reveal that the proposed quantization-based estimators, which are local constant in nature, outperform their kernel counterparts and even often dominate their local bilinear kernel competitors. ∗ Research is supported by a Bourse F.R.I.A. of the Fonds National de la Recherche Scientifique, Communauté française de Belgique. † Research is supported by the IAP research network grant nr. P7/06 of the Belgian government (Belgian Science Policy), the Crédit de Recherche J.0113.16 of the FNRS (Fonds National pour la Recherche Scientifique), Communauté Française de Belgique, and a grant from the National Bank of Belgium. 1 1 Introduction Since its introduction in the seminal paper Koenker and Bassett (1978), quantile regression has met a tremendous success. Unlike standard least squares regression that focuses on the mean of a scalar response Y conditional on a d-dimensional covariate X, quantile regression is after the corresponding conditional quantile of any order α ∈ (0, 1), hence provides a thorough description of the conditional distribution of the response. It is well-known that quantile regression, as an L1 -type method, dominates least squares regression methods in terms of robustness, while remaining very light on the computational side since it relies on linear programming methods. Quantile regression was first defined for linear regression but was of course later extended to nonlinear/nonparametric regression. For a modern account of quantile regression, we refer to the monograph Koenker (2005). Now, quantile regression for long has been restricted to the single-output context for which the response is univariate. The reason is of course that the lack of a canonical ordering in Rm , m > 1, makes unclear how to define a suitable concept of quantile in the multivariate setting; we refer to Serfling (2002) for a nice review on multivariate quantiles. More recently, Hallin et al. (2010) introduced a multivariate quantile that enjoys all properties usually expected from a quantile and that provides, as in the univariate case, a strong connection with the concept of halfspace depth. The same paper also based on this quantile a concept of multiple-output linear regression quantiles, extending to the multiple-output setup the linear regression quantiles from Koenker and Bassett (1978). Hallin et al. (2015) — hereafter, HLPS15 — then extended this construction to nonparametric regression by introducing local constant and local bilinear versions of these multiple-output regression quantiles. The resulting nonparametric multiple- output regression quantiles make use of kernel smoothing, hence actually extend to the multiple- output setup the single-output local constant and local linear kernel quantile regression methods from Yu and Jones (1998). Kernel methods, however, are not the only smoothing techniques that can be used to perform nonparametric quantile regression. In the single-output context, Charlier et al. (2015a) indeed showed recently that nonparametric quantile regression can alternatively be based on optimal e N of size N of the quantization, which is a method that provides a discretized approximation X d-dimensional (typically continuous) covariate X (this discretization is obtained by projecting X onto an optimal quantization grid, that is onto a collection of N points minimizing the Lp -norm e N − X). Charlier et al. (2015a) introduced a nonparametric regression quantile method of X based on optimal quantization and derived some of its theoretical properties. Charlier et al. (2015b) then showed through simulations that the resulting sample regression quantiles often dominate their kernel competitors in terms of integrated square errors. 2 This dominance of quantization-based regression quantiles over their kernel competitors pro- vides a natural motivation to try and define quantization-based analogs of the kernel multiple- output regression quantiles from HLPS15. This is the objective of the present paper, that is organized as follows. Section 2 describes the population multiple-output regression quantiles considered in this work, namely the conditional version of the multivariate quantiles from Hallin et al. (2010), and explains how these can be approximated through optimal quantization. The approximation is shown to be arbitrarily accurate as the number N of grid points used in the quantization goes to infinity. Section 3 defines the corresponding sample quantization-based re- gression quantiles and establishes their consistency (for the fixed-N approximation of multiple- output regression quantiles). Section 4 is devoted to numerical results : first, a data-driven method to select the smoothing parameter N is described (Section 4.1). Then, a comparison with the kernel-based competitors from HLPS15 is performed, based on empirical integrated square errors and on visual inspection of the resulting conditional quantile regions (Section 4.2). Finally, Section 5 concludes and an appendix collects the proofs. 2 Quantization-based multiple-output regression quantiles As mentioned above, the main objective of this paper is to estimate through optimal quantization the population multiple-output regression quantiles from Hallin et al. (2015). We start by describing these quantiles. 2.1 The population multiple-output regression quantiles considered Let X be a vector of covariates of dimension d and Y be a vector of response variables of dimension m. The multivariate quantiles from HLPS15 are indexed by a vector α ranging over B m := {y ∈ Rm : 0 < |y| < 1}, the open unit ball of Rm deprived of the origin (throughout, | · | denotes the Euclidean norm). This index α factorizes into α = αu, with α = |α| ∈ (0, 1) and u ∈ S m−1 := {y ∈ Rm : |y| = 1}. Letting Γu be an arbitrary m × (m − 1) matrix whose columns form, jointly with u, an orthonormal basis of Rm , the population regression α-quantiles we consider are the following. Definition 2.1 (HLPS15). The α-quantile of Y given X = x is any element of the collection of hyperplanes π α,x := {(x, y) ∈ Rd × Rm : u0 y = c0α,x Γ0u y + aα,x } such that aα,x q α,x := = arg min(a,c0 )0 ∈Rm E[ρα (Yu − c0 Yu⊥ − a)|X = x], (2.1) cα,x where α =: αu ∈ B m , Yu := u0 Y , Yu⊥ := Γ0u Y and z 7→ ρα (z) := z(α − I[z<0] ) is the usual check function (throughout, IA is the indicator function of A). 3 Like their original location version introduced in Hallin et al. (2010), the α-quantiles from Definition 2.1 are directional quantiles. The vector u is the direction of the quantile, whereas the scalar quantity α refers to the order of the quantile. The y-part of the α-quantile from Definition 2.1, that is, {y ∈ Rm : u0 y = c0α,x Γ0u y + aα,x }, (2.2) actually provides the Hallin et al. (2010) (location) multivariate α-quantile associated with the conditional distribution P Y |X=x of Y given X = x. We refer to Hallin et al. (2010) for the exact interpretation of this quantile. It then follows directly from Hallin et al. (2010) that, under mild assumptions, the regression α-quantile region Rα,x := ∩u∈S m−1 y ∈ Rm : u0 y ≥ c0αu,x Γ0u y + aαu,x , α ∈ 0, 21 , (2.3) coincides with the level-α halfspace depth region of P Y |X=x (that is, with the set of y’s whose halfspace depth with respect to P Y |X=x is larger than or equal to α). In the single-output case (m = 1), the quantile in (2.2), for u = 1 (resp., for u = −1), reduces to the usual α-quantile (resp., (1 − α)-quantile) of P Y |X=x , and Rα,x coincides with the interval whose end points are these two quantiles. If P Y |X=x is absolutely continuous with respect to the Lebesgue measure on Rm , with a density that has a connected support and admits finite first-order moments, the minimiza- tion problem in (2.1) admits a unique solution; see Hallin et al. (2010). Moreover, (a, c0 )0 7→ Ga,c (x) = E[ρα (Yu − c0 Yu⊥ − a)|X = x] is then convex and continuously differentiable on Rm . Therefore, under these asumptions, q α,x = (aα,x , c0α,x )0 is alternatively characterized as the unique solution of the system of equations ∂a Ga,c (x) = P [u0 Y < a + c0 Γ0u Y |X = x] − α = 0 (2.4) ∇c Ga,c (x) = E Γ0u Y α − I[u0 Y <a+c0 Γ0u Y ] X = x = 0; (2.5) see Lemma A.4 in Appendix A. As we will see in the sequel, twice differentiability of (a, c0 )0 7→ Ga,c (x) actually requires slightly stronger assumptions. 2.2 Quantization-based multiple-output regression quantiles We now construct an approximation of the above regression quantiles based on optimal quan- tization. For any fixed N ∈ N0 (:= {1, 2, . . . }), quantization replaces the random d-vector X N e γ := Projγ N (X) obtained by projecting X onto the N -quantization by a discrete version X grid γ N (∈ (Rd )N ). The quantization grid is optimal if it minimizes the quantization error N e γ − Xkp , where kZkp := (E[|Z|p ])1/p denotes the Lp -norm of Z. Existence (but not kX 4 unicity) of such an optimal grid is guaranteed if the distribution of X does not charge any e N will denote the projection of X onto an hyperplane; see, e.g., Pagès (1998). In the sequel, X arbitrary optimal N -grid. This approximation becomes more and more precise as N increases e N − Xkp = O(N −1/d ) as N → ∞; see, e.g., Graf and Luschgy (2000). More details since kX on optimal quantization can be found in Pagès (1998), Pagès and Printems (2003) or Graf and Luschgy (2000). Let p ≥ 1 such that kXkp < ∞ and let γ N be an optimal quantization grid. Replacing X e N onto γ N leads to considering in (2.1) by its projection X aN eN α,x e N = x̃], = arg min(a,c0 )0 ∈Rm E[ρα (Yu − c0 Yu⊥ − a)|X e q α,x = (2.6) cN eα,x where x̃ denotes the projection of x onto γ N . A quantization-based approximation of the multiple-output regression quantile from Definition 2.1 above is thus any hyperplane of the form 0 0 0 π̃ N d m cN aN α,x := (x, y) ∈ R × R : u y = (eα,x ) Γu y + eα,x . eN This quantization-based quantile being entirely characterized by q α,x , we will investigate the N N quality of this approximation through q e − X goes to zero in Lp -norm as N goes eα,x . Since X eN to infinity, we may expect that q α,X − q α,X also converges to zero in an appropriate sense. To formalize this, the following assumptions are needed. Assumption (A) (i) The random vector (X, Y ) is generated through Y = M (X, ε), where the d-dimensional covariate vector X and the m-dimensional error vector ε are mutu- ally independent; (ii) the support SX of the distribution PX of X is compact; (iii) denoting by GLm (R) the set of m × m invertible real matrices, the link function M : SX × Rm → Rm is of the form (x, z) 7→ M (x, z) = Ma,x + Mb,x z, where the functions Ma,· : SX → Rm and Mb,· : SX → GLm (R) are Lipschitz with respect to the Euclidean norm and operator norm, respectively (see below); (iv) the distribution PX of X does not charge any hyperplane; (v) kXkp < ∞ and kεkp < ∞; For the sake of clarity, we make precise that the Lipschitz properties of Ma,· and Mb,· in Assumption (A)(iii) mean that there exist constants C1 , C2 > 0 such that ∀x1 , x2 ∈ Rd , |Ma,x1 − Ma,x2 | ≤ C1 |x1 − x2 |, (2.7) ∀x1 , x2 ∈ Rd , kMb,x1 − Mb,x2 k ≤ C2 |x1 − x2 |, (2.8) where kAk = supu∈S m−1 |Au| denotes the operator norm of A. The smallest constant C1 (resp., C2 ) that satisfies (2.7) (resp., (2.8)) will be denoted as [Ma,· ]Lip (resp., [Mb,· ]Lip ). 5 Assumption (B) The distribution of ε is absolutely continuous with respect to the Lebesgue measure on Rm , with a density f ε : Rm → R+ 0 that is bounded, has a connected support, admits finite second-order moments, and satisfies, for some constants C > 0, r > m − 1 and s > 0, f (z 1 ) − f ε (z 2 ) ≤ C |z 1 − z 2 |s 1 + 1 |z 1 + z 2 |2 −(3+r+s)/2 , ε 2 (2.9) for all z 1 , z 2 ∈ Rm . We refer to Hallin et al. (2010) for more details on Assumption (B) and more particularly on (2.9). For technical reasons, the condition r > m − 2 from Hallin et al. (2010) had to be slightly reinforced into r > m − 1 above. As shown in Lemma A.4, Assumption (B) ensures twice continuous differentiability of (a, c0 )0 7→ Ga,c (x). Under these assumptions, we have the following result (see Appendix A for the proof). Theorem 2.1. Let Assumptions (A) and (B) hold. Then, for any α ∈ B m , qN sup |e α,x − q α,x | → 0, x∈SX as N → ∞. This result confirms that, as the size N of the optimal quantization grid goes to infinity, eN the quantization-based approximation q α,x of q α,x becomes arbitrarily precise. Clearly, the approximation is actually uniform in x. This makes it natural to try and define, whenever eN observations are available, a sample version of q α,x that will then be an estimator of q α,x from which one will be able to obtain in particular sample versions of the regression quantile regions Rα,x in (2.3). 3 Sample quantization-based multiple-output quantiles We now consider the problem of defining, from independent copies (X1 , Y 1 ), . . . , (Xn , Y n ) bN,n of (X, Y ), a sample version q eN α,x of the quantization-based regression quantile coefficients q α,x in (2.6). 3.1 Definition of the estimator No closed form is available for an optimal quantization grid, except in some very particular cases. bN,n The definition of q α,x thus first requires constructing an (approximate) optimal grid. This may be done through a stochastic gradient algorithm, which proceeds as follows to quantize a d- dimensional random vector X. 6 Let (ξ t )t∈N0 be a sequence of independent copies of X, and let (δt )t∈N0 be a deterministic sequence in (0, 1) satisfying ∞ P P∞ 2 t=1 δt = +∞ and t=1 δt < +∞. Starting from an initial N - grid γ̂ N,0 with pairwise distinct components, the algorithm recursively defines the grid γ̂ N,t , t ∈ N0 , as δt γ̂ N,t = γ̂ N,t−1 − ∇x dpN (γ̂ N,t−1 , ξ t ), p where ∇x dpN (x, ξ) is the gradient with respect to the x-component of the so-called local quanti- zation error dpN (x, ξ) = mini=1,...,N |xi − ξ|p , with x = (x1 , . . . , xN ) ∈ (Rd )N and ξ ∈ Rd . Since (∇x dpN (x, ξ))i = p|xi − ξ|p−2 (xi − ξ)I[xi =Projx (ξ)] , i = 1, . . . , N , two consecutive grids γ̂ N,t−1 and γ̂ N,t differ by one point only, namely the point corresponding to the non-zero component of this gradient. The reader can refer to Pagès (1998), Pagès and Printems (2003) or Graf and Luschgy (2000) for more details on this algorithm, which, for p = 2, is known as the Competitive Learning Vector Quantization (CLVQ) algorithm. bN,n The construction of q α,x then proceeds in two steps. (S1) An “optimal” quantization grid is obtained from the algorithm above. First, an initial grid γ̂ N,0 is selected by sampling randomly without replacement among the Xi ’s, under the constraint that the same value cannot be picked more than once (a constraint that is relevant only if there are ties in the Xi ’s). Second, n iterations of the algorithm are performed, based on ξ t = Xt , for t = 1, . . . , n. The resulting optimal grid is denoted as γ̂ N,n = (x̂N,n N,n 1 , . . . , x̂n ). eN 0 0 0 eN (S2) The approximation q α,x = arg min(a,c0 )0 E[ρα (u Y − c Γu Y − a)| X = x̃] in (2.6) is then estimated by aN,n n α,x X bN,n ρα (u0 Y i − c0 Γ0u Y i − a)I[X̂ N = x̂] , b q α,x = = arg min(a,c0 )0 ∈Rm (3.1) cN,n bα,x i=1 i bN = X where X b N,n = Proj N,n (Xi ) and x̂ = x̂N,n = Proj N,n (x). i i γ̂ γ̂ An estimator of the multiple-output regression quantiles from Definition 2.1 is then any hyperplane of the form 0 0 0 π̂ N,n d m cN,n aN,n α,x := (x, y) ∈ R × R : u y = (bα,x ) Γu y + bα,x . bN,n Since this estimator is entirely characterized by q bN,n α,x , we may focus on q α,x when investigating the properties of these sample quantiles. We will show that, for fixed N (∈ N0 ) and x(∈ SX ), bN,n q eN α,x is a weakly consistent estimator for q α,x . The result requires restricting to p = 2 and reinforcing Assumption (A)(iv) into the following assumption. 7 Assumption (A)0 Same as Assumption (A), but with Assumption (A)(iv) replaced by the following : PX is absolutely continuous with respect to the Lebesgue measure on Rd . We then have the following result (see Appendix B for the proof). Theorem 3.1. Let Assumption (A) 0 hold. Then, for any α ∈ B m , x ∈ SX and N ∈ N0 , N,n eN bα,x − q q α,x → 0 as n → ∞ in probability, provided that quantization is based on p = 2. In principle, Theorem 2.1 and Theorem 3.1 could be combined to provide an asymptotic N,n bα,x − q α,x | → 0 as n → ∞ in probability, with N = Nn going to infinity at result stating that q an appropriate rate. However, obtaining such a result is extremely delicate, since all convergence results for the CLVQ algorithm are as n → ∞ with N fixed. 3.2 A bootstrap modification For small sample sizes, the stochastic gradient algorithm above is likely to provide a grid that is far from being optimal, which may have a negative impact on the proposed sample quantiles. To improve on this, we propose the same bootstrap approach as the one adopted in the single-output context by Charlier et al. (2015a,b) : (S1) For some integer B, we first generate B samples of size n with replacement from the initial sample X 1 , . . . , X n , that we denote as {ξ tb , t = 1, . . . , n}, b = 1, . . . , B. We also generate initial grids γ̂ N,0 b as above, by sampling randomly among the correspond- ing {ξ tb , t = 1, . . . , n} under the constraints that the N values are pairwise distinct. We then perform B times the CLVQ algorithm with iterations based on {ξ tb , t = 1, . . . , n} and with initial grid γ̂ N,0 N,n b . This provides B optimal grids γ̂ b , b = 1, . . . , B (each of size N ). (S2) Each of these grids is then used to estimate multiple-output regression quantiles. Working again with the original sample (X i , Y i ), i = 1, . . . , n, we project the X-part onto the (b),N,n grids γ̂ N,n b , b = 1, . . . , B. Therefore, (3.1) provides B estimates of q α,x , denoted as q̂ α,x , b = 1, . . . , B. This leads to the bootstrap estimator B 1 X (b),N,n q̄ N,n α,x = q̂ α,x , (3.2) B b=1 obtained by averaging these B estimates. Denoting by R̂α,x the resulting sample quantile regions (see Section 4.2.3 for more details), the parameter B should be chosen large enough to smooth the mappings x 7→ R̂α,x , but not too 8 large to keep the computational burden under control. We use B = 50 or B = 100 in the sequel. The choice of N , that plays the role of the smoothing parameter in the nonparametric regression method considered, has an important impact on the proposed estimators and is discussed in the next section. 4 Numerical results In this section, we explore the numerical performances of the proposed estimators. We first introduce in Section 4.1 a data-driven method for selecting the size N of the quantization grid. In Section 4.2, we then compare the proposed (bootstrap) quantization-based estimators with their kernel-type competitors from HLPS15. 4.1 Data-driven selection of N In this section, we extend the N -selection criterion developed in Charlier et al. (2015b) to the present multiple-output context. This criterion is based on the minimization of an empirical integrated square error (ISE) quantity that is essentially convex in N , which allows to identify an optimal value Nopt of N . Let x1 , . . . , xJ be values of interest in SX and u1 , . . . , uK be directions of interest in S m−1 , with J, K finite. The procedure to select N works as follows. For any combination of xj and uk , (b),N,n we first compute q̄ N,n αuk ,xj = B 1 PB b=1 q̂ αuk ,xj from B bootstrap samples as above. We then generate Be further samples of size n with replacement from the initial sample X1 , . . . , Xn , and we perform B e times the CLVQ algorithm with iterations based on these samples. This provides B e optimal quantization grids. Working again with the original sample (Xi , Y i ), i = 1, . . . , n and (B+b̃),N,n using the b̃th grid, (3.1) provides B e new estimations, denoted q̂ αu k ,xj , b̃ = 1, . . . , B. e We then consider J K B e 1 X 1 X 1 X N,n (B+b̃),N,n 2 ISEα,B,B,J,K d (N ) = q̄ αuk ,xj − q̂ αuk ,xj . J K e j=1 B e k=1 b̃=1 To make the notation lighter, we simply denote these integrated square errors as ISEd α (N ) (throughout, our numerical results will be based on m = 2, B e = 30 and K equispaced directions in S 1 ; the values of B, K and x1 , . . . , xJ will be made precise in each case). These sample ISEs are to be minimized in N . Since not all values of N can be considered in practice, we rather consider N̂α;opt = arg min ISE d α (N ), (4.1) N ∈N where the cardinality of N (⊂ N0 ) is finite and may be chosen as a function of n. 9 0.2 0.4 0.25 0.2 0.4 0.20 0.15 ISEα(N) ^ 0.10 0.05 0.00 10 15 20 25 30 N Figure 1: Plot of the mappings N 7→ ISE d α (N ) (α = 0.2, 0.4) with B = 50, B e = 30 and K = 40, averaged over 100 mutually independent replications of Model (M1) with sample size n = 999. For illustration purposes, we simulated random samples of size n = 999 according to the model (M1) (Y1 , Y2 ) = (X, X 2 ) + (1 + X 2 )ε, where X ∼ U ([−2, 2]), ε has independent N (0, 1/4) marginals, and X and ε are independent. Figure 1 plots, for α = 0.2 and α = 0.4, the graphs of N 7→ ISE d α (N ), where the ISEs are based on B = 50, K = 40 and x1 = −1.89, x2 = −1.83, x3 = −1.77, . . . , xJ = 1.89 (more precisely, the figure shows the average of the corresponding graphs, computed from 100 mutually independent replications). It is seen that ISE curves are indeed essentially convex in N and allow to select N equal to 10 for both values of α. 4.2 Comparison with competitors In this section, we investigate the numerical performances of our estimator q̄ N,n α,x . In Section 4.2.1, we first define the competitors that will be considered. Then we compare the respective ISEs through simulations (Section 4.2.2) and show how the estimated quantile regions compare on a given sample (Section 4.2.3). 4.2.1 The competitors considered The main competitors are the local constant and local bilinear estimators from HLPS15, that extend to the multiple-output setting the local constant and local linear estimators of Yu and Jones (1998), respectively. To describe these estimators, fix a kernel function K : Rd → R+ and 10 ⊥ a bandwidth h. Writing Yiu := u0 Y i and Yiu := Γ0u Y i , the local constant estimator is then the c c minimizer qbα,x acα,x , (b = (b cα,x )0 )0 of n X − x X i 0 c c 1 q 7→ K ρα Yiu − q Xiu , with Xiu := ⊥ . (4.2) h Yiu i=1 ` ` ` As for the local (bi)linear estimator qbα,x a`α,x , (b = (b cα,x )0 )0 , its transpose vector (qbα,x )0 is given by the first row of the (d + 1) × m matrix Q̂ that minimizes n X − x X i 1 1 ρα Yiu − (vec Q)0 Xiu ` ` Q 7→ K , with Xiu := ⊥ ⊗ . (4.3) h Yiu Xi − x i=1 As explained in HLPS15, the local bilinear approach is more informative than the local constant one and should be more reliable close to the boundary of the covariate support. However, the c is of dimension m, whereas X ` price to pay is an increase of the covariate space dimension (Xiu iu is of dimension m(d + 1)). We refer to HLPS15 for more details on these approaches. In the sequel, we consider d = 1 and m = 2 in order to provide graphical representations of the corresponding quantile regions. The kernel K will be the density of the bivariate standard Gaussian distribution and we choose, as in most applications in HLPS15, 3sx h= , (4.4) n1/5 where sx stands for the empirical standard deviation of X1 , . . . , Xn . 4.2.2 Comparison of ISEs We now compare our bootstrap estimators with the competitors above in terms of ISEs. To do so, we generated 500 independent samples of size n = 999 from (M1) (Y1 , Y2 ) = (X, X 2 ) + (1 + X 2 )ε1 , (M2) (Y1 , Y2 ) = (X, X 2 ) + ε1 , 3 π 2 (M3) (Y1 , Y2 ) = (X, X 2 ) + 1 + 2 sin 2X ε2 , where X ∼ U ([−2, 2]), ε1 has independent N (0, 1/4) marginals, ε2 has independent N (0, 1) marginals, and X is independent of ε1 and ε2 (these models were already considered in HLPS15). Both the proposed quantization-based quantiles and their competitors are indexed by a scalar order α ∈ (0, 1) and a direction u ∈ S 1 . In this section, we compare efficiencies when estimating a given conditional quantile q αu,x . In the sequel, we still work with α = 0.2, 0.4 and we fix u = (0, 1)0 . For each replication in each model, the various quantile estimators were computed, based on the bandwidth h in (4.4) for the HLPS15 estimators and based on B = 100 and the N - selection procedure described in Section 4.1 (with x1 = −1.89, x2 = −1.83, . . . , xJ = 1.89, N = 11 {10, 15, 20} and K = 1 direction, namely the direction u = (0, 1)0 above) for the quantization- based estimators. For each estimator, we then evaluated J J X 2 X 2 ISEaα = aαu,xj − aαu,xj b and ISEcα = cαu,xj − cαu,xj b , j=1 j=1 aαu,xj stands for āN,n still for x1 = −1.89, x2 = −1.83, . . . , xJ = 1.89; here, b acαu,xj or b αu,xj , b a`αu,xj cαu,xj for c̄N,n and b αu,xj , b c cαu,x j or b ` cαu,x j . Figure 2 reports, for each model and each estimator, the boxplots of ISEaα and ISEcα obtained from the 500 replications considered. Results reveal that the proposed estimator q̄ N,n α,x and the local bilinear estimator q ` bα,x perform c significantly better than the local constant estimator qbα,x , particularly for the estimation of the first component aα,x of q α,x . In most cases, the proposed estimator q̄ N,n α,x actually also dominates ` the local bilinear one qbα,x (the only cases where the opposite holds relate to the estimation of cα,x and the difference of performance is then really small). It should be noted that the quantization- based estimator q̄ N,n α,x is local constant in nature, which makes it remarkable that it behaves well in terms of ISE compared to its local bilinear kernel competitor. 4.2.3 Comparison of sample quantile regions As explained in HLPS15, the regression quantile regions Rα,x in (2.3) are extremely informative about the conditional distribution of the response, which makes it desirable to obtain well- behaved estimations of these regions. That is why we now compare the sample regions obtained from the proposed quantization-based quantile estimators with the kernel ones from HLPS15. Irrespective of the quantile coefficient estimators q̂ αu,x = (âαu,x , ĉ0αu,x )0 used, the corresponding sample regions are obtained as R̂α,x := ∩u∈S m−1 y ∈ Rm : u0 y ≥ ĉ0αu,x Γ0u y + âαu,x , F where SFm−1 is a finite subset of S m−1 ; compare with (2.3). We considered a random sample of size n = 999 from Model (M1) and computed, for the various estimation methods, R̂α,x for α = 0.2, 0.4 and for x = −1.89, −1.83, −1.77, . . . , 1.89; in each case, SFm−1 = SF1 is made of 360 equispaced directions in S 1 . In each model, we did not select h following the data-driven procedure mentioned in Section 4.2.1, but chose it equal to 0.37, as proposed in HLPS15, Figure 3. For the quantization-based estimators, N was selected according to the data-driven method from Section 4.1 (with B = 100, K = 360, N = {5, 10, 15, 20}, and still x1 = −1.89, x2 = −1.83, . . . , xJ = 1.89), which led to the optimal value N = 10. The resulting sample quantile regions, obtained from the quantization-based method and from the local constant and local bilinear kernel ones, are plotted in Figure 3. 12 1.2 0.12 1.0 0.10 0.8 0.08 ISEaα ISEαc 0.6 0.06 0.4 0.04 0.02 0.2 0.00 0.0 α = 0.2 α = 0.4 α = 0.2 α = 0.4 (a) (b) 2.5 0.6 2.0 1.5 0.4 ISEaα ISEαc 1.0 0.2 0.5 0.0 0.0 α = 0.2 α = 0.4 α = 0.2 α = 0.4 (c) (d) 1.5 0.06 1.0 0.04 ISEaα ISEαc 0.5 0.02 0.00 0.0 α = 0.2 α = 0.4 α = 0.2 α = 0.4 (e) (f) Figure 2: Boxplots, for α = 0.2, 0.4 and u = (0, 1)0 , of ISEaα (left) and of ISEcα (right) for various conditional quantile estimators obtained from 500 independent random samples according to Models (M1) (top), (M2) (middle) and (M3) (bottom), with size n = 999. The estimators considered are the quantization-based estima- ` c tor q̄ N,n α,x (in blue), the local bilinear estimator q bα,x (in purple) and the local constant estimator qbα,x (in red). 13 For comparison purposes, the population quantile regions Rα,x are also reported there. We observe that the quantization-based and local bilinear methods provide quantile regions that are nice and close to the population ones. They succeed in particular in catching the underlying heteroscedasticity. Clearly, they perform better than the local constant methods close to the boundary of the covariate range. While the local (bi)linear methods, as already mentioned, are known to exhibit good boundary behaviour, it is surprising that the quantization-based method also behaves well in this respect, since this method is of a local constant nature. Finally, it should be noted that, unlike the smoothing parameter of the local constant/bilinear methods (namely, h), that of the quantization-based method (namely, N ) was chosen in a fully data-driven way. 5 Summary In this paper, we defined new nonparametric estimators of the multiple-output regression quan- tiles proposed in HLPS15. The main idea, that was already used in the single-output context in Charlier et al. (2015a,b), is to perform localization through the concept of optimal quantization rather than through standard kernel methods. We derived consistency results that generalize to the multiple-output context those obtained in Charlier et al. (2015a). Moreover, the good empirical efficiency properties of quantization-based quantiles showed in Charlier et al. (2015b) extend to the multiple-output context. In particular, the proposed quantization-based sample quantiles, that are local constant in nature, outperform their kernel-based counterparts, both in terms of integrated square errors and in terms of visual inspection of the corresponding quan- tile regions. The proposed quantiles actually perform as well as (and sometimes even strictly dominate) the local bilinear kernel estimators from HLPS15. The data-driven selection procedure we proposed for the smoothing parameter N involved in the quantization-based method allows to make the estimation fully automatic. Our estimation procedure was actually implemented in R and the code is available from the authors on simple request. A Proof of Theorem 2.1 The proof requires several lemmas. First, recall that Ga,c (x) = E[ρα (Yu − c0 Yu⊥ − a)|X = x] and consider the corresponding quantized quantity G e N = x̃]. e a,c (x̃) = E[ρα (Yu − c0 Y ⊥ − a)|X u Since q α,x = (aα,x , c0α,x )0 and q eNα,x = (eaN cN α,x , (e 0 0 α,x ) ) are defined as the vectors achieving the minimum of Ga,c (x) and G e a,c (x̃) respectively, we naturally start controlling the distance between G e a,c (x̃) and Ga,c (x). This is achieved below in Lemma A.5, whose proof requires the following 14 5 5 Y2 Y2 0 0 -5 -5 -5 0 5 -5 0 5 Y1 Y1 (a) (b) 5 5 Y2 Y2 0 0 -5 -5 -5 0 5 -5 0 5 Y1 Y1 (c) (d) Figure 3: (a)-(c) The sample quantile regions R̂α,x , for α = 0.2, 0.4 and x = −1.89, −1.83, −1.77, . . . , 1.89, computed from a random sample of size n = 999 from Model (M1) by using (a) the quantization-based method, (b) the local constant kernel method, and (c) the local bilinear kernel one. (d) The corresponding population quantile regions Rα,x . 15 preliminary lemmas. Throughout this appendix, C is a constant that may vary from line to line. Lemma A.1. Let Assumption (A) hold and fix α = αu ∈ B m , a ∈ R and c ∈ Rm−1 . Then for p x1 , x2 ∈ Rd , |Ga,c (x1 ) − Ga,c (x2 )| ≤ max(α, 1 − α) 1 + |c|2 ([Ma,· ]Lip + [Mb,· ]Lip kεk1 )|x1 − x2 |. Proof. For x1 , x2 ∈ Rd , we have |Ga,c (x1 ) − Ga,c (x2 )| = E[ρα (Yu − c0 Yu⊥ − a)|X = x1 ] − E[ρα (Yu − c0 Yu⊥ − a)|X = x2 ] = E[ρα ((u − Γu c)0 M (X, ε) − a)|X = x1 ] − E[ρα ((u − Γu c)0 M (X, ε) − a)|X = x2 ] = E[ρα ((u − Γu c)0 M (x1 , ε) − a) − ρα ((u − Γu c)0 M (x2 , ε) − a)], where we used the independence of X and ε. Using the fact that ρα is a Lipschitz function with Lipschitz constant [ρα ]Lip := max(α, 1 − α), then the Cauchy-Schwarz inequality, this yields |Ga,c (x1 ) − Ga,c (x2 )| ≤ [ρα ]Lip E[|(u − Γu c)0 (M (x1 , ε) − M (x2 , ε))|], ≤ [ρα ]Lip |u − Γu c| E[|M (x1 , ε) − M (x2 , ε)|] ≤ [ρα ]Lip |(u, Γu )(1, −c0 )0 | E[|Ma,x1 − Ma,x2 + (Mb,x1 − Mb,x2 )ε|] p ≤ [ρα ]Lip 1 + |c|2 ([Ma,· ]Lip + [Mb,· ]Lip kεk1 )|x1 − x2 |, where we used Assumptions (A)(iii)-(v). The following lemma shows that, under the assumptions considered, the regularity prop- erty (2.9) extends from the error density f ε (·) to the conditional density f Y |X=x (·). Lemma A.2. Let Assumptions (A) and (B) hold and fix x ∈ SX . Then, for some constants C > 0, r > m − 1 and s > 0, we have Y |X=x −(3+r+s)/2 (y 1 ) − f Y |X=x (y 2 ) ≤ C|y 1 − y 2 |s 1 + 12 |y 1 + y 2 |2 f , (A.1) for all y 1 , y 2 ∈ Rm . Proof. Assumption (A) allows to rewrite the conditional density of Y given X = x as 1 −1 f Y |X=x (y) = f ε Mb,x (y − Ma,x ) , |det(Mb,x )| for all y ∈ Rm . Hence, we have |f Y |X=x (y 1 ) − f Y |X=x (y 2 )| 1 ε −1 −1 (y 1 − Ma,x ) − f ε Mb,x = f Mb,x (y 2 − Ma,x ) . |det(Mb,x )| 16 Now, Assumption (B) entails −1 s Y |X=x Y |X=x C Mb,x (y 1 − y 2 ) |f (y 1 ) − f (y 2 )| ≤ −1 2 (3+r+s)/2 |det(Mb,x )| 1 + 21 Mb,x (y 1 + y 2 − 2Ma,x ) −1 2 −(3+r+s)/2 ≤ C|(y 1 − y 2 )|s 1 + 21 Mb,x (y 1 + y 2 − 2Ma,x ) , where the second inequality comes from the compactness of SX and the continuity of the mapping x 7→ Mb,x . The result then follows from the fact that 2 1 + 21 y 1 + y 2 2 1 + 1 M −1 (y 1 + y 2 − 2Ma,x ) 2 b,x −1 2 1 + 21 Mb,x {Mb,x (y 1 + y 2 − 2Ma,x )} + 2Ma,x = 2 1 + 1 M −1 (y 1 + y 2 − 2Ma,x ) 2 b,x −1 2 C + C Mb,x (y 1 + y 2 − 2Ma,x ) ≤ −1 2 ≤ C, 1 + 12 Mb,x (y 1 + y 2 − 2Ma,x ) where we used again the continuity of x 7→ Ma,x and x 7→ Mb,x , and the compactness of SX . We will also need the following result belonging to linear algebra. Lemma A.3. For p > q ≥ 1, let V = (v 1 . . . v q ) be a p × q full-rank matrix and H be a q-dimensional vector subspace of Rp . Then, there exists a p × q matrix U = (u1 . . . uq ) whose columns form an orthonormal basis of H and such that Iq + U 0 V is invertible. Proof. We fix p ≥ 2 and we prove the result by induction on q between q = 1 and q = p − 1. We start with the case q = 1 and take U = (u1 ), where u1 is an arbitrary unit p-vector in H. If det(1 + U 0 V ) = 1 + u01 v 1 = 0, then we may alternatively take U ∗ = (−u1 ), which provides det(1 + U 0∗ V ) = 1 − u01 v 1 = 2 6= 0. Assume then that the result holds for q (with q < p − 1) and let us prove it for q + 1. Pick an arbitrary p × (q + 1) matrix U = (u1 . . . uq+1 ) whose columns form an orthonormal basis of the (q + 1)-dimensional vector subspace H of Rp . Assume that det(I q+1 + U 0 V ) = 0, where V = (v 1 . . . v q+1 ) is the given p × (q + 1) full-rank matrix. Letting U−1 = (u2 . . . uq+1 ) and V −1 = (v 2 . . . v q+1 ), an expansion of the determinant along the first row provides Pq+1 0 = det(Iq+1 + U 0 V ) = (u01 v 1 + 1) det(Iq + U−1 0 V −1 ) + i=2 (−1) i+1 u0 v det(W ), 1 i i for some q × q matrices W 2 , . . . , W m . With U∗ = (−u1 , u2 . . . uq+1 ), we then have Pq+1 det(Iq+1 + U∗0 V ) = (−u01 v 1 + 1) det(Iq + U−1 0 V −1 ) − i=2 (−1) i+1 u0 v det(W ) 1 i i 0 = 2 det(Iq + U−1 V −1 ) − det(Iq+1 + U 0 V ) 0 = 2 det(Iq + U−1 V −1 ). 17 0 The induction hypothesis guarantees that U−1 can be chosen such that det(Iq + U−1 V −1 ) is non-zero, which establishes the result. We can now calculate explicitly the gradient and the Hessian matrix of the function (a, c) 7→ Ga,c (x) for any x in the support SX of X, and derive some important properties of this Hessian matrix. Lemma A.4. Let Assumptions (A) and (B) hold. Then (i) (a, c) 7→ Ga,c (x) is twice differen- tiable at any x ∈ SX , with gradient vector P [u0 Y < a + c0 Γ0u Y |X = x] − α ! ! ∇a Ga,c (x) ∇Ga,c (x) = = (A.2) E Γ0u Y (I[u0 Y <a+c0 Γ0u Y ] − α)|X = x ∇c Ga,c (x) and Hessian matrix ! 1 t0 Z f Y |X=x (a + c0 t)u + Γu t dt; Ha,c (x) = Rm−1 t tt0 (ii) for any (a, c, x) ∈ R × Rm−1 × SX , Ha,c (x) is positive definite; (iii) (a, c, x) 7→ Ha,c (x) is continuous over R × Rm−1 × SX . Proof. (i) Let 1 ηα (a, c) = I[u0 Y −c0 Γ0u Y −a<0] − α 0 . Γu Y For any (a, c0 )0 , (a0 , c00 )0 ∈ Rm , we then have ρα (u0 Y − c0 Γ0u Y − a) − ρα (u0 Y − c00 Γ0u Y − a0 ) − (a − a0 , c0 − c00 )ηα (a0 , c0 ) n o = (u0 Y − c0 Γ0u Y − a) I[u0 Y −c00 Γ0u Y −a0 <0] − I[u0 Y −c0 Γ0u Y −a<0] ≥ 0, (A.3) so that ηα (a, c) is a subgradient for (a, c) 7→ ρα (u0 Y − c0 Γ0u Y − a). Hence, ∇Ga,c (x) = ∇a,c E[ρα (u0 Y − c0 Γ0u Y − a)|X = x] = E[ηα (a, c)|X = x], (A.4) which readily provides (A.2). Let us now show that |∇Ga+∆a ,c+∆c (x) − ∇Ga,c (x) − Ha,c (x)(∆a , ∆0c )0 | = o |(∆a , ∆0c )0 | as (∆a , ∆0c )0 → 0. From (A.4) and the identity ! ! ! Z (a+∆a )+(c+∆c )0 t 1 1 t0 ∆a dz = , a+c0 t t t tt0 ∆c 18 we obtain ∇Ga+∆a ,c+∆c (x) − ∇Ga,c (x) − Ha,c (x)(∆a , ∆0c )0 = E[ηα (a + ∆a , c + ∆c ) − ηα (a, c)|X = x] ! ! 1 t0 ∆a Z f Y |X=x (a + c0 t)u + Γu t dt − Rm−1 t tt0 ∆c ! 1 Z Z f Y |X=x (zu + Γu t) dzdt = I[z−(c+∆c )0 t−(a+∆a )<0] − I[z−c0 t−a<0] Rm−1 R t ! Z Z (a+∆a )+(c+∆c )0 t 1 f Y |X=x (a + c0 t)u + Γu t dzdt − Rm−1 a+c0 t t ! Z Z (a+∆a )+(c+∆c )0 t 1 f Y |X=x (zu + Γu t) − f Y |X=x (a + c0 t)u + Γu t dzdt. = Rm−1 a+c0 t t Now, by Lemma A.2, one has, for any z between a + c0 t and (a + ∆a ) + (c + ∆c )0 t, |f Y |X=x (zu + Γu t) − f Y |X=x (a + c0 t)u + Γu t | C|z − a − c0 t|s C|∆a + ∆0c t|s ≤ ≤ · 1 + 12 |(z + a + c0 t)u + 2Γu t|2 (3+r+s)/2 |(1, t0 )0 |3+r+s This entails |∆a + ∆0c t|1+s Z |∇Ga+∆a ,c+∆c (x) − ∇Ga,c (x) − Ha,c (x)(∆a , ∆0c )0 | ≤C dt Rm−1 |(1, t0 )0 |2+r+s Z 1 ≤ C|(∆a , ∆0c )0 |1+s dt = o(|(∆a , ∆0c )0 |), Rm−1 |(1, t0 )0 |1+r as (∆a , ∆0c )0 → 0. Therefore, (a, c) 7→ Ga,c (x) is twice continuously differentiable at any x ∈ SX , with Hessian matrix H Ga,c (x) . Eventually, Assumption (A)(iii) implies that ! 1 t0 Z 1 −1 f ε Mb,x (a + c0 t)u + Γu t − Ma,x Ha,c (x) = dt. (A.5) |det(Mb,x )| Rm−1 t tt0 (ii) Positive definiteness then readily follows from (A.5) and Assumption (B). (iii) Every entry of Ha,c (x) is an integral involving an integrand of the form (see (A.5)) δ tδi i tjj −1 f ε Mb,x ((a + c0 t)u + Γu t − Ma,x ) − f ε ((1, t0 )0 ) gi,j (a, c, x, t) = | det(Mb,x )| δ tδi i tjj + f ε ((1, t0 )0 ) =: gi,j I II (a, c, x, t) + gi,j (a, c, x, t), | det(Mb,x )| 19 I (a, c, x, t) and (a, c, x) 7→ g II (a, c, x, t) where δi , δj ∈ {0, 1}. Clearly, for any t, (a, c, x) 7→ gi,j i,j are continuous. Therefore, in view of Theorem 8.5 in Briane and Pagès (2012), it is sufficient to prove that there exist integrable functions hIi,j , hII i,j : R m−1 → R+ such that I |gi,j (a, c, x, t)| ≤ hIi,j (t) and |gi,j II (a, c, x, t)| ≤ hII i,j (t) for any (a, c, x, t). Since Assumptions (A)(ii)-(iii) ensure that det(Mb,x ) stays away from 0 for any x ∈ SX , we δi j ε δ 0 0 can take t 7→ hII i,j (t) := ti tj f ((1, t ) )/(inf x∈SX | det(Mb,x )|), whose integrability follows from the fact that f ε (·) is bounded and ε has finite second-order moments. Now, Lemma A.2 and Assumptions (A)(ii)-(iii) readily entail that there exist r > m − 1 and s > 0 such that δ I (a, c, x, t)| = |tδi i tjj | × f Y |X=x (a + c0 t)u + Γu t − f Y |X=x Mb,x (1, t0 )0 + Ma,x |gi,j δ C|(a + c0 t)u + Γu t − Mb,x (1, t0 )0 − Ma,x |s ≤ |tδi i tjj | 2 (3+r+s)/2 1 + 21 (a + c0 t)u + Γu t + Mb,x (1, t0 )0 + Ma,x 2 −(3+r+s)/2 ≤ C|t|δi +δj (1 + |t|s ) 1 + 12 t + Γ0u Mb,x (1, t0 )0 + Γ0u Ma,x 2 −(3+r+s)/2 ≤ C|t|δi +δj (1 + |t|s ) 1 + 21 (Im−1 + Γ0u Ax )t + Γ0u B x ) , (A.6) where the matrices Ax := (Mb,x ).2 and B x := (Mb,x ).1 − Ma,x are based on the parti- tion Mb,x = ((Mb,x ).1 (Mb,x ).2 ) into an m × 1 matrix (Mb,x ).1 and an m × (m − 1) ma- trix (Mb,x ).2 . Lemma A.3 implies that it is always possible to choose Γu in such a way that Im−1 + Γ0u Ax is invertible. Consequently, one may proceed as in the proof of Lemma A.2 and write 1 + |t|2 2 1 + 12 (Im−1 + Γ0u Ax )t + Γ0u B x 2 1 + (Im−1 + Γ0u Ax )−1 [(Im−1 + Γ0u Ax )t + Γ0u B x ] − (Im−1 + Γ0u Ax )−1 Γ0u B x = 2 1 + 12 (Im−1 + Γ0u Ax )t + Γ0u B x ) 2 C + C (Im−1 + Γ0u Ax )t + Γ0u B x ) ≤ 2 ≤ C, 1 + 1 (Im−1 + Γ0u Ax )t + Γ0u B x ) 2 where we used the fact that x 7→ Ax and x 7→ B x are continuous functions defined over the com- I (a, c, x, t)| ≤ C|t|δi +δj (1 + |t|s ) 1 + |t|2 −(3+r+s)/2 =: pact set SX . Therefore, (A.6) provides |gi,j hIi,j (t), where hIi,j (·) is integrable over Rm−1 (since r > m − 1). The proof of Theorem 2.1 still requires the following lemma. Lemma A.5. Let Assumptions (A) and (B) hold, fix α ∈ B m , and write x̃ = x̃N = Projγ N (x) for any x. Then, (i) for any compact set K(⊂ Rm−1 ), supx∈SX supa∈R,c∈K |G e a,c (x̃) − Ga,c (x)| → 0 as N → ∞; (ii) supx∈S | min(a,c0 )0 ∈Rm G X e a,c (x̃) − min(a,c0 )0 ∈Rm Ga,c (x)| → 0 as N → ∞. 20 e N = x̃] is equivalent to [X ∈ Cx ], where we Proof. (i) Fix a ∈ R and c ∈ K. First note that [X let C x = C N x = {z ∈ SX : Projγ N (z) = x̃}. Hence, one has e N = x̃] − E[ρα (Yu − c0 Y ⊥ − a)|X = x̃] E[ρα (Yu − c0 Y ⊥ − a)|X u u ≤ sup E[ρα (Yu − c0 Yu⊥ − a)|X = z] − E[ρα (Yu − c0 Yu⊥ − a)|X = x̃], z∈Cx which provides |G e a,c (x̃) − Ga,c (x)| e N = x̃] − E[ρα (Yu − c0 Y ⊥ − a)|X = x̃]| ≤ |E[ρα (Yu − c0 Yu⊥ − a)|X u + |E[ρα (Yu − c0 Yu⊥ − a)|X = x̃] − E[ρα (Yu − c0 Yu⊥ − a)|X = x]| ≤ 2 sup |E[ρα (Yu − c0 Yu⊥ − a)|X = z] − E[ρα (Yu − c0 Yu⊥ − a)|X = x̃]| z∈Cx ≤ 2 sup Ga,c (z) − Ga,c (x̃) z∈Cx p ≤ 2 max(α, 1 − α) 1 + |c|2 ([Ma,· ]Lip + [Mb,· ]Lip kεk1 ) sup |z − x̃|, z∈Cx where we used Lemma A.1. It directly follows that, for some C that does not depend on N , sup sup |G e a,c (x̃) − Ga,c (x)| ≤ C sup sup |z − x̃| =: C sup R(Cx ); x∈SX a∈R,c∈K x∈SX z∈Cx x∈SX the quantity R(Cx ) is the “radius” of the cell Cx . The result then follows from the fact that supx∈SX R(Cx ) → 0 as N → ∞; see Lemma A.2(ii) in Charlier et al. (2015a). aN (ii) For simplicity of notations, we write ã = e cN α,x and c̃ = eα,x . From Lemma A.4(ii), v 0 Haα,x ,cα,x (x)v > 0 for any x ∈ SX and any v ∈ S m−1 = {x ∈ Rm : |x| = 1}. The compactness assumption on SX and the continuity of x 7→ Ha,c (x) (Lemma A.4(iii)) yield that inf inf v 0 Haα,x ,cα,x (x)v > 0. x∈SX v∈S m−1 This, jointly with Part (i) of the result, implies that there exists a positive integer N0 and a eN compact set, Kα (⊂ Rm ) say, such that, for all N ≥ N0 and for all x ∈ SX , q α,x and q α,x belong cN to Kα . In particular, for all N ≥ N0 and for all x ∈ SX , eα,x and cα,x belong to a compact set Kαc (⊂ Rm−1 )). Then, with I+ = I[min 0 0 m Gea,c (x̃)≥min 0 0 m Ga,c (x)] , we have (a,c ) ∈R (a,c ) ∈R min G e a,c (x̃) − min Ga,c (x) I+ = (G e ã,c̃ (x̃) − Gaα,x ,cα,x (x))I+ (a,c0 )0 ∈Rm (a,c0 )0 ∈Rm ≤ G e aα,x ,cα,x (x̃) − Gaα,x ,cα,x (x) I+ ≤ sup |G e a,c (x̃) − Ga,c (x)|I+ . (A.7) a∈R,c∈Kαc 21 Similarly, with I− := 1 − I+ , we have min e a,c (x̃) − min G Ga,c (x) I− = Gaα,x ,cα,x (x) − G e ã,c̃ (x̃) I− 0 0 m (a,c ) ∈R 0 0(a,c ) ∈Rm ≤ Gã,c̃ (x) − G e ã,c̃ (x̃) I− ≤ sup |G e a,c (x̃) − Ga,c (x)|I− . (A.8) a∈R,c∈Kαc From (A.7)-(A.8), we readily obtain min G e a,c (x̃) − min Ga,c (x) ≤ sup |G e a,c (x̃) − Ga,c (x)|. (a,c0 )0 ∈Rm (a,c0 )0 ∈Rm a∈R,c∈Kαc The result then directly follows from Part (i) of the result. We can now prove Theorem 2.1. Proof of Theorem 2.1. Write again x̃ = x̃N = Projγ N (x) and fix the same integer N0 and the same compact sets Kα and Kαc as in the proof of Lemma A.5. Then, for x ∈ SX and N ≥ N0 , one has |Gã,c̃ (x) − Gaα,x ,cα,x (x)| ≤ |Gã,c̃ (x) − G e ã,c̃ (x̃)| + |G e ã,c̃ (x̃) − Gaα,x ,cα,x (x)| ≤ sup |Ga,c (x) − G e a,c (x̃) − min Ga,c (x) e a,c (x̃)| + min G a∈R,c∈Kαc a,c a,c ≤ sup sup |Ga,c (x) − G e a,c (x̃) − min Ga,c (x). e a,c (x̃)| + sup min G x∈SX a∈R,c∈Kαc x∈SX a,c a,c Therefore, Lemma A.5 implies that, as N → ∞, sup |Gã,c̃ (x) − Gaα,x ,cα,x (x)| → 0. (A.9) x∈SX Performing a second-order expansion about q α,x = (aα,x , c0α,x )0 provides 1 N 0 eα,x − q α,x H N eN Gã,c̃ (x) − Gaα,x ,cα,x = q ∗ (x) q α,x − q α,x , 2 with H N ∗x := HaN ∗x ,c∗x N N N 0 0 qN N (x), where q ∗x = (a∗x , (c∗x ) ) = θq α,x + (1 − θ)e α,x , for some θ ∈ N 0 (0, 1). Write H N N ∗x = O x Λx O x , where O x is an m × m orthogonal matrix and where Λx = N diag(λN N 1,x , . . . , λm,x ) collects the eigenvalues of H ∗x in decreasing order. We then have 1 N 0 eα,x − q α,x H N eN Gã,c̃ (x) − Ga,c (x) = q ∗x q α,x − q α,x 2 m m 1X N N 2 λN m,x X 2 = eα,x − q α,x j ≥ λj,x O x q eN Ox q α,x − q α,x j 2 2 j=1 i=1 λN m,x 2 λN 2 eN = m,x q N = Ox q α,x − q α,x eα,x − q α,x . 2 2 22 Hence, N 2 −1 eα,x − q α,x ≤ 2 inf sup q inf λN̄ m,x sup |Gã,c̃ (x) − Ga,c (x)|. (A.10) x∈SX N̄ ≥N0 x∈SX x∈SX The result then follows from (A.9) and from the fact inf λN̄ v 0 Ha,c (x)(a,c)=(aN̄ ,cN̄ ) v inf m,x = inf inf inf N̄ ≥N0 x∈SX N̄ ≥N0 x∈SX v∈S m−1 ∗x ∗x ≥ inf inf inf v 0 Ha,c (x)v > 0, x∈SX v∈S m−1 (a,c0 )0 ∈Kα which results from Lemma A.4(ii)-(iii) and the compactness of SX , S m−1 and Kα . B Proof of Theorem 3.1 Let (X1 , Y 1 ), . . . , (Xn , Y n ) be independent copies of (X, Y ). Recall that γ N denotes an optimal quantization grid of size N for the random d-vector X and that γ̂ N,n stands for the grid provided by the CLVQ algorithm on the basis of X1 , . . . , Xn . Below, we will write (x̃N N 1 , . . . , x̃N ) and (x̂N,n N,n 1 , . . . , x̂N ) for the grid points of γ N and γ̂ N,n , respectively. Throughout this section, we assume that the empirical quantization of X, based on X1 , . . . , Xn converges almost surely towards the population one, i.e., X b N,n = Proj N,n (X) → X eN = γ̂ Projγ N (X) almost surely as n → ∞. This is justified by classical results in quantization about the convergence in n of the CLVQ algorithm when N is fixed; see Pagès (1998). The proof of Theorem 3.1 requires Lemmas B.1-B.2 below. Lemma B.1. Let Assumption (A) 0 hold. Fix N ∈ N0 and x ∈ SX . Write x̃ = x̃N = Projγ N (x) bN = X and x̂ = x̂N,n = Projγ̂ N,n (x). Then, with X b N,n = Proj N,n (X i ), i = 1, . . . , n, we have i i γ̂ 1 Pn a.s. e N = x̃]; (i) n i=1 I[X̂ N = x̂N ] −−−→ n→∞ P [X i N,n a.s. N,n a.s. (ii) after possibly reordering the x̃N i ’s, x̂i −−−→ x̃N i , i = 1, . . . , N (hence, γ̂ −−−→ γ N ). n→∞ n→∞ A proof is given in Charlier et al. (2015a). Lemma B.2. Let Assumptions (A) and (B) hold. Fix α = αu ∈ B m , x ∈ SX and N ∈ N0 . Let K (⊂ Rm ) be compact and define Pn 1 n 0 i=1 ρα (u Y i − c0 Γ0u Y i − a) I[X̂ N =x̂] G b a,c (x̂) = b N,n (x̂) G := i . a,c 1 Pn n i=1 I[X̂ N =x̂] i Then (i) sup(a,c0 )0 ∈K |G b a,c (x̂) − G e a,c (x̃)| = oP (1) as n → ∞; 23 (ii) | min(a,c0 )0 ∈Rm G b a,c (x̂) − min(a,c0 )0 ∈Rm G e a,c (x̃)| = oP (1) as n → ∞; (iii) |G e N,n N,n (x̃) − G e N a cN (x̃)| = oP (1) as n → ∞. ba ,b α,xc α,x e α,x ,eα,x Proof. (i) Since N E[ρα (u0 Y − c0 Γ0u Y − a)I[X̃ N =x̃] ] 0 0 e a,c (x̃) = E[ρα (u Y − c G Γ0u Y − a)|X e = x̃] = N , P [X e = x̃] it is sufficient, in view of Lemma B.1(i), to show that n 1 X 0 0 0 0 0 0 sup ρ (u Y − c Γ Y − a) − E ρ (u Y − c Γ Y − a)I = oP (1), α i u i I N α u N [X̃ =x̃] (a,c0 )0 ∈K n [X̂ i =x̂] i=1 as n → ∞. It is natural to decompose it as n 1 X ρα (u0 Y i − c0 Γ0u Y i − a) I[X̂ N =x̂] − E ρα (u0 Y − c0 Γ0u Y − a)I[X̃ N =x̃] sup (a,c0 )0 ∈K n i=1 i ≤ sup |Tac1 | + sup |Tac2 |, (a,c0 )0 ∈K (a,c0 )0 ∈K with n 1X ρα (u0 Y i − c0 Γ0u Y i − a) I[X̂ N =x̂] − I[X̃ N =x̃] , Tac1 := n i i i=1 and n 1X ρα (u0 Y i − c0 Γ0u Y i − a)I[X̃ N =x̃] − E ρα (u0 Y − c0 Γ0u Y − a)I[X̃ N = x̃] , Tac2 := n i i=1 eN with X = Projγ N (X i ), i = 1, . . . , n. i We start by considering Tac2 . Since x 7→ M a,x and x 7→ M b,x are continuous functions defined over the compact set SX , one has that, for all (a, c0 )0 ∈ K, ρα (u0 Y − c0 Γ0u Y − a)I[X̃ N = x̃] ≤ max(α, 1 − α)|u0 Y − c0 Γ0u Y − a| ≤ max(α, 1 − α)|(u − Γu c)0 Y − a| " # ≤ max(α, 1 − α) |u − Γu c| sup |M a,x | + |ε| sup kM b,x k + |a| x∈SX x∈SX ≤ C1 |ε| + C2 , (B.1) for some constants C1 , C2 that do not depend on (a, c0 )0 . Since Assumption (A)(v) ensures that E[|ε|] < +∞ (recall that p = 2 here), the uniform law of large numbers (see, e.g., Theorem 16(a) in Ferguson, 1996) then implies that sup |Tac2 | = oP (1), as n → ∞. (B.2) (a,c0 )0 ∈K 24 It remains to treat Tac1 . Let `n := {i = 1, . . . , n : I[X̂ N =x̂] 6= I[X̃ N =x̃] } be the set collecting i i the indices of observations that are projected on the same point as x for γ N but not for γ̂ N,n , or on the same point as x for γ̂ N,n but not for γ N . Proceeding as in (B.1) then shows that, for any (a, c0 )0 ∈ K, 1X #`n 1 X |Tac1 | ≤ |ρα (u0 Y i − c0 Γ0u Y i − a)| ≤ × (C1 + C2 |εi |) =: S1 × S2 , n n #`n i∈`n i∈`n say. Lemma B.1(ii) implies that S1 = oP (1) as n → ∞. Regarding S2 , the independence between #`n and the εi ’s (which follows from the fact that #`n is measurable with respect to the X i ’s) entails that E[S2 ] = O(1) as n → ∞, hence that S2 = OP (1) as n → ∞. Therefore, sup |Tac1 | ≤ S1 S2 = oP (1) as n → ∞, (a,c0 )0 ∈K which, jointly with (B.2), establishes the result. b = (â, ĉ0 )0 instead of q (ii) For simplicity, we write q̃ = (ã, c̃0 )0 and q eN aN α,x = (e cN α,x , (e 0 0 α,x ) ) bN,n and q aN,n α,x = (b cN,n α,x , (b 0 0 α,x ) ) , respectively. First fix δ > 0 and η > 0, and choose n1 and R large enough to have |q̃| ≤ R and P [|b q| > R] < η/2 for any n ≥ n1 . This is possible since q b is nothing but the sample Hallin et al. (2010) quantile of a number of Y i ’s that increases to infinity (so that, with arbitrary large probability for n large, |b q| cannot exceed 2 supx∈SX |q α,x |). Define KR := {y ∈ Rm : |y| ≤ R}. Then, with I+ := I[min 0 0 m Gba,c (x̂)≥min 0 0 m Gea,c (x̃)] , we have (a,c ) ∈R (a,c ) ∈R min b a,c (x̂) − min G G b â,ĉ (x̂) − G e a,c (x̃)I+ = (G e ã,c̃ (x̃))I+ 0 0 m (a,c ) ∈R 0 0 m (a,c ) ∈R ≤ G b ã,c̃ (x̂) − G e ã,c̃ (x̃) I+ ≤ sup |G b a,c (x̂) − G e a,c (x̃)|I+ , (B.3) (a,c0 )0 ∈KR for n ≥ n1 . Similarly, with I− := 1 − I+ , we have that, under |b q| ≤ R, min G b a,c (x̂) − min G e ã,c̃ (x̃) − G e a,c (x̃)I− = G b â,ĉ (x̂) I− (a,c0 )0 ∈Rm (a,c0 )0 ∈Rm ≤ G e â,ĉ (x̃) − G b â,ĉ (x̂) I− ≤ sup |G b a,c (x̂) − G e a,c (x̃)|I− , (B.4) (a,c0 )0 ∈KR still for n ≥ n1 . By combining (B.3) and (B.4), we obtain that, under |b q| ≤ R, min b a,c (x̂) − min G G e a,c (x̃) ≤ sup |Gb a,c (x̂) − G e a,c (x̃)|, 0 0 m (a,c ) ∈R 0 0 m (a,c ) ∈R (a,c0 )0 ∈KR for n ≥ n1 . Therefore, for any such n, we get P min Ga,c (x̂) − min Ga,c (x̃) > δ b e (a,c0 )0 ∈Rm (a,c0 )0 ∈Rm ≤ P min Ga,c (x̂) − min Ga,c (x̃) > δ, |b q| ≤ R + P |b q| > R b e a,c a,c e a,c (x̃)| > δ + η · h i ≤P sup |G b a,c (x̂) − G (a,c0 )0 ∈KR 2 25 From Part (i) of the lemma, we conclude that, for n large enough, P min G b a,c (x̂) − min G e a,c (x̃) > δ < η, 0 0 m (a,c ) ∈R 0 0 m (a,c ) ∈R as was to be shown. (iii) This proof proceeds in the same way as for (ii). We start with picking N1 and R large enough so that P [|bq| > R] < η/2 for any N ≥ N1 , with η fixed. This yields h i h i η P |Ge â,ĉ (x̃) − G e ã,c̃ (x̃)| > δ ≤ P |G e â,ĉ (x̃) − G e ã,c̃ (x̃)| > δ, |b q| ≤ M + . (B.5) 2 Note then that h i P |Ge â,ĉ (x̃) − G e ã,c̃ (x̃)| > δ, |b q| ≤ M h i h i ≤ P |G e â,ĉ (x̃) − G b â,ĉ (x̂)| > δ/2, |b q| ≤ M + P |G b â,ĉ (x̂) − G e ã,c̃ (x̃)| > δ/2, |b q| ≤ M h i h i ≤P sup |G e a,c (x̃) − G b a,c (x̂)| > δ/2 + P | min G b a,c (x̂) − min G e a,c (x̃)| > δ/2 (a,c0 )0 ∈KR a,c a,c (n) (n) =: p1 + p2 , (n) (n) say. Parts (i) and (ii) of the lemma imply that p1 and p2 can be made arbitrarily small for n large enough. Combining this with (B.5) yields the result. We can now prove Theorem 3.1. Proof of Theorem 3.1. Under the assumptions considered, the function (a, c0 )0 7→ G e a,c (x̃) has a unique minimizer (that is the Hallin et al. (2010) α-quantile of the distribution of Y conditional e N = x̃). Therefore, the convergence in probability of G on X e N,n N,n (x̃) towards Ge N N (x̃) aα,x ,b b cα,x aα,x ,e e cα,x (Lemma B.2(iii)) implies the convergence in probability of the corresponding arguments (note indeed that the function (a, c0 )0 7→ G e a,c (x̃) does not depend on n). References Briane, M. and G. Pagès (2012). Analyse - Théorie de l’intégration: Convolution et transformée de Fourier. Vuibert. Charlier, I., D. Paindaveine, and J. Saracco (2015a). Conditional quantile estimators through optimal quantization. J. Statist. Plann. Inference 156, 14–30. Charlier, I., D. Paindaveine, and J. Saracco (2015b). Conditional quantile estimation based on optimal quantization: from theory to practice. Comput. Statist. Data Anal. 91, 20–39. Ferguson, T. S. (1996). A Course in Large Sample Theory. Chapman & Hall/CRC. 26 Graf, S. and H. Luschgy (2000). Foundations of quantization for probability distributions, Volume 1730 of Lecture Notes in Mathematics. Berlin: Springer-Verlag. Hallin, M., Z. Lu, D. Paindaveine, and M. Šiman (2015). Local constant and local bilinear multiple-output quantile regression. Bernoulli 3 (21), 1435–1466. Hallin, M., D. Paindaveine, and M. Šiman (2010). Multivariate quantiles and multiple-output regression quantiles: from L1 optimization to halfspace depth. Ann. Statist. 38 (2), 635–669. Koenker, R. (2005). Quantile Regression. Number 38 in Econometric Society Monographs. Cambridge: Cambridge Univ. Press. Koenker, R. and G. Bassett, Jr. (1978). Regression quantiles. Econometrica 46 (1), 33–50. Pagès, G. (1998). A space quantization method for numerical integration. J. Comput. Appl. Math. 89 (1), 1–38. Pagès, G. and J. Printems (2003). Optimal quadratic quantization for numerics: the Gaussian case. Monte Carlo Methods Appl. 9 (2), 135–165. Serfling, R. (2002). Quantile functions for multivariate analysis: approaches and applications. Statist. Neerlandica 56 (2), 214–232. Special issue: Frontier research in theoretical statistics, 2000 (Eindhoven). Yu, K. and M. C. Jones (1998). Local linear quantile regression. J. Amer. Statist. Assoc. 93 (441), 228–237. 27

References (13)

Briane, M. and G. Pagès (2012). Analyse -Théorie de l'intégration: Convolution et transformée de Fourier. Vuibert.
Charlier, I., D. Paindaveine, and J. Saracco (2015a). Conditional quantile estimators through optimal quantization. J. Statist. Plann. Inference 156, 14-30.
Charlier, I., D. Paindaveine, and J. Saracco (2015b). Conditional quantile estimation based on optimal quantization: from theory to practice. Comput. Statist. Data Anal. 91, 20-39.
Ferguson, T. S. (1996). A Course in Large Sample Theory. Chapman & Hall/CRC.
Graf, S. and H. Luschgy (2000). Foundations of quantization for probability distributions, Volume 1730 of Lecture Notes in Mathematics. Berlin: Springer-Verlag.
Hallin, M., Z. Lu, D. Paindaveine, and M. Šiman (2015). Local constant and local bilinear multiple-output quantile regression. Bernoulli 3 (21), 1435-1466.
Hallin, M., D. Paindaveine, and M. Šiman (2010). Multivariate quantiles and multiple-output regression quantiles: from L 1 optimization to halfspace depth. Ann. Statist. 38 (2), 635-669.
Koenker, R. (2005). Quantile Regression. Number 38 in Econometric Society Monographs. Cambridge: Cambridge Univ. Press.
Koenker, R. and G. Bassett, Jr. (1978). Regression quantiles. Econometrica 46 (1), 33-50.
Pagès, G. (1998). A space quantization method for numerical integration. J. Comput. Appl. Math. 89 (1), 1-38.
Pagès, G. and J. Printems (2003). Optimal quadratic quantization for numerics: the Gaussian case. Monte Carlo Methods Appl. 9 (2), 135-165.
Serfling, R. (2002). Quantile functions for multivariate analysis: approaches and applications. Statist. Neerlandica 56 (2), 214-232. Special issue: Frontier research in theoretical statistics, 2000 (Eindhoven).
Yu, K. and M. C. Jones (1998). Local linear quantile regression. J. Amer. Statist. Assoc. 93 (441), 228-237.

About the author

Jérôme Saracco

BORDEAUX INP, Faculty Member

Papers

224

Followers

View all papers from Jérôme Saraccoarrow_forward

Multiple‐Output Quantile Regression through Optimal Quantization

Sign up for access to the world's latest research

Abstract

Related papers

References (13)

Related papers

Related topics

Cited by