{"title": "Affine-Invariant Online Optimization and the Low-rank Experts Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 4747, "page_last": 4755, "abstract": "We present a new affine-invariant optimization algorithm called Online Lazy Newton. The regret of Online Lazy Newton is independent of conditioning: the algorithm's performance depends on the best possible preconditioning of the problem in retrospect and on its \\emph{intrinsic} dimensionality. As an application, we show how Online Lazy Newton can be used to achieve an optimal regret of order $\\sqrt{rT}$ for the low-rank experts problem, improving by a $\\sqrt{r}$ factor over the previously best known bound and resolving an open problem posed by Hazan et al (2016).", "full_text": "A\ufb03ne-Invariant Online Optimization\nand the Low-rank Experts Problem\n\nTomer Koren\nGoogle Brain\n\n1600 Amphitheatre Pkwy\nMountain View, CA 94043\n\nRoi Livni\n\nPrinceton University\n\n35 Olden St.\n\nPrinceton, NJ 08540\n\ntkoren@google.com\n\nrlivni@cs.princeton.edu\n\nAbstract\n\nWe present a new a\ufb03ne-invariant optimization algorithm called Online Lazy Newton.\nThe regret of Online Lazy Newton is independent of conditioning: the algorithm\u2019s\nperformance depends on the best possible preconditioning of the problem in\n\u221a\nretrospect and on its intrinsic dimensionality. As an application, we show how\n\u221a\nOnline Lazy Newton can be used to achieve an optimal regret of order\nrT for the\nlow-rank experts problem, improving by a\nr factor over the previously best known\nbound and resolving an open problem posed by Hazan et al. [15].\n\n1\n\nIntroduction\n\nIn the online convex optimization setting, a learner is faced with a stream of T convex functions\nover a bounded convex domain (cid:88) \u2286 (cid:82)d. At each round t the learner gets to observe a single convex\nfunction ft and has to choose a point xt \u2208 (cid:88). The aim of the learner is to minimize the cumulative\nT-round regret, de\ufb01ned as\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nft(xt) \u2212 min\nx\u2208(cid:88)\n\nft(x).\n\nt=1\n\n\u221a\nIn this very general setting, Online Gradient Descent [25] achieves an optimal regret rate of \u0398(GD\nT),\nwhere G is a bound on the Lipschitz constants of the ft and D is a bound on the diameter of the\ndomain, both with respect to the Euclidean norm. For simplicity, let us restrict the exposition to linear\nlosses ft(xt) = gT\nt xt, in which case G bounds the maximal Euclidean norm (cid:107)gt(cid:107); it is well known that\nthe general convex case can be easily reduced to this case.\nOne often useful way to obtain faster convergence in optimization is to employ preconditioning,\nnamely to apply a linear transformation P to the gradients before using them to make update steps.\n\u221a\nIn an online optimization setting, if we have had access to the best preconditioner in hindsight we\ncould have achieved a regret rate of the form \u0398(inf P GPDP\nT), where DP is the diameter of the set\nP \u00b7 (cid:88)and GP is a bound on the norm of the conditioned gradients (cid:107)P\u22121gt(cid:107). We shall thus refer to the\nquantity GPDP as the conditioning of the problem when a preconditioner P is used.\nIn many cases, however, it is more natural to directly assume a bound |gT\nt xt| \u2264 C on the magnitude of\nthe losses rather than assuming the bounds D and G. In this case, the condition number need not\nbe bounded and typical guarantees of gradient-based methods such as online gradient descent do\nnot directly apply. In principle, it is possible to \ufb01nd a preconditioner P such that GPDP = O(C\nd),\nand if one further assumes that the intrinsic dimensionality of the problem (i.e., the rank of the\nloss vectors g1, . . . , gT ) is r (cid:28) d, the conditioning of the optimization problem can be improved\n\u221a\nto O(C\nr). However, this approach requires one to have access to the transformation P, which is\ntypically data-dependent and known only in retrospect.\n\n\u221a\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f\u221a\nIn this paper we address the following natural question: can one achieve a regret rate of O(C\nrT)\nwithout the explicit prior knowledge of a good preconditioner P? We answer to this question in the\na\ufb03rmative and present a new algorithm that achieves this rate, called Online Lazy Newton. Our\nalgorithm is a variant of the Online Newton Step algorithm due to Hazan et al. [14], that employs a lazy\nprojection step. While the Online Newton Step algorithm was designed to exploit curvature in the loss\nfunctions (speci\ufb01cally, a property called exp-concavity), our adaptation is aimed at general\u2014possibly\n\u221a\neven linear\u2014online convex optimization and exploits latent low-dimensional structure. It turns out\nthat this adaptation of the algorithm is able to achieve O(C\nrT) regret up to a small logarithmic factor,\nwithout any prior knowledge of the optimal preconditioner. A crucial property of our algorithm is its\na\ufb03ne-invariance: Online Lazy Newton is invariant to any a\ufb03ne transformation of the gradients gt, in\nthe sense that running the algorithm on gradients g(cid:48)\n= P\u22121gt and applying the inverse transformation\nP on the produced decisions results with the same decisions that would have been obtained by applying\nthe algorithm directly to the original vectors gt.\nAs our main application, we establish a new regret rate for the low rank experts problem, introduced\nby Hazan et al. [15]. The low rank experts setting is a variant of the classical prediction with expert\nadvice problem, where one further assumes that the experts are linearly dependent and their losses\nspan a low dimensional space of rank r. The challenge in this setting is to achieve a regret rate which\n[15] proved a lower bound of \u2126(\u221a\nis independent of number of experts, and only depends on their rank r. In this setting, Hazan et al.\nrT) on the regret, but fell short of providing a matching upper\n\u221a\na O((cid:112)rT log T) regret bound, which is optimal up to a(cid:112)log T factor and improves upon the prior\nbound and only gave an algorithm achieving a suboptimal O(r\nT) regret bound. Applying the Online\nLazy Newton algorithm to this problem, we are able to improve upon the latter bound and achieve\n\nt\n\nbound unless T is exponential in the rank r.\n\n1.1 Related work\n\nAdaptive regularization is an important topic in online optimization that has received considerable\nattention in recent years. The AdaGrad algorithm presented in [9] (as well as the closely related\nalgorithm was analyzed in [22]) dynamically adapts to the geometry of the data. In a sense, AdaGrad\nlearns the best preconditioner from a trace-bounded family of Mahalanobis norms. (See Section 2.2\nfor a detailed discussion and comparison of guarantees). The MegaGrad algorithm of van Erven and\nKoolen [23] uses a similar dynamic regularization technique in order to obliviously adapt to possible\ncurvature in the loss functions. Lower bounds for preconditioning when the domain is unbounded\nhas been presented in [7]. These lower bounds are inapplicable, however, once losses are bounded\n(as assumed in this paper). More generally, going beyond worst case analysis and exploiting latent\nstructure in the data is a very active line of research within online learning. Work in this direction\ninclude adaptation to stochastic i.i.d data (e.g., [11, 12, 20, 8]), as well as the exploration of various\nstructural assumptions that can be leveraged for better guarantees [4, 12, 13, 5, 19].\nOur Online Lazy Newton algorithm is a part of a wide family of algorithms named Follow the\nRegularized Leader (FTRL). FTRL methods choose at each iteration the minimizer of past observed\nlosses with an additional regularization term [16, 12, 21]. Our algorithm is closely related to the\nFollow The Approximate Leader (FTAL) algorithm presented in [14]. The FTAL algorithm is designed\nto achieve logarithmic regret rate for exp-concave problems, exploiting the curvature information\nof such functions. In contrast, our algorithm is aimed for optimizing general convex functions\nwith possibly no curvature; while FTAL performs FTL over the second-order approximation of the\nfunctions, Online Lazy Newton instead utilizes a \ufb01rst-order approximation with an additional rank-one\nquadratic regularizer. Another algorithm closely related to ours is the Second-order Perceptron\nalgorithm of Cesa-Bianchi et al. [3] (which in turn is closely related to the Vovk-Azoury-Warmuth\nforecaster [24, 1]), which is a variant of the classical Perceptron algorithm adapted to the case where\nthe data is \u201cskewed\u201d, or ill-conditioned in the sense used above. Similarly to our algorithm, the\nSecond-order Perceptron employs adaptive whitening of the data to address its skewness. Finally,\nthe SON algorithm, proposed in [18] is an enhanced version of Online Newton Step which utilizes\nsketching to improve over previous second order online learning algorithms. Similar to our paper,\nthey propose a version that is completely invariant to linear transformations. Their regret bound (for\nour setting) depends linearly on the dimension of the ambient space, and quadratic in the rank of the\nloss matrix. In contrast, our regret bounds do not depend on the dimension of the ambient space and\nare linear in the rank of the loss matrix \u2013 two properties that are necessary in order to achieve optimal\nregret bound for the low rank expert problem.\n\n2\n\n\frate of(cid:101)O(r\n\nThis work is highly inspired and motivated by the problem of low rank experts to which we give an\n\u221a\noptimal algorithm. The problem was \ufb01rst introduced in [15] where the authors established a regret\nT), where r is the rank of the experts\u2019 losses, which was the \ufb01rst regret bound in this\nsetting that did not depend on the total number of experts. The problem has been further studied and\ninvestigated by Cohen and Mannor [6], Barman et al. [2] and Foster et al. [10]. Here we establish the\n\ufb01rst tight upper bound (up to logarithmic factor) that is independent of the total number of experts N.\n\n2 Setup and Main Results\n\nWe begin by recalling the standard framework of Online Convex Optimization. At each round\nt = 1, . . . , T a learner chooses a decision xt from a bounded convex subset (cid:88) \u2286 (cid:82)d in d-dimensional\nspace. An adversary then chooses a convex cost function ft, and the learner su\ufb00ers a loss of ft(xt).\nWe measure the performance of the learner in terms of the regret, which is de\ufb01ned as the di\ufb00erence\nbetween accumulated loss incurred by the learner and the loss of the best decision in hindsight.\nNamely, the T-round regret of the learner is given by\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nRegretT\n\n=\n\nft(xt) \u2212 min\nx\u2208(cid:88)\n\nft(x).\n\nOne typically assumes that the diameter of the set (cid:88) is bounded, and that the convex functions\nf1, . . . , fT are all Lipschitz continuous, both with respect to certain norms on (cid:82)d (typically, the norms\nare taken as dual to each other). However, a main point of this paper is to refrain from making explicit\nassumptions on the geometry of the optimization problem, and design algorithms that are, in a sense,\noblivious to it.\nNotation. Given a positive de\ufb01nite matrix A (cid:31) 0 we will denote by (cid:107) \u00b7 (cid:107)A the norm induced by A,\nnamely, (cid:107)x(cid:107)A =\n= sup(cid:107)x(cid:107) A\u22641 |xTg| and can be\nshown to be equal to (cid:107)g(cid:107)\u2217\n= (cid:107)g(cid:107)A\u22121. Finally, for a non\u2013invertible matrix A, we denote by A\u2020 its\nMoore\u2013Penrose psuedo inverse.\n\nxT Ax. The dual norm to (cid:107) \u00b7 (cid:107)A is de\ufb01ned by (cid:107)g(cid:107)\u2217\n\n\u221a\n\nA\n\nA\n\n2.1 Main Results\nOur main results are a\ufb03ne invariant regret bounds for the Online Lazy Newton algorithm, which we\npresent below in Section 3. We begin with a bound for linear losses that controls the regret in terms\nof the intrinsic dimensionality of the problem and a bound on the losses.\nTheorem 1. Consider the online convex optimization setting with linear losses ft(x) = gT\nassume that |gT\nthe regret is bounded as\n\nt x, and\nt x| \u2264 C for all t and x \u2208 (cid:88). If Algorithm 1 is run with \u03b7 < 1/C, then for every H (cid:31) 0\n\n(cid:16)\n\n1 +\n\n(DHGHT)2\n\nr\n\n(cid:17)\n\nt x(cid:63)|2(cid:17)\n\n|gT\n\nT(cid:88)\n\nt=1\n\n+ 3\u03b7\n\n1 +\n\n,\n\n(1)\n\nwhere r = rank((cid:80)T\n\n\u03b7\n\nRegretT \u2264 4r\nlog\nt ) \u2264 d and\nDH = max\nx,y\u2208(cid:88)\n\nt=1gtgT\n\n(cid:107)x \u2212 y(cid:107)H ,\n\nGH = max\n1\u2264t\u2264T\n\n(cid:107)gt(cid:107)\u2217\nH .\n\n(cid:16)\n\n(cid:16)\n\nt xt|. Then, for every H (cid:31) 0 the regret is bounded as\n\nBy a standard reduction, the analogue statement for convex losses holds, as long as we assume that\nthe dot-products between gradients and decision vectors are bounded.\nCorollary 2. Let f1, . . . , fT be an arbitrary sequence of convex functions over (cid:88). Suppose Algorithm 1\nis run with 1/\u03b7 > maxt maxx\u2208(cid:88)|\u2207T\nRegretT \u2264 4r\nlog\nt ) \u2264 d and\nt=1\u2207t\u2207T\nDH = max\nx,y\u2208(cid:88)\n\nwhere r = rank((cid:80)T\n\nt x(cid:63)|2(cid:17)\n\nGH = max\n1\u2264t\u2264T\n\n(DHGHT)2\n\n(cid:107)x \u2212 y(cid:107)H ,\n\n(cid:107)\u2207t(cid:107)\u2217\nH .\n\nT(cid:88)\n\n+ 3\u03b7\n\n1 +\n\n|\u2207T\n\n1 +\n\n(cid:16)\n\n(cid:17)\n\n(2)\n\nt=1\n\n\u03b7\n\n,\n\nr\n\n3\n\n\fIn particular, we can use the theorem to show that as long as we bound |\u2207 f(xt)Txt| by a constant\u2014a\nsigni\ufb01cantly weaker requirement than assuming bounds on the diameter of (cid:88)and on the norms of the\ngradients\u2014one can \ufb01nd a norm (cid:107) \u00b7 (cid:107)H for which the quantities DH and GH are properly bounded.\nWe stress again that, importantly, Algorithm 1 need not know the matrix H in order to achieve the\ncorresponding bound.\nTheorem 3. Assume that\n\nLet r = rank((cid:80)T\nalgorithm is then at most O(cid:0)C(cid:112)rT log T(cid:1).\n\nt ) \u2264 d, and run Algorithm 1 with \u03b7 = \u0398(cid:0)(cid:112)r log(T)/T(cid:1). The regret of the\n\n|\u2207T\nt x| \u2264 C.\n\nt=1\u2207t\u2207T\n\nmax\n1\u2264t\u2264T\n\nmax\nx\u2208(cid:88)\n\n2.2 Discussion\n\nIt is worth comparing our result to previously studied adaptive regularization algorithms techniques.\nPerhaps the most popular gradient method that employs adaptive regularization is the AdaGrad\nalgorithm introduced in [9]. The AdaGrad algorithm enjoys the regret bound depicted in Eq. (3). It is\ncompetitive with any \ufb01xed regularization matrix S (cid:31) 0 such that Tr(S) \u2264 d:\n\n(cid:32)\u221a\n(cid:18)\u221a\n\n= (cid:101)O\n\n(cid:113)(cid:88)T\n(cid:113)(cid:88)T\n\nd inf\nS(cid:31)0,\nTr(S)\u2264d\n\nr inf\nS(cid:31)0\n\nt=1 (cid:107)x(cid:63)(cid:107)2\n\n2 (cid:107)\u2207t(cid:107)2\nS\u2217\n\nt=1 (cid:107)x(cid:63)(cid:107)2\n\nS\n\n(cid:107)\u2207t(cid:107)2\nS\u2217\n\n(cid:33)\n\n,\n\n.\n\n(cid:19)\n\nRegretT(AdaGrad) = O\n\nRegretT(OLN)\n\n(3)\n\n(4)\n\nt x(cid:63)(cid:107) \u2264 (cid:107)\u2207t(cid:107)\u2217\n\nOn the other hand, for every matrix S (cid:31) 0 by the generalized Cauchy-Schwartz inequality we have\n(cid:107)\u2207T\nS(cid:107)x(cid:63)(cid:107)S. Plugging this into Eq. (2) and a proper tuning of \u03b7 gives a bound which is\ncompetitive with any \ufb01xed regularization matrix S (cid:31) 0, depicted in Eq. (4).\nOur bound improves on AdaGrad\u2019s regret bound in two ways. First, the bound in Eq. (4) scales with\nthe intrinsic dimension of the problem: when the true underlying dimensionality of the problem is\nsmaller than the dimension of the ambient space, Online Lazy Newton enjoys a superior regret bound.\nFurthermore, as demonstrated in [15], the dependence of AdaGrad\u2019s regret on the ambient dimension\nis not an artifact of the analysis, and there are cases where the actual regret grows polynomially with\nd rather than with the true rank r (cid:28) d.\nThe second case where the Online Lazy Newton bound can be superior over AdaGrad\u2019s is when there\nexists a conditioning matrix that improves not only the norm of the gradients with respect to the\nEuclidean norm, but also that the norm of x(cid:63) is smaller with respect to the optimal norm induced by S.\n2, and in particular when (cid:107)x(cid:63)(cid:107)S < (cid:107)x(cid:63)(cid:107)2,\n\nMore generally, whenever(cid:80)T\n\nS (cid:107)x(cid:63)(cid:107)2\nEq. (4) will produce a tighter bound than the one in Eq. (3).\n\nt x(cid:63))2 <(cid:80)T\n\nt=1 (cid:107)\u2207t(cid:107)2\n\nt=1(\u2207T\n\n3 The Online Lazy Newton Algorithm\n\nWe next present the main focus of this paper: the a\ufb03ne-invariant algorithm Online Lazy Newton (OLN),\ngiven in Algorithm 1. The algorithm maintains two vectors, xt and yt. The vector yt is updated at\neach iteration using the gradient of ft at xt, via yt = yt\u22121 \u2212 \u2207t where \u2207t = \u2207 ft(xt). The vector yt is\nnot guaranteed to be at (cid:88), hence the actual prediction of OLN is determined via a projection onto the\nset (cid:88), resulting with the vector xt+1 \u2208 (cid:88). However, similarly to ONS, the algorithm \ufb01rst transforms\ns, and\nprojections are taken with respect to At. In this context, we use the notation\n\nyt via the (pseudo-)inverse of the matrix At given by the sum of the outer products(cid:80)t\n\ns=1\u2207s\u2207T\n\n(cid:88)(y) = arg min\n\u03a0 A\nx\u2208(cid:88)\n\n(x \u2212 y)T A(x \u2212 y).\n\nto denote the projection onto a set (cid:88) with respect to the (semi-)norm (cid:107) \u00b7 (cid:107)A induced by a positive\nsemide\ufb01nite matrix A (cid:23) 0.\n\n4\n\n\fAlgorithm 1 OLN: Online Lazy Newton\n\nParameters: initial point x1 \u2208 (cid:88), step size \u03b7 > 0\nInitialize y0 = 0 and A0 = 0\nfor t = 1, 2 . . . T do\n\nPlay xt, incur cost ft(xt), observe gradient \u2207t = \u2207 ft(xt)\nRank 1 update At = At\u22121 + \u03b7\u2207t\u2207T\nOnline Newton step and projection:\n\nt\n\nyt = yt\u22121 \u2212 \u2207t\n(A\u2020\nt yt)\nxt+1 = \u03a0 At\n\n(cid:88)\n\nend for\nreturn\n\nThe main motivation behind working with the At as preconditioners is that\u2014as demonstrated in\nour analysis\u2014the algorithm becomes invariant to linear transformations of the gradient vectors \u2207t.\nIndeed, if P is some linear transformation, one can observe that if we run the algorithm on P\u2207t instead\nof \u2207t, this will transform the solution at step t from xt to P\u22121xt. In turn, the cumulative regret is\ninvariant to such transformations. As seen in Theorem 1, this invariance indeed leads to an algorithm\nwith an improved regret bound when the input representation of the data is poorly conditioned.\nWhile the algorithm is very similar to ONS, it is di\ufb00erent in several important aspects. First, unlike\nONS, our lazy version maintains at each step a vector yt which is updated without any projections.\nProjection is then applied only when we need to calculate xt+1. In that sense, it can be thought as a\ngradient descent method with lazy projections (analogous to dual-averaging methods) while ONS is\nsimilar to gradient descent methods with a greedy projection step (reminiscent of mirror-descent type\nalgorithms). The e\ufb00ect of this, is a decoupling between past and future conditioning and projections:\nand if the transformation matrix At changes between rounds, the lazy approach allows us to condition\nand project the problem at each iteration independently.\nSecond, ONS uses an initialization of A0 = \u0001 Id (while OLN uses A0 = 0) and, as a result, it is not\ninvariant to a\ufb03ne transformations. While this di\ufb00erence might seem negligible as \u0001 is typically chosen\nto be very small, recall that the matrices At are used as preconditioners and their small eigenvalues\ncan be very meaningful, especially when the problem at hand is poorly conditioned.\n\n4 Application: Low Rank Experts\n\nIn this section we consider the Low-rank Experts problem and show how the Online Lazy Newton\nalgorithm can be used to obtain a nearly optimal regret in that setting. In the Low-rank Experts\nproblem, which is a variant of the classic problem of prediction with expert advice, a learner has to\nchoose at each round t = 1, . . . , T between following the advice of one of N experts. On round t,\nthe learner chooses a distribution over the experts in the form of a probability vector xt \u2208 \u2206n (here\n\u2206n denotes the n-dimensional simplex); thereafter, an adversary chooses a cost vector gt \u2208 [\u22121, 1]N\nt gt \u2208 [\u22121, 1]. In contrast with the standard\nassigning losses to experts, and the player su\ufb00ers a loss of xT\nexperts setting, here we assume that in hindsight the expert share a common low rank structure and\nwe have that rank(g1, . . . , gT) \u2264 r, for some r < N.\nIt is known that in the stochastic setting (i.e., gt are drawn i.i.d. from some \ufb01xed distribution) a\nfollow-the-leader algorithm will enjoy a regret bound of min{\u221a\nSet \u03b7 =(cid:112)r log(T)/T, and run Algorithm 1 with (cid:88) = \u2206n and ft(x) = gT\nThis bound matches the \u2126(\u221a\nO(r\n\nasked whether one can achieve the same regret bound in the online setting. Here we answer this\nquestion on the a\ufb03rmative.\nTheorem 4 (Low Rank Experts). Consider the low rank expert setting, where rank(g1, . . . , gT) \u2264 r.\nt x. Then, the obtained regret\nsatis\ufb01es\n\nrT) lower bound of [15] up to a log T factor, and improves upon their\n\u221a\nT) upper bound, so long as T is not exponential in r. Put di\ufb00erently, if one aims at ensuring an\n\nrT,(cid:112)T log N}. In [15] the authors\n\n= O((cid:112)rT log T).\n\nRegretT\n\n5\n\n\faverage regret of at most \u0001, the OLN algorithm would need O((r/\u00012) log(1/\u0001)) iterations as opposed\nto the O(r2/\u00012) iterations required by the algorithm of [15]. We also remark that, since the Hedge\n\nalgorithm can be used to obtain regret rate of O((cid:112)T log N), we can obtain an algorithm with regret\nbound of the form O(cid:0) min(cid:8)(cid:112)rT log T,(cid:112)T log N(cid:9)(cid:1) by treating Hedge and OLN as meta-experts and\n\napplying Hedge over them.\n\n5 Analysis\n\nFor the proofs of our main theorems we will rely on the following two technical lemmas.\nLemma 5 ([17], Lemma 5). Let \u03a61, \u03a62 : (cid:88)(cid:55)\u2192 (cid:82) be two convex functions de\ufb01ned over a closed and\nconvex domain (cid:88) \u2286 (cid:82)d, and let x1 \u2208 arg minx\u2208(cid:88) \u03a61(x) and x2 \u2208 arg minx\u2208(cid:88) \u03a62(x). Assume that \u03a62\nis \u03c3-strongly-convex with respect to a norm (cid:107) \u00b7 (cid:107). Then, for \u03c6 = \u03a62 \u2212 \u03a61 we have\n\nFurthermore, if \u03c6 is convex then\n\n(cid:107)x2 \u2212 x1(cid:107) \u2264 1\n\n\u03c3\n\n(cid:107)\u2207\u03c6(x1)(cid:107)\u2217\n\n.\n\n(cid:0)(cid:107)\u2207\u03c6(x1)(cid:107)\u2217(cid:1)2\n\n.\n\n0 \u2264 \u03c6(x1) \u2212 \u03c6(x2) \u2264 1\n\n\u03c3\n\nLemma 6. Let g1, . . . , gT \u2208 (cid:82)d be a sequence of vectors, and de\ufb01ne Gt = H +(cid:80)t\nT(cid:88)\nwhere r is the rank of the matrix(cid:80)t\n\nThe following lemma is a slight strengthening of a result given in [14].\nwhere H is a positive de\ufb01nite matrix such that (cid:107)gt(cid:107)\u2217\nH \u2264 \u03b3 for all t. Then\nt gt \u2264 r log\nt G\u22121\ngT\nt=1\n.\ns=1 gsgT\nProof. Following [14], we \ufb01rst prove that\nt gt \u2264 log\n\n= log det(cid:0)H\u22121/2GT H\u22121/2(cid:1).\n\nT(cid:88)\n\nt G\u22121\ngT\n\n\u03b32T\nr\n\n1 +\n\n(cid:17)\n\n(cid:16)\n\n,\n\ns\n\ndet GT\ndet H\n\nt=1\n\nTo this end, let G0 = H, so that we have Gt = Gt\u22121 + gtgT\ndeterminant lemma, which states that det(A \u2212 uuT) = (1 \u2212 uT A\u22121u) det(A), gives\n\ns=1 gsgT\n\ns\n\nfor all t,\n\n(5)\nt for all t \u2265 1. The well-known matrix\n= 1 \u2212 det(Gt\u22121)\ndet Gt\n\n.\n\nt=1\n\nt=1\n\nlog\n\n= log\n\ndet Gt\n\nt G\u22121\ngT\n\ndet GT\ndet H ,\n\ndet Gt\ndet Gt\u22121\n\nUsing the inequality 1 \u2212 x \u2264 log(1/x) and summing over t = 1, . . . , T, we obtain\n\nwhich yields Eq. (5).\n\nt gt = 1 \u2212 det(Gt \u2212 gtgT\nt )\nt G\u22121\ngT\nt gt \u2264 T(cid:88)\nT(cid:88)\nNext, observe that H\u22121/2GT H\u22121/2 = I +(cid:80)T\n(cid:33)\nT(cid:88)\nT(cid:88)\n(cid:1) =\nTr(cid:0)gT\ns H\u22121/2 = H\u22121/2((cid:80)T\nAlso, the rank of the matrix(cid:80)T\n(cid:16) 1\nr(cid:88)\n\nlog det(cid:0)H\u22121/2GT H\u22121/2(cid:1) =\n\ns=1 H\u22121/2gsgT\ns H\u22121gs\n\nlog \u03bbi \u2264 r log\n\nH\u22121/2gsgT\n\nH)2 \u2264 \u03b32T .\ns)H\u22121/2 is at most r. Hence,\nall the eigenvalues of the matrix H\u22121/2GT H\u22121/2 are equal to 1, except for r of them whose sum is at\nmost r + \u03b32T. Denote the latter by \u03bb1, . . . , \u03bbr; using the concavity of log(\u00b7) and Jensen\u2019s inequality,\nwe conclude that\n\ns H\u22121/2\ns=1 H\u22121/2gsgT\n\ns=1\ns=1 gsgT\n\n(cid:17) \u2264 r log\n\n(cid:32) T(cid:88)\n\ns H\u22121/2 and\n\nr(cid:88)\n\n((cid:107)gs(cid:107)\u2217\n\n(cid:17)\n\n(cid:16)\n\nTr\n\ns=1\n\n=\n\ns=1\n\ni=1\n\n\u03bbi\n\nr\n\ni=1\n\n1 +\n\n\u03b32T\nr\n\n,\n\n(cid:3)\n\nwhich together with Eq. (5) gives the lemma.\n\n6\n\n\fWe can now prove our main results. We begin by proving Theorem 1.\n\nProof of Theorem 1. For all t, let\n\n\u02dcft(x) = gT\n\nt x +\n\n(gT\nt x)2\n\n\u03b7\n2\n\nand set\n\n(cid:101)Ft(x) =\n(cid:101)Ft(x) =\n\ns=1\n\nt(cid:88)\n(cid:0)x \u2212 A\u2020\n\n1\n2\n\nObserve that xt+1, which is the choice of Algorithm 1 at iteration t + 1, is the minimizer of(cid:101)Ft; indeed,\n\nxT Atx.\n\nt x +\n\nsince yt is in the column span of At, we can write up to a constant:\n\n\u02dcfs(x) = \u2212yT\n\n1\n2\n\n(cid:1)T At\n\n(cid:0)x \u2212 A\u2020\n\nt yt\n\n(cid:1) + const.\n\nt yt\n\nIn other words, Algorithm 1 is equivalent to a follow-the-leader algorithm on the functions \u02dcft.\nNext, \ufb01x some positive de\ufb01nite matrix H (cid:31) 0 and let DH = maxx,y\u2208(cid:88) (cid:107)x \u2212 y(cid:107)H and GH =\nmax1\u2264t\u2264T (cid:107)gt(cid:107)\u2217\n\nH. Next we have\n\n(cid:101)Ft(x) + \u03b7\n\nH\n\n=\n\n2 (cid:107)x \u2212 xt+1(cid:107)2\n1\n2 (cid:107)x \u2212 xt+1(cid:107)2\nxT Atx \u2212 yT\nt x + \u03b7\n2\n1\nt x \u2212 2xT\n(cid:107)x(cid:107)2\nH \u2212 yT\n(cid:107)x(cid:107)2\n\u03b7\n2\n2\nt x \u2212 2xT\n(cid:107)x(cid:107)2\n\u2212 yT\nt+1x + (cid:107)xt+1(cid:107)2\n\u03b7\nH,\n2\n\nGt\n\n+\n\nAt\n\nH\n\n=\n\n=\n\n=\n\nt+1x + (cid:107)xt+1(cid:107)2\n\nH\n\nwhere Gt =(cid:80)t\nGt = H +(cid:80) gtgT\n\u03a62(x) =(cid:101)Ft(x) + \u03b7\n\n\u02dcft(xt) \u2212 \u02dcft(xt+1) +\n\ns=1 gtgT\n\nIn turn, we have that the function is \u03b7-strongly convex with respect to the norm (cid:107) \u00b7 (cid:107)Gt , where\n\nt , and is minimized at x = xt+1. Then by Lemma 5 with \u03a61(x) = (cid:101)Ft\u22121(x) and\n\n+ H.\n\nt\n\n2 (cid:107)x \u2212 xt+1(cid:107)2\n\nH, thus \u03c6(x) = \u02dcft(x) + \u03b7\n\n2 (cid:107)x \u2212 xt+1(cid:107)2\n\nH, we have\n\n)2\n\nGt\n\n\u03b7\n2\n\n(cid:107)xt \u2212 xt+1(cid:107)2\nt xt + \u03b7H(xt \u2212 xt+1)(cid:107)\u2217\n\nH\n\n((cid:107)gt + \u03b7gtgT\n(1 + \u03b7gT\n((cid:107)gt(cid:107)\u2217\n((cid:107)gt(cid:107)\u2217\n((cid:107)gt(cid:107)\u2217\n\nGt\n\nGt\n\nGt\n\nt xt)2((cid:107)gt(cid:107)\u2217\n)2 + 2\u03b7((cid:107)H(xt \u2212 xt+1)(cid:107)\u2217\n)2 + 2\u03b7((cid:107)H(xt \u2212 xt+1)(cid:107)\u2217\n)2 + 2\u03b7(cid:107)xt \u2212 xt+1(cid:107)2\nH .\n\nGt\n\nGt\n\n)2\nH)2\n\n\u2264 1\n\u03b7\n\u2264 2\n\u03b7\n\u2264 8\n\u03b7\n\u2264 8\n\u03b7\n8\n\u03b7\n\n=\n\n)2 + 2\u03b7((cid:107)H(xt \u2212 xt+1)(cid:107)\u2217\n\nGt\n\nOverall, we obtain\n\nBy the FTL-BTL Lemma (e.g., [16]), we have that(cid:80)T\n\n\u02dcft(xt) \u2212 \u02dcft(xt+1) \u2264 8\n\nt G\u22121\ngT\n\n\u03b7\n\nt gt +\n\nt=1\n\nHence, we obtain that:\n\n\u02dcft(xt) \u2212 \u02dcft(x(cid:63)) \u2264 8\n\n\u03b7\n\nt G\u22121\ngT\n\nt gt +\n\nt=1\n\nT(cid:88)\nT(cid:88)\n\nt=1\n\nt=1\n\nT(cid:88)\nT(cid:88)\n\nt=1\n\n7\n\n)2\n\n\u2235 (cid:107)v + u(cid:107)2 \u2264 2(cid:107)v(cid:107)2 + 2(cid:107)u(cid:107)2\n\u2235 1\n\u03b7\n\n\u2265 max\nx\u2208(cid:88)\n\n|gT\nt x|\n\n\u2235 H \u227a Gt \u21d2 H\u22121 (cid:31) G\u22121\n\nt\n\nt=1\n\n3\u03b7\n2\n\n(cid:107)xt \u2212 xt+1(cid:107)2\nH .\n\nT(cid:88)\n\u02dcft(xt) \u2212 \u02dcft(x(cid:63)) \u2264(cid:80)T\nT(cid:88)\n\n(cid:107)xt \u2212 xt+1(cid:107)2\nH .\n\n3\u03b7\n2\n\nt=1\n\n\u02dcft(xt) \u2212 \u02dcft(xt+1).\n\nt=1\n\n\fT(cid:88)\n\nPlugging in ft(x) = gT\n\nt x)2 and rearranging, we obtain\n\nT(cid:88)\n\ngT\n\nt\n\n2(gT\n\nt x + \u03b7\n\n(cid:0)xt \u2212 x(cid:63)(cid:1) \u2264 8\n\nT(cid:88)\nT(cid:88)\n\nt=1\n\nT(cid:88)\nT(cid:88)\n\n3\u03b7\n2\n\nt G\u22121\ngT\n\nt gt +\n\n(cid:107)xt \u2212 xt+1(cid:107)2\n\nH\n\n+\n\n\u03b7\n2\n\n(gT\nt x(cid:63))2\n\nt=1\n\nt=1\n\nt=1\n\n\u03b7\n2\n\n(cid:17)\n\nt=1\n(gT\nt x(cid:63))2 +\n\nt G\u22121\nt gt +\ngT\nG2\nHT\nr\n\n\u03b7\n\u2264 8\n\u03b7\n\u2264 8r\n\u03b7\nFinally, note that we have obtained the last inequality for every matrix H (cid:31) 0. By rescaling a matrix H\nand re-parametrizing H \u2192 H/(\u221a\nGH \u2192 \u221a\nT and\n(cid:3)\nProof of Theorem 3. To simplify notations, let us assume that |\u2207T\nt x\u2217| \u2264 1. We get from Corollary 2\nthat for every \u03b7:\n\n(cid:16)\nT DH) we obtain a matrix whose diameter is DH \u2192 1/\u221a\n\nT DHGH. Plugging these into the last inequality yield the result.\n\nT(cid:88)\n\n3\u03b7\n2 T D2\n(gT\nt x(cid:63))2,\n\nt=1\n3\u03b7\n2 T D2\n\n1 +\n\nlog\n\n\u03b7\n2\n\nt=1\n\n+\n\n+\n\nH\n\nH\n\nFor each GH and DH we can set \u03b7 =\n\nRegretT \u2264 2r\n\n(cid:17)\n\nD2G2T2\n\nlog\n\n1 +\n\n(cid:16)\n(cid:113)(2r/T) log(1 + G2\n(cid:115)\n(cid:16)\n\n\u03b7\n\nr\n\nrT log\n\n1 +\n\nRegretT \u2264\n\nH D2\n\n+ 3\u03b7(1 + T).\nH/r) and obtain the regret bound\nD2\nHG2\nr\n\nHT\n\n(cid:17)\n\n.\n\nHence, we only need to show that there exists a matrix H (cid:31) 0 such that D2\n= O(r). Indeed, set\nS = span(\u22071, . . . , \u2207T), and denote (cid:88)S to be the projection of (cid:88) onto S (i.e., (cid:88)S = P(cid:88) where P is the\nprojection over S). De\ufb01ne\n\nHG2\n\nH\n\n(cid:66) = {\u2207 \u2208 S : |\u2207Tx| \u2264 1, \u2200 x \u2208 (cid:88)S}.\n\nt=1 \u2286 (cid:66). Further, (cid:66) is a symmetric convex set, hence by an\nNote that by de\ufb01nition we have that {\u2207t}T\nellipsoid approximation we obtain a positive semide\ufb01nite matrix B (cid:23) 0, with positive eigenvalues\nrestricted to S, such that\n\n(cid:66) \u2286 {\u2207 \u2208 S : \u2207TB\u2207 \u2264 1} \u2286 r (cid:66).\nr (cid:88)S \u2286 1\n\u2217 \u2286 {v \u2208 S : vTB\u2020v \u2264 1}.\n\nr (cid:66)\n\nBy duality we have that\nThus if PS is the projection over S we have for every x \u2208 (cid:88) that xTPS B\u2020PSx \u2264 r. On the other hand\nt B\u2207t \u2264 1. We can now choose H = B\u2020 + \u0001 Id where \u0001 is arbitrary small and\nfor every \u2207t we have \u2207T\nhave\n\n1\n\n\u2207T\nt H\u22121\u2207t = \u2207T\n\nt (B\u2020 + \u0001 Id)\u22121\u2207t \u2264 2\n\nand\n\nAcknowledgements\n\nxTHx = xTPT\n\nS B\u2020PSx + \u0001(cid:107)x(cid:107)2 \u2264 2r.\n\n(cid:3)\n\nThe authors would like to thank Elad Hazan for helpful discussions. RL is supported by the Eric and\nWendy Schmidt Fund for Strategic Innovations.\n\nReferences\n[1] K. S. Azoury and M. K. Warmuth. Relative loss bounds for on-line density estimation with the\n\nexponential family of distributions. Machine Learning, 43(3):211\u2013246, 2001.\n\n[2] S. Barman, A. Gopalan, and A. Saha. Online learning for structured loss spaces. arXiv preprint\n\narXiv:1706.04125, 2017.\n\n8\n\n\f[3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order perceptron algorithm. SIAM\n\nJournal on Computing, 34(3):640\u2013668, 2005.\n\n[4] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with\n\nexpert advice. Machine Learning, 66(2-3):321\u2013352, 2007.\n\n[5] C.-K. Chiang, T. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin, and S. Zhu. Online optimization\n\nwith gradual variations. In COLT, pages 6\u20131, 2012.\n\n[6] A. Cohen and S. Mannor. Online learning with many experts. arXiv preprint arXiv:1702.07870,\n\n2017.\n\n[7] A. Cutkosky and K. Boahen. Online learning without prior information. arXiv preprint\n\narXiv:1703.02629, 2017.\n\n[8] S. De Rooij, T. Van Erven, P. D. Gr\u00fcnwald, and W. M. Koolen. Follow the leader if you can,\n\nhedge if you must. The Journal of Machine Learning Research, 15(1):1281\u20131316, 2014.\n\n[9] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. The Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[10] D. J. Foster, A. Rakhlin, and K. Sridharan. Zigzag: A new approach to adaptive online learning.\n\narXiv preprint arXiv:1704.04010, 2017.\n\n[11] E. Hazan and S. Kale. On stochastic and worst-case models for investing. In Advances in Neural\n\nInformation Processing Systems, pages 709\u2013717, 2009.\n\n[12] E. Hazan and S. Kale. Extracting certainty from uncertainty: Regret bounded by variation in\n\ncosts. Machine learning, 80(2-3):165\u2013188, 2010.\n\n[13] E. Hazan and S. Kale. Better algorithms for benign bandits. The Journal of Machine Learning\n\nResearch, 12:1287\u20131311, 2011.\n\n[14] E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex\noptimization. In International Conference on Computational Learning Theory, pages 499\u2013513.\nSpringer, 2006.\n\n[15] E. Hazan, T. Koren, R. Livni, and Y. Mansour. Online learning with low rank experts. In 29th\n\nAnnual Conference on Learning Theory, pages 1096\u20131114, 2016.\n\n[16] A. Kalai and S. Vempala. E\ufb03cient algorithms for online decision problems. Journal of Computer\n\nand System Sciences, 71(3):291\u2013307, 2005.\n\n[17] T. Koren and K. Levy. Fast rates for exp-concave empirical risk minimization. In Advances in\n\nNeural Information Processing Systems, pages 1477\u20131485, 2015.\n\n[18] H. Luo, A. Agarwal, N. Cesa-Bianchi, and J. Langford. E\ufb03cient second order online learning\nby sketching. In Advances in Neural Information Processing Systems, pages 902\u2013910, 2016.\n[19] A. Rakhlin and K. Sridharan. Online learning with predictable sequences. In Conference on\n\nLearning Theory, pages 993\u20131019, 2013.\n\n[20] A. Rakhlin, O. Shamir, and K. Sridharan. Localization and adaptation in online learning. In\nProceedings of the Sixteenth International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 516\u2013526, 2013.\n\n[21] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends\u00ae\n\nin Machine Learning, 4(2):107\u2013194, 2012.\n\n[22] M. Streeter and H. B. McMahan. Less regret via online conditioning. Technical report, 2010.\n[23] T. van Erven and W. M. Koolen. Metagrad: Multiple learning rates in online learning. In\n\nAdvances in Neural Information Processing Systems, pages 3666\u20133674, 2016.\n\n[24] V. Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213\u2013248, 2001.\n[25] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. 2003.\n\n9\n\n\f", "award": [], "sourceid": 2490, "authors": [{"given_name": "Tomer", "family_name": "Koren", "institution": "Google"}, {"given_name": "Roi", "family_name": "Livni", "institution": "Princeton"}]}