Accelerating ADMM for efficient simulation and optimization

The alternating direction method of multipliers (ADMM) is a popular approach for solving optimization problems that are potentially non-smooth and with hard constraints. It has been applied to various computer graphics applications, including physical simulation, geometry processing, and image processing. However, ADMM can take a long time to converge to a solution of high accuracy. Moreover, many computer graphics tasks involve non-convex optimization, and there is often no convergence guarantee for ADMM on such problems since it was originally designed for convex optimization. In this paper, we propose a method to speed up ADMM using Anderson acceleration, an established technique for accelerating fixed-point iterations. We show that in the general case, ADMM is a fixed-point iteration of the second primal variable and the dual variable, and Anderson acceleration can be directly applied. Additionally, when the problem has a separable target function and satisfies certain conditions, ADMM becomes a fixed-point iteration of only one variable, which further reduces the computational overhead of Anderson acceleration. Moreover, we analyze a particular non-convex problem structure that is common in computer graphics, and prove the convergence of ADMM on such problems under mild assumptions. We apply our acceleration technique on a variety of optimization problems in computer graphics, with notable improvement on their convergence speed.

. We apply our accelerated ADMM solver to optimize a quad mesh, subject to hard constraints of face planarity and soft constraints of closeness to a reference surface. Our solver leads to a faster decrease of combined residual than the original ADMM, achieving better satisfaction of hard constraints within the same computational time (highlighted in the plot in bottom right).
The alternating direction method of multipliers (ADMM) is a popular approach for solving optimization problems that are potentially non-smooth and with hard constraints. It has been applied to various computer graphics applications, including physical simulation, geometry processing, and image processing. However, ADMM can take a long time to converge to a solution of high accuracy. Moreover, many computer graphics tasks involve non-convex optimization, and there is often no convergence guarantee for ADMM on such problems since it was originally designed for convex optimization. In this paper, we propose a method to speed up ADMM using Anderson acceleration, an established technique for accelerating fixed-point iterations. We show that in the general case, ADMM is a fixed-point iteration of the second primal variable and the dual variable, and Anderson acceleration can be directly applied. Additionally, when the problem has a separable

INTRODUCTION
Many tasks in computer graphics involve solving optimization problems. For example, a geometry processing task may compute the vertex positions of a deformed mesh by minimizing its deformation energy [Sorkine and Alexa 2007], whereas a physical simulation task may optimize the node positions of a system to enforce physics laws that govern its behavior [Martin et al. 2011;Schumacher et al. 2012]. Such tasks are often formulated as unconstrained optimization, where the target function penalizes the violation of certain conditions so that they are satisfied as much as possible by the solution. It has been an active research topic to develop fast numerical solvers for such problems, with various methods proposed in the past [Sorkine and Alexa 2007;Liu et al. 2008;Bouaziz et al. 2012;Wang 2015;Kovalsky et al. 2016;Liu et al. 2017;Shtengel et al. 2017;Rabinovich et al. 2017].
On the other hand, some applications involve optimization with hard constraints, i.e., conditions that need to be enforced strictly. Such constrained optimization problems are often more difficult to solve [Nocedal and Wright 2006]. One possible solution strategy is to introduce a quadratic penalty term for the hard constraints with a large weight, thereby converting it into an unconstrained problem that is easier to handle. However, to strictly enforce the hard constraints, their penalty weight needs to approach infinity [Nocedal and Wright 2006], which can cause instability for numerical solvers. More sophisticated techniques, such as sequential quadratic programming or the interior-point method, can enforce constraints without stability issues. However, these solvers often incur high computational costs and may not meet the performance requirements for graphics applications. It becomes even more challenging for non-smooth problems where the target function is not everywhere differentiable, as many constrained optimization solvers require gradient information and may not be applicable for such cases.
In recent years, the alternating direction method of multipliers (ADMM) [Boyd et al. 2011] has become a popular approach for solving optimization problems that are potentially non-smooth and with hard constraints. The key idea is to introduce auxiliary variables and derive an equivalent problem with a separable target function, subject to a linear compatibility constraint between the original variables and the auxiliary variables [Combettes and Pesquet 2011]. ADMM searches for a solution to this converted problem by alternately updating the original variables, the auxiliary variables, and the dual variables. With properly chosen auxiliary variables, each update step can reduce to simple sub-problems that can be solved efficiently, often in parallel with closed-form solutions. In addition, ADMM does not rely on the smoothness of the problem, and converges quickly to a solution of moderate accuracy [Boyd et al. 2011]. Such properties make ADMM an attractive choice for solving large-scale optimization problems in various applications such as signal processing [Chartrand and Wohlberg 2013;Simonetto and Leus 2014], image processing [Figueiredo and Bioucas-Dias 2010;Almeida and Figueiredo 2013], and computer vision . Recently, ADMM has also been applied for computer graphics problems such as geometry processing [Bouaziz et al. 2013;Neumann et al. 2013;Xiong et al. 2014;Neumann et al. 2014], physics simulation [Gregson et al. 2014;Pan and Manocha 2017;Overby et al. 2017], and computational photography [Heide et al. 2016;Xiong et al. 2017;Wang et al. 2018].
Despite the effectiveness and versatility of ADMM, there are still two major limitations for its use in computer graphics. First, although ADMM converges quickly in initial iterations, its final convergence might be slow [Boyd et al. 2011]. This makes it impractical for problems with a strong demand for solution accuracy, such as those with strict requirements on the satisfaction of hard constraints. Recent attempts to accelerate ADMM such as [Goldstein et al. 2014;Kadkhodaie et al. 2015;Zhang and White 2018] are only designed for convex problems, which limits their applications in computer graphics. Second, ADMM was originally designed for convex problems, whereas many computer graphics tasks involve non-convex optimization. Although ADMM turns out to be effective for many non-convex problems in practice, its convergence for general nonconvex optimization remains an open research question. Recent convergence results such as [Li and Pong 2015;Magnússon et al. 2016;Wang et al. 2019] rely on strong assumptions that are not satisfied by many computer graphics problems.
This paper addresses these two issues of ADMM. First, we propose a method to accelerate ADMM for non-convex optimization problems. Our approach is based on Anderson acceleration [Anderson 1965;Walker and Ni 2011], a well-established technique for accelerating fixed-point iterations. Previously, Anderson acceleration has been applied to local-global solvers for unconstrained optimization problems in computer graphics [Peng et al. 2018]. Our approach expands its applicability to many constrained optimization problems as well as other unconstrained problems where local-solver solvers are not feasible. To this end, we need to solve two problems: (i) we must find a way to interpret ADMM as a fixed-point iteration; (ii) as Anderson acceleration can become unstable, we should define criteria to accept the accelerated iterate and a fall-back strategy when it is not accepted, similar to [Peng et al. 2018]. We show that in the general case ADMM is a fixed-point iteration of the second primal variable and the dual variable, and we can evaluate the effectiveness of an accelerated iterate via its combined residual which is known to vanish when the solver converges. Moreover, when the problem structure satisfies some mild conditions, one of these two variables can be determined from the other one; in this case ADMM becomes a fixed-point iteration of only one variable with less computational overhead, and we can accept an accelerated iterate based on a more simple condition. We apply this method to a variety of ADMM solvers for computer graphics problems, and observe a notable improvement in their convergence rates.
Additionally, we provide a new convergence proof of ADMM on non-convex problems, under weaker assumptions than the convergence results in [Li and Pong 2015;Magnússon et al. 2016;Wang et al. 2019]. For a particular problem structure that is common in computer graphics, we also provide sufficient conditions for the global linear convergence of ADMM. Our proofs shed new light on the convergence properties of non-convex ADMM solvers.

RELATED WORK
Optimization solvers in computer graphics. The development of efficient optimization solvers has been an active research topic in computer graphics. One particular type of method, called localglobal solvers, has been widely used for unconstrained optimization in geometry processing and physical simulation. For geometry processing, Sorkine and Alexa [2007] proposed a local-global approach to minimize deformation energy for as-rigid-as-possible mesh surface modeling. Liu et al. [2008] developed a similar method to perform conformal and isometric parameterization for triangle meshes. Bouaziz et al. [2012] extended the approach to a unified framework for optimizing discrete shapes. For physical simulation,  proposed a local-global solver for optimization-based simulation of mass-spring systems.  extended this approach to the projective dynamics framework for implicit time integration of physical systems via energy minimization.
Local-global solvers often converge quickly to an approximate solution, but may be slow for final convergence. Other methods have been proposed to achieve improved convergence rates. For geometry processing, Kovalsky et al. [2016] achieved a fast convergence of geometric optimization by iteratively minimizing a local quadratic proxy function. Rabinovich et.al. [2017] proposed a scalable approach to compute locally injective mappings, via local-global minimization of a reweighted proxy function. Claici et al. [2017] proposed a preconditioner for fast minimization of distortion energies. Shtengel et al. [2017] applied the idea of majorizationminimization [Lange 2004] to iteratively update and minimize a convex majorizer of the target energy in geometric optimization. Zhu et al. [2018] proposed a fast solver for distortion energy minimization, using a blended quadratic energy proxy together with improved line-search strategy and termination criteria. For physical simulation, Wang [2015] proposed a Chebyshev semi-iterative acceleration technique for projective dynamics. Later, Wang and Yang [2016] developed a GPU-friendly gradient descent method for elastic body simulation, using Jacobi preconditioning and Chebyshev acceleration. Liu et al. [2017] proposed an L-BFGS solver for physical simulation, with faster convergence than the projective dynamics solver from . Brandt et al. [2018] performed projective dynamics simulation in a reduced subspace, to compute fast approximate solutions for high-resolution meshes.
ADMM was originally designed for convex optimization [Gabay and Mercier 1976;Fortin and Glowinski 1983;Eckstein and Bertsekas 1992]. For such problems, its global linear convergence has been established in [Lin et al. 2015;Deng and Yin 2016;Giselsson and Boyd 2017], but these proofs require both terms in the target function to be convex. In comparison, our proof of global linear convergence allows for non-convex terms in the target function, which is better aligned with computer graphics problems. In practice, ADMM works well for many non-convex problems as well [Wen et al. 2012;Chartrand 2012;Chartrand and Wohlberg 2013;Miksik et al. 2014;Lai and Osher 2014;Liavas and Sidiropoulos 2015], but it is more challenging to establish its convergence for general non-convex problems. Only very recently have such convergence proofs been given under strong assumptions [Li and Pong 2015;Magnússon et al. 2016;Wang et al. 2019]. We provide in this paper a general proof of convergence for non-convex problems under weaker assumptions.
It is well known that ADMM converges quickly to an approximate solution, but may take a long time to convergence to a solution of high accuracy [Boyd et al. 2011]. This has motivated researchers to explore acceleration techniques for ADMM. Goldstein et al. [2014] and Kadkhodaie et al. [2015] applied Nesterov's acceleration [Nesterov 1983], whereas Zhang and White [2018] applied GMRES acceleration to a special class of problems where the ADMM iterates become linear. All these methods are designed for convex problems only, which limits their applicability in computer graphics.
Anderson acceleration. Anderson acceleration [Walker and Ni 2011] is an established technique to speed up the convergence of a fixed-point iteration. It was first proposed in [Anderson 1965] for solving nonlinear integral equations, and independently rediscovered later by Pulay [1980;1982] for accelerating the selfconsistent field method in quantum chemistry. Its key idea is to utilize the m previous iterates to compute a new iterate that converges faster to the fixed point. It is indeed a quasi-Newton method for finding a root of the residual function, by approximating its inverse Jacobian using previous iterates [Eyert 1996;Fang and Saad 2009;Rohwedder and Schneider 2011]. Recently, a renewed interest in this method has led to the analysis of its convergence [Toth and Kelley 2015;Toth et al. 2017], as well as its application in various numerical problems [Sterck 2012;Lipnikov et al. 2013;Pratapa et al. 2016;Suryanarayana et al. 2019;Ho et al. 2017]. Peng et al. [Peng et al. 2018] noted that local-global solvers in computer graphics can be treated as fixed-point iteration, and applied Anderson acceleration to improve their convergence. Additionally, to address the stability issue of classical Anderson acceleration [Walker and Ni 2011;Potra and Engler 2013], they utilize the monotonic energy decrease of local-global solvers and only accept an accelerated iterate when it decreases the target energy. Fang and Saad [2009] called classical Anderson acceleration the Type-II method in an Anderson family of multi-secant methods. Another member of the family, called the type-I method, uses quasi-Newton to approximate the Jacobian of the fixed-point residual function instead [Walker and Ni 2011], and has been analyzed recently in . In this paper, we focus our discussion on the type-II method.

OUR METHOD 3.1 Preliminary
ADMM. Let us consider an optimization problem Here x can be the vertex positions of a discrete geometric shape, or the node positions of a physical system at a particular time instance. The quantity Dx + h encodes a transformation of the positions x relevant for the optimization problem, such as the deformation gradient of each tetrahedron element in an elastic object. The notation Φ(x, Dx + h) signifies that the target function contains a term that directly depends on Dx + h, such as elastic energy dependent on the deformation gradient. In some applications, the optimization enforces hard constraints on x or Dx + h, i.e., conditions that need to be strictly satisfied by the solution. Such hard constraints can be encoded using an indicator function term within the target function. Specifically, suppose we want to enforce a condition y ∈ C where y is a subset from the components of x or Dx + h, and C is the feasible set. Then we include the following term into Φ: By definition, if x * is a solution, then the corresponding components y * must satisfy y * ∈ C; otherwise it will result in a target function value +∞ instead of the minimum. Examples of such an approach to modeling hard constraints can be found in [Deng et al. 2015]. In many applications, the optimization problem (1) can be nonlinear, non-convex, and potentially non-smooth. It is challenging to solve such a problem numerically, especially when hard constraints are involved. One common technique is to introduce an auxiliary variable z = Dx + h to derive an equivalent problem where W is a diagonal matrix with positive diagonal elements. W can be the identity matrix in the trivial case, or a diagonal scaling matrix that improves conditioning [Giselsson and Boyd 2017;Overby et al. 2017]. ADMM [Boyd et al. 2011] is widely used to solve such problems. For ease of discussion, let us consider the problem Its solution corresponds to a stationary point of the augmented Lagrangian function Here u is the dual variable and µ > 0 is the penalty parameter. Following [Boyd et al. 2011], we also call x and z the primal variables. ADMM searches for a stationary point by alternately updating x, z and u, resulting in the following iteration scheme [Boyd et al. 2011]: We can also update z before x, resulting in an alternative scheme: In this paper, we refer to the scheme (5) as x-z-u iteration, and the scheme (6) as z-x-u iteration. In both cases, the updates for z and x often reduce to simple subproblems that can potentially be solved in parallel. According to [Boyd et al. 2011], the optimality condition of ADMM is that both its primal residual and dual residual vanish. For both iteration schemes above, the primal residual is defined as As for the dual residual: for the x-z-u iteration it is defined as whereas for the z-x-u iteration it is defined as Intuitively, the primal residual measures the violation of the linear side constraint, whereas the dual residual measures the violation of the dual feasibility condition [Boyd et al. 2011]. Accordingly, ADMM is terminated when both ∥r k +1 p ∥ and ∥r k +1 d ∥ are small enough. Anderson acceleration. ADMM is easy to parallelize and convergences quickly to an approximate solution. However, it can take a long time to converge to a solution of high accuracy [Boyd et al. 2011]. In the following subsections, we will discuss how to apply Anderson acceleration [Walker and Ni 2011] to improve its convergence. Anderson acceleration is a technique to speed up the convergence of a fixed-point iteration G : R n → R n , by utilizing the current iterate as well as m previous iterates. Let q k −m , q k −m+1 , . . . , q k be the latest m + 1 iterates, and denote their residuals under map- . Then the accelerated iterate is computed as where (θ * 1 , . . . , θ * m ) is the solution to a linear least-squares problem: In Eq. (9), β ∈ (0, 1] is a mixing parameter, and is typically set to 1 [Walker and Ni 2011]. We follow this convention throughout this paper. Previously, Anderson acceleration has been applied to speed up local-global solvers in computer graphics [Peng et al. 2018].

Anderson acceleration of ADMM: the general approach
To speed up ADMM with Anderson acceleration, we must first define its iteration scheme as a fixed-point iteration. For the x-z-u iteration, we note that x k +1 is dependent only on z k and u k . Therefore, by treating x k +1 as a function of (z k , u k ), we can rewrite z k +1 , and subsequently u k +1 , as a function of (z k , u k ) as well. In this way, the x-z-u iteration can be treated as a fixed-point iteration of (z, u): Similarly, we can treat the z-x-u scheme as a fixed-point iteration of (x, u). In addition, to ensure stability for Anderson acceleration, we should define criteria to evaluate the effectiveness of an accelerated iterate, as well as a fall-back strategy when the criteria are not met. Goldstein et al. [2014] pointed out that if the problem is convex, For the x-z-u iteration, the combined residual is defined as Here the first term is a measure of the primal residual, whereas the second term is related to the dual residual (7) but without the matrix A T . The combined residual for the z-x-u iteration is defined as Although [Goldstein et al. 2014] only proved the monotonic decrease of the combined residual for convex problems, our experiments show that the combined residual is decreased by the majority of iterates from the non-convex ADMM solvers considered in this paper. Indeed, if ADMM converges to a solution, then both the primal residual Ax k +1 − Bz k +1 − c and the variable changes z k +1 − z k and x k +1 − x k must converge to zero, so the combined residual must converge to zero as well. Therefore, we evaluate the effectiveness of an accelerated iterate by checking whether it decreases the combined residual compared with the previous iteration, and revert to the un-accelerated ADMM iterate if this is not the case.
Algorithm 2: Anderson acceleration for ADMM with z-x-u iteration.
Algorithm 1 summarizes our Anderson acceleration approach for the x-z-u iteration. Note that the evaluation of combined residual requires computing the change of z in one un-accelerated ADMM iteration. However, given an accelerated iterate (z AA , u AA ), it is often difficult to find a pair (z † , u † ) that leads to (z AA , u AA ) after one ADMM iteration (i.e., (z AA , u AA ) = G(z † , u † )). Therefore, we run one ADMM iteration on (z AA , u AA ) instead, and use the resulting values (z ⋆ , u ⋆ ) = G(z AA , u AA ) to evaluate the combined residual. If the accelerated iterate is accepted, then the computation of (z ⋆ , u ⋆ ) can be reused in the next step of the algorithm and incurs no overhead. We can derive an acceleration method for the x-z-u iteration in a similar way, by swapping x and z and adopting Eq. (8) for the computation of combined residual, as summarized in Algorithm 2.
Remark 3.1. If the target function Φ contains an indicator function for a hard constraint on the primal variable updated in the second step of an ADMM iteration (i.e., z in the x-z-u iteration, or x in the zx-u iteration), then after each iteration this variable must satisfy the hard constraint. However, as Anderson acceleration computes the accelerated iterate via an affine combination of previous iterates, the accelerated z AA or x AA may violate the constraint unless its feasible set is an affine space. In other words, the accelerated iterate may not correspond to a valid ADMM iteration, and may cause issues if it is used as a solution. Therefore, to apply Anderson acceleration, we should ensure that Φ contains no indicator function associated with the primal variable updated in the second step of the original ADMM iteration. This does not limit the applicability of our method, because it can always be achieved by introducing auxiliary variables and choosing an appropriate iteration scheme. The simulation in Fig. 4 is an example of changing the iteration scheme to allow acceleration.

ADMM with a separable target function
The general approach in Section 3.2 does not assume any special structure of the target function. When the target function terms for x and z are separable, it is possible to improve the efficiency of acceleration further. To this end, we consider the following problem Moreover, we assume this problem satisfies the following properties: wherex is a constant and G is a symmetric positive definite matrix.
One example of such optimization is the implicit time integration of elastic bodies in [Overby et al. 2017], wherex is the predicted values of node positions x without internal forces, G = M/∆t 2 where M is the mass matrix and ∆t is the integration time step, the auxiliary variable z stacks the deformation gradient of each element, and д(z) sums the elastic potential energy for all elements. For the problem (13), the x-z-u iteration of ADMM becomes And the z-x-u iteration becomes Similar to Remark 3.1, we assume that the target function contains no indicator function for the primal variable updated in the second step. The general acceleration algorithms in Section 3.2 treat ADMM as a fixed-point iteration of (z, u) or (x, u). Next, we will show that if the problem satisfies certain conditions, then ADMM becomes a fixed-point iteration of only one variable, allowing us to reduce the overhead of Anderson acceleration and improve its effectiveness.
Remark 3.2. Without assuming the convexity of function д(·), there may be multiple solutions for the minimization problems in (16) and (18). Throughout this paper, we assume the solver adopts a deterministic algorithm for (16) and (18), so that given the same values of x and u it always returns the same value of z. (15)- (17), under certain conditions u k +1 can be represented as a function of z k +1 : Proposition 3.1. If the optimization problem (13) satisfies Assumptions 3.1 and 3.2, and the function д(z) is differentiable, then the x-z-u iteration (15)-(17) satisfies A proof is given in Appendix A. Proposition 3.1 shows that u k +1 can be recovered from z k +1 . Therefore, we can treat the x-z-u iteration (15)-(17) as a fixed-point iteration of z instead of (z, u), and Algorithm 3: Anderson acceleration for ADMM with x-z-u iteration, on a problem (13) that satisfies Assumptions 3.1, 3.2 and with a differentiable д.
if reset == FALSE AND r ≥ r prev then // Check residual 6 z k = z default ; // Revert to un-accelerated z // Re-compute u and x with (17) and (15) 20 end while apply Anderson acceleration to z alone. From the accelerated z AA , we recover its corresponding dual variable u AA via Eq. (21). This approach brings two major benefits. First, the main computational overhead for Anderson acceleration in each iteration is to update the normal equation system for the problem (10), which involves inner products of time complexity O(mn) where n is the dimension of variables that undergo fixed-point iteration [Peng et al. 2018].
Since B is invertible, u and z are of the same dimension; thus this new approach reduces the computational cost of inner products by half. Another benefit is a more simple criterion for the effectiveness of an accelerated iterate, based on the following property: Proposition 3.2. Suppose the problem (13) satisfies Assumptions 3.1 and 3.2, and the function A proof is given in Appendix B. Note that the left-hand side of (22) has a similar form as the primal residual, but involves the value of x in the next iteration. Accordingly, we evaluate the effectiveness of an accelerated iterate z AA and its corresponding dual variable u AA by first computing a new value x ⋆ according to the x-update step (15), then evaluating a residualr x-z-u = Ax ⋆ − Bz AA − c. We only accept z AA if it leads to a smaller norm of this residual compared to the previous iteration; otherwise, we revert to the last un-accelerated iterate. If z AA is accepted, then x ⋆ can be reused in the next step. The main benefit here is that we do not need to run an additional ADMM iteration to verify the effectiveness of z AA , which incurs less computational overhead when the accelerated iterate is rejected. This acceleration strategy is summarized in Algorithm 3. Fig. 2 shows an example where accelerating z alone leads to a faster decrease of combined residual than accelerating z, u together.
3.3.2 z-x-u iteration. Similar to the previous discussion, when the problem satisfies certain conditions, the z-x-u scheme is a fixedpoint iteration of only one variable. In particular, we have: Proposition 3.3. If the optimization problem (13) satisfies Assumptions 3.1 and 3.2, then the z-x-u iteration (18)-(20) satisfies A proof is given in Appendix C. This property implies that x k +1 can be recovered from u k +1 ; thus we can treat the z-x-u scheme (18)-(20) as a fixed-point iteration of u instead of (x, u). In theory, we can apply Anderson acceleration to the history of u to obtain an accelerated iterate u AA , and recover the corresponding x AA from Eq. (23). However, this would require solving a linear system with matrix G, and can be computationally expensive. Instead, we note that x k +1 and u k +1 are related to by an affine map, and this relation is satisfied by any previous pair of x and u values. Then since u AA is an affine combination of previous u values, we can apply the same affine combination coefficients to the corresponding previous x values to obtain x AA , which is guaranteed to satisfy Eq. (23) with u AA . As the affine combination coefficients are computed from u only, this still reduces the computational cost compared to applying Anderson acceleration to (x, u). Similar to the x-z-u case, we can verify the convergence of the z-x-u iteration by comparing x in the current iteration with the value of z in the next iteration: Proposition 3.4. Suppose the problem (13) satisfies Assumptions 3.1 and 3.2. Let u k +1 = G zxu (u k ) denote the fixed-point iteration of u induced by the x-z-u iteration (18)-(20). Then u k +1 is a fixed point of mapping G zxu (·) if and only if Accordingly, we evaluate the effectiveness of u AA and x AA by computing from them a z ⋆ using Eq. (18), and evaluating the residual We accept u AA if the norm of this residual is smaller than the previous iteration, and revert to the last unaccelerated iterate otherwise. If u AA is accepted, then z ⋆ is reused in the next step. Algorithm 4 summarizes our approach.
Remark 3.3. We have shown that ADMM can be reduced to a fixedpoint iteration of the second primal variable or the dual variable based on Assumptions 3.1 and 3.2, and (for the x-z-u iteration) the smoothness of д. In fact, these assumptions can be further relaxed.
Algorithm 4: Anderson acceleration for ADMM with z-x-u iteration, on a problem (13) that satisfies Assumptions 3.1 and 3.2. 1 r prev = +∞; j = 0; reset = TRUE; k = 0; 2 while TRUE do // Update z with (18) and compute residual with (24) // Check whether the residual increases 5 if reset == FALSE AND r ≥ r prev then // Revert to un-accelerated x, u // Use history of u to compute affine coeffients // Compute accelerated x and u with the coefficients We refer the reader to Appendix E for more details. Fig. 11 is an example of using such relaxed conditions to reduce the fixed-point iteration to one variable.

Convergence analysis
For Anderson acceleration to be applicable, an ADMM solver must be convergent already. However, many ADMM solvers used in computer graphics lack a convergence guarantee due to the nonconvexity of the problems they solve. Although ADMM works well for many non-convex problems in practice, convergence proofs on such problems rely on strong assumptions that are often not satisfied by graphics problems [Li and Pong 2015;Magnússon et al. 2016;Wang et al. 2019]. In this subsection, we discuss the convergence of ADMM on the problem (13) where the term д in the target function can be non-convex. We first provide a set of conditions for linear convergence of ADMM on such problems, and then give more general convergence proofs using weaker assumptions than existing results in the literature. As the problem structure (13) is common in computer graphics, our new results can potentially expand the applicability of ADMM for graphics problems.
To ease the presentation, we first introduce some notation. To account for the fact that the target function may be unbounded from above due to an indicator function, we suppose all the functions are mappings to R {+∞}. Following [Rockafellar 1997], for a function F we define its effective domain and level set as: α is a bounded set for any α ∈ R. Given a set S, let I S and B S denote the interior and the boundary of S, respectively. A function F is continuous on R n if: (i) it is continuous within I dom(F ) in the conventional sense; and (ii) We say a function is Lipschitz differentiable if it is differentiable and its gradient is Lipschitz continuous. Unless specified otherwise, I denotes the identity matrix and the identity map. The symbol conv(S) denotes the convex hull of a set S, and ∂F denotes the set of all sub-differentials for a function F (see [Rockafellar and Wets 2009, Definition 8.3(b)]). For matrix Q, we use ρ(Q) to represent its spectral radius. We will discuss conditions for the ADMM iterates {(x k , z k , u k )} to converge to a stationary point (x * , z * , u * ) of the augmented Lagrangian for problem (13), which is defined by the conditions [Boyd et al. 2011]: Linear convergence. Our discussion involves the following definitions related to the problem (13) and Assumptions 3.1 and 3.2: We denote by ρ(K) the spectral radius of matrix K. To prove linear convergence of ADMM for the problem (13) regardless of its initial value, we need the following assumption: Assumption 3.3. ∇д is Lipschitz differentiable on R n with a Lipschitz constant L, i.e. ∥∇д(z 1 ) − ∇д(z 2 )∥ ≤ L∥z 1 − z 2 ∥ ∀z 1 , z 1 ∈ R n .
Then we have: Theorem 3.1. If Assumptions 3.1-3.3 are satisfied and ρ(K) < 1 2L , then for a sufficiently large µ the x-z-u iteration (15)-(17) converges to a stationary point defined in Eq. (25). Moreover, Proofs are provided in Appendix F. The theorems above rely on Assumption 3.3 which requires the function д to be globally Lipschitz differentiable. This may not be the case for some graphics problems. For example, the StVK energy used for simulation of hyperelastic materials is a quartic function of the deformation gradient, and is locally Lipschitz differentiable but not globally so. For such problems, we can still prove linear convergence with additional conditions on its initial value and penalty parameter. In the following, we use T (x, z) to denote the target function (13). We make the following relaxed assumption aboutд: Assumption 3.4.
For linear convergence of the x-z-u iteration, we assume the following for the initial value (x 0 , z 0 , u 0 ) and penalty parameter µ: Assumption 3.5. ( (2) µ is large enough such that c 1 ≤ 1, where ) ⊂ dom(д) and let L c be a Lipschitz constant of ∇д over this set.
For the z-x-u iteration, we need a different assumption that relies on the following proposition which is proved in Appendix F.3: Proposition 3.5. Let R(A) be the range of matrix A. Then for any x ∈ R(A), ∥Kx∥ ≥ η∥x∥ where η > 0 is a constant depending on K.
The proofs for these two theorems are given in Appendix F. ACM Trans. Graph., Vol. 38, No. 6, Article 163. Publication date: November 2019.

(z, u)-Acceleration z-Acceleration z-Acceleration
x-z-u (Rubber) Fig. 2. Comparison between the ADMM solver in [Overby et al. 2017] and our method according to Algorithm 3, for computing the same frame of a simulation sequence with three elastic bars. Two material stiffness settings ("soft rubber" and "rubber") are used for testing. In both case, our method leads to faster decrease of residuals and accelerates the convergence. For the case with rubber, we also test Algorithm 1 that applies Anderson acceleration to (z, u), which also speeds up the convergence but is less effective than accelerating z alone.
Remark 3.4. Unlike existing linear convergence proofs such as [Lin et al. 2015;Deng and Yin 2016;Giselsson and Boyd 2017], we do not require both f and д to be convex. This makes our proofs applicable to some graphics problems with a non-convex д, such as the elastic body simulation problem in [Overby et al. 2017] where д is an elastic potential energy. In the supplementary material we provide numerical verification of linear convergence on such a problem.
General convergence under weak assumptions. If a linear convergence rate is not needed, the assumptions above can be further relaxed to prove the convergence of ADMM on problem (13): instead of the relation between the matrix K and the Lipschitz constant L, we require the following weak assumption on function д.
A function F : R n → R is called semi-algebraic if its graph { y, F (y) | y ∈ R n } ⊂ R n+1 is a union of finitely many sets each defined by a finite number of polynomial equalities and strict inequalities [Li and Pong 2015]. This assumption covers a large range of functions used in computer graphics. For example, polynomials (such as StVK energy) and rational functions (such as NURBS) are both semi-algebraic. Then we have: Theorem 3.5. Suppose Assumptions 3.1, 3.2, 3.4, 3.5 and 3.7 are satisfied, and Then for a sufficiently large µ the x-z-u iteration (15)-(17) converges to a stationary point defined in Eq. (25), and +∞ n=1 ∥z k +1 − z k ∥ < ∞.
Proofs are given in Appendix F.1 and F.3.  [Kadkhodaie et al. 2015], which are designed for convex problems, are ineffective for this problem instance. Over-relaxation is effective in accelerating the convergence, but not as much as our approach.
Lipschitz differentiable, and for the z-x-u iteration we do not require the matrix A to be of full row rank. This makes our results applicable to a wider range of problems in computer graphics. In particular, for geometry optimization, the reduction matrix A that relates vertex positions to auxiliary variables may not be of full row rank, potentially due to the presence of auxiliary variables that are derived in the same way from vertex positions but involved in different constraints. Although for the z-x-u iteration our assumptions on д are more restrictive than those in [Li and Pong 2015;Wang et al. 2019], such assumptions are still general enough to be satisfied by many graphics problems.

RESULTS
We apply our methods to a variety of ADMM solvers in graphics. We implement Anderson acceleration following the source code released by the authors of [Peng et al. 2018 Fig. 4. For the simulation of a discretized flag with hard constraints that limit its strain, our accelerated solver convergences faster than an ADMM solver.
Here the color-coding shows the deviation from the deformation gradient singular values from their prescribed range. Using the same computational budget to compute a frame, the results with our solver satisfy the strain limiting constraints better.
normalized combined residual R c and normalized forward residual R f to measure convergence: where r c is the combined residual computed from Eq. (11) or (12), r f is the squared norm of the residual of Eq. (22) or (24), N z is the dimension of z, and a > 0 is a scalar that indicates the typical variable range. In the following, for all physical simulation and geometry optimization problems, we set a to the average edge length of the initial discretized model. For image processing problems, we simply set a = 1. For the choice of parameter m, similar to [Peng et al. 2018] we observe that a large m tends to improve the reduction of iteration count but increases the computational overhead per iteration (see Fig. 2). We choose m = 6 by default. Overby et al. [2017] performed physical simulation via the following optimization problem:

Physical simulation
Here x is the node positions of the discretized object, f (x) is a momentum energy of the form (14) with G being a scaled mass matrix, Dx collects the deformation gradient of each element, д(z) is the elastic potential energy, and W is a diagonal scaling matrix that improves conditioning. This problem is solved in (28) using ADMM with the x-z-u iteration. As it satisfies the assumptions in Proposition 3.1, we apply Anderson acceleration to variable z according to Algorithm 3. Our method is implemented based on the source code released by the authors of [Overby et al. 2017] 2 . Fig. 2 compares the simulation performance on three elastic bars subject to horizontal external forces on their two ends. We use the same material stiffness for all bars, and a different elastic potential energy model for each bar (corotational, StVK and neo-Hookean, respectively). We apply the original solver and our solver with different m values to the same problem for a particular frame, and plot their normalized combined residuals and normalized forward residuals through the iterations.
2 https://github.com/mattoverby/admm-elastic The methods are compared on two types of material stiffness ("soft rubber" and "rubber" as defined in the code from [Overby et al. 2017], with the latter one being stiffer). Our method decreases both residuals much faster than the original ADMM solver for each stiffness settings. Moreover, these two residuals are highly correlated, which demonstrates the effectiveness of using the forward residual to verify accelerated iterates according to Proposition 3.2. On the rubber models, we also evaluate the performance of the general approach in Algorithm 1 that accelerates z and u together. We can see that accelerating z alone leads to a faster decrease of the combined residual. One possible reason is that Algorithm 3 explicitly enforces the compatibility condition (21), so that the accelerated z and the recovered u always correspond to a valid intermediate value for a certain ADMM iterate sequence. This property does not hold for the general approach, since it only performs affine combination to obtain the accelerated z and u, which is more akin to finding a new initial value for an ADMM sequence. In Fig. 3, we use the same soft rubber simulation problem to compare our method with existing ADMM acceleration techniques, including [Goldstein et al. 2014] and [Kadkhodaie et al. 2015] which combined Nesterov's acceleration scheme with a restarting rule based on combined residual, as well as over-relaxation [Eckstein and Bertsekas 1992] with a relaxation parameter α ∈ [1.5, 1.8] as explained in [Boyd et al. 2011, §3.4.3]. As [Goldstein et al. 2014;Kadkhodaie et al. 2015] rely on the convexity of the problem, they are ineffective for this non-convex problem and in fact increases the computational time. Although over-relaxation speeds up the decrease of residual, it achieves less acceleration than our method. The solver in [Overby et al. 2017] allows enforcing hard constraints on node positions. Our method can be applied in such cases as well. In Fig. 4, we simulate the movement of a triangulated flag under the wind force. Within д(z), the elastic potential energy for each triangle is defined as the squared Euclidean distance from its deformation gradient to the closest rotation matrix. In addition, д(z) contains an indicator function term for the strain limit of each triangle that requires all singular values of the deformation gradient to be within the range [0.95, 1.05]. Due to such hard constraints for z, we cannot apply our method to the x-z-u iteration (see Remark 3.1). Instead, we adopt the z-x-u iteration and apply Algorithm 4 to accelerate u alone, because the iteration satisfies the assumptions in Proposition 3.3. We compare the original ADMM solver with our accelerated solver with m = 6. To this end, we first apply our solver to compute a simulation sequence, and then re-solve the optimization problem using the original ADMM solver. Fig. 4 plots the normalized forward residual from each solver on three frames, where we see a faster decrease of the residual using our solver. In addition, for these three frames we take the results from both solvers within the same computational time, and use color-coding to illustrate the maximum deviation of its deformation gradient singular values from the prescribed range on each triangle. We can see that our solver leads to better satisfaction of the strain limiting constraints. Hard constraints are also used in [Overby et al. 2017] to handle collision between objects. In Figs. 5 and 6, we apply our method in such scenarios. Here an elastic solid horse model falls under gravity and collides with static objects in the scene. In [Overby et al. 2017], this is handled by enforcing hard constraints on x that prevent the nodes from penetrating the static objects. As this would reduce the x-update step to a time-consuming quadratic programming problem, [Overby et al. 2017] linearizes the constraints and solve the resulting linear system. However, with such modification it is no longer an ADMM algorithm. Therefore, we apply the constraints on z instead and solve the problem using z-x-u iteration, with acceleration according to Algorithm 4. Figs. 5 and 6 plot the normalized forward residual for computing certain frames in the simulation sequence, showing a faster decrease of the residual with our method.

Geometry processing
We also apply our method to an ADMM solver for mesh optimization subject to both soft and hard constraints based on [Deng et al. 2015]. The input is a mesh with vertex positions x, soft constraints A i x ∈ C i (i ∈ S), and hard constraints A j x ∈ C j (j ∈ H ). Here each reduction matrix A i and A j selects vertex positions relevant to the constraint and (where appropriate) compute their differential coordinates with respect to either their mean position or one of the vertices. [Deng et al. 2015] introduce auxiliary variables z i ∈ C i (i ∈ S) and z j ∈ C j (j ∈ H ) to derive an optimization problem min x,z Here ∥L(x −x)∥ 2 is an optional Laplacian fairing energy for the vertex positions and/or for their displacement from initial positions, whereas ∥A i x − z i ∥ 2 penalizes the violation of a soft constraint with a user-specified weight w i . This problem is solved in [Deng et al. 2015] using the augmented Lagrangian method (ALM), where each iteration performs multiple alternate updates of z and x followed by a single update of u, using the same formulas as (6). Wu et al. [2011] pointed out that it is more efficient to perform only one alternate update of primal variables per iteration, in which case ALM reduces to ADMM. Therefore, we solve the problem using ADMM with the z-x-u iteration, and apply the general approach in Algorithm 2 for acceleration because the target function is not separable.

ADMM
Our method Initial Mesh #V: 230400 #F: 229440 Fig. 7. Our method accelerates an ADMM solver for wire mesh optimization, as shown by the normalized combined residual plots. We also show two results computed using ADMM and our accelerated solver within the same computational time (indicated in the bottom-right plot), and evaluate their violation of the angle constraints and edge length constraints using the error metrics in Eq. (30). Our result satisfies these constraints better.  Fig. 8. Comparison of wire mesh optimzation results using our accelerated ADMM solver and an accelerated quadratic penalty method as described in [Peng et al. 2018]. The error metric E is the sum of squared distances from the mesh vertices to the reference shape, and the color-coding illustrates the distance for each vertex. Although the quadratic penalty method can improve satisfaction of the angle and edge length constraints with a larger penalty weight, this leads to greater deviation from the reference shape.
In Fig. 7, we apply our method with m = 6 to the wire mesh optimization problem from [Garg et al. 2014]. The input is a regular quad mesh subject to the following constraints: • Hard constraints: all edges have the same length l; within a face, each angle formed by two incident edges is in the range [ π 4 , 3π 4 ]. • Soft constraint: each vertex lies on a given reference surface.
The mesh is optimized without the Laplacian fairing term, i.e., L = 0. Our method leads to a faster decrease of the combined residual with respect to both the iteration count and the computational time. We also evaluate the violation of hard constraints using the following error metrics for angle α and edge length e: The data and color-coding in Fig. 7 show that within the same computational time, the result from our method satisfies the hard constraints better than the original ADMM. Besides ADMM, another popular approach for enforcing hard constraints is the quadratic penalty method, which replaces the original constrained problem by an unconstrained problem with quadratic terms in the target function to penalize the violation of hard constraints [Nocedal and Wright 2006]. Fig. 8 compares the effectiveness of these two approaches in enforcing hard constraints while decreasing the original target function. For the quadratic penalty method, we use ShapeUp [Bouaziz et al. 2012] with Anderson acceleration as described in [Peng et al. 2018], and solve three problem instances with different penalty weights for hard constraints and fixed weights for the other terms. Each solver is run to full convergence for comparison. We can see that although a larger penalty weight for hard constraints improves their satisfaction, it also leads to relatively less penalty and greater violation of the soft constraints. In particular, with a large penalty weight to satisfy the hard constraints to a similar level as ADMM, the result from the quadratic penalty method deviates much more from the reference surface than ADMM. It shows that ADMM is more effective in satisfying hard constraints without compromising the minimization of the target function, and our method further improves its efficiency.
In Figs. 1 and 9, we apply our method to planar quad mesh optimization, a classical problem in architectural geometry [Liu et al. 2006]. The input is a quad mesh subject to the following constraints: • Hard constraint: vertices within each face lie on a common plane. • Soft constraint: each vertex lies on a given reference surface.
Following [Bouaziz et al. 2012], the reduction matrix for each hard constraint represents the mean-centering operator for the vertices on a common face. The target function includes a Laplacian fairness energy and a relative fairness energy for the vertex positions, as described in [Liu et al. 2011]. We measure the planarity error for each face F of a given mesh using the metric d max (F )/e, where d max (F ) is the maximum distance from a vertex of F to the best fitting plane of its vertices, and e is the average edge length of the mesh. In both Fig. 1 and Fig. 9, our method accelerates the decrease of the combined residual, producing a result with lower planarity error than the original ADMM within the same computational time.

Image processing
In Fig. 10, we apply our method to the ADMM solver from the ProxImaL image optimization framework [Heide et al. 2016]. We compare our method with the original solver on the following problem that computes a deconvoluted image x from an observation image f with Gaussian noise and a convolution operator K: where (∇x) i, j is the image gradient of x at pixel (i, j). This is solved in [Heide et al. 2016] using ADMM with the x-z-u iteration, and we accelerate it using Algorithm 1 with m = 6. We modify the source code of the ProxImaL library 3 to implement our accelerated solver, and use conjugate gradient to solve the linear systems in the update steps. Fig. 10 shows that our method requires less computational time and lower iteration count to achieve the same residual value.
Finally, in Fig. 11, we accelerate the ADMM solver used by the Coded Wavefront Sensor in [Wang et al. 2018] for computing the observed wavefront from a captured image. The wavefront x is computed by solving an optimization problem min x,z where z is an auxiliary variable for image gradient, and д(z) is a quadratic term that measures the consistency between the wavefront and the captured image. From the general condition presented in Appendix E.1, we know that in each iteration the dual variable u k can be represented as a function of z k via Eq. (49). Therefore, we apply Anderson acceleration to z alone. Moreover, as д(z) is quadratic, Eq. (49) implies that z k and u k are related by a linear map. Thus we use the history z to compute the affine combination coefficients for Anderson acceleration, and apply them to both z and u to derive the accelerated z and its compatible u, similar to Algorithm 4. We modify the source code released by the authors of [Wang et al.  2018] 4 to implement our accelerated solver. Fig. 11 compares the normalized combined residual plots between the two solvers, using a test example provided in the released source code. Compared to the original ADMM, our method leads to a significant reduction of computational time and iteration count for the same accuracy. Also included in the comparison is the GMRES acceleration for ADMM proposed in [Zhang and White 2018], which is designed specifically for strongly convex quadratic problems. Following [Zhang and White 2018], we restart GMRES every 10 iterations to reduce computational cost. As a general method, our approach is outperformed by GMRES acceleration, but only by a small margin.

CONCLUSION AND FUTURE WORK
In this paper, we apply Anderson acceleration to improve the convergence of ADMM on computer graphics problems. We show that ADMM can be interpreted as a fixed-point iteration of the second primal variable and the dual variable in the general case, and of only one of them when the problem has a separable target function and satisfies certain conditions. Such interpretation allows us to directly apply Anderson acceleration in the former case, and further reduce its computational overhead in the latter case. Moreover, for each case we propose a simple residual for measuring the convergence, and use it to determine whether to accept an accelerated iterate. We apply this method to a variety of ADMM solvers in graphics, with applications ranging from physics simulation, geometry processing, to image processing. Our method shows its effectiveness on all these problems, with a notable reduction of iteration account and computational time required to reach the same accuracy. On the theoretical front, we also prove the convergence of ADMM for a common non-convex problem structure in computer graphics under weak assumptions. Our work addresses two main limitations of ADMM especially on non-convex problems, which will help to expand its applicability in computer graphics as a versatile solver for 4 https://github.com/vccimaging/MegapixelAO optimization problems that are potentially non-smooth, non-convex, and with hard constraints. One limitation of our method is that it can be less effective for ADMM solvers with very low computational cost per iteration. In this case, the overhead of Anderson acceleration can cause a large relative increase of computational time, which partly cancels out the speedup gained from the reduction of iteration count. One such example is Fig. 12, where we apply our method to the ADMM solver in [Tao et al. 2019] for correcting a vector field into an integrable gradient field of geodesic distance. Although our method reduces the number of iterations, its large relative overhead actually increases the computational time for achieving the same residual.
Our experiments show that Anderson acceleration is effective in reducing the number of iterations, but we do not have a theoretical guarantee for such property. This is still an open research problem, and the only existing result we are aware of is [Evans et al. 2018], which proves that Anderson acceleration improves the convergence rate for linearly converging fixed-point methods if a set of strong assumptions is satisfied. Further theoretical analysis of our method is needed to understand and guarantee its performance.
Currently we follow the convention and set the mixing parameter β = 1 for Anderson acceleration. Although it is effective in our experiments, other values of β = 1 can potentially improve the performance [Eyert 1996]. The optimal choice of mixing parameter remains an open research problem, and should be explored further.
The convergence of ADMM can also be affected by the choice of the penalty parameter and the conditioning of linear side constraints. Recently, researchers have started to analyze the optimal choice of penalty parameter and conditioning for ADMM, but only on simple convex problems [Ghadimi et al. 2015;Giselsson and Boyd 2017]. Overby et al. [2017] proposed a heuristic for choosing such parameters for non-convex physical simulation problems, but there is still no theoretical guarantee for its effectiveness. A potential future research is to perform such analysis on non-convex problems, Fig. 12. We apply our method to the ADMM solver in [Tao et al. 2019] for correcting a vector field into an integrable gradient field. Due to the very low computational cost per iteration in the original solver, Anderson acceleration incurs a large relative overhead. As a result, although our method reduces the number of iterations, it actually increases the computational time.
as well as how they can be used in conjunction with Anderson acceleration to further improve convergence of ADMM.
Finally, as ADMM is a popular solver across different problem domains, we can apply our method to problems outside computer graphics. In this paper we have focused on a problem structure common for graphics tasks. Applications in other domains may involve other problem structures and require different analyses and strategies, which will be an interesting future work.

B PROOF FOR PROPOSITION 3.2
For the first part, suppose z k+1 is the fixed-point of the x-z-u iteration, which means that Then we have For the second part, if Ax k +2 − Bz k+1 − c = 0 then from (17): And from (16) and Remark 3.2 which completes the proof. □
For the second part, suppose Then we have (20) and (41) (20), which completes the proof. □ E FURTHER DISCUSSION FOR PROPOSITIONS 3.1-3.4 We now consider the general condition such that between the second updated primal variable and the dual variable, one of them is a function of the other. We consider the most general case: Unlike Section 3.3, we do not assume any specific form of f and д. We then only need to discuss the following x-z-u iteration because the conclusion for z-x-u iteration is similar: We first need the subproblem (43) and (44) to be well-defined, for which the next condition is sufficient : (C1) f and д are bounded from below and lower-semi continuous.
Then we rewrite the ADMM iteration as: E.1 u as a function of z By (47) and (48): Now we can see that u k +1 is a function z k+1 if and only if: (C2) B is invertible. (C3) ∂д(z) contains exactly one element ∀z ∈ dom(∂д).
From [Rockafellar and Wets 2009, Theorem 9.18] we know that the next condition is sufficient: Moreover, we need additional conditions in order to use Anderson acceleration on z. Note that Anderson acceleration generates z AA by affine combination. So if we want to use (49) to compute u AA from z AA , the following condition is needed: (C4) The domain of ∂д, defined as {z | ∂д(z) ∅}, is affine.

E.2 z as a function of u
From (49) we know that z is a function of u if and only if: (C5) The inverse mapping of set-valued mapping ∂д(z) is a singlevalued mapping.
The next condition is sufficient to ensure (C5) but not necessary: Also, similar to the argument in Appendix E.1, in order to apply Anderson acceleration on u we need the following condition: (C6) The range of ∂д, defined as z∈R n ∂д(z), is affine.

F PROOFS FOR CONVERGENCE THEOREMS
This section proves the linear convergence theorems when д is locally Lipschitz differentiable (Theorems 3.3 and 3.4) and the general convergence theorems (Theorems 3.5 and 3.6). The proofs for Theorems 3.1 and 3.2 are similar to those for Theorems 3.3 and 3.4, so we will not give their complete proofs but only summarize the main steps. Because of the order in which some lemmas are used in the proofs, we will prove Theorem 3.5 and 3.3 first. Without loss of generality, we assume c = 0 in Eq. (13) to simplify notation.
F.1 Proof for Theorem 3.5 Recall that Theorem 3.5 is about general convergence of the x-z-u iteration. We first note that: These two equations will be used frequently in the following. Note that from Assumption 3.4(2) we can derive (33) from (16). Moreover, based on the definition of L c in Assumption 3.5 we have: Proposition F.1. Suppose the Lipschitz constant of ∇д(z) over conv(Lд α ) is L 1 , then ∀ Bz 1 , Bz 2 ∈ Lд α , we have Moreover, if µ > L 1 , and z 2 ∈ argmin z (д(z) + µ 2 ∥Bz − q∥ 2 ), then: The proof is standard so we omit it. Also see [Nesterov 2013 Proof. By the definition of z k +1 in (16): And notice the definition of x k +1 in (15): (54) Combine the two equations above: By (17) and (34): (55) Notice that L(x k , z k , u k ) ≤ T (x 0 , z 0 ) and by the definition of c 1 : д(z k +1 ) ≤ T (x 0 , z 0 ) + c 1 .
Thus we have proved the first part. For the second part, we have: Here (56) is derived from (54), (57) is derived from Assumption 3.5(2) and Proposition F.1, and (58) is trivial. Add them together, and then use (34) and the fact that Which completes the proof. Then the x-z-u iteration satisfies д(z k ) ≤ T (x 0 , z 0 ) + c 1 , L(x k , z k , u k ) ≤ T (x 0 , z 0 ).
By Proposition F.2, Assumption 3.4(3) has the same effect as the Lipschitz differentiability assumption. The next step is similar to the convergence proof in [Wang et al. 2019], which requires the following properties for the sequence (x k , z k , u k ): (P1) Boundedness: the generated sequence (x k , z k , u k ) is bounded, and L(x k , z k , u k ) is lower bounded. (P2) Sufficient descent: there is a constant C 1 (µ) > 0 such that for sufficiently large k, we have: L(x k , z k , u k ) − L(x k +1 , z k +1 , u k +1 ) ≥ C 1 (µ)(∥B(z k +1 − z k )∥ 2 + ∥A(x k +1 − x k )∥ 2 ).
(P4) Limiting continuity: if (x * , z * , u * ) is the limit point of the sub-sequence (x k s , z k s , u k s ) for s ∈ N, then we have: lim s→∞ L(x k s , z k s , u k s ) = L(x * , z * , u * ).
Note that although the x-z-u iteration is not same as the one defined in [Li and Pong 2015], the proof for [Li and Pong 2015, Theorem 3] is not affected by the difference. Combining it with [Wang et al. 2019, Proposition 2], we can prove Theorem 3.5: Proof for Theorem 3.5. From [Wang et al. 2019, Proposition 2], [Li and Pong 2015, Theorem 3], and Proposition F.2 in our paper, we only need to show (P1)-(P4) hold for (x k , z k , u k ).
From Assumption 3.4(1) д(z) is level-bounded and G is invertible so f (x) is also level-bounded, thus (x k , z k ) is bounded. The boundedness of u k can be derived from (33). The lower boundedness of L(x k , z k , u k ) comes from Assumption 3.5(2) and the fact that T (x, z) ≥ 0. In fact we have: L(x k , z k , u k ) ≥ −c 1 .
Similar to (72) we have: Then let µ > max For the proof of Theorem 3.1, the derivation can start from (71) without assumptions on the initial values. The rest of the proofs is the same as the proof of Theorem 3.3. For the proof of Theorem 3.2, the derivation of (89) does not rely on the initial value. The rest is the same.