Anderson Acceleration for Nonconvex ADMM Based on Douglas-Rachford Splitting

The alternating direction multiplier method (ADMM) is widely used in computer graphics for solving optimization problems that can be nonsmooth and nonconvex. It converges quickly to an approximate solution, but can take a long time to converge to a solution of high-accuracy. Previously, Anderson acceleration has been applied to ADMM, by treating it as a fixed-point iteration for the concatenation of the dual variables and a subset of the primal variables. In this paper, we note that the equivalence between ADMM and Douglas-Rachford splitting reveals that ADMM is in fact a fixed-point iteration in a lower-dimensional space. By applying Anderson acceleration to such lower-dimensional fixed-point iteration, we obtain a more effective approach for accelerating ADMM. We analyze the convergence of the proposed acceleration method on nonconvex problems, and verify its effectiveness on a variety of computer graphics problems including geometry processing and physical simulation.


Introduction
Numerical optimization is commonly used in computer graphics, and finding a suitable solver is often instrumental to the performance of the algorithm.For an unconstrained problem with a simple smooth target function, gradient-based solvers such as gradient descent or the Newton method are popular choices [NW06].On the other hand, for more complex problems, such as those with a nonsmooth target function or with nonlinear hard constraints, it is often necessary to employ more sophisticated optimization solvers to achieve the desired performance.For example, proximal splitting methods [CP11] are often used to handle nonsmooth optimization problems with or without constraints.The basic idea is to introduce auxiliary variables to replace some of the original variables in the target function, while enforcing consistency between the original variables and the auxiliary variables with a soft or hard constraint.This often allows to problem to be solved via alternating update of the variables, which reduces to simple sub-problems that can be solved efficiently.One example of such proximal splitting methods is the local-global solvers commonly used for geometry processing and physical simulation [SA07, LZX * 08, BDS * 12, LBOK13, BML * 14].
Another popular type of proximal splitting methods, the alternating direction method of multipliers (ADMM) [BPC * 11], is designed † Corresponding author: juyong@ustc.edu.cn(Juyong Zhang) for the following form of optimization: where y is the dual variable, and β ∈ R + is a penalty parameter.This formulation is general enough to represent a large variety of optimization problems.For example, any additional hard constraint can be incorporated into the target function using an indicator function that vanishes if the constraint is satisfied and has value +∞ otherwise.The above iteration often has a low computational cost, where each sub-problem can be solved in parallel and/or in a closed form.The solver can handle nonsmooth problems, and typically converges to an approximate solution in a small number of iterations [BPC * 11].Moreover, although ADMM was initially designed for convex problems, it has proved to be also effective for many noncovex problems [WYZ19].Such properties make it a popular solver for large-scale optimization in computer graphics [NVW * 13, NVT * 14, OBLN17], computer vision [LFYL18,WG19], and image processing [FB10, AF13, HDN * 16].
Despite its popularity, a major drawback of ADMM is that it can take a long time to converge to a solution of high accuracy.This limitation has motivated various work on accelerating ADMM with a focus on convex problem [GOSB14, KCSB15,ZW18].For nonconvex ADMM, an acceleration technique was proposed recently in [ZPOD19].By treating the steps (2)-(4) as a fixed-point iteration of the variables (x, y) , it speeds up the convergence using Anderson acceleration [And65], a well-known acceleration technique for fixedpoint iterations.It is also shown in [ZPOD19] that for problems with a separable target function that satisfies certain assumptions, ADMM can be treated as a fixed-point iteration on a reduced set of variables, which further reduces the overhead of Anderson acceleration.
In this paper, we propose a novel acceleration technique for nonconvex ADMM from a different perspective.We note that if the target function is separable in x and z, then ADMM is equivalent to Douglas-Rachford (DR) splitting [DR56], a classical proximal splitting method.Such equivalence enables us to interpret ADMM using its equivalent DR splitting form, which turns out to be a fixedpoint iteration for a linear transformation of the ADMM variables, with the same dimensionality as the dual variable y.As a result, we can apply Anderson acceleration to such alternative form of fixed-point iteration, often with a much lower dimensionality than the fixed-point iteration of (x, y) that is utilized in [ZPOD19] for the general case and with a lower computational overhead.Moreover, compared to the other acceleration techniques in [ZPOD19] based on reduced variables, our new approach has the same dimensionality for the fixed-point iteration but requires a much weaker assumption on the optimization problem.To achieve stability of the Anderson acceleration, we propose two merit functions for determining whether an accelerated iterate can be accepted: 1) the DR envelope, with a strong guarantee for global convergence of the accelerated solver, and 2) the primal residual norm, which provides fewer theoretical guarantees but incurs lower computational overhead.As far as we are aware of, this is the first global convergence proof for Anderson acceleration on nonconvex ADMM.We evaluate our method on a variety of nonconvex ADMM solvers used in computer graphics and other domains.Thanks to its low dimensionality and strong theoretical guarantee, our method achieves more effective acceleration than [ZPOD19] on many of the experiments.
To summarize, our main contributions include: • We propose an acceleration technique for nonconvex ADMM solvers, by utilizing their equivalence to DR splitting and applying Anderson acceleration to the fixed-point iteration form of DR splitting.We also propose two types of merit functions that can be used to verify the effectiveness of an accelerated iterate, as well as acceptance criteria for the iterate based on the merit functions.• We prove the convergence of our accelerated solver under appropriate assumptions on the problem and the algorithm parameters.

Related Works
ADMM.ADMM is a variant of the augmented Lagrangian scheme that uses partial updates for the dual variables, and is commonly used for optimization problems with separable target functions and It is well known that ADMM suffers from slow convergence to a high-accuracy solution, and different strategies have been proposed in the past to speed up its convergence, e.g., using Nesterov's acceleration [GOSB14,KCSB15] or GMRES [ZW18].However, these acceleration methods focus on convex problems, while many problems in computer graphics are nonconvex.Similar to ADMM, DR splitting also needs a large number of iterations to converge to a solution of high accuracy [FZB19].This has motivated research works on acceleration techniques for DR splitting, such as adaptive synchronization [BKW * 19] and momentum acceleration [ZUMJ19].Anderson acceleration and similar adaptive acceleration strategies have also been used to accelerate DR splitting [FZB19,PL19].However, these works consider convex problems only, and their convergence proofs rely heavily on the convexity.Thus they are not applicable to the nonconvex problems considered in this paper.
The equivalence between ADMM and DR splitting is well known for convex problems [Glo83].Some existing methods utilize this connection to accelerate ADMM [PJ16, PL19], but they are only applicable to convex problems.Our method is based on the equivalence between ADMM and DR splitting for nonconvex problems, which has only been established very recently [BK15,YY16,TP20].

Algorithm
In this section, we first introduce the background for ADMM, DR splitting, and Anderson acceleration.Then we discuss the equivalence between ADMM and DR splitting on nonconvex problems, and derive an Anderson acceleration technique for ADMM based on its equivalent DR splitting form.

Preliminary
ADMM.In this paper, we focus on ADMM for the following optimization problem with a separable target function: with the ADMM steps given by: Throughout this paper, we assume that the solutions to subproblems (6) and (8) always exist.Note that for each sub-problem, it is possible that there exist multiple solutions.Like [ZPOD19], we assume that the solver for each sub-problem is deterministic and always returns the same solution if given the same input, so that the operator argmin is single-valued.Although the order of steps here appears different from the standard scheme in Eqs.(3)-(4), they are actually equivalent since they have the same relative order between the steps.We adopt this notation instead of the standard scheme, because it facilitates our discussion about the equivalence with DR splitting.A commonly used convergence criterion for ADMM is that both the primal residual r k p and the dual residual r k d vanish [BPC * 11]: The primal and dual residuals measure the violation of the linear side constraint and the dual feasibility condition of problem (5), respectively [BPC * 11].An alternative criterion is a vanishing combined residual [GOSB14]: which is a sufficient condition for vanishing primal and dual residuals.Moreover, the combined residual decreases monotonically for convex problems [GOSB14].
DR splitting.DR splitting has been used to solve optimization problems of the following form: with an iteration scheme: where γ ∈ R + is a constant and prox h denotes the proximal mapping of function h, i.e., prox h (x) := argmin Similar to our treatment of ADMM, we assume that there always exists a solution to the minimization problem above, and its solver always return the same result if given the same input, so that the proximal operator is single-valued.Although DR splitting has been primarily used on convex optimization, recent results show that it is also effective for noncovex problems [LP16].Later in Section 3.2, we will show that the ADMM steps (6)-( 8) are equivalent to the DR splitting scheme (12)-( 14) for two functions ϕ 1 , ϕ 2 derived from the target function and the linear constraint in Eq. (5).
Anderson Acceleration.Given a fixed-point iteration Anderson acceleration [And65, WN11] aims at speeding up its convergence to a fixed point where the residual Its main idea is to use the residuals of the latest step x k and its previous m steps x k−1 , ..., x k−m to find a new step x AA k+1 with a small residual.This is achieved via an affine combination of the images of x k , x k−1 , ..., x k−m under the fixed-point mapping G: where the coefficients are found by solving a least-squares problem:

Anderson Acceleration Based on DR Splitting
The derivation of our acceleration method relies on the equivalence between ADMM and DR splitting from [TP20], which we will review in the following.To facilitate the presentation, we first introduce a notation from [TP20]: Definition 3.1.Given f : R n → R ∪ {+∞} and A ∈ R p×n , the image function f A : R p → [−∞, +∞] is defined as Note that we adopt a different symbol for image function than the one used in [TP20] to improve readability.The equivalence between ADMM and DR splitting is given as follows: Proposition 3.2.([TP20, Theorem 5.5]) Suppose (x, y, z) ∈ R m × R n × R p , and let (x + , y + , z + ) be generated by the ADMM iteration (6)-(8) from (x, y, z).Define Then we have: where γ = 1/β, and Proposition 3.2 shows that for the optimization problem (5), we can find the functions ϕ 1 and ϕ 2 in the problem (11) such that the DR splitting steps (12)-( 14) are related to the ADMM steps (6)-( 8) via the transformation defined in Eq. ( 16).
According to the DR splitting steps ( 13) and ( 14), both u k and v k are functions of s k .Then the step (12) indicates that s k+1 can be written as a function of s k only: where I denote the identity operator.In other words, the DR splitting steps can be considered as a fixed-point iteration of s, which is a transformation of the variables x and y for its equivalent ADMM solver.Therefore, we can apply Anderson acceleration to the s variable in DR splitting to speed up the convergence.One tempting approach is to compute the value of s according to Eq. ( 16) after each ADMM iteration and apply Anderson acceleration.This would not work in general, however, because from an accelerated value of s we cannot recover the values of x and y to carry on the subsequent ADMM steps.Instead, we perform Anderson acceleration on DR splitting, and derive the ADMM solution x, y, z based on the final values of the DR splitting variables s, u, v.To implement this idea, we still need to resolve a few problems.First, we need to determine the specific forms of the proximal operators prox γϕ 1 and prox γϕ 1 used in DR splitting.Second, similar to [ZPOD19], we need to define criteria for the acceptance of an accelerated iterate, to improve the stability of Anderson acceleration.Finally, we need to find a way to recover the ADMM variables x, y, z after the termination of DR splitting.These problems will be discussed in the following.

Proximal Operators for γϕ 1 and γϕ 2
In general, given the functions f and g from the optimization problem (5), it is difficult to find an explicit formula for the image functions ϕ 1 and ϕ 2 given in Eq. ( 20).On the other hand, the proximal operators prox γϕ 1 and prox γϕ 2 have rather simple forms, as we will show below.Here and in the remaining parts of the paper, we will make frequent use of the following proposition from [TP20]: Then from Proposition 3.3, it is easy to derive the following: Proposition 3.4.The proximal operators prox γϕ 1 , prox γϕ 2 defined in Eqs.(18) and (19) can be evaluated as follows: where We choose s AA as the new iterate if d meets a certain criterion, and revert to the un-accelerated iterate G(s k−1 ) otherwise.
One choice of the merit function is where u(s) and v(s) denote the u and v values produced by the DR splitting steps ( 13) and ( 14) from s, i.e., Note that according to Eq. ( 12), v(s) − u(s) measures the change in variable s between two consecutive iterations.Therefore, if s converges to a value s * , then ψ P (s) must converge to zero.Moreover, Proposition 3.2 indicates that v − u = Ax − Bz − c , which is the norm of the primal residual for the equivalent ADMM problem (5) [BPC * 11].We call ψ P (s) the primal residual norm, and accept an accelerated iterate if its primal residual norm is no larger than the previous iterate.Thus the decrease criterion is: An alternative merit function is the DR envelope: (28) where u(s) is defined in Eq. ( 26).It is shown in [TP20, Theorem 4.1] that ψ E (s) decreases monotonically during DR splitting iterations under the following assumptions: Here a function F is said to be L-smooth if it is differentiable and (29) where x, z are defined in (23) and (24) respectively, and u(s), v(s) are defined in (26).
A proof is given in Appendix B. Note that the values x, z, u(s), v(s) are already evaluated during the DR splitting iteration.Therefore, the actual cost for computing ψ E (s) is the evaluation of functions f and g as well as two inner products, which only incurs a small overhead in many cases.Using the DR envelope as the merit function, we can enforce a more sophisticated decrease criterion that provides a stronger guarantee of convergence.Specifically, we require that s AA decreases the DR envelope sufficiently compared to s k−1 : where ν 1 , ν 2 are nonnegative constants.The convergence of our solver using such acceptance criterion is discussed in Theorems 4.4 and 4.6 in Section 4.
In this paper, unless stated otherwise, we use the DR envelope as the merit function to benefit from its convergence guarantee if the optimization problem satisfies the conditions given Theorems 4.4 or 4.6, and use the primal residual norm otherwise as it is an effective heuristic with lower overhead according to our experiments.

Recovery of x, y, z
After the variable s converges to a fixed point s * for the mapping G, it is easy to recover the corresponding stationary point (x * , y * , z * ) for the ADMM problem.Before presenting the method, we first introduce the definition for the stationary points.Definition 3.6.(x * , y * , z * ) is said to be a stationary point of (5) if where ∂ f and ∂g denote the generalized subdifferentials of f and g [RW09, Definition 8.3], respectively.Our method for recovering (x * , y * , z * ) is based on the following: Proposition 3.7.Let s * be a fixed point of G. Define Then (x * , y * , z * ) is a stationary point of the problem (5).
A proof is given in Appendix C. Note that the evaluation of x * , z * has the same form as the intermediate values x, z in Proposition 3.4 for evaluating the proximal operators in DR splitting.Therefore, during the DR splitting, we store the values of x and z when evaluating the proximal operators.When the variable s converges, we simply return the latest values of x, z as the solution to the ADMM problem.Algorithm 1 summarizes our acceleration method.
Algorithm 1: Anderson Acceleration for ADMM based on DR splitting.
Data: x 0 , y 0 , z 0 : initial values; m ∈ N: number of previous iterates used for acceleration; k max : maximum number of iterations; ε: convergence threshold.1 x default = x 0 ; z default = z 0 ; 2 s 0 = Ax 0 − y 0 ; u 0 = v 0 = 0; s default = s 0 ; 3 k = 0; ψ prev = r = +∞; reset = TRUE; 4 while TRUE do // Perform one iteartion of DR splitting to evaluate merit function for s k Compute ψ using Eq. ( 25) (or Eq. ( 28));  As pointed out in [FS09], Anderson acceleration can be considered as a quasi-Newton method to find the root of the residual function, utilizing the m previous iterates to approximate the inverse Jacobian.Similar to other Anderson acceleration based methods such as [HS16, PDZ * 18b, ZPOD19], we observe that a larger m leads to more reduction in the number of iterations required for convergence, but also increases the overhead per iteration.We empirically set m = 6 in all our experiments.

Comparison with [ZPOD19]
[ZPOD19] also proposed an Anderson acceleration approach for ADMM.In the general case, they treat the ADMM iteration (6)-( 8) as a fixed-point iteration of (x, y).In comparison, Proposition 3.2 shows that our approach is based on a fixed-point iteration of s = Ax − y, with a dimensionality up to 50% lower than (x, y).A main computational overhead for Anderson acceleration is 2m inner products between vectors with the same dimensionality as the fixedpoint iteration variables [PDZ * 18a].Therefore, our approach incurs a lower overhead per iteration.The lower dimensionality of our formulation also indicates that it describes the inherent structure of ADMM in a more essential way.And we observe in experiments that such lower-dimensional representation can be more effective in reducing the number of iterations required for convergence.Together with the lower overhead per iteration, this often leads to faster convergence than the general approach from [ZPOD19].
It is also shown in [ZPOD19] that if there is a special structure in the problem (5), ADMM can be represented as a fixed-point iteration of x or y alone, which would have the same dimensionality as the fixed-point mapping we use in this paper.In this case, besides the general approach mentioned in the previous paragraph, Anderson acceleration can also be applied to x or y alone, often with similar performance to our approach.However, this formulation requires one of the two target function terms in (5) to be a strongly convex quadratic function, which is a strong assumption that limits its applicability.In comparison, our method imposes no special requirements on functions f and g, making it a more versatile approach for effective acceleration.

Convergence Analysis
If we utilize the DR envelope as the merit function in Algorithm 1, and use condition (30) to determine acceptance for an accelerated iterate, then it can be shown that Algorithm 1 converges to a stationary point to the optimization problem.In the following, we will discuss the conditions for such convergence.Unless stated otherwise, we assume that all the functions are lower semicontinuous and proper.In contrast to Section 3, we will write ∈ instead of = for the evaluation of proximal mappings and minimization subproblems, to indicate that our results are still applicable when these operators are multi-valued.We first introduce some definitions: Definition 4.1.A point s * is said to be a fixed point of the mapping G if s * ∈ G(s * ).Definition 4.2.A point u * is said to be a stationary point of (11 Our first convergence result requires the following assumptions: Then {z k i } is bounded.Let z * be a cluster point of {z k i }, and define Then (x * , y * , z * ) is a stationary point of (5).
A proof is given in Appendix D. Remark 4.5.Given a fixed point s * of G, we can also compute a stationary point (5) without the assumptions used in Theorem 4.4.The reader is referred to Appendix E for further discussion.Theorem 4.4 shows the subsequence convergence of {(s k , u k , v k )} to a value corresponding to a stationary point.Next, we consider the global convergence of the whole sequence.We define Our global convergence results rely on the following assumptions: A proof is given in Appendix F. Remark 4.7.A sufficient condition for Assumption (C.2) is that f and g are both semi-algebraic functions.In this case, ϕ 1 and ϕ 2 will both be semi-algebraic [TP20], thus D γ is also semi-algebraic.Since a semi-algebraic function is also sub-analytic [XY13], D γ will be a sub-analytic function.As noted in [ZPOD19], a large variety of functions used in computer graphics are semi-algebraic.Interested readers are referred to [ZPOD19] and [LP15] for further discussion.Remark 4.8.If the functions f and g satisfy some further conditions, it can be shown that the convergence rate of (s k , u k , v k ) is r-linear.The discussion relies on the KL property [ABS13] and is rather technical, so we leave it to Appendix F. Remark 4.9.Assumption (A.1) requires the function f in (5) to be globally Lipschitz differentiable.When f is only locally Lipschitz differentiable, it is still possible to prove the convergence of Algorithm 1.One such example is given in Appendix (I).

Assumptions on f and g
Assumptions (A.1), (A.2) and (B.2) impose conditions on the functions ϕ 1 and ϕ 2 in (11).As there is no closed-form expression for ϕ 1 and ϕ 2 in general, these conditions can be difficult to verify.For practical purposes, we provide some conditions on the functions f and g that can ensure Assumptions (A.1), (A.2) and (B.A proof is given in Appendix G.

Numerical Experiments
We apply our method to a variety of problems to validate its effectiveness, focusing mainly on nonconvex problems in computer graphics.We describe each problem using the same variable names as in (5), so that its ADMM solver can be described by the steps (6)-(8).Different solvers are run using the same initialization.
For each problem, we compare the convergence speed between the original ADMM solver, the accelerated solver (AA-ADMM) proposed in [ZPOD19], and our method.For each method we plot the combined residual (10) with respect to the iteration count and the computational time respectively, where a faster decrease of the combined residual indicates faster convergence.For ADMM and AA-ADMM, the combined residual is evaluated according to Eq. (10).For DR splitting, it can be evaluated using the values of s, u, v without recovering their corresponding ADMM variables.
Using the notations and results from Proposition 3.2, we have Therefore, given an DR splitting iterate (s k , u k , v k ), we evaluate the combined residual r k c by performing a partial iteration and computing Similar to [ZPOD19], we normalize all combined residual values as follows to factor out the influence from the dimensionality and the value range of the variables:

Normalized Combined Residual
Figure 1: Comparison between ADMM, AA-ADMM, and our method with different merit functions, using the ℓ q -regularized logistic regression problem (32).The two variants of our method have similar performance.Both accelerate the convergence of ADMM and perform better than AA-ADMM.
where N A is the number of rows of matrix A, and a is a scalar that indicates the typical range of variable values.For both AA-ADMM and our method, we use m = 6 previous iterates for Anderson acceleration.We adopt the implementation of (32) Here x = (w, v) are the parameters to be optimized, with w ∈ R n and v ∈ R. z = (z 1 , z 2 ) is an auxiliary variable, with z 1 ∈ R n and z 2 ∈ R. {(a i , b i ) | i = 1, . . ., p} is a set of input data pairs each consisting of a feature vector a i ∈ R n and a label b i ∈ {−1, 1}.Ω(z 1 ) = n i=1 |z 1 i | 1/2 is an ℓ q sparsity regularization term with q = 1 2 .To test the performance, we use the data generator in the code to randomly generate p = 1000 pairs of data with feature vector dimension n = 1000.We test the problem with a weight parameter λ = 10 −4 and a penalty parameter β = 10 5 .It can be verified that problem (32) satisfy the assumptions for Theorem 4.6 (see Appendix H).Thus we use the DR envelope as the merit function for Algorithm 1, with parameter ν 1 = ν 2 = 10 −3 for the acceptance condition (30).For comparison, we also run the algorithm using the primal residual norm as the merit function.We run AA-ADMM using the general approach in [ZPOD19] that accelerates x and the dual variable y simultaneously, since the problem does not meet their requirement for reduced-variable acceleration.Fig. 1 shows the comparison between the four solvers.We can see that both AA-ADMM and our methods can accelerate the convergence, while our methods achieve better performance thanks to the lower dimensionality of its accelerated variables.In addition, there is no significant difference between the performance of the two variants of our method, which verifies the effectiveness of the primal residual norm as the merit function despite its lack of convergence guarantee in theory.Figure 2: Comparison using (33) for computing a frame in physical simulation of a stretched elastic bar with 6171 vertices and 25000 tetrahedrons, using three types of hyperelastic energy and a high stiffness parameter ('rubber' in the source code of [OBLN17]).The normalized combined residual plots (the top two rows) show that both variants of our method achieve similar acceleration results as the reduced-variable scheme of AA-ADMM.All three approaches perform better than the general scheme of AA-ADMM.The bottom two rows plot the relative energy (35) and include a Newton solver [SB12] and an L-BFGS solver for [LBK17] for comparison.Physical Simulation.Next, we consider the ADMM solver used in [OBLN17] for the following optimization for physical simulation: where z is the node positions to be optimized, x is an auxiliary variable that represents the absolute or relative node positions for the elements according to the selection matrix D, W is a diagonal weight matrix, f is an elastic potential energy, and g is a quadratic momentum energy.In Appendix I, we use the StVK model as an example to prove the convergence of Algorithm 1 on problem (33).AA-ADMM can be applied to this problem to accelerate the variable x alone [ZPOD19], and we include both the general approach and the reduced-variable approach for comparison.For our method, we include the implementation using each merit function into the comparison, and choose parameter ν 1 = ν 2 = 0 for the acceptance condition (30).Fig. 2 shows the performance of the five solver

Figure 3: Computation of compressed manifold basis via problem (36). Our method achieves similar reduction of iterations as AA-ADMM, but outperforms AA-ADMM in computational time thanks to its lower overhead.
variants on the simulation of a stretched hyperelastic bar with a high stiffness parameter, using three types of hyperelastic energy.We adapt the source codes from [OBLN17] § and [ZPOD19] ¶ for the implementation of ADMM and AA-ADMM, respectively.The normalized combined residual plots (the top two rows) show that all accelerated variants achieve better performance than the ADMM solver.Overall, the general AA-ADMM takes a long time than other accelerated variants for full convergence, potentially due to the larger number of variables involved in the fixed-point iteration and the higher overhead they induce.For a more complete evaluation, we also compare the solvers with a Newton method [SB12] and an L-BFGS method [LBK17], neither of which suffers from slow final convergence.Specifically, we use them to minimize the following energy equivalent to the target function of (33): In the bottom two rows of Fig. 2, we compare all methods by plotting their relative energy with respect to the iteration count and computational time, where F 0 and F * are the initial value and the minimum of the energy F, respectively.We can see that although the Newton method requires the fewest iterations to convergence, it is one of the slowest methods § https://github.com/mattoverby/admm-elastic¶ https://github.com/bldeng/AA-ADMM in terms of actual computational time, due to its high computational cost per iteration.L-BFGS achieves the best performance in terms of computational time, followed by the accelerated ADMM solvers.Note, however, that classical Newton and L-BFGS are intended for smooth unconstrained optimization problems, and they are often not applicable if the problem is nonsmooth or constrained -the type of problems that ADMM is popular for.
Geometry Processing.Nonconvex ADMM solvers have also been used in geometry processing.In Fig. 3, we compare the performance between different methods on the following optimization problem from [NVT * 14] for compressed manifold modes on a triangle mesh with N vertices: min where Z ∈ R N×K denotes a set of basis functions to be optimized, X 1 , X 2 ∈ R N×K are auxiliary variables, L ∈ R N×N is a Laplacian matrix, and ι is an indicator function of Z for enforcing the orthogonality condition if Z T DZ = I with respect to a mass matrix D. We apply our method with the primal residual norm as the merit function.We use the source code released by the authors for the ADMM solver, and modify it to implement AA-ADMM and our method.We use the general approach of AA-ADMM that accelerates X together with the dual variable, as the problem does not meet the requirement for reduced-variable acceleration.Fig. 3 shows the combined residual plots for the three methods on two models as well as the parameter settings for each problem instance.Our method achieves a similar effect in reducing the number of iterations as AA-ADMM, but outperforms AA-ADMM in terms of computational time thanks to its lower computational overhead.
We also apply our method to a problem proposed in [DBD * 15] for optimizing the vertex positions x ∈ R 3n of a mesh model subject to a set of soft constraints A i x ∈ C i (i ∈ S) and hard constraints A j x ∈ C j ( j ∈ H), where matrices A i and A j select the relevant https://github.com/tneumann/cmmvertices for the constraints and compute their differential coordinates where appropriate, and C i and C j represent the feasible sets.This is formulated in [DBD * 15] as the following optimization: where z i (i ∈ S) and z j ( j ∈ H) are auxiliary variables, σ C i and σ C j are indicator functions for the feasible sets, and w i are user-specified weights.The first term of the target function is an optional Laplacian smoothness energy, whereas the second term measures the violation of the soft constraints using the squared Euclidean distance to the feasible sets.This problem is solved using ADMM and AA-ADMM in [ZPOD19].However, since its target function is not separable, our accelerated ADMM solver is not applicable.To apply our method, we reformulate the problem as follows: where D C i (•) denotes the Euclidean distance to C i .This problem has a separable target function, and we derive its ADMM solver in Appendix A. We compare the performance of ADMM, AA-ADMM and our method on problem (38) for wire mesh optimization [GSD * 14]: we optimize a regular quad mesh subject to the soft constraints that each vertex lies on a target shape, and the hard constraints that (1) each edge has the same length l and (2) all angles of each quad face are within the range [π/4, 3π/4].In Fig. 4, We solve the problem on a mesh with 230K vertices, using L = 0, w i = 1, and penalty parameter β = 10000.The combined residual plots show that both AA-ADMM and our method and our method achieve faster convergence than ADMM, with a slightly better performance from our method.To illustrate the benefit of such acceleration, we take the results generated by each method within the same computational time, and use color-coding to visualize the edge-length error ξ(e) = |e − l|/l (39) where e is the actual length for each edge.We can see that the two accelerated solvers lead to notably smaller edge-length errors than ADMM within the same computational time.The acceleration brings significant savings in computational time needed for a highaccuracy solution, which is required for the physical fabrication of the design [GSD * 14].

Random Initialization Segmentation Result Original
Image Processing.In Fig. 5, we test our method on the nonconvex ADMM solver for the following image segmentation problem [WG19]: where x ∈ R n represents the pixel-wise labels to be optimized, L is a Laplacian matrix based on the similarity between adjacent pixels, d is a unary cost vector, and z = (z 1 , z 2 ) is an auxiliary variable with z 1 , z 2 ∈ R n .ι 1 and ι 2 are indicator functions for the feasible sets S 1 = [0, 1] n and S 2 = {p ∈ R n | n i=1 (p i − 1 2 ) 2 = n 4 } respectively, which together with the linear constraint between x and z induces a binary constraint for the labels x.Fig. 5 uses the cameraman image to compare ADMM, AA-ADMM, and our method with the primal residual norm as the merit function, using the same random initialization.We use the python source code released by the authors * * for the ADMM implementation, and modify it to implement AA-ADMM and our method.We use the general approach of AA-ADMM since the problem does not meet the reduced-variable conditions.The released code gradually changes the penalty parameter β, starting with β = 5 and increasing it by 3% every five iterations until it reaches the upper bound 1000.Since a different value of β will lead to a different fixed-point iteration, for both AA-ADMM and our method we reset the history of Anderon acceleration when β changes.We observe an interesting behavior of the ADMM solver: initially it maintains a relatively high value of the combined residual norm until the variable z converges to its value z * in the solution; afterwards, * * https://github.com/wubaoyuan/Lpbox-ADMM41) with λ = 2, for computing local mesh deformation components from an input mesh sequence and given weights.The methods are tested using two mesh sequences constructed from the facial expression dataset of [RBSB18], with 100 frames and 250 frames, respectively.We set the penalty parameter to β = 10 for both problem instances.Our method have similar acceleration performance as AA-ADMM in reducing the number of iterations, and outperforms AA-ADMM in actual computational time.
z remains close to z * , and the ADMM iteration effectively reduces to an affine transformation for the variables x and y with a rapid decrease of the combined residual norm.In comparison, our method shows more oscillation of the combined residual norm in the initial stage but accelerates the convergence of z towards z * , followed by a similar rapid decrease of the combined residual norm, thus outperforming ADMM in both iteration count and computational time.On the other hand, AA-ADMM fails to achieve acceleration.
Convex Problems.Although our method is designed with nonconvex problems in mind, it can be naturally applied to convex problems.In Fig. 6, we apply our method to the ADMM solver in [NVW * 13] for computing mesh deformation components given a mesh animation sequence and component weights: argmin where matrix Z represents the deformation components to be optimized, V is the input mesh sequence, W represents the given weights for the components, X is an auxiliary variable, and Ω 1 (X) is a weighted ℓ 1 /ℓ 2 -norm to induce local support for the deformation components.In Fig. 7, we accelerate the ADMM solver in [HDN * 16] for image deconvolution: where z represents the image to be recovered, x = (x 1 , x 2 ) are auxiliary variables, matrix K represents the convolution operator, G is the image gradient matrix, and Ω 2 is the ℓ 1 /ℓ 2 -norm for regularizing the image gradients.Both problems (41) and (42) are convex, and AA-ADMM can only be applied using the general approach due to the problem structures.For both problems, we apply our method using the primal residual norm as the merit function.We use the source codes released by the authors † † ‡ ‡ to implement the ADMM solver and their accelerated versions.For both problems, our method accelerates the convergence of ADMM and outperforms AA-ADMM in the computational time.
Limitation.Similar to [ZPOD19], our method may not be effective for ADMM solvers with very low computational cost per iteration.Fig. 8 shows the performance of our method and AA-ADMM on the ADMM solver from [TZD * 19] for recovering a geodesic distance function on a mesh surface from a unit tangent vector field.The methods achieve almost the same effect in reducing the amount of iterations for convergence.Our method requires a shorter computational time than AA-ADMM to achieve the same value of combined residual, because we can only apply the general approach of AA-ADMM to this problem and its overhead is higher than our method.On the other hand, both approaches take a longer time than the original ADMM solver to achieve convergence, because the very low computational cost per iteration of the original solver means high relative overhead for both acceleration techniques.

Concluding Remarks
In this paper, we propose an acceleration method for ADMM by applying Anderson Acceleration on its equivalent DR splitting formulation.Based on a fixed-point interpretation of DR splitting, we accelerate one of its variables that is not explicitly available in ADMM but can be derived from a linear transformation of the ADMM variables.Our strategy consistently outperforms the general Anderson acceleration approach in [ZPOD19]   Both AA-ADMM and our method can reduce the number of iterations required for convergence, but their actual computational time is higher due to the very low computational cost per iteration for the ADMM solver.Our method takes a shorter time than AA-ADMM thanks to its lower overhead.
reduced-variable approach in [ZPOD19], our method has the same dimensionality for the accelerated variable and achieves similar performance, but imposes no special requirements on the problem except for the separability of its target function.This makes our approach applicable to a much wider range of problems.In addition, we analyze the convergence of the proposed algorithm, and show that it converges to a stationary point of the ADMM problem under appropriate assumptions.Various ADMM solvers in computer graphics and other domains have been tested to verify the effectiveness and efficiency of our algorithm.
There are still some limitations for our approach.First, the equivalence between ADMM and DR splitting relies on a separable target function for the ADMM problem.As a result, our method is not applicable to problems where the target function is not separable.However, as far as we are aware of, the majority of ADMM problems in computer graphics, computer vision, and image processing have a separable target function.Moreover, as shown in the geometry optimization example in Section 5, it is possible to reformulate the problem to make the target function separable.Therefore, this issue does not hinder the practical application of our method.Another limitation is that there is no theoretical guarantee that the method can always accelerate the convergence even locally.Recently, [EPRX20] provide theoretical results showing that Anderson Acceleration can improve the convergence rate, but their proofs require the original iteration to be contractive or converge q-linearly.For nonconvex DR splitting, to the best of our knowledge, local q-linear convergence can only be shown in very special cases that is too restrictive in practice.Further investigation of the theoretical property of Anderson Acceleration and nonconvex DR splitting is needed to provide a theoretical guarantee for acceleration.where matrix A stacks all matrices {A i | i ∈ S} and {A j | j ∈ H}.In the following, y denotes the dual variable that consists of {y i | i ∈ S} and {y j | j ∈ H} corresponding to the soft constraints S and hard constraints H, respectively.We will use superscripts to indicate iteration counts, to avoid conflict with subscripts that indicate the constraints.Then the step (6) reduces to the problem min which can be solved via the linear system The step (7) is simply written as The step (8) reduces to separable subproblems: min The solution to (47) is where P C j (•) is a projection operator onto the C j .The solution to (46) is ϕ(u + ) = h(x * ).Then we have: But by the definition of u Then we are able to prove the general transition theorem: Theorem E.1.Suppose s * is the fixed-point of G, and u * ∈ prox γϕ 1 (s * ) ∩ prox γϕ 2 (2u * − s * ).Define: x * ∈ argmin Similarly we have βB T y * ∈ ∂g(z * ).
The last condition follows from u * = Ax * = Bz * + c.
we can assume that k ′ is sufficiently large such that for any k ≥ k ′ we have where a 1 is some constant.Similar to the previous proof, we can show where a 2 = √ 5 γ min{ν 1 , c (1+γL) 2 }.Summing (56) from k ′ to ∞ Therefore Similarly, we can show for any l ≥ k ′ we have which means that {H k } converges q-linearly and since H k ≥ v k − u k we get the r-linear convergence of v k − u k .Then
and F +∞.Under Assumptions (A.1)-(A.3),the DR envelope has a more simple form: c 2020 The Author(s) Computer Graphics Forum c 2020 The Eurographics Association and John Wiley & Sons Ltd.W. Ouyang et al. / Anderson Acceleration for Nonconvex ADMM Based on Douglas-Rachford Splitting Proposition 3.5.If Assumptions (A.1)-(A.3)hold, then

2
where L and σ are defined in Assumption (A.1).(B.4)The function g(z) := g(z) + β Bz + c − s 2 is level-bounded and bounded from below for any given s.Our first convergence result is then given as follows: Theorem 4.4.Suppose Assumptions (A.1)-(A.3)and (B.1)-(B.3)hold.Let {(s k , u k , v k )} be the sequence generated by Algorithm 1 using Eq.(30) as the acceptance condition.Then (a) {ψ E (s k )} is monotonically decreasing and v k − u k → 0. (b) The sequence (s k , u k , v k ) is bounded.If any subsequence {s k i } converges to a point s * , then s * is a fixed point of G and u * = prox γϕ 1 (s * ) is a stationary point of (11).Moreover, such a convergent subsequence must exist.(c) Suppose Assumption (B.4) is also satisfied.For any convergent subsequence {s k i } in (b), let {z k i } be the corresponding subsequence generated by Algorithm 1, i.e.,
2).These conditions are based on the results in [TP20, Section 5.4].Proposition 4.10.Suppose the problem (5) and the ADMM subproblems in (6) and (8) have a solution.Then the following conditions are sufficient for Assumptions (A.1), (A.2) and (B.2): (D.1) f and g are proper and lower semicontinuous.(D.2) One of the functions f and g is level-bounded, and the other is bounded from below.(D.3)A is surjective.(D.4) f satisfies one of the following conditions: 1. f is Lipschitz differentiable, and argmin x { f (x) | Ax = s} is single-valued and Lipschitz continuous; 2. f is Lipschitz differentiable and convex; 3. f is differentiable, and ∇ f (x) − ∇ f (y) ≤ L A(x − y) 2 for any x, y if ∇ f (x) and ∇ f (y) are in the range of A T .(D.5) The function Z(s) := argmin z {g(z) | Bz + c = s} is locally bounded on the set S = {Bz + c | g(z) < +∞}, i.e., for any s ∈ S there exists a neighborhood O such that Z is bounded on O.

cFigure 4 :
Figure 4: Comparison between ADMM and accelerated methods on a wire mesh optimization problem (38).The normalized combined residual plots show faster convergence using the accelerated solvers and better performance with our method.The color-coding visualizes the edge length error ξ defined in (39) on meshes computed by the three methods within the same computational time (see the bottom-right plot).

Figure 5 :
Figure 5: Comparison on the image segmentation problem (40) with a re-formulated binary constraint.Our method reduces the iteration count and computational time required for convergence, while AA-ADMM fails to achieve acceleration.

Figure 6 :
Figure6: Comparison on a convex problem (41) with λ = 2, for computing local mesh deformation components from an input mesh sequence and given weights.The methods are tested using two mesh sequences constructed from the facial expression dataset of[RBSB18], with 100 frames and 250 frames, respectively.We set the penalty parameter to β = 10 for both problem instances.Our method have similar acceleration performance as AA-ADMM in reducing the number of iterations, and outperforms AA-ADMM in actual computational time.

cFigure 7 :
Figure 7: Comparison on the convex problem (42) for image deconvolution.We choose λ = 400 in problem (42), and set the penalty parameter to β = 100.ADMM and AA-ADMM have fairly similar performance.Both are outperformed by our method.

Figure 8 :
Figure 8: Comparison on the ADMM solver in [TZD * 19] for recovering geodesic distance on meshes.Both AA-ADMM and our method can reduce the number of iterations required for convergence, but their actual computational time is higher due to the very low computational cost per iteration for the ADMM solver.Our method takes a shorter time than AA-ADMM thanks to its lower overhead.

c
2020 The Author(s) Computer Graphics Forum c 2020 The Eurographics Association and John Wiley & Sons Ltd.
[ZPOD19]riteria for Accepting Accelerated IterateClassical Anderson acceleration can be unstable with slow convergence or stagnate at a wrong solution [WN11, PE13, PDZ * 18b].To improve stability, in[ZPOD19]an accelerated iterate is accepted only if it decreases a certain quantity that will converge to zero with effective iterations, such as the combined residual.Adopting a similar approach, we define a merit function ψ whose decrease indicates the effectiveness of an iteration.At the k-th iteration, we evaluate the un-accelerated iterate G(s k−1 ) as well as the accelerated iterate s AA , and evaluate the decrease of the merit function from s k−1 to s AA : Use s AA for next acceptance check 18 s k+1 = s AA ; k = k + 1; 19 else // Revert to last accepted iterate 20 s k = s default ; u k = u default ; v k = v default ; 21 x k = x default ; z k = z default ; reset = TRUE; 23 if k ≥ k max OR r < ε then 24 return x default , z default ; Anderson acceleration from [PDZ * 18a] † .All experiments are run on a desktop PC with a hexa-core CPU at 3.7GHz and 16GB of RAM.The source codes for the examples are available at https://github.com/YuePengUSTC/AADR.