Description

To appear Annals of Applied Probability 2005Vol. 0, No. 0, 1 8 ON THE ERGODICITY PROPERTIES OF SOME ADAPTIVE MCMC ALGORITHMS By Christophe Andrieu, and Éric Moulines University of Bristol and École Nationale

Information

Category:
## Literature

Publish on:

Views: 79 | Pages: 40

Extension: PDF | Download: 0

Share

Transcript

To appear Annals of Applied Probability 2005Vol. 0, No. 0, 1 8 ON THE ERGODICITY PROPERTIES OF SOME ADAPTIVE MCMC ALGORITHMS By Christophe Andrieu, and Éric Moulines University of Bristol and École Nationale Supérieure des Télécommunications In this paper we study the ergodicity properties of some adaptive Monte Carlo Markov chain algorithms MCMC that have been recently proposed in the literature. We prove that under a set of verifiable conditions, ergodic averages calculated from the output of a so-called adaptive MCMC sampler converge to the required value and can even, under more stringent assumptions, satisfy a central limit theorem. We prove that the conditions required are satisfied for the Independent Metropolis-Hastings algorithm and the Random Walk Metropolis algorithm with symmetric increments. Finally we propose an application of these results to the case where the proposal distribution of the Metropolis-Hastings update is a mixture of distributions from a curved exponential family. 1. Introduction. Markov chain Monte Carlo MCMC, introduced by Metropolis et al. 1953, is a popular computational method for generating samples from virtually any distribution π. In particular there is no need for the normalising constant to be known and the space R nx for some integer n x on which it is defined can be high dimensional. We will hereafter denote B the associated countably generated σ-field. The method consists of simulating an ergodic Markov chain { k, k 0} on with transition probability P such that π is a stationary distribution for this chain, i.e πp = π. Such samples can be used e.g. to compute integrals π f def = for some π-integrable function f : R, S n f = 1 n f x π dx, n f k. 1 In general the transition probability P of the Markov chain depends on some tuning parameter, say θ defined on some space Θ R n θ for some integer n θ, and the convergence properties of the Monte Carlo averages in Eq. 1 might highly depend on a proper choice for these parameters. The authors would like to thank the Royal Society/CNRS partnership and the Royal Society International Exchange visit program, An initial version of this work was presented at the First Cape Cod Monte Carlo Workshop, September AMS 2000 subject classifications: Primary 65C05, 65C40, 60J27, 60J35; secondary 93E35 Keywords and phrases: Adaptive Markov chain Monte Carlo, Self-tuning algorithm, Metropolis-Hastings algorithm, Stochastic approximation, State-dependent noise, Randomly varying truncation, Martingale, Poisson method 1 2 C. ANDRIEU & É. MOULINES We illustrate this here with the Metropolis-Hastings MH update, but it should be stressed at this point that the results presented in this paper apply to much more general settings including in particular hybrid samplers, sequential or population Monte Carlo samplers. The MH algorithm requires the choice of a proposal distribution q. In order to simplify the discussion, we will here assume that π and q admit densities with respect to the Lebesgue measure λ Leb, denoted with an abuse of notation π and q hereafter. The rôle of the distribution q consists of proposing potential transitions for the Markov chain { k }. Given that the chain is currently at x, a candidate y is accepted with probability αx, y defined as { 1 πy qy,x πx qx,y if πxqx, y 0 αx, y = 1 otherwise, where a b def = mina, b. Otherwise it is rejected and the Markov chain stays at its current location x. The transition kernel P of this Markov chain takes the form for x, A B P x, A = αx, x+zqx, x+zλ Leb dz+1 A x 1 αx, x+zqx, x+zλ Leb dz, A x x 2 where A x def = {z, x + z A}. The Markov chain P is reversible with respect to π, and therefore admits π as invariant distribution. Conditions on the proposal distribution q that guarantee irreducibility and positive recurrence are mild and many satisfactory choices are possible; for the purpose of illustration, we concentrate in this introduction on the symmetric increments random-walk MH algorithm hereafter SRWM, in which qx, y = qy x for some symmetric probability density q on R nx, referred to as the increment distribution. The transition kernel Pq SRW of the Metropolis algorithm is then given for x, A B by Pq SRW x, A = 1 A x πx + z πx qz λ Leb dz + 1 A x x 1 1 πx + z qz λ Leb dz. 3 πx A classical choice for q is the multivariate normal distribution with zero-mean and covariance matrix Γ, N 0, Γ. We will later on refer to this algorithm as the N-SRWM. It is well known that either too small or too large a covariance matrix will result in highly positively correlated Markov chains, and therefore estimators S n f with a large variance Gelman et al have shown that the optimal covariance matrix under restrictive technical conditions not given here for the N-SRWM is /n x Γ π, where Γ π is the true covariance matrix of the target distribution. In practice this covariance matrix Γ is determined by trial and error, using several realisations of the Markov chain. This hand-tuning requires some expertise and can be time-consuming. In order to circumvent this problem, Haario et al have proposed to learn Γ on the fly. The Haario et al. 2001 algorithm can be summarized as follows, ERGODICITY OF ADAPTIVE MCMC 3 µ k+1 = µ k + γ k+1 k+1 µ k k 0 4 Γ k+1 = Γ k + γ k+1 k+1 µ k k+1 µ k T Γ k, where, denoting C nx + the cone of positive n x n x matrices, k+1 is drawn from P θk k,, where for θ = µ, Γ Θ = R nx C+ nx, P θ = PN SRW 0,λΓ is here the kernel of a symmetric random walk MH with a Gaussian increment distribution N 0, λγ, where λ 0 is a constant scaling factor depending only on the dimension of the state-space n x and kept constant across the iterations, {γ k } is a non-increasing sequence of positive stepsizes such that γ k = and γ1+δ k for some δ 0 Haario et al have suggested the choice γ k = 1/k. It was realised in Andrieu and Robert 2001 that such a scheme is a particular case of a more general framework. More precisely, for θ = µ, Γ Θ, define H : Θ R nx R nx nx as Hθ, x def = x µ, x µx µ T Γ T. 5 With this notation, the recursion in 4 may be written as θ k+1 = θ k + γ k+1 Hθ k, k+1, k 0, 6 with k+1 P θk k,. This recursion is at the core of most of classical stochastic approximation algorithms see e.g. Benveniste et al. 1990, Duflo 1997, Kushner and Yin 1997 and the references therein. This algorithm is designed to solve the equations hθ = 0 where θ hθ is the so-called mean field defined as hθ def = Hθ, xπdx. 7 For the present example, assuming that x 2 πdx , one can easily check that hθ = Hθ, xπdx = µ π µ, µ π µµ π µ T + Γ π Γ T, 8 with µ π and Γ π the mean and covariance of the target distribution, µ π = x πdx and Γ π = x µ π x µ π T πdx. 9 One can rewrite 6 as θ k+1 = θ k + γ k+1 hθ k + γ k+1 ξ k+1, where {ξ k = Hθ k 1, k hθ k 1, k 1} is generally referred to as the noise. The general theory of stochastic approximation SA provides us with conditions under which this recursion eventually converges to the set {θ Θ, hθ = 0}. These issues are discussed in Sections 3 and 5. 4 C. ANDRIEU & É. MOULINES In the context of adaptive MCMC, the parameter convergence is not the central issue; the focus is rather on the approximation of πf by the sample mean S n f. However there is here a difficulty with the adaptive approach: as the parameter estimate θ k = θ k 0,..., k depends on the whole past, the successive draws { k } do not define an homogeneous Markov chain and standard arguments for the consistency and asymptotic normality of S n f do not apply in this framework. Note that this is despite the fact that for any θ Θ, πp θ = π. This is illustrated by the following example. Let = {1, 2} and consider for θ, θ1, θ2 Θ the following Markov transition probability matrices [ ] [ ] 1 θ θ 1 θ1 θ1 P θ = P =. θ 1 θ θ2 1 θ2 One can check that for any θ Θ, π = 1/2, 1/2 satisfies πp θ = π. However if we let θ k be a given function θ : 0, 1 of the current state, i.e. θ k = θ k, one defines a new Markov chain with transition probability P with now [θ2/θ1+θ2, θ1/θ1+θ2] as invariant distribution. One recovers π when the dependence on the current state k is removed or vanishes with the iterations. With this example in mind, the problems that we address in the present paper and our main general results are in words the following: 1. In situations where θ k+1 θ k 0 as k + w.p. 1, we prove a strong law of large numbers for S n f see Theorem 8 under mild additional conditions. Such a consistency result may arise even in situations where the parameter sequence {θ k } does not converge. 2. In situations where θ k converges w.p. 1, we prove an invariance principle for ns n f πf; the limiting distribution is in general a mixture of Gaussian distributions see Theorem 9. Note that Haario et al have proved the consistency of Monte Carlo averages for the algorithm described by 4. Our result applies to more general settings and rely on assumptions which are less restrictive than those used in Haario et al The second point above, the invariance principle, has to the best of our knowledge not been addressed for adaptive MCMC algorithms. We point out that Atchadé and Rosenthal 2003 have independently extended the consistency result of Haario et al to the case where is unbounded, using Haario et al s mixingale technique. Our technique of proof is different and our algorithm allows for an unbounded parameter θ to be considered, as opposed to Atchadé and Rosenthal The paper is organized as follows. In Section 2 we detail our general procedure and introduce some notation. In Section 3, we establish the consistency i.e. a strong law of large numbers for S n f Theorem 8. In Section 4 we strengthen the conditions required to ensure the law of large numbers LLN for S n f and establish an invariance principle Theorem 9. In Section 5 we focus on the classical Robbins-Monro implementation of our procedure and introduce further conditions that allow us to prove that {θ k } converges w.p. 1 Theorem 11. In Section 6 we establish general properties of the generic SRWM required to ensure a LLN and an invariance principle. For pædagocical purposes we show how to apply these results to the N-SRWM of Haario et al Theorem 15. In Section 7 we present another application of our theory. We focus on the Independent Metropolis- Hastings algorithm IMH and establish general properties required for the LLN and the ERGODICITY OF ADAPTIVE MCMC 5 invariance principle. We then go on to propose and analyse an algorithm that matches the so-called proposal distribution of the IMH to the target distribution π, in the case where the proposal distribution is a mixture of distributions from the exponential family. The main result of this section is Theorem 21. We conclude with the remark that this latter result equally applies to a generalisation of the N-SRWM, where the proposal is again a mixture of distributions. Application to samplers which consist of a mixture of SRWM and IMH is straightforward. 2. Algorithm description and main definitions. Before describing the procedure under study, it is necessary to introduce some notation and definitions. Let T be a separable space and let BT be a countably generated σ-field on T. For a Markov chain with transition probability P : T BT [0, 1] and any non-negative measurable function f : T [0, +, we denote by P ft = P t, f def = T P t, dt ft and for any integer k, P k the k-th iterate of the kernel. For a probability measure µ, we define, for any A BT, µp A def = T µdtp t, A. A Markov chain on a state space T is said to be µ-irreducible if there exists a measure µ on BT such that, whenever µa 0, k=0 P k t, A 0 for all t T. Denote by µ a maximal irreducibility measure for P see Meyn and Tweedie 1993 Chapter 4 for the definition and the construction of such a measure. If P is µ-irreducible, aperiodic and has an invariant probability measure π, then π is unique and is a maximal irreducibility measure. Two main ingredients are required for the definition of our adaptive MCMC algorithms: 1. A family of Markov transition kernels on, {P θ, θ Θ} indexed by a finitedimensional parameter θ Θ R n θ an open set. For each θ in Θ, it is assumed that P θ is π-irreducible and that πp θ = π, i.e. π is the invariant distribution for P θ. 2. A family of update functions {Hθ, x : Θ R n θ}, which are used to adapt the value of the tuning parameter. The adaptive algorithm studied in this paper which corresponds to the process {Z k } defined below requires for both its definition and study the introduction of an intermediate stopped process, which we now define. First, in order to take into account potential jumps outside the space Θ, we extend def the parameter space with a cemetery point, θ c Θ and define Θ = Θ {θ c }. It is convenient to introduce the family of transition kernels {Q γ, γ 0} such that for any γ 0, x, θ Θ, A B and B B Θ, Q γ x, θ; A B = A P θ x, dy1{θ + γhθ, y B} + δ θc B P θ x, dy1{θ + γhθ, y / Θ}, 10 where δ θ denotes the Dirac delta function at θ Θ. In its general form the basic version of the adaptive MCMC algorithms considered here may be written as follows. Set θ 0 = θ Θ, 0 = x, and for k 0 define recursively the sequence { k, θ k, k 0}: if θ k = θ c, then set θ k+1 = θ c and k+1 = x. Otherwise k+1, θ k+1 Q ρk+1 k, θ k ;, where ρ = {ρ k } A 6 C. ANDRIEU & É. MOULINES is a sequence of stepsizes. The sequence { k, θ k } is a non-homogeneous Markov chain on the product space Θ. This non-homogeneous Markov chain defines a probability measure on the canonical state space Θ N equipped with the canonical product σ- algebra. We denote F = {F k, k 0} the natural filtration of this Markov chain and P ρ x,θ and E ρ x,θ the probability and the expectation associated to this Markov chain starting from x, θ Θ. Because of the interaction with feedback between k and θ k, the stability of this inhomogeneous Markov chain is often difficult to establish. This is a long-standing problem in the field of stochastic optimization: known practical cures to this problem include the reprojections on a fixed set see Kushner and Yin 1997 or the more recent reprojection on random varying boundaries proposed in Chen and Zhu 1986, Chen et al and generalized in Andrieu et al More precisely, we first define the notion of compact coverage of Θ. A family of compact subsets {K q, q 0} of Θ is said to be a compact coverage if, K q = Θ, and K q intk q+1, q 0, 11 q 0 where inta denotes the interior of set A. Let γ def = {γ k } be a monotone non-increasing sequence of positive numbers and let K be a subset of. For a sequence a = {a k } and an integer l, we define the shifted sequence a l as follows: for any k 1, a l def k = a k+l. Let Π : Θ K K 0 be a measurable function. Define the homogeneous Markov chain Z k = { k, θ k, κ k, ν k, k 0} on the product space Z def = Θ N N with transition probability R : Z BZ : [0, 1] algorithmically defined as follows note that in order to alleviate notation, the dependence of R on both γ and {K q, q 0} is implicit throughout the paper: for any x, θ, κ, ν Z 1. If ν = 0, then draw, θ Q γκ Πx, θ; ; otherwise, draw, θ Q γκ+ν x, θ;. 2. If θ K κ, then set: κ = κ and ν = ν + 1; otherwise, set κ = κ + 1, and ν = 0. In words, κ and ν are counters: κ is the index of the current active truncation set; ν counts the number of iterations since the last reinitialisation. The event {ν k = 0} indicates that a reinitialization occurs: the algorithm is restarted at iteration k from a point in K K 0 with the smaller sequence of stepsizes γ κ k. Note the important fact, at the heart of our analysis, that between reinitializations this process coincides with the basic version of the algorithm described earlier, with ρ = γ κ k. This is formalized in Lemma 1 below. This algorithm is reminiscent of the projection on random varying boundaries proposed in Chen and Zhu 1986, Chen et al. 1988: whenever the current iterate wanders outside the active truncation set the algorithm is reinitialised with a smaller initial value of the stepsize and a larger truncation set. The homogeneous Markov chain {Z k, k 0} defines a probability measure on the canonical state space Z N equipped with the canonical product σ-algebra. We denote G = {G k, k 0}, P x,θ,k,l and Ēx,θ,k,l the filtration, probability and expectation associated to this Markov chain started from x, θ, k, l Z. For simplicity we will use the shorthand notation Ē and P for ERGODICITY OF ADAPTIVE MCMC 7 Ēx,θ def = Ēx,θ,0,0 and P x,θ def = P x,θ,0,0 for all x, θ Θ. These probability measures depend upon the deterministic sequence γ : the dependence will be implicit hereafter. We define recursively {T n, n 0} the sequence of successive reinitialisation times T n+1 = inf {k T n + 1, ν k = 0}, with T 0 = 0, 12 where by convention inf{ } =. It may be shown that under mild conditions on {P θ, θ Θ}, {Hθ, x, θ, x Θ } and the sequence γ, P sup κ n = P {T n = } = 1, n 0 i.e., the number of reinitialisations of the procedure described above is finite P -a.s.. We postpone the presentation and the discussion of simple sufficient conditions that ensure that this holds in concrete situations to Sections 5, 6 and 7. We will however assume this property to hold in Sections 3 and 4. Again, we stress on the fact that our analysis of the homogeneous Markov chain {Z k } the algorithm for a given sequence γ relies on the study of the inhomogeneous Markov chain defined earlier the stopped process, def for the sequences {ρ k = γ κ k} of stepsizes. It is therefore important to precisely and probabilistically relate these two processes. This is the aim of the lemma below adapted from Andrieu et al., 2005, Lemma 4.1. Define for K Θ, n=0 σk = inf{k 1, θ k K}. 13 Lemma 1. For any m 1, for any non-negative measurable function Ψ m : Θ m R +, for any integer k 0 and x, θ Θ, Ē x,θ,k,0 [Ψ m 1, θ 1,..., m, θ m 1{T 1 m}] = E γ k Πx,θ [Ψ m 1, θ 1,..., m, θ m 1{σK k m}]. 3. Law of large numbers. Hereafter, for a probability distribution P, the various kinds of convergence in probability, almost-sure and weak in distribution are denoted respectively prob. a.s P, P and D P Assumptions. As pointed out in the introduction, a LLN has been obtained for a particular adaptive MCMC algorithm by Haario et al. 2001, using mixingale theory see McLeish Our approach is more in line with the martingale proof of the LLN for Markov chains, and is based on the existence and regularity of the solutions of Poisson s equation and martingale limit theory. The existence and appropriate properties of those solutions can be easily established under a uniform in the parameter θ version of the V - uniform ergodicity of the transition kernels P θ see condition A1 below and Proposition 3. 8 C. ANDRIEU & É. MOULINES We will use the following notation throughout the paper. For W : [1, and f : R a measurable function define fx f W = sup x W x and L W = {f : f W }. 14 We will also consider functions f : Θ R. We will often use the short-hand notation f θ x = fθ, x for all θ, x Θ in order to avoid ambiguities. We will assume that f θ 0 whenever θ / Θ except when f θ does not depend on θ, i.e. f θ f θ for any θ, θ Θ Θ. Let W : [1,. We say that the family of functions {f θ : R, θ Θ} is W - Lipschitz if for any compact subset K Θ, sup f θ W and sup θ θ 1 f θ f θ W . 15 θ K θ,θ K K,θ θ A1 For any θ Θ, P θ has π as stationary distribution. In addition there exists a function V : [1, such that sup x K V x with K defined in Section 2 and such that for any compact subset K Θ, i Minorization condition. There exist C B, ɛ 0 and a probability measure ϕ all three depending on K such that ϕc

Related Search

Similar documents

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks