Summary of statComp

Summary of statCompwritten by Haodong with

Math

This is a brief summary for DATA130004, statistcal computing in the school of data science, Fudan University, fall 2021. The summary contains only the logic flow and the esssential parts of the course, and the "importance" is judged by how well I remember. Some details might be ignored. You are suppose to refer to this summary for previewing or reviewing for the course or other similar fields. You are not supposed to use it as text book or lecture note record word by word.The course focus on applying statistical methods on computer, especially for R.

1. generate random variables 1.1. inverse transform 1.2. acceptance-rejection method 2. Monte Carlo integration and variance reduction 2.1. Simple MC integration 2.2. Variance Reduction 2.2.1. antithetic variable 2.2.2. control variate 2.2.3. Antethetic variables as a control variate 2.2.4. Control variate and linear regression 2.2.5. importance sampling 2.2.6. stratified sampling 2.2.7. Stratified importance sampling 3. MC in statistical inference 3.1. point estimation 3.2. confidence interval 3.3. hypothesis testing 3.3.1. empirical type I error rate 3.3.2. Power of a test 4. Bootstrap 4.1. bootstrap estimate of distribution 4.2. point estimation 4.3. confidence interval 4.3.1. standard normal distribution 4.3.2. percentile CI 4.3.3. Basic bootstrap CI 4.3.4. Bootstrap t CI 5. Jacknife 5.1. Bias 5.2. Standard Error 6. Bayesian statistics and MCMC 6.1. Bayesian problem set-up 6.2. Markov chain Monte Carlo 6.2.1. Metropolis-Hasting sampler 6.2.2. Metropolis sampler 6.2.3. Random Walk sampler 6.2.4. independent sampler 6.2.5. Gibbs sampler 6.3. Monitoring the convergence (Gelman-Rubin method) 7. EM algorithm 8. Variational inference 8.1. KL divergence 8.2. Evidence lower bound (ELBO) 8.3. The mean-field variational family

1. generate random variablesIn this part, we talk about the methods for generating random numbers. For some certain distributions, like normal distribution, we need to think of how could we use computer to generate numbers from the distributuion we want. This is something you should bare in mind, one of the key idea for this course is to think from the view of a computer programmer.1.1. inverse transform You might learned from your probability course one interesting theorem, which you don't know what for in a long time, that is,

Theorem

Probability Integral TransformationIf X is a continuous random variables with cdf F

(

)

, then U=F

(

)

∼Uniform

(

0,1

)

Thus, given a uniform random number generator and a cdf, we can get the random number using the inverse of the cdf.1. Derive the inverse function F

-1

(

)

.2. Generate u∼Unif

(

0,1

)

.3. x=F

-1

(

)

This method requires the F

-1

easy to compute.1.2. acceptance-rejection methodIn this method, given a generator with pdf Y∼g

(

)

, we can generate from our target X∼f

(

)

. We propose the algorithm, then we prove it.First we assume that c satisfies

(

)

(

)

≤c for all t∈R, then 1. Generate y∼g

(

)

, u∼Unif

(

0,1

)

.2. If u<

(

)

(

)

then accept y and set x=yNow we prove x is actually from target f.

P(X=y\|accept)	=P(Y=y\|accept)
	=P(accept\|Y=y)P(Y=y) P(accept)

whereP

(

accept|Y=y

)

(

)

(

)

Y=y

)

(

)

(

)

(

)

= ∑yP

(

accept|Y=y

)

(

Y=y

)

= ∑y

(

)

(

)

⋅g

(

)

Thus P

(

X=y|accept

)

(

)

(

)

⋅g

(

)

(

)

The continuous case is shown in the homework.2. Monte Carlo integration and variance reductionIn this part, we solve the integration problem 𝜃=∫g

(

)

dx. For x

,⋯,x

iid

∼

X, the empirical averge E

[

(

)

]

∑g

(

)

estimates the population mean, i.e., the expectaion, Eg

(

)

=∫g

(

)

(

)

dx, X∼f. Thus we introduce the Mote Carlo integration. The Monte Carlo method is a widely used method. When saying MC, we are simulating many times to estimate the target. For example, in this part, for estimate the integral, we simulate many random numbers to approximate the result.2.1. Simple MC integrationTo estimate 𝜃=

∫

(

)

dx, X∼Unif

(

0,1

)

.1. Generate x

,⋯,x

iid

∼

X2. 𝜃=

m∑i=1g

(

)

.Notice that the integration is limited to the range 0 to 1. We can generalize to range

[

a,b

]

by change of variable y=

x-a

b-a

, or just generate X∼Unif

(

a,b

)

, and 𝜃=

(

b-a

)

m∑i=1g

(

)

.2.2. Variance ReductionAlthough the simple MC integration is unbiased eastimator of 𝜃, we can use better estimators with smaller variance.2.2.1. antithetic variableIn simple MC, we use independent random variables. However, with usage of dependent variables, we might reduce the variables.Suppose Y and Z have same distribution with X but dependent. Var

(

Y+Z

)

Var

(

)

+Var

(

)

+2Cov

(

Y,Z

)

, if Cov

(

Y,Z

)

<0, the variance can be reduced.Known that if U∼Unif

(

0,1

)

, then 1-U∼Unif

(

0,1

)

and U and 1-U are negatively correlated. And we can expect that

Corollary

If g

(

)

(

,⋯,X

)

is monotone, then Y=g

(

-1

(

)

,⋯,F

-1

(

)

and Y=g

(

-1

(

1-u

)

,⋯,F

-1

(

1-u

)

are negatively correlated.

Proof is ignored here.Then, instead of generating m Unif

(

0,1

)

random varables, we needs

generations, and for j=1,⋯,

, we define Y

(

-1

(

)

,⋯,F

-1

(

)

and Y

'=g

(

-1

(

1-u

(

)

,⋯,F

-1

(

1-u

(

)

, then 𝜃=

m/2∑i=1

(

)

.2.2.2. control variate In this part, we still try to use the benefits of correlation. Suppose there is a f with 𝜇=Ef

(

)

known and f is correlated with g.Then define 𝜃

(

)

(

)

-𝜇

)

. 𝜃

is still an unbiased estimator of 𝜃 and Var

(

𝜃

)

=Var

(

)

+2c⋅Cov

(

)

(

)

Var

(

)

Let c

Cov

(

)

(

)

Var

(

)

, we minimize the Var

(

𝜃

)

=Var

(

)

Cov

(

)

(

)

Var

(

)

.The percent of reduction is

Var

(

)

-Var

(

𝜃

)

Var

(

)

=Cor

(

)

(

)

×100%.2.2.3. Antethetic variables as a control variateWe combine the two methods, formulate control variate as linear combination of two unbiased estimator 𝜃

=c𝜃

(

1-c

)

𝜃

. Suppose 𝜃

and 𝜃

have identical distributions and r=Cor

(

𝜃

, 𝜃

)

<0. Then Var

(

𝜃

)

Var

(

𝜃

)

+2c

(

1-c

)

Cov

(

𝜃

, 𝜃

)

(

1-c

)

Var

(

𝜃

)

=Var

(

𝜃

)

+2c

(

1-c

)

(

1-c

)

=Var

(

𝜃

)

(

2-2r

)

(

2-2r

)

c+1

.2.2.4. Control variate and linear regressionIn control variate method, suppose we have n samples

(

)

(

)

,⋯,

(

)

(

)

. When applying the linear regression g

(

)

=𝛼+𝛽f

(

)

+𝜀, we have the following four important properties.1. the OLS estimators

𝛼=⏨⏨g(x)-𝛽⏨⏨f(x)
𝛽=Cov(f(x),g(x)) Var(f(x))=-c*

2. The predicted value at 𝜇=Ef

(

)

is the control variate estimator of the target integration𝛼+𝛽𝜇=

⏨⏨

(

)

-𝛽

(

⏨⏨

(

)

-𝜇

)

⏨⏨

(

)

(

⏨⏨

(

)

-𝜇

)

=𝜃

3. The variance of the control variate estimator is the residual mean squared error (MSE).Var

(

⏨⏨

(

)

(

⏨⏨

(

)

-𝜇

)

Var

(

)

(

)

-𝜇

)

Var

(

)

-𝛽f

(

)

-𝛼

)

𝜎

𝜀

4. The percetage of improvement is

{

Cor

(

)

(

)

}

×100% is the coefficient of determintion.2.2.5. importance samplingSimple MC integration,

b-a

m∑i=1g

(

)

, weight the interval

[

a,b

]

uniformly. The replicates X

,⋯,X

are uniformly distributed on

[

a,b

]

. Then we consider other weight functions.

Algorithm

Importance sampling1. decide a "envolope" f

(

)

2. For i=1,⋯,m(a) generate x

∼f(b) record

(

)

(

)

3. 𝜃

m∑i=1

(

)

(

)

Let X be a r.v. with density f such that f

(

)

>0 on the support of g. Set Y=

(

)

(

)

then 𝜃=∫g

(

)

dx=∫

(

)

(

)

(

)

dx=E

[

(

)

(

)

]

Thus we can use 𝜃

m∑i=1

(

)

(

)

to estimate 𝜃.Now we analysis the variance.Var

(

𝜃

)

=Var

(

m∑i=1Y

)

Var

(

)

Var

(

)

(

)

{

(

)

(

)

(

)

(

)

}

where E

(

)

(

)

=∫

(

)

(

)

(

)

dx=𝜃 and E

(

)

(

)

=∫

(

)

(

)

dx.Var

(

𝜃

)

{

∫

(

)

(

)

dx-𝜃

}

where ∫

(

)

(

)

dx=

{

∫

(

)

(

)

}

{

∫f

(

)

}

≥

(

∫g

(

)

. The equation holds iff f

(

)

∝|g

(

)

|.2.2.6. stratified sampling

Algorithm

Stratified Sampling1. divide

[

0,1

]

into k strata where the j-th strata is I

(

j-1

)

2. On each strata, for i=1,⋯m

(a) generate x

(

)

∼Unif

(

)

by density f

(

)

=k⋅1

(

x∈I

)

(b) 𝜃

∑i=1g

(

)

3. 𝜃

k∑j=1𝜃

Note that E𝜃

=Eg

(

)

=∫g

(

)

k⋅1

(

x∈I

)

dx=k

∫I

(

)

dx. That's why we need

.Now we show that the variance of importance sampling is smaller than simple MC integration.Denote 𝜃

be the simple MC estimation. For simplicity, we suppose in importance sampling each strata has equal numer, m, of replicates, and total number M=mk. Denote 𝜃

{

(

)

|u∈I

}

and 𝜎

=Var

{

(

)

|u∈I

}

.Var

(

𝜃

)

=Var

(

k∑j=1𝜃

)

k∑j=1Var

(

𝜃

)

k∑j=1

𝜎

k∑j=1𝜎

Suppose a two-step experiment, J is discrete uniform on

{

1,⋯,k

}

. For i=1,⋯,M,1. draw J2. generate from I

Var

(

𝜃

)

Var

(

)

Var

(

)

Var

(

)

Var

(

𝜃

)

k∑j=1𝜎

Var

(

𝜃

)

≥Var

(

𝜃

)

The equation holds iff Var

(

𝜃

)

=0, i.e., 𝜃

=⋯=𝜃

.2.2.7. Stratified importance sampling

Algorithm

Stratified importance sampling1. choose an importance function f.2. divide the real line into k strata where the j-th strata is I

(

j-1

)

, where a

=-∞, a

-1

(

)

, a

=+∞.3. On strata j define g

(

)

(

)

(

x∈I

)

and f

(

)

x|I

(

x|I

)

(

)

(

x∈I

)

∫f

(

)

(

x∈I

)

=kf

(

)

(

x∈I

)

, for i=1,⋯m(a) generate x

(

)

∼f

(

)

(b) 𝜃

m∑i=1g

(

)

4. 𝜃

SIS

=k∑j=1𝜃

Note that 𝜃

=∫g

(

)

dx=

∫I

(

)

dx, thus 𝜃=∑𝜃

.Now we prove that Var

(

𝜃

SIS

)

≤Var

(

𝜃

)

. In I

, denote 𝜃

∫I

(

)

dx=E

{

(

)

(

)

}

and 𝜎

=Var

{

(

)

(

)

}

, where x∼f

.Var

(

𝜃

SIS

)

=Var

(

∑𝜃

)

=∑Var

(

𝜃

)

=∑

𝜎

∑𝜎

Var

(

𝜃

)

Var

(

)

𝜎

where 𝜎

=Var

(

)

(

)

Next we show that 𝜎

-k∑𝜎

≥0. We consider the two-stage experiment, For J=j, we geberate x

from f

and set Y

(

)

(

)

(

)

(

)

. x

x and kY

Y. Var

(

)

{

Var

(

)

}

+Var

{

(

)

}

where E

{

Var

(

)

}

(

𝜎

)

∑𝜎

Var

{

(

)

}

=Var

(

𝜃

)

Then 𝜎

=Var

(

)

Var

(

)

Var

(

𝜃

)

+k∑𝜎

≥k∑𝜎

3. MC in statistical inference 3.1. point estimation𝜃=

∑𝜃

(

)

(

⏨

)

1 n{∑(xi-⏨x)2}1 2
1 n{1 n-1∑(xi-⏨x)2}1 2

mse=

∑

(

𝜃

(

)

-𝜃

)

3.2. confidence interval

Algorithm

Monte Carlo Confindence interval1. For each replicate j=1,⋯,m(a) generate the j-th random sample X

(

)

,⋯,X

(

)

(b) compute the confidence interval C

for the j-th sample(c) Compute y

(

𝜃∈C

)

for the j-the sample2. Compute the empirical confidence level

⏨

∑y

3.3. hypothesis testing3.3.1. empirical type I error rate

Algorithm

MC type I error rate1. For each replicate, indexed by j=1,⋯,m(a) Generate the j-th random sample x

(

)

,⋯,x

(

)

from the null distribution. (b) Compute the test statistic T

from the j-th sample.(c) Record the test decision I

=1 if H

is rejected at significance level 𝛼 and otherwise I

=0.2. Compute the proportion of significant tests

∑I

. This proportion is the observed Type I error rate.

3.3.2. Power of a test

Algorithm

MC power of a test1. select a particular 𝜃

∈𝛩

2. For each replicate, indexed by j=1,⋯,m(a) Generate the j-th random sample x

(

)

,⋯,x

(

)

under 𝜃

. (b) Compute the test statistic T

from the j-th sample.(c) Record the test decision I

=1 if H

is rejected at significance level 𝛼 and otherwise I

=0.3. Compute the proportion of significant tests 𝜋

(

𝜃

)

∑I

4. Bootstrap4.1. bootstrap estimate of distribution

Algorithm

bootstrap estimate of distribution1. Foe each Bootstrap replicate b=1,⋯,B(a) Generate samle x

(

)

(

)

,⋯,x

(

)

by sample with replacement from the observation x

, ⋯,x

(b) Compute the b-th replicate 𝜃

(

)

using x

(

)

.2. The bootstrap estimate of F

𝜃

is the empirical distribution of 𝜃

(

)

,⋯,𝜃

(

)

4.2. point estimation1. se of 𝜃. se

≜se

(

𝜃

)

B-1

B∑b=1

(

𝜃

(

)

⏨

𝜃

)

where

⏨

𝜃

B∑b=1𝜃

(

)

.2. bias of 𝜃. bias

(

𝜃

)

B∑b=1𝜃

(

)

-𝜃.4.3. confidence intervalNow we use Bootstrap to estimate confidence intervals.4.3.1. standard normal distributionUse approximately normal poperty,

[

𝜃±1.96se

(

𝜃

)

]

4.3.2. percentile CIUse the sample quantile

[

𝜃

𝛼/2

,𝜃

1-𝛼/2

]

.4.3.3. Basic bootstrap CISuppose

(

L,U

)

is the confidence interval, i.e., P

(

L≥𝜃

)

(

U≤𝜃

)

𝛼

. Then

𝛼

(

L≥𝜃

)

(

L-𝜃≥𝜃-𝜃

)

(

𝜃-𝜃≥𝜃-L

)

Thus L-𝜃 is the 1-

𝛼

percentile of 𝜃-𝜃. We can estimate the 1-

𝛼

quantiles of 𝜃 using bootstrap replicate 𝜃

1-𝛼/2

, then 𝜃

1-𝛼/2

-𝜃 is approximately equal to the 1-

𝛼

quantile of 𝜃-𝜃. Set 𝜃-L=𝜃

1-𝛼/2

-𝜃, we have L=2𝜃-𝜃

1-𝛼/2

.Similarly, we have U=2𝜃-𝜃

𝛼/2

. Thus the CI is

[

2𝜃-𝜃

1-𝛼/2

,2𝜃-𝜃

𝛼/2

]

.4.3.4. Bootstrap t CI

Algorithm

Bootstrap t confidence interval1. Compute 𝜃 from the observed data.2. For each bootstrap replicate b=1,⋯,B(a) Sample with replacement x

(

)

(

)

,⋯,x

(

)

(b) compute 𝜃

(

)

from x

(

)

(

𝜃

(

)

. (another layer of Bootstrap, resample from x

(

)

, not from x)(d) compute the b-th replicate of the t

distribution t

(

)

𝜃

(

)

-𝜃

(

𝜃

(

)

.3. find sample quantiles t

𝛼

and t

𝛼

from

{

(

)

}

b=1

.4. compute se

(

𝜃

)

from

{

𝜃

(

)

}

b=1

.5. confidence interval

[

𝜃-t

𝛼

(

𝜃

)

,𝜃-t

𝛼

(

𝜃

)

]

5. Jacknife Jacknife is similar to the leave-one-out method. In each sample, the Jacknife except one obvervation, i.e., the i-th Jacknife sample is x

(

)

(

,⋯,x

i-1

i+1

,⋯,x

)

. The i-th Jacknife estimator is 𝜃

(

)

(

)

.5.1. BiasThe jacknife bias is bias

jack

(

n-1

)

(

⏨

𝜃

(

⋅

)

-𝜃

)

where

⏨

𝜃

(

⋅

)

n∑i=1𝜃

(

)

.why n-1? For example, 𝜃=

n∑i=1

(

⏨

)

, bias

(

𝜃

)

(

𝜃

)

-𝜎

n-1

𝜎

-𝜎

𝜎

. whileE

(

𝜃

(

)

-𝜃

)

(

𝜃

(

)

-𝜃

)

-E

(

𝜃-𝜃

)

n-1

𝜎

(

n-1

)

bias

(

𝜃

)

n-1

5.2. Standard ErrorThe jacknife standard error is se

jack

n-1

n∑i=1

(

𝜃

(

)

⏨

𝜃

(

⋅

)

Why

n-1

? For example, if 𝜃=

⏨

, then 𝜃

(

)

⏨

-x

n-1

⏨

𝜃

(

⋅

)

⏨

.n∑i=1

(

𝜃

(

)

⏨

𝜃

(

⋅

)

=n∑i=1

(

⏨

-x

n-1

⏨

)

(

n-1

)

n∑i=1

(

⏨

-x

)

n-1

6. Bayesian statistics and MCMCBayesian statistics looks problems in a differen way from frenquetists. In frequentist, the parameter is fixed but unknown, and he experiment is repeatable. However, in Bayesian, the parameter is random and unknown, and the experiment is fixed, or to say, the target of Bayesian is the posteriorp

(

𝜃|data

)

, rather than the likelihood p

(

data|𝜃

)

. 6.1. Bayesian problem set-up

parameter	𝜃
data	X
prior distribution of 𝜃	𝜋(𝜃)
sampling model for X	f(x\|𝜃)
posterior distribution for 𝜃\|X	p(𝜃\|X)=𝜋(𝜃)f(X\|𝜃) ∫𝛩𝜋(𝜃)f(X\|𝜃)d𝜃∝prior×likelihood

6.2. Markov chain Monte CarloConstruct a Markov chain

{

:t=0,1,⋯

}

whose stationary distribution is target distribution.6.2.1. Metropolis-Hasting samplerNow supppose our target distribution is f, to move from one state to another, we have a proposal Y∼g

(

y|X

)

Algorithm

MH sampler1. Choose a proper proposal g

(

⋅|X

)

2. Initialize X

and repeat until the chain converges. At time t,(a) Generate Y from g

(

⋅|X

)

(b) Compute the acceptance rate r

(

)

(

)

(

)

(

)

(

Y|X

)

(

, Y

)

min

{

(

)

}

be the acceptance ratio. Set X

t+1

Y		with probability 𝛼(Xt,Y)
Xt		with probability 1-𝛼(Xt,Y)

Now we show how the MH sampler works.Let s, t be two different state. Without loss of generality, we assume that f

(

)

(

r|s

)

≥f

(

)

(

s|r

)

. Then 𝛼

(

r,s

)

=1 and 𝛼

(

s,r

)

(

)

(

s|r

)

(

)

(

r|s

)

, and

P(Xt=s,Xt+1=r)	=P(Xt=s)P(Xt+1=r\|Xt=s)
	=f(s)g(r\|s)𝛼(s,r)
	=f(r)g(s\|r)
P(Xt=r,Xt+1=s)	=P(Xt=r)P(Xt+1=s\|Xt=r)
	=f(r)g(s\|r)𝛼(r,s)
	=f(r)g(s\|r)

⟹ P

(

=s,X

t+1

)

(

=r,X

t+1

)

Thus

P(Xt=r)	= ∑sP(Xt=r,Xt+1=s)
	= ∑sP(Xt=s,Xt+1=r)
	=P(Xt+1=r)

The Markov chain is stationary at f.□We can also prove it in a kernel point of view. Define K

(

r,s

)

(

s|r

)

𝛼

(

r,s

)

be the transition kernel from r to s. We call f is the stationary distriibution if the balance condition holds ∫f

(

)

(

r,s

)

dr=f

(

)

Under MH sampler,

f(r)K(r,s)	=f(r)g(s\|r)𝛼(r,s)
	=f(r)g(s\|r)min{1,f(s)g(r\|s) f(r)g(s\|r)}
	=min{f(r)g(s\|r), f(s)g(r\|s)}
	=f(s)g(r\|s)𝛼(s,r)
	=f(s)K(s,r)

We have the detailed balanced consition f

(

)

(

r,s

)

(

)

(

s,r

)

. Then ∫f

(

)

(

r,s

)

dr=∫f

(

)

(

s,r

)

dr=f

(

)

∫K

(

s,r

)

dr=f

(

)

6.2.2. Metropolis samplerWith symmetric proposal g

(

X|Y

)

(

Y|X

)

. r

(

, Y

)

(

)

(

)

.6.2.3. Random Walk sampler Proposal g

(

Y|X

)

(

|Y-X

)

. For example Y=X

(

0,𝜎

)

. r

(

, Y

)

(

)

(

)

.6.2.4. independent samplerProposal g

(

Y|X

)

(

)

. r

(

, Y

)

(

)

(

)

(

)

(

)

.6.2.5. Gibbs samplerNow suppose we generate samples from a multivariate f

(

)

where x∈𝜒⊂R

. We partition the d-dimension vector x into K disjoint blocks, denoted by x=

(

,⋯,x

)

where K≤d. Let f

(

-k

)

(

,⋯,x

k-1

k+1

,⋯,x

)

, k=1,⋯,K

Algorithm

Gibbs sampler1. Starting with an arbitrary point x

(

)

∈𝜒 with f

(

)

>02. At time t, (1) generate x

(

)

∼f

(

t-1

)

,⋯,x

(

t-1

)

⋮ (k) generate x

(

)

∼f

(

)

,⋯,x

(

)

k-1

(

t-1

)

k+1

,⋯,x

(

t-1

)

⋮ (K) generate x

(

)

∼f

(

t-1

)

,⋯,x

(

t-1

)

K-1

)

3. Set x

(

)

(

)

,⋯,x

(

)

Gibbs sampler is a special case ofMH sampler, with accept ratio 1. Proof can be found in homework 7.6.3. Monitoring the convergence (Gelman-Rubin method)In MCMC, the chain may be traped in some local mode, we can run multiple chains from different initial point to check the convergence. Recall ANOVA in DATA130046 statistics II. Suppose there are J chains, and n is the number in each chain after discarding the burn-in period. 𝜓 is a function of data. Write𝜓

(

)

=𝜓

(

)

,⋯,X

(

)

, i=1,⋯,n;j=1,⋯,J

⏨

𝜓

(

)

n∑i=1𝜓

(

)

⏨

𝜓

n∑i=1

⏨

𝜓

(

)

Then the Between sequence variance is B=

J-1

J∑j=1

(

⏨

𝜓

(

)

⏨

𝜓

)

and the Within sequence variance is W=

J∑j=1s

where s

n-1

n∑i=1

(

𝜓

(

)

⏨

𝜓

(

)

. The Gelman-Rubin statistic is

n-1

W+

which should decrease and converge to 1 if chain converges well. A recommended threshold is 1.1.7. EM algorithm The expectation-maximization algorithm is used to overcome the difficulties in maximizing likelihoods. Suppose except observed data Y, there are unobserved data U. f

(

Y|𝜃

)

is not easily evaluate, but f

(

Y,U|𝜃

)

is easy to calculate. Then we write the total log-likelihood as

log

(

Y,U|𝜃

)

log

(

Y|𝜃

)

log

(

U|Y,𝜃

)

Take expactation with respect to u, and given Y=y, 𝜃=𝜃

, we have E

log

(

Y,U|𝜃

)

|Y=y,𝜃

=ℓ

(

𝜃

)

log

(

U|Y,𝜃

)

Y=y,𝜃

Write it as

(1)

(

𝜃,𝜃

)

=ℓ

(

𝜃

)

(

𝜃,𝜃

)

Let 𝜃=𝜃

, we have

(2)

(

𝜃

,𝜃

)

=ℓ

(

𝜃

)

(

𝜃

,𝜃

)

(1)

(2)

we get ℓ

(

𝜃

)

-ℓ

(

𝜃

)

≥Q

(

𝜃,𝜃

)

-Q

(

𝜃

,𝜃

)

{

(

𝜃,𝜃

)

-C

(

𝜃

,𝜃

)

}

.The right part is

C(𝜃,𝜃)-C(𝜃,𝜃*)	=E{logf(U\|Y,𝜃) f(U\|Y,𝜃)\|Y=y,𝜃}
	≤logE{f(U\|Y,𝜃) f(U\|Y,𝜃)\|Y=y,𝜃}
	=log{∫f(u\|y,𝜃) f(u\|y,𝜃)f(u\|y,𝜃)du}
	=0

Thus ℓ

(

𝜃

)

-ℓ

(

𝜃

)

(

𝜃,𝜃

)

-Q

(

𝜃

,𝜃

)

. Given 𝜃

, if we find 𝜃 that Q

(

𝜃,𝜃

)

≥Q

(

𝜃

,𝜃

)

, then ℓ

(

𝜃

)

≥ℓ

(

𝜃

)

Algorithm

The EM algorithm1. E-step: compute Q

(

𝜃,𝜃

(

)

log

(

Y,U|𝜃

)

Y=y,𝜃

(

)

2. M-step: 𝜃

(

t+1

)

arg

max

𝜃

(

𝜃,𝜃

(

)

8. Variational inferenceTo approximate the posterior p

(

z|x

)

𝜋

(

)

(

x|z

)

(

)

in bayesian statistics, rather than generating random samples using MCMC, we can solve the problem by optimization, i.e., find the closest density to p

(

x|z

)

. To judge the distance between our density q and the target p, we can use the Kullback-Leibler divergence. 8.1. KL divergenceKL

(

f‖g

)

{

log

(

)

(

)

}

=+∞∫-∞log

(

)

(

)

(

)

dxProperties:1. KL divergence is non-negative and KL=0 iff f=g2. KL is not a distance, i.e., KL

(

f||g

)

≠KL

(

g||f

)

Proof of Property 1:-KL

(

f||g

)

{

log

(

)

(

)

}

Jensen's inequality

≤

log

{

(

)

(

)

}

=log+∞∫-∞

(

)

(

)

(

)

dx=08.2. Evidence lower bound (ELBO)Suppose we have a family of densities over latent variables, then we wantq

(

)

arg

min

q∈Q

(

)

|| p

(

x|z

)

KL(q(z) \|\| p(x\|z))	=Eq{logq(z) p(z\|x)}
	=Eq{log q(z)}-Eq{log p(z\|x)}
	=Eq{logq(z)}-Eq{logf(x,z)}+Eq{logf(x)}
	=logf(x)-ELBO(q)

Define evidence lower bound (ELBO) as ELBO

(

)

≜E

{

log

(

x,z

)

}

-E

{

log

(

)

}

Then

log

(

)

=KL

(

)

||p

(

z|x

)

=KL

(

)

|| p

(

x|z

)

+ELBO

(

)

≥ELBO

(

)

To minimize KL

(

)

|| p

(

x|z

)

, we maximize ELBO

(

)

ELBO(q)	=Eq{logf(x,z)}-Eq{logq(z)}
	=Eq{logf(x\|z)}+Eq{log𝜋(z)}-Eq{logq(z)}
	=Eq{logf(x\|z)}-Eq{logq(z) 𝜋(z)}
	=Eq{logf(x\|z)}-KL(q(z)\|\|𝜋(z))

To maximize ELBO

(

)

, on one hand, we maximize E

{

log

(

x|z

)

}

, i.e., find q that fits the data. On the other hand, we minimize KL

(

)

||𝜋

(

)

, i.e., find q that fits the prior.8.3. The mean-field variational familyNow we consider a special case that the family of the q is Q≜

q: q

(

)

=m∏j=1q

(

)

. It can be treated as the elements of the q is "indepdent".

ELBO(q)	=Eq{logf(x,z)}-Eq{logq(z)}
	=∫m∏j=1qjalogf(x,z)-logqjadz

We use a method called coordinate ascend to maximize ELBO

(

)

, i.e., fix all others, climb up on q

. Then ELBO

(

)

=∫q

{

∫

log

(

x,z

)

∏j≠kq

}

-∫q

log

, where c

is the constant have nothing with q

.Define

log

(

x,z

)

=∫

log

(

x,z

)

∏j≠kq

, thus ELBO

(

)

=∫q

log

(

x,z

)

-∫q

log

=-KL

(

||p

(

x,z

)

=p∝

exp

-k

log

(

x,z

)

P(Xt=s,Xt+1=r)	=P(Xt=s)P(Xt+1=r\|Xt=s)
	=f(s)g(r\|s)𝛼(s,r)
	=f(r)g(s\|r)
P(Xt=r,Xt+1=s)	=P(Xt=r)P(Xt+1=s\|Xt=r)
	=f(r)g(s\|r)𝛼(r,s)
	=f(r)g(s\|r)

f(r)K(r,s)	=f(r)g(s\|r)𝛼(r,s)
	=f(r)g(s\|r)min{1,f(s)g(r\|s) f(r)g(s\|r)}
	=min{f(r)g(s\|r), f(s)g(r\|s)}
	=f(s)g(r\|s)𝛼(s,r)
	=f(s)K(s,r)

C(𝜃,𝜃)-C(𝜃,𝜃*)	=E{logf(U\|Y,𝜃) f(U\|Y,𝜃)\|Y=y,𝜃}
	≤logE{f(U\|Y,𝜃) f(U\|Y,𝜃)\|Y=y,𝜃}
	=log{∫f(u\|y,𝜃) f(u\|y,𝜃)f(u\|y,𝜃)du}
	=0

KL(q(z) \|\| p(x\|z))	=Eq{logq(z) p(z\|x)}
	=Eq{log q(z)}-Eq{log p(z\|x)}
	=Eq{logq(z)}-Eq{logf(x,z)}+Eq{logf(x)}
	=logf(x)-ELBO(q)

ELBO(q)	=Eq{logf(x,z)}-Eq{logq(z)}
	=Eq{logf(x\|z)}+Eq{log𝜋(z)}-Eq{logq(z)}
	=Eq{logf(x\|z)}-Eq{logq(z) 𝜋(z)}
	=Eq{logf(x\|z)}-KL(q(z)\|\|𝜋(z))