Summary of statCompwritten by Haodong with MathThis is a brief summary for DATA130004, statistcal computing in the school of data science, Fudan University, fall 2021. The summary contains only the logic flow and the esssential parts of the course, and the "importance" is judged by how well I remember. Some details might be ignored. You are suppose to refer to this summary for previewing or reviewing for the course or other similar fields. You are not supposed to use it as text book or lecture note record word by word.The course focus on applying statistical methods on computer, especially for R. 1. generate random variables 1.1. inverse transform 1.2. acceptance-rejection method 2. Monte Carlo integration and variance reduction 2.1. Simple MC integration 2.2. Variance Reduction 2.2.1. antithetic variable 2.2.2. control variate 2.2.3. Antethetic variables as a control variate 2.2.4. Control variate and linear regression 2.2.5. importance sampling 2.2.6. stratified sampling 2.2.7. Stratified importance sampling 3. MC in statistical inference 3.1. point estimation 3.2. confidence interval 3.3. hypothesis testing 3.3.1. empirical type I error rate 3.3.2. Power of a test 4. Bootstrap 4.1. bootstrap estimate of distribution 4.2. point estimation 4.3. confidence interval 4.3.1. standard normal distribution 4.3.2. percentile CI 4.3.3. Basic bootstrap CI 4.3.4. Bootstrap t CI 5. Jacknife 5.1. Bias 5.2. Standard Error 6. Bayesian statistics and MCMC 6.1. Bayesian problem set-up 6.2. Markov chain Monte Carlo 6.2.1. Metropolis-Hasting sampler 6.2.2. Metropolis sampler 6.2.3. Random Walk sampler 6.2.4. independent sampler 6.2.5. Gibbs sampler 6.3. Monitoring the convergence (Gelman-Rubin method) 7. EM algorithm 8. Variational inference 8.1. KL divergence 8.2. Evidence lower bound (ELBO) 8.3. The mean-field variational family 1. generate random variablesIn this part, we talk about the methods for generating random numbers. For some certain distributions, like normal distribution, we need to think of how could we use computer to generate numbers from the distributuion we want. This is something you should bare in mind, one of the key idea for this course is to think from the view of a computer programmer.1.1. inverse transform You might learned from your probability course one interesting theorem, which you don't know what for in a long time, that is,
Theorem1Probability Integral TransformationIf X is a continuous random variables with cdf FX(x), then U=FX(X)∼Uniform(0,1).
Thus, given a uniform random number generator and a cdf, we can get the random number using the inverse of the cdf.1. Derive the inverse function F-1X(u).2. Generate u∼Unif(0,1).3. x=F-1(u)This method requires the F-1 easy to compute.1.2. acceptance-rejection methodIn this method, given a generator with pdf Y∼g(t), we can generate from our target X∼f(t). We propose the algorithm, then we prove it.First we assume that c satisfies f(t)
g(t)
≤c
for all t∈R, then
1. Generate y∼g(t), u∼Unif(0,1).2. If u<f(y)
cg(y)
then accept y and set x=y
Now we prove x is actually from target f.a
P(X=y|accept)=P(Y=y|accept)
=P(accept|Y=y)P(Y=y)
P(accept)
whereP(accept|Y=y)=P(u<f(y)
cg(y)
|Y=y)
=f(y)
cg(y)
P(accept)= āˆ‘yP(accept|Y=y)P(Y=y)= āˆ‘yf(y)
cg(y)
ā‹…g(y)=1
c
Thus P(X=y|accept)=f(y)
cg(y)
ā‹…g(y)
1
c
=f(y)
The continuous case is shown in the homework.2. Monte Carlo integration and variance reductionIn this part, we solve the integration problem šœƒ=g(x)dx. For x1,⋯,xmiid
∼
X
, the empirical averge EM[g(x)]=1
m
āˆ‘g(xi)
estimates the population mean, i.e., the expectaion, Eg(x)=g(x)f(x)dx, X∼f. Thus we introduce the Mote Carlo integration.
The Monte Carlo method is a widely used method. When saying MC, we are simulating many times to estimate the target. For example, in this part, for estimate the integral, we simulate many random numbers to approximate the result.2.1. Simple MC integrationTo estimate šœƒ=g(x)dx, X∼Unif(0,1).1. Generate x1,⋯,xmiid
∼
X
2. šœƒ=1
m
māˆ‘i=1g(xi)
.
Notice that the integration is limited to the range 0 to 1. We can generalize to range [a,b] by change of variable y=x-a
b-a
, or just generate X∼Unif(a,b), and šœƒ=(b-a)1
m
māˆ‘i=1g(xi)
.
2.2. Variance ReductionAlthough the simple MC integration is unbiased eastimator of šœƒ, we can use better estimators with smaller variance.2.2.1. antithetic variableIn simple MC, we use independent random variables. However, with usage of dependent variables, we might reduce the variables.Suppose Y and Z have same distribution with X but dependent. Var(1
2
(Y+Z))=1
4
aVar(Y)+Var(Z)+2Cov(Y,Z)a
, if Cov(Y,Z)<0, the variance can be reduced.
Known that if U∼Unif(0,1), then 1-U∼Unif(0,1) and U and 1-U are negatively correlated. And we can expect that
Corollary1 If g(X)=g(X1,⋯,Xn) is monotone, then Y=g(F-1X(u1),⋯,F-1X(un)) and Y=g(F-1X(1-u1),⋯,F-1X(1-un)) are negatively correlated.
Proof is ignored here.Then, instead of generating m Unif(0,1) random varables, we needs m
2
generations, and for j=1,⋯,m
2
, we define Yj=g(F-1X(u(j)1),⋯,F-1X(u(j)n)) and Yj'=g(F-1X(1-u(j)1),⋯,F-1X(1-u(j)n)), then šœƒ=1
m
m/2āˆ‘i=1(Yj+Yj')
.
2.2.2. control variate In this part, we still try to use the benefits of correlation. Suppose there is a f with šœ‡=Ef(x) known and f is correlated with g.Then define šœƒc=g(x)+c(f(x)-šœ‡). šœƒc is still an unbiased estimator of šœƒ and Var(šœƒc)=Var(g(x))+2cā‹…Cov(g(x),f(x))+c2Var(f(x))Let c*=-Cov(g(x),f(x))
Var(f(x))
, we minimize the Var(šœƒc*)=Var(g(x))-Cov2(g(x),f(x))
Var(f(x))
.
The percent of reduction is Var(g(x))-Var(šœƒc*)
Var(g(x))
=Cor2(g(x),f(x))Ɨ100%
.
2.2.3. Antethetic variables as a control variateWe combine the two methods, formulate control variate as linear combination of two unbiased estimator šœƒc=cšœƒ1+(1-c)šœƒ2. Suppose šœƒ1 and šœƒ2 have identical distributions and r=Cor(šœƒ1, šœƒ2)<0. Then Var(šœƒc)=c2Var(šœƒ1)+2c(1-c)Cov(šœƒ1, šœƒ2)+(1-c)2Var(šœƒ2)=Var(šœƒ1)ac2+2c(1-c)r+(1-c)2a=Var(šœƒ1)a(2-2r)c2-(2-2r)c+1ac*=1
2
.
2.2.4. Control variate and linear regressionIn control variate method, suppose we have n samples (f(x1),g(x1)),⋯,(f(xn),g(xn)). When applying the linear regression g(x)=š›¼+š›½f(x)+šœ€, we have the following four important properties.1. the OLS estimatorsa
š›¼=āØāØg(x)-š›½āØāØf(x)
š›½=Cov(f(x),g(x))
Var(f(x))
=-c*
2. The predicted value at šœ‡=Ef(x) is the control variate estimator of the target integrationš›¼+š›½šœ‡=āØāØg(x)-š›½(āØāØf(x)-šœ‡)=āØāØg(x)+c*(āØāØf(x)-šœ‡)=šœƒc*3. The variance of the control variate estimator is the residual mean squared error (MSE).Var(āØāØg(x)+c(āØāØf(x)-šœ‡))=1
n
Var(g(x)+c(f(x)-šœ‡))
=1
n
Var(g(x)-š›½f(x)-š›¼)
=šœŽ2šœ€
n
4. The percetage of improvement is {Cor(g(x),f(x))}Ɨ100% is the coefficient of determintion.2.2.5. importance samplingSimple MC integration, b-a
m
māˆ‘i=1g(Xi)
, weight the interval [a,b] uniformly. The replicates X1,⋯,Xm are uniformly distributed on [a,b]. Then we consider other weight functions.
Algorithm1Importance sampling1. decide a "envolope" f(x)2. For i=1,⋯,m(a) generate xi∼f(b) record g(xi)
f(xi)
3. šœƒIS=1
m
māˆ‘i=1g(xi)
f(xi)
Let X be a r.v. with density f such that f(x)>0 on the support of g. Set Y=g(x)
f(x)
then
šœƒ=g(x)dx=g(x)
f(x)
f(x)dx=Ef[g(x)
f(x)
]
Thus we can use šœƒIS=1
m
māˆ‘i=1g(xi)
f(xi)
to estimate šœƒ.
Now we analysis the variance.Var(šœƒIS)=Var(1
m
māˆ‘i=1Yi)=1
m
Var(Y)
=1
m
Var(g(x)
f(x)
)=1
m
{E(g(x)
f(x)
)2-(Eg(x)
f(x)
)2}
where Eg(x)
f(x)
=g(x)
f(x)
f(x)dx=šœƒ
and E(g(x)
f(x)
)2=g2(x)
f(x)
dx
.
Var(šœƒIS)=1
m
{g2(x)
f(x)
dx-šœƒ2}
where g2(x)
f(x)
dx={g2(x)
f(x)
dx}{f(x)dx}≄(g(x)dx)2
. The equation holds iff f(x)āˆ|g(x)|.
2.2.6. stratified sampling
Algorithm2Stratified Sampling1. divide [0,1] into k strata where the j-th strata is Ij=(j-1
k
,j
k
)
2. On each strata, for i=1,⋯mj(a) generate x(j)i∼Unif(Ij) by density fj(x)=kā‹…1(x∈Ij)(b) šœƒj=1
mj
mjāˆ‘i=1g(x(j)i)
3. šœƒS=1
k
kāˆ‘j=1šœƒj
Note that Ešœƒj=Eg(x(j))=g(x)kā‹…1(x∈Ij)dx=kg(x)dx. That's why we need 1
k
.
Now we show that the variance of importance sampling is smaller than simple MC integration.Denote šœƒM be the simple MC estimation. For simplicity, we suppose in importance sampling each strata has equal numer, m, of replicates, and total number M=mk. Denote šœƒj=E{g(u)|u∈Ij} and šœŽ2j=Var{g(u)|u∈Ij}.Var(šœƒS)=Var(1
k
kāˆ‘j=1šœƒj)=1
k2
kāˆ‘j=1Var(šœƒj)
=1
k2
kāˆ‘j=1šœŽ2j
m
=1
Mk
kāˆ‘j=1šœŽ2j
Suppose a two-step experiment, J is discrete uniform on {1,⋯,k}. For i=1,⋯,M,1. draw J2. generate from IJVar(šœƒM)=1
M
Var(g(u))
=1
M
aVaraE(g(u)aJ)a+EaVar(g(u)aJ)aa
=1
M
aVar(šœƒJ)+1
k
kāˆ‘j=1šœŽ2ja
=1
Mk
kāˆ‘j=1šœŽ2j+1
M
Var(šœƒJ)≄Var(šœƒS)
The equation holds iff Var(šœƒJ)=0, i.e., šœƒ1=⋯=šœƒk.2.2.7. Stratified importance sampling
Algorithm3Stratified importance sampling1. choose an importance function f.2. divide the real line into k strata where the j-th strata is Ij=(aj-1,aj), where a0=-āˆž, aj=F-1(j
k
), ak=+āˆž
.
3. On strata j define gj(x)=g(x)1(x∈Ij) and fj(x)=fx|Ij(x|Ij)=f(x)1(x∈Ij)
f(x)1(x∈Ij)dx
=kf(x)1(x∈Ij)
, for i=1,⋯m
(a) generate x(j)i∼fj(x)(b) šœƒj=1
m
māˆ‘i=1g(x(j)i)
4. šœƒSIS=kāˆ‘j=1šœƒj
Note that šœƒj=gj(x)dx=g(x)dx, thus šœƒ=āˆ‘šœƒj.Now we prove that Var(šœƒSIS)≤Var(šœƒIS). In Ij, denote šœƒj=g(x)dx=E{gj(x)
fj(x)
}
and
šœŽ2j=Var{gj(x)
fj(x)
}
, where x∼fj.
Var(šœƒSIS)=Var(āˆ‘šœƒj)=āˆ‘Var(šœƒj)=āˆ‘šœŽ2j
m
=k
M
āˆ‘šœŽ2j
Var(šœƒIS)=1
M
Var(Y)=šœŽ2
M
where šœŽ2=Var(g(x)
f(x)
)
Next we show that šœŽ2-kāˆ‘šœŽ2j≄0. We consider the two-stage experiment, For J=j, we geberate x* from fj and set Y*=gj(x*)
fj(x*)
=g(x*)
kf(x*)
. x*d
=
x
and kY*d
=
Y
.
Var(Y*)=E{Var(Y*|J)}+Var{E(Y*|J)}where E{Var(Y*|J)}=E(šœŽ2J)=1
k
āˆ‘šœŽ2j
Var{E(Y*|J)}=Var(šœƒJ)
Then šœŽ2=Var(Y)=k2Var(Y*)=k2Var(šœƒJ)+kāˆ‘šœŽ2j≄kāˆ‘šœŽ2j3. MC in statistical inference 3.1. point estimationšœƒ=1
m
āˆ‘šœƒ(j)
se(āØx)=a
1
n
{āˆ‘(xi-āØx)2}1
2
1
n
{1
n-1
āˆ‘(xi-āØx)2}1
2
mse=1
m
āˆ‘(šœƒ(j)-šœƒ)2
3.2. confidence interval
Algorithm4Monte Carlo Confindence interval1. For each replicate j=1,⋯,m(a) generate the j-th random sample X(j)1,⋯,X(j)n(b) compute the confidence interval Cj for the j-th sample(c) Compute yj=1(šœƒāˆˆCj) for the j-the sample2. Compute the empirical confidence level āØy=1
m
āˆ‘yj
3.3. hypothesis testing3.3.1. empirical type I error rate
Algorithm5MC type I error rate1. For each replicate, indexed by j=1,⋯,m(a) Generate the j-th random sample x(j)1,⋯,x(j)n from the null distribution. (b) Compute the test statistic Tj from the j-th sample.(c) Record the test decision Ij=1 if H0 is rejected at significance level š›¼ and otherwise Ij=0.2. Compute the proportion of significant tests 1
m
āˆ‘Ij
. This proportion is the observed Type I error rate.
3.3.2. Power of a test
Algorithm6MC power of a test1. select a particular šœƒ1āˆˆš›©12. For each replicate, indexed by j=1,⋯,m(a) Generate the j-th random sample x(j)1,⋯,x(j)n under šœƒ1. (b) Compute the test statistic Tj from the j-th sample.(c) Record the test decision Ij=1 if H0 is rejected at significance level š›¼ and otherwise Ij=0.3. Compute the proportion of significant tests šœ‹(šœƒ1)=1
m
āˆ‘Ij
.
4. Bootstrap4.1. bootstrap estimate of distribution
Algorithm7bootstrap estimate of distribution1. Foe each Bootstrap replicate b=1,⋯,B(a) Generate samle x*(b)=(x*1(b),⋯,x*m(b)) by sample with replacement from the observation x1, ⋯,xn(b) Compute the b-th replicate šœƒ(b) using x*(b).2. The bootstrap estimate of Fšœƒ is the empirical distribution of šœƒ(1),⋯,šœƒ(B)
4.2. point estimation1. se of šœƒ. seBā‰œse(šœƒ*)=1
B-1
Bāˆ‘b=1(šœƒ(b)-āØšœƒ*)2
where āØšœƒ*=1
B
Bāˆ‘b=1šœƒ(b)
.
2. bias of šœƒ. bias(šœƒ)=1
B
Bāˆ‘b=1šœƒ(b)-šœƒ
.
4.3. confidence intervalNow we use Bootstrap to estimate confidence intervals.4.3.1. standard normal distributionUse approximately normal poperty, [šœƒĀ±1.96seB(šœƒ)]4.3.2. percentile CIUse the sample quantile [šœƒ*š›¼/2,šœƒ*1-š›¼/2].4.3.3. Basic bootstrap CISuppose (L,U) is the confidence interval, i.e., P(Lā‰„šœƒ)=P(Uā‰¤šœƒ)=š›¼
2
. Then
š›¼
2
=P(Lā‰„šœƒ)=P(L-šœƒā‰„šœƒ-šœƒ)
=P(šœƒ-šœƒā‰„šœƒ-L)
Thus L-šœƒ is the 1-š›¼
2
percentile of šœƒ-šœƒ. We can estimate the 1-š›¼
2
quantiles of šœƒ using bootstrap replicate šœƒ*1-š›¼/2, then šœƒ*1-š›¼/2-šœƒ is approximately equal to the 1-š›¼
2
quantile of šœƒ-šœƒ. Set šœƒ-L=šœƒ*1-š›¼/2-šœƒ, we have L=2šœƒ-šœƒ*1-š›¼/2.
Similarly, we have U=2šœƒ-šœƒ*š›¼/2. Thus the CI is [2šœƒ-šœƒ*1-š›¼/2,2šœƒ-šœƒ*š›¼/2].4.3.4. Bootstrap t CI
Algorithm8Bootstrap t confidence interval1. Compute šœƒ from the observed data.2. For each bootstrap replicate b=1,⋯,B(a) Sample with replacement x(b)=(x(b)1,⋯,x(b)n)(b) compute šœƒ(b) from x(b)(c) estimate se(šœƒ(b)). (another layer of Bootstrap, resample from x(b), not from x)(d) compute the b-th replicate of the t* distribution t(b)=šœƒ(b)-šœƒ
se(šœƒ(b))
.
3. find sample quantiles t*š›¼
2
and t*1-š›¼
2
from {t(b)}Bb=1.
4. compute se(šœƒ) from {šœƒ(b)}Bb=1.5. confidence interval [šœƒ-t*1-š›¼
2
se(šœƒ),šœƒ-t*š›¼
2
se(šœƒ)]
5. Jacknife Jacknife is similar to the leave-one-out method. In each sample, the Jacknife except one obvervation, i.e., the i-th Jacknife sample is x(i)=(x1,⋯,xi-1,xi+1,⋯,xn). The i-th Jacknife estimator is šœƒ(i)=f(x(i)).5.1. BiasThe jacknife bias is biasjack=(n-1)(āØšœƒ(ā‹…)-šœƒ)where āØšœƒ(ā‹…)=1
n
nāˆ‘i=1šœƒ(i)
.
why n-1? For example, šœƒ=1
n
nāˆ‘i=1(xi-āØx)2
, bias(šœƒ)=E(šœƒ)-šœŽ2=n-1
n
šœŽ2-šœŽ2=-1
n
šœŽ2
. while
E(šœƒ(i)-šœƒ)=E(šœƒ(i)-šœƒ)-E(šœƒ-šœƒ)=-1
n-1
šœŽ2+1
n
šœŽ2
=-šœŽ2
n(n-1)
=bias(šœƒ)
n-1
5.2. Standard ErrorThe jacknife standard error is sejack=n-1
n
nāˆ‘i=1(šœƒ(i)-āØšœƒ(ā‹…))2
Why n-1
n
? For example, if šœƒ=āØx, then šœƒ(i)=nāØx-xi
n-1
, āØšœƒ(ā‹…)=āØx.
nāˆ‘i=1(šœƒ(i)-āØšœƒ(ā‹…))2=nāˆ‘i=1(nāØx-xi
n-1
-āØx)2
=1
(n-1)2
nāˆ‘i=1(āØx-xi)2
=1
n-1
s2
6. Bayesian statistics and MCMCBayesian statistics looks problems in a differen way from frenquetists. In frequentist, the parameter is fixed but unknown, and he experiment is repeatable. However, in Bayesian, the parameter is random and unknown, and the experiment is fixed, or to say, the target of Bayesian is the posteriorp(šœƒ|data), rather than the likelihood p(data|šœƒ). 6.1. Bayesian problem set-up
parameteršœƒ
dataX
prior distribution of šœƒšœ‹(šœƒ)
sampling model for Xf(x|šœƒ)
posterior distribution for šœƒ|Xp(šœƒ|X)=šœ‹(šœƒ)f(X|šœƒ)
šœ‹(šœƒ)f(X|šœƒ)dšœƒ
āˆpriorƗlikelihood
6.2. Markov chain Monte CarloConstruct a Markov chain {Xt:t=0,1,⋯} whose stationary distribution is target distribution.6.2.1. Metropolis-Hasting samplerNow supppose our target distribution is f, to move from one state to another, we have a proposal Y∼g(y|Xt).
Algorithm9MH sampler1. Choose a proper proposal g(ā‹…|Xt)2. Initialize X0 and repeat until the chain converges. At time t,(a) Generate Y from g(ā‹…|Xt)(b) Compute the acceptance rate r(Xt,Y)=f(Y)g(Xt|Y)
f(Xt)g(Y|Xt)
(c) Let š›¼(Xt, Y)=min{r(Xt,Y),1} be the acceptance ratio. Set Xt+1=a
Ywith probability š›¼(Xt,Y)
Xtwith probability 1-š›¼(Xt,Y)
Now we show how the MH sampler works.Let s, t be two different state. Without loss of generality, we assume that f(s)g(r|s)≄f(r)g(s|r). Then š›¼(r,s)=1 and š›¼(s,r)=f(r)g(s|r)
f(s)g(r|s)
, and
a
P(Xt=s,Xt+1=r)=P(Xt=s)P(Xt+1=r|Xt=s)
=f(s)g(r|s)š›¼(s,r)
=f(r)g(s|r)
P(Xt=r,Xt+1=s)=P(Xt=r)P(Xt+1=s|Xt=r)
=f(r)g(s|r)š›¼(r,s)
=f(r)g(s|r)
⟹ P(Xt=s,Xt+1=r)=P(Xt=r,Xt+1=s)
Thus a
P(Xt=r)= āˆ‘sP(Xt=r,Xt+1=s)
= āˆ‘sP(Xt=s,Xt+1=r)
=P(Xt+1=r)
The Markov chain is stationary at f.ā–”We can also prove it in a kernel point of view. Define K(r,s)=g(s|r)š›¼(r,s) be the transition kernel from r to s. We call f is the stationary distriibution if the balance condition holds f(r)K(r,s)dr=f(s)Under MH sampler, a
f(r)K(r,s)=f(r)g(s|r)š›¼(r,s)
=f(r)g(s|r)min{1,f(s)g(r|s)
f(r)g(s|r)
}
=min{f(r)g(s|r), f(s)g(r|s)}
=f(s)g(r|s)š›¼(s,r)
=f(s)K(s,r)
We have the detailed balanced consition f(r)K(r,s)=f(s)K(s,r). Then f(r)K(r,s)dr=f(s)K(s,r)dr=f(s)K(s,r)dr=f(s)6.2.2. Metropolis samplerWith symmetric proposal g(X|Y)=g(Y|X). r(Xt, Y)=f(Y)
f(Xt)
.
6.2.3. Random Walk sampler Proposal g(Y|Xt)=g(|Y-Xt|). For example Y=Xt+N(0,šœŽ2). r(Xt, Y)=f(Y)
f(Xt)
.
6.2.4. independent samplerProposal g(Y|Xt)=g(Y). r(Xt, Y)=f(Y)g(Xt)
f(Xt)g(Y)
.
6.2.5. Gibbs samplerNow suppose we generate samples from a multivariate f(x) where xāˆˆšœ’āŠ‚Rd. We partition the d-dimension vector x into K disjoint blocks, denoted by x=(x1,⋯,xK)T where K≤d. Let fk(xk|x-k)=fk(xk|x1,⋯,xk-1,xk+1,⋯,xK), k=1,⋯,K
Algorithm10Gibbs sampler1. Starting with an arbitrary point x(0)āˆˆšœ’ with f(x(0))>02. At time t, (1) generate x(t)1∼f1(x1|x(t-1)2,⋯,x(t-1)K) ā‹® (k) generate x(t)k∼fk(xk|x(t)1,⋯,x(t)k-1,x(t-1)k+1,⋯,x(t-1)K) ā‹® (K) generate x(t)K∼fK(xK|x(t-1)2,⋯,x(t-1)K-1) 3. Set x(t)=(x(t)1,⋯,x(t)K)T
Gibbs sampler is a special case ofMH sampler, with accept ratio 1. Proof can be found in homework 7.6.3. Monitoring the convergence (Gelman-Rubin method)In MCMC, the chain may be traped in some local mode, we can run multiple chains from different initial point to check the convergence. Recall ANOVA in DATA130046 statistics II. Suppose there are J chains, and n is the number in each chain after discarding the burn-in period. šœ“ is a function of data. Writešœ“(j)i=šœ“(X(j)1,⋯,X(j)i), i=1,⋯,n;j=1,⋯,JāØšœ“(j)=1
n
nāˆ‘i=1šœ“(j)i
āØšœ“=1
n
nāˆ‘i=1āØšœ“(j)i
Then the Between sequence variance is B=n
J-1
Jāˆ‘j=1(āØšœ“(j)-āØšœ“)2
and the Within sequence variance is W=1
J
Jāˆ‘j=1s2j
where s2j=1
n-1
nāˆ‘i=1(šœ“(j)i-āØšœ“(j))2
. The Gelman-Rubin statistic is
R=n-1
n
W+1
n
B
W
which should decrease and converge to 1 if chain converges well. A recommended threshold is 1.1.7. EM algorithm The expectation-maximization algorithm is used to overcome the difficulties in maximizing likelihoods. Suppose except observed data Y, there are unobserved data U. f(Y|šœƒ) is not easily evaluate, but f(Y,U|šœƒ) is easy to calculate. Then we write the total log-likelihood aslogf(Y,U|šœƒ)=logf(Y|šœƒ)+logf(U|Y,šœƒ)Take expactation with respect to u, and given Y=y, šœƒ=šœƒ*, we have Eualogf(Y,U|šœƒ)|Y=y,šœƒ*a=ā„“(šœƒ)+Eualogf(U|Y,šœƒ)|Y=y,šœƒ*aWrite it asQ(šœƒ,šœƒ*)=ā„“(šœƒ)+C(šœƒ,šœƒ*)Let šœƒ=šœƒ*, we have Q(šœƒ*,šœƒ*)=ā„“(šœƒ*)+C(šœƒ*,šœƒ*)(1)-(2) we get ā„“(šœƒ)-ā„“(šœƒ*)≄Q(šœƒ,šœƒ*)-Q(šœƒ*,šœƒ*)-{C(šœƒ,šœƒ*)-C(šœƒ*,šœƒ*)}.The right part is a
C(šœƒ,šœƒ*)-C(šœƒ*,šœƒ*)=E{logf(U|Y,šœƒ)
f(U|Y,šœƒ*)
|Y=y,šœƒ*}
≤logE{f(U|Y,šœƒ)
f(U|Y,šœƒ*)
|Y=y,šœƒ*}
=log{f(u|y,šœƒ)
f(u|y,šœƒ*)
f(u|y,šœƒ*)du}
=0
Thus ā„“(šœƒ)-ā„“(šœƒ*)=Q(šœƒ,šœƒ*)-Q(šœƒ*,šœƒ*). Given šœƒ*, if we find šœƒ that Q(šœƒ,šœƒ*)≄Q(šœƒ*,šœƒ*), then ā„“(šœƒ)≄ℓ(šœƒ*).
Algorithm11The EM algorithm1. E-step: compute Q(šœƒ,šœƒ(t))=EUalogf(Y,U|šœƒ)aY=y,šœƒ(t)a2. M-step: šœƒ(t+1)=argmaxšœƒQ(šœƒ,šœƒ(t))
8. Variational inferenceTo approximate the posterior p(z|x)=šœ‹(z)f(x|z)
f(x)
in bayesian statistics, rather than generating random samples using MCMC, we can solve the problem by optimization, i.e., find the closest density to p(x|z). To judge the distance between our density q and the target p, we can use the Kullback-Leibler divergence.
8.1. KL divergenceKL(f‖g)=Ef{logf(x)
g(x)
}=logf(x)
g(x)
f(x)dx
Properties:1. KL divergence is non-negative and KL=0 iff f=g2. KL is not a distance, i.e., KL(f||g)≠KL(g||f)Proof of Property 1:-KL(f||g)=Ef{logg(x)
f(x)
}Jensen's inequality
≤
log{Efg(x)
f(x)
}=logg(x)
f(x)
f(x)dx=0
8.2. Evidence lower bound (ELBO)Suppose we have a family of densities over latent variables, then we wantq*(z)=argminq∈QKL(q(z) || p(x|z))a
KL(q(z) || p(x|z))=Eq{logq(z)
p(z|x)
}
=Eq{log q(z)}-Eq{log p(z|x)}
=Eq{logq(z)}-Eq{logf(x,z)}+Eq{logf(x)}
=logf(x)-ELBO(q)
Define evidence lower bound (ELBO) as ELBO(q)ā‰œEq{logf(x,z)}-Eq{logq(z)}Then logf(x)=KL(q(z)||p(z|x))=KL(q(z) || p(x|z))+ELBO(q)≄ELBO(q)To minimize KL(q(z) || p(x|z)), we maximize ELBO(q).a
ELBO(q)=Eq{logf(x,z)}-Eq{logq(z)}
=Eq{logf(x|z)}+Eq{logšœ‹(z)}-Eq{logq(z)}
=Eq{logf(x|z)}-Eq{logq(z)
šœ‹(z)
}
=Eq{logf(x|z)}-KL(q(z)||šœ‹(z))
To maximize ELBO(q), on one hand, we maximize Eq{logf(x|z)}, i.e., find q that fits the data. On the other hand, we minimize KL(q(z)||šœ‹(z)), i.e., find q that fits the prior.8.3. The mean-field variational familyNow we consider a special case that the family of the q is Qā‰œaq: q(z)=māˆj=1qj(zj)a. It can be treated as the elements of the q is "indepdent".a
ELBO(q)=Eq{logf(x,z)}-Eq{logq(z)}
=māˆj=1qjalogf(x,z)-logqjadz
We use a method called coordinate ascend to maximize ELBO(q), i.e., fix all others, climb up on qk. Then ELBO(q)=qk{logf(x,z) āˆj≠kqjdzj}dzk-qklogqkdzk+c1, where c1 is the constant have nothing with qk.Define logp(x,zk)=logf(x,z) āˆj≠kqjdzj+c2, thus ELBO(q)=qklogp(x,zk)dzk-qklogqkdzk+c1=-KL(qk||p(x,zk))+c3q*k=pāˆexpaE-klogf(x,z)a