Theorem1Probability Integral TransformationIf X is a continuous random variables with cdf FX(x), then U=FX(X)ā¼Uniform(0,1).
Thus, given a uniform random number generator and a cdf, we can get the random number using the inverse of the cdf.1. Derive the inverse function F-1X(u).2. Generate uā¼Unif(0,1).3. x=F-1(u)This method requires the F-1 easy to compute.1.2. acceptance-rejection methodIn this method, given a generator with pdf Yā¼g(t), we can generate from our target Xā¼f(t). We propose the algorithm, then we prove it.First we assume that c satisfies f(t)
g(t)ā¤c for all tāR, then 1. Generate yā¼g(t), uā¼Unif(0,1).2. If u<f(y)
cg(y) then accept y and set x=yNow we prove x is actually from target f.a
P(X=y|accept)
=P(Y=y|accept)
=P(accept|Y=y)P(Y=y)
P(accept)
whereP(accept|Y=y)=P(u<f(y)
cg(y)|Y=y)=f(y)
cg(y)P(accept)=āyP(accept|Y=y)P(Y=y)=āyf(y)
cg(y)ā g(y)=1
cThus P(X=y|accept)=f(y)
cg(y)ā g(y)
1
c=f(y)The continuous case is shown in the homework.2. Monte Carlo integration and variance reductionIn this part, we solve the integration problem š=ā«g(x)dx. For x1,āÆ,xmiid
ā¼
X, the empirical averge EM[g(x)]=1
māg(xi) estimates the population mean, i.e., the expectaion, Eg(x)=ā«g(x)f(x)dx, Xā¼f. Thus we introduce the Mote Carlo integration. The Monte Carlo method is a widely used method. When saying MC, we are simulating many times to estimate the target. For example, in this part, for estimate the integral, we simulate many random numbers to approximate the result.2.1. Simple MC integrationTo estimate š=1ā«0g(x)dx, Xā¼Unif(0,1).1. Generate x1,āÆ,xmiid
ā¼
X2. š=1
mmāi=1g(xi).Notice that the integration is limited to the range 0 to 1. We can generalize to range [a,b] by change of variable y=x-a
b-a, or just generate Xā¼Unif(a,b), and š=(b-a)1
mmāi=1g(xi).2.2. Variance ReductionAlthough the simple MC integration is unbiased eastimator of š, we can use better estimators with smaller variance.2.2.1. antithetic variableIn simple MC, we use independent random variables. However, with usage of dependent variables, we might reduce the variables.Suppose Y and Z have same distribution with X but dependent. Var(1
2(Y+Z))=1
4aVar(Y)+Var(Z)+2Cov(Y,Z)a, if Cov(Y,Z)<0, the variance can be reduced.Known that if Uā¼Unif(0,1), then 1-Uā¼Unif(0,1) and U and 1-U are negatively correlated. And we can expect that
Corollary1If g(X)=g(X1,āÆ,Xn) is monotone, then Y=g(F-1X(u1),āÆ,F-1X(un)) and Y=g(F-1X(1-u1),āÆ,F-1X(1-un)) are negatively correlated.
Proof is ignored here.Then, instead of generating m Unif(0,1) random varables, we needs m
2 generations, and for j=1,āÆ,m
2, we define Yj=g(F-1X(u(j)1),āÆ,F-1X(u(j)n)) and Yj'=g(F-1X(1-u(j)1),āÆ,F-1X(1-u(j)n)), then š=1
mm/2āi=1(Yj+Yj').2.2.2. control variate In this part, we still try to use the benefits of correlation. Suppose there is a f with š=Ef(x) known and f is correlated with g.Then define šc=g(x)+c(f(x)-š). šc is still an unbiased estimator of š and Var(šc)=Var(g(x))+2cā Cov(g(x),f(x))+c2Var(f(x))Let c*=-Cov(g(x),f(x))
Var(f(x)), we minimize the Var(šc*)=Var(g(x))-Cov2(g(x),f(x))
Var(f(x)).The percent of reduction is Var(g(x))-Var(šc*)
Var(g(x))=Cor2(g(x),f(x))Ć100%.2.2.3. Antethetic variables as a control variateWe combine the two methods, formulate control variate as linear combination of two unbiased estimator šc=cš1+(1-c)š2. Suppose š1 and š2 have identical distributions and r=Cor(š1,š2)<0. Then Var(šc)=c2Var(š1)+2c(1-c)Cov(š1,š2)+(1-c)2Var(š2)=Var(š1)ac2+2c(1-c)r+(1-c)2a=Var(š1)a(2-2r)c2-(2-2r)c+1ac*=1
2.2.2.4. Control variate and linear regressionIn control variate method, suppose we have n samples (f(x1),g(x1)),āÆ,(f(xn),g(xn)). When applying the linear regression g(x)=š¼+š½f(x)+š, we have the following four important properties.1. the OLS estimatorsa
š¼=āØāØg(x)-š½āØāØf(x)
š½=Cov(f(x),g(x))
Var(f(x))=-c*
2. The predicted value at š=Ef(x) is the control variate estimator of the target integrationš¼+š½š=āØāØg(x)-š½(āØāØf(x)-š)=āØāØg(x)+c*(āØāØf(x)-š)=šc*3. The variance of the control variate estimator is the residual mean squared error (MSE).Var(āØāØg(x)+c(āØāØf(x)-š))=1
nVar(g(x)+c(f(x)-š))=1
nVar(g(x)-š½f(x)-š¼)=š2š
n4. The percetage of improvement is {Cor(g(x),f(x))}Ć100% is the coefficient of determintion.2.2.5. importance samplingSimple MC integration, b-a
mmāi=1g(Xi), weight the interval [a,b] uniformly. The replicates X1,āÆ,Xm are uniformly distributed on [a,b]. Then we consider other weight functions.
Algorithm1Importance sampling1. decide a "envolope" f(x)2. For i=1,āÆ,m(a) generate xiā¼f(b) record g(xi)
f(xi)3. šIS=1
mmāi=1g(xi)
f(xi)
Let X be a r.v. with density f such that f(x)>0 on the support of g. Set Y=g(x)
f(x) then š=ā«g(x)dx=ā«g(x)
f(x)f(x)dx=Ef[g(x)
f(x)]Thus we can use šIS=1
mmāi=1g(xi)
f(xi) to estimate š.Now we analysis the variance.Var(šIS)=Var(1
mmāi=1Yi)=1
mVar(Y)=1
mVar(g(x)
f(x))=1
m{E(g(x)
f(x))2-(Eg(x)
f(x))2}where Eg(x)
f(x)=ā«g(x)
f(x)f(x)dx=š and E(g(x)
f(x))2=ā«g2(x)
f(x)dx.Var(šIS)=1
m{ā«g2(x)
f(x)dx-š2}where ā«g2(x)
f(x)dx={ā«g2(x)
f(x)dx}{ā«f(x)dx}ā„(ā«g(x)dx)2. The equation holds iff f(x)ā|g(x)|.2.2.6. stratified sampling
Algorithm2Stratified Sampling1. divide [0,1] into k strata where the j-th strata is Ij=(j-1
k,j
k)2. On each strata, for i=1,āÆmj(a) generate x(j)iā¼Unif(Ij) by density fj(x)=kā 1(xāIj)(b) šj=1
mjmjāi=1g(x(j)i)3. šS=1
kkāj=1šj
Note that Ešj=Eg(x(j))=ā«g(x)kā 1(xāIj)dx=kā«Ijg(x)dx. That's why we need 1
k.Now we show that the variance of importance sampling is smaller than simple MC integration.Denote šM be the simple MC estimation. For simplicity, we suppose in importance sampling each strata has equal numer, m, of replicates, and total number M=mk. Denote šj=E{g(u)|uāIj} and š2j=Var{g(u)|uāIj}.Var(šS)=Var(1
kkāj=1šj)=1
k2kāj=1Var(šj)=1
k2kāj=1š2j
m=1
Mkkāj=1š2jSuppose a two-step experiment, J is discrete uniform on {1,āÆ,k}. For i=1,āÆ,M,1. draw J2. generate from IJVar(šM)=1
Algorithm3Stratified importance sampling1. choose an importance function f.2. divide the real line into k strata where the j-th strata is Ij=(aj-1,aj), where a0=-ā, aj=F-1(j
k), ak=+ā.3. On strata j define gj(x)=g(x)1(xāIj) and fj(x)=fx|Ij(x|Ij)=f(x)1(xāIj)
ā«f(x)1(xāIj)dx=kf(x)1(xāIj), for i=1,āÆm(a) generate x(j)iā¼fj(x)(b) šj=1
mmāi=1g(x(j)i)4. šSIS=kāj=1šj
Note that šj=ā«gj(x)dx=ā«Ijg(x)dx, thus š=āšj.Now we prove that Var(šSIS)ā¤Var(šIS). In Ij, denote šj=ā«Ijg(x)dx=E{gj(x)
fj(x)} and š2j=Var{gj(x)
fj(x)}, where xā¼fj.Var(šSIS)=Var(āšj)=āVar(šj)=āš2j
m=k
Māš2jVar(šIS)=1
MVar(Y)=š2
M where š2=Var(g(x)
f(x))Next we show that š2-kāš2jā„0. We consider the two-stage experiment, For J=j, we geberate x* from fj and set Y*=gj(x*)
fj(x*)=g(x*)
kf(x*). x*d
=
x and kY*d
=
Y. Var(Y*)=E{Var(Y*|J)}+Var{E(Y*|J)}where E{Var(Y*|J)}=E(š2J)=1
kāš2jVar{E(Y*|J)}=Var(šJ)Then š2=Var(Y)=k2Var(Y*)=k2Var(šJ)+kāš2jā„kāš2j3. MC in statistical inference 3.1. point estimationš=1
māš(j)se(āØx)=a
1
n{ā(xi-āØx)2}1
2
1
n{1
n-1ā(xi-āØx)2}1
2
mse=1
mā(š(j)-š)23.2. confidence interval
Algorithm4Monte Carlo Confindence interval1. For each replicate j=1,āÆ,m(a) generate the j-th random sample X(j)1,āÆ,X(j)n(b) compute the confidence interval Cj for the j-th sample(c) Compute yj=1(šāCj) for the j-the sample2. Compute the empirical confidence levelāØy=1
māyj
3.3. hypothesis testing3.3.1. empirical type I error rate
Algorithm5MC type I error rate1. For each replicate, indexed by j=1,āÆ,m(a) Generate the j-th random sample x(j)1,āÆ,x(j)n from the null distribution. (b) Compute the test statistic Tj from the j-th sample.(c) Record the test decision Ij=1 if H0 is rejected at significance level š¼ and otherwise Ij=0.2. Compute the proportion of significant tests 1
māIj. This proportion is the observed Type I error rate.
4. Bootstrap4.1. bootstrap estimate of distribution
Algorithm7bootstrap estimate of distribution1. Foe each Bootstrap replicate b=1,āÆ,B(a) Generate samle x*(b)=(x*1(b),āÆ,x*m(b)) by sample with replacement from the observation x1,āÆ,xn(b) Compute the b-th replicate š(b) using x*(b).2. The bootstrap estimate of Fš is the empirical distribution of š(1),āÆ,š(B)
4.2. point estimation1. se of š. seBāse(š*)=1
B-1Bāb=1(š(b)-āØš*)2 where āØš*=1
BBāb=1š(b).2. bias of š. bias(š)=1
BBāb=1š(b)-š.4.3. confidence intervalNow we use Bootstrap to estimate confidence intervals.4.3.1. standard normal distributionUse approximately normal poperty, [šĀ±1.96seB(š)]4.3.2. percentile CIUse the sample quantile [š*š¼/2,š*1-š¼/2].4.3.3. Basic bootstrap CISuppose (L,U) is the confidence interval, i.e., P(Lā„š)=P(Uā¤š)=š¼
2. Then š¼
2=P(Lā„š)=P(L-šā„š-š)=P(š-šā„š-L)Thus L-š is the 1-š¼
2 percentile of š-š. We can estimate the 1-š¼
2 quantiles of š using bootstrap replicate š*1-š¼/2, then š*1-š¼/2-š is approximately equal to the 1-š¼
2 quantile of š-š. Set š-L=š*1-š¼/2-š, we have L=2š-š*1-š¼/2.Similarly, we have U=2š-š*š¼/2. Thus the CI is [2š-š*1-š¼/2,2š-š*š¼/2].4.3.4. Bootstrap t CI
Algorithm8Bootstrap t confidence interval1. Compute š from the observed data.2. For each bootstrap replicate b=1,āÆ,B(a) Sample with replacement x(b)=(x(b)1,āÆ,x(b)n)(b) compute š(b) from x(b)(c) estimate se(š(b)). (another layer of Bootstrap, resample from x(b), not from x)(d) compute the b-th replicate of the t* distribution t(b)=š(b)-š
se(š(b)).3. find sample quantiles t*š¼
2 and t*1-š¼
2 from {t(b)}Bb=1.4. compute se(š) from {š(b)}Bb=1.5. confidence interval [š-t*1-š¼
2se(š),š-t*š¼
2se(š)]
5. Jacknife Jacknife is similar to the leave-one-out method. In each sample, the Jacknife except one obvervation, i.e., the i-th Jacknife sample is x(i)=(x1,āÆ,xi-1,xi+1,āÆ,xn). The i-th Jacknife estimator is š(i)=f(x(i)).5.1. BiasThe jacknife bias is biasjack=(n-1)(āØš(ā )-š)where āØš(ā )=1
n-1s26. Bayesian statistics and MCMCBayesian statistics looks problems in a differen way from frenquetists. In frequentist, the parameter is fixed but unknown, and he experiment is repeatable. However, in Bayesian, the parameter is random and unknown, and the experiment is fixed, or to say, the target of Bayesian is the posteriorp(š|data), rather than the likelihood p(data|š). 6.1. Bayesian problem set-up
6.2. Markov chain Monte CarloConstruct a Markov chain {Xt:t=0,1,āÆ} whose stationary distribution is target distribution.6.2.1. Metropolis-Hasting samplerNow supppose our target distribution is f, to move from one state to another, we have a proposal Yā¼g(y|Xt).
Algorithm9MH sampler1. Choose a proper proposal g(ā |Xt)2. Initialize X0 and repeat until the chain converges. At time t,(a) Generate Y from g(ā |Xt)(b) Compute the acceptance rate r(Xt,Y)=f(Y)g(Xt|Y)
f(Xt)g(Y|Xt)(c) Let š¼(Xt, Y)=min{r(Xt,Y),1} be the acceptance ratio. Set Xt+1=a
Y
with probability š¼(Xt,Y)
Xt
with probability 1-š¼(Xt,Y)
Now we show how the MH sampler works.Let s, t be two different state. Without loss of generality, we assume that f(s)g(r|s)ā„f(r)g(s|r). Then š¼(r,s)=1 and š¼(s,r)=f(r)g(s|r)
f(s)g(r|s), and a
P(Xt=s,Xt+1=r)
=P(Xt=s)P(Xt+1=r|Xt=s)
=f(s)g(r|s)š¼(s,r)
=f(r)g(s|r)
P(Xt=r,Xt+1=s)
=P(Xt=r)P(Xt+1=s|Xt=r)
=f(r)g(s|r)š¼(r,s)
=f(r)g(s|r)
ā¹ P(Xt=s,Xt+1=r)=P(Xt=r,Xt+1=s)Thus a
P(Xt=r)
=āsP(Xt=r,Xt+1=s)
=āsP(Xt=s,Xt+1=r)
=P(Xt+1=r)
The Markov chain is stationary at f.ā”We can also prove it in a kernel point of view. Define K(r,s)=g(s|r)š¼(r,s) be the transition kernel from r to s. We call f is the stationary distriibution if the balance condition holds ā«f(r)K(r,s)dr=f(s)Under MH sampler, a
f(r)K(r,s)
=f(r)g(s|r)š¼(r,s)
=f(r)g(s|r)min{1,f(s)g(r|s)
f(r)g(s|r)}
=min{f(r)g(s|r), f(s)g(r|s)}
=f(s)g(r|s)š¼(s,r)
=f(s)K(s,r)
We have the detailed balanced consition f(r)K(r,s)=f(s)K(s,r). Then ā«f(r)K(r,s)dr=ā«f(s)K(s,r)dr=f(s)ā«K(s,r)dr=f(s)6.2.2. Metropolis samplerWith symmetric proposal g(X|Y)=g(Y|X). r(Xt, Y)=f(Y)
f(Xt).6.2.3. Random Walk sampler Proposal g(Y|Xt)=g(|Y-Xt|). For example Y=Xt+N(0,š2). r(Xt, Y)=f(Y)
f(Xt)g(Y).6.2.5. Gibbs samplerNow suppose we generate samples from a multivariate f(x) where xāšāRd. We partition the d-dimension vector x into K disjoint blocks, denoted by x=(x1,āÆ,xK)T where Kā¤d. Let fk(xk|x-k)=fk(xk|x1,āÆ,xk-1,xk+1,āÆ,xK), k=1,āÆ,K
Algorithm10Gibbs sampler1. Starting with an arbitrary point x(0)āš with f(x(0))>02. At time t, (1) generate x(t)1ā¼f1(x1|x(t-1)2,āÆ,x(t-1)K)ā® (k) generate x(t)kā¼fk(xk|x(t)1,āÆ,x(t)k-1,x(t-1)k+1,āÆ,x(t-1)K)ā® (K) generate x(t)Kā¼fK(xK|x(t-1)2,āÆ,x(t-1)K-1) 3. Set x(t)=(x(t)1,āÆ,x(t)K)T
Gibbs sampler is a special case ofMH sampler, with accept ratio 1. Proof can be found in homework 7.6.3. Monitoring the convergence (Gelman-Rubin method)In MCMC, the chain may be traped in some local mode, we can run multiple chains from different initial point to check the convergence. Recall ANOVA in DATA130046 statistics II. Suppose there are J chains, and n is the number in each chain after discarding the burn-in period. š is a function of data. Writeš(j)i=š(X(j)1,āÆ,X(j)i), i=1,āÆ,n;j=1,āÆ,JāØš(j)=1
nnāi=1š(j)iāØš=1
nnāi=1āØš(j)iThen the Between sequence variance is B=n
J-1Jāj=1(āØš(j)-āØš)2 and the Within sequence variance is W=1
JJāj=1s2j where s2j=1
n-1nāi=1(š(j)i-āØš(j))2. The Gelman-Rubin statistic is R=n-1
nW+1
nB
Wwhich should decrease and converge to 1 if chain converges well. A recommended threshold is 1.1.7. EM algorithm The expectation-maximization algorithm is used to overcome the difficulties in maximizing likelihoods. Suppose except observed data Y, there are unobserved data U. f(Y|š) is not easily evaluate, but f(Y,U|š) is easy to calculate. Then we write the total log-likelihood aslogf(Y,U|š)=logf(Y|š)+logf(U|Y,š)Take expactation with respect to u, and given Y=y, š=š*, we have Eualogf(Y,U|š)|Y=y,š*a=ā(š)+Eualogf(U|Y,š)|Y=y,š*aWrite it asQ(š,š*)=ā(š)+C(š,š*)Let š=š*, we have Q(š*,š*)=ā(š*)+C(š*,š*)(1)-(2) we get ā(š)-ā(š*)ā„Q(š,š*)-Q(š*,š*)-{C(š,š*)-C(š*,š*)}.The right part is a
C(š,š*)-C(š*,š*)
=E{logf(U|Y,š)
f(U|Y,š*)|Y=y,š*}
ā¤logE{f(U|Y,š)
f(U|Y,š*)|Y=y,š*}
=log{ā«f(u|y,š)
f(u|y,š*)f(u|y,š*)du}
=0
Thus ā(š)-ā(š*)=Q(š,š*)-Q(š*,š*). Given š*, if we find š that Q(š,š*)ā„Q(š*,š*), then ā(š)ā„ā(š*).
Algorithm11The EM algorithm1. E-step: compute Q(š,š(t))=EUalogf(Y,U|š)aY=y,š(t)a2. M-step:š(t+1)=argmaxšQ(š,š(t))
8. Variational inferenceTo approximate the posterior p(z|x)=š(z)f(x|z)
f(x) in bayesian statistics, rather than generating random samples using MCMC, we can solve the problem by optimization, i.e., find the closest density to p(x|z). To judge the distance between our density q and the target p, we can use the Kullback-Leibler divergence. 8.1. KL divergenceKL(fāg)=Ef{logf(x)
g(x)}=+āā«-ālogf(x)
g(x)f(x)dxProperties:1. KL divergence is non-negative and KL=0 iff f=g2. KL is not a distance, i.e., KL(f||g)ā KL(g||f)Proof of Property 1:-KL(f||g)=Ef{logg(x)
f(x)}Jensen's inequality
ā¤
log{Efg(x)
f(x)}=log+āā«-āg(x)
f(x)f(x)dx=08.2. Evidence lower bound (ELBO)Suppose we have a family of densities over latent variables, then we wantq*(z)=argminqāQKL(q(z)|| p(x|z))a
KL(q(z)|| p(x|z))
=Eq{logq(z)
p(z|x)}
=Eq{log q(z)}-Eq{log p(z|x)}
=Eq{logq(z)}-Eq{logf(x,z)}+Eq{logf(x)}
=logf(x)-ELBO(q)
Define evidence lower bound (ELBO) as ELBO(q)āEq{logf(x,z)}-Eq{logq(z)}Then logf(x)=KL(q(z)||p(z|x))=KL(q(z)|| p(x|z))+ELBO(q)ā„ELBO(q)To minimize KL(q(z)|| p(x|z)), we maximize ELBO(q).a
ELBO(q)
=Eq{logf(x,z)}-Eq{logq(z)}
=Eq{logf(x|z)}+Eq{logš(z)}-Eq{logq(z)}
=Eq{logf(x|z)}-Eq{logq(z)
š(z)}
=Eq{logf(x|z)}-KL(q(z)||š(z))
To maximize ELBO(q), on one hand, we maximize Eq{logf(x|z)}, i.e., find q that fits the data. On the other hand, we minimize KL(q(z)||š(z)), i.e., find q that fits the prior.8.3. The mean-field variational familyNow we consider a special case that the family of the q is Qāaq: q(z)=māj=1qj(zj)a. It can be treated as the elements of the q is "indepdent".a
ELBO(q)
=Eq{logf(x,z)}-Eq{logq(z)}
=ā«māj=1qjalogf(x,z)-logqjadz
We use a method called coordinate ascend to maximize ELBO(q), i.e., fix all others, climb up on qk. Then ELBO(q)=ā«qk{ā«logf(x,z)ājā kqjdzj}dzk-ā«qklogqkdzk+c1, where c1 is the constant have nothing with qk.Define logp(x,zk)=ā«logf(x,z)ājā kqjdzj+c2, thus ELBO(q)=ā«qklogp(x,zk)dzk-ā«qklogqkdzk+c1=-KL(qk||p(x,zk))+c3q*k=pāexpaE-klogf(x,z)a