Bảng chú giải

Chọn một trong những từ khóa bên trái

StatisticsBootstrapping

Thời gian đọc: ~10 min

Bootstrapping is the use of simulation to approximate the value of the plug-in estimator of a statistical functional T which is expressed in terms of independent observations from the input distribution \nu. The key point is that drawing k observations from the empirical distribution \widehat{\nu} is the same as drawing k times from the list of observations.

Example
Consider the statistical functional T(\nu) = the expected difference between the greatest and least of 10 independent observations from \nu. Suppose that 50 observations X_1, \ldots , X_{50} from \nu are observed, and that \widehat{\nu} is the associated empirical CDF. Explain how T(\widehat{\nu}) may be estimated with arbitrarily small error.

Solution. The value of T(\widehat{\nu}) is defined to be the expectation of a distribution that we have instructions for how to sample from. So we sample 10 times with replacement from X_1, \ldots , X_{50}, identify the largest and smallest of the 10 observations, and record the difference. We repeat B times for some large integer B, and we return the sample mean of these B values.

By the law of large numbers, the result can be made arbitrarily close to T(\widehat{\nu}) with arbitrarily high probability by choosing B sufficiently large.

Although this example might seem a bit contrived, bootstrapping is useful in practice because of a common source of statistical functionals that fit the bootstrap form: standard errors.

Example
Suppose that we estimate the median \theta of a distribution using the plug-in estimator \widehat{\theta} for 75 observations, and we want to produce a confidence interval for \theta. Show how to use bootstrapping to estimate the standard error of the estimator.

Solution. By definition, the standard error of \widehat{\theta} is the square root of the variance of the median of 75 independent draws from \nu. Therefore, the plug-in estimator of the standard error is the square root of the variance of the median of 75 independent draws from \widehat{\nu}. This can be readily simulated. If the observations are stored in a vector X, then

using Random, Statistics, StatsBase
X = rand(75)
std(median(sample(X, 75)) for _ in 1:10^5)
sd(sapply(1:10^5,function(n) {median(sample(X,75,replace=TRUE))}))

returns a very accurate approximation of T(\widehat{\nu}).

Perhaps the most important caution regarding bootstrapping is that the bootstrap only approximates T(\widehat{\nu}). It only approximates T(\nu) (where \nu is the underlying true distribution from which the observations are sampled) insofar as we have enough observations for T(\widehat{\nu}) to approximate T(\nu) well.

Exercise
Suppose that \nu is the uniform distribution on [0,1]. Generate 75 observations from \nu, store them in a vector X, and compute the bootstrap estimate of T(\widehat{\nu}), where T(\nu) is the standard deviation of 75 independent observations from \nu. Use Monte Carlo simulation to directly estimate T(\nu). Can the gap between your approximations of T(\widehat{\nu}) and T(\nu) be made arbitrarily small by using more bootstrap samples?

Solution. The gap cannot be made arbitrarily small. We would need to get more than 75 samples from the distribution to get closer to the exact value of T(\operatorname{Unif}([0,1])).

X = rand(75)
std(median(sample(X, 75)) for _ in 1:10^6) # estimate T(ν̂)
std(median(rand(75)) for _ in 1:10^6) # estimate T(ν)
Bruno
Bruno Bruno