- Hands-On Ensemble Learning with R
- Prabhanjan Narayanachar Tattar
- 840字
- 2025-04-04 16:30:55
The jackknife method for mean and variance
Suppose the probability distribution is unknown, and the histogram and other visualization techniques suggest that the assumption of normal distribution is not appropriate. However, we don't have rich information either to formulate a reasonable probability model for the problem at hand. Here, we can put the jackknife technique to good use.
We define mean and variance estimators as follows:


The pseudovalues associated with and
are respectively given in the following expressions:


The mean of will be the sample mean, and the mean of
will be sampling variance. However, the application of the jackknife method lies in the details. Based on the estimated mean alone, we would not be able to infer about the population mean, and based on the sample variance, we would not be able to exact inference about the population variance. To see what is happening with these formulas of pseudovalues and how their variances will be useful, we will set up an elegant R program next.
We will simulate n = 1000 observations from the Weibull distribution with some scale and shape parameters. In the standard literature, we will be able to find the estimates of these two parameters. However, a practitioner is seldom interested in these parameters and would prefer to infer about the mean and variance of the lifetimes. The density function is a complex form. Furthermore, the theoretical mean and variance of a Weibull random variable in terms of the scale and shape parameter is easily found to be too complex, and the expressions involving Gamma integrals do not help the case any further. If the reader tries to search for the string statistical inference for the mean of Weibull distribution in a search engine, the results will not be satisfactory, and it won't be easy to proceed any further, except for individuals who are mathematically adept. In this complex scenario, we will look at how the jackknife method saves the day for us.
A note is in order before we proceed. The reader might wonder, who cares about Weibull distribution in this era of brawny super-computational machines? However, any reliability engineer will vouch for the usefulness of lifetime distributions, and Weibull is an important member of this class. The second point might be that the normal approximation will hold well for large samples. However, when we have moderate samples to carry out the inference, the normal approximation for a highly skewed distribution such as Weibull might lose out on the power and confidence of the tests. Besides, the question is that if we firmly believe that the underlying distribution is Weibull (without parameters) it remains a monumental mathematical task to obtain the exact distributions of the mean and variance of the Weibull distribution.
The R program will implement the jackknife technique for the mean and variance for given raw data:
> # Simulating observations from Weibull distribution > set.seed(123) > sr <- rweibull(1000,0.5,15) > mean(sr); sd(sr); var(sr) [1] 30.41584 [1] 69.35311 [1] 4809.854
As mentioned in earlier simulation scenarios, we plant the seed for the sake of reproducible results. The rweibull
function helps to enact the task of simulating observations from the Weibull distribution. We calculate the mean, standard deviation, and variance of the sample. Next, we define the pv_mean
function that will enable computation of pseudovalues
of mean and variance:
> # Calculating the pseudovalues for the mean > pv_mean <- NULL; n <- length(sr) > for(i in 1:n) + pv_mean[i] <- sum(sr)- (n-1)*mean(sr[-i]) > head(sr,20) [1] 23.29756524 0.84873231 11.99112962 0.23216910 0.05650965 [6] 143.11046494 6.11445277 0.19432310 5.31450418 9.21784734 [11] 0.02920662 9.38819985 2.27263386 4.66225355 77.54961762 [16] 0.16712791 29.48688494 150.60696742 18.64782005 0.03252283 > head(pv_mean,20) [1] 23.29756524 0.84873231 11.99112962 0.23216910 0.05650965 [6] 143.11046494 6.11445277 0.19432310 5.31450418 9.21784734 [11] 0.02920662 9.38819985 2.27263386 4.66225355 77.54961762 [16] 0.16712791 29.48688494 150.60696742 18.64782005 0.03252283 > mean(pv_mean); sd(pv_mean) [1] 30.41584 [1] 69.35311
Note that the values and pseudovalues
of the mean and the value of the observation are the same for all observations. In fact, this is anticipated, as the statistic we are looking at is the mean, which is simply the average. Removing the average of other observations from that should return the value. Consequently, the mean of the pseudovalues
and the sample mean would be the same too. However, that does not imply that the efforts are futile. We will continue with the computations for the variance term as follows:
> # Calculating the pseudovalues for the variance > pv_var <- NULL > pseudo_var <- function(x,i){ + n = length(x) + psv <- (n/(n-2))*(x[i]-mean(x))^2-(1/(n-1)*(n-2))*sum(x-mean(x))^2 + return(psv) + } > pv_var <- NULL > for(i in 1:n) + pv_var[i] <- pseudo_var(sr,i) > head(pv_var) [1] 50.77137 875.96574 340.15022 912.87970 923.53596 12725.52973 > var(sr); mean(pv_var) [1] 4809.854 [1] 4814.673 > sd(pv_var) [1] 35838.59
Now, there is no counterpart to the pseudovalue
of the observation in the actual data. Here, the mean of the pseudovalues
will approximately equal the sample variance. This is the standard deviation sd(pv_var)
which will help in carrying out the inference related to the variance or standard deviation.
We have seen how the jackknife is useful in inferring the mean and variance. In the next part of this section, we will see how the pseudovalues
can be useful in solving problems in the context of a survival regression problem.