The jackknife method for mean and variance

Suppose the probability distribution is unknown, and the histogram and other visualization techniques suggest that the assumption of normal distribution is not appropriate. However, we don't have rich information either to formulate a reasonable probability model for the problem at hand. Here, we can put the jackknife technique to good use.

We define mean and variance estimators as follows:

The jackknife method for mean and variance
The jackknife method for mean and variance

The pseudovalues associated with The jackknife method for mean and variance and The jackknife method for mean and variance are respectively given in the following expressions:

The jackknife method for mean and variance
The jackknife method for mean and variance

The mean of The jackknife method for mean and variance will be the sample mean, and the mean of The jackknife method for mean and variance will be sampling variance. However, the application of the jackknife method lies in the details. Based on the estimated mean alone, we would not be able to infer about the population mean, and based on the sample variance, we would not be able to exact inference about the population variance. To see what is happening with these formulas of pseudovalues and how their variances will be useful, we will set up an elegant R program next.

We will simulate n = 1000 observations from the Weibull distribution with some scale and shape parameters. In the standard literature, we will be able to find the estimates of these two parameters. However, a practitioner is seldom interested in these parameters and would prefer to infer about the mean and variance of the lifetimes. The density function is a complex form. Furthermore, the theoretical mean and variance of a Weibull random variable in terms of the scale and shape parameter is easily found to be too complex, and the expressions involving Gamma integrals do not help the case any further. If the reader tries to search for the string statistical inference for the mean of Weibull distribution in a search engine, the results will not be satisfactory, and it won't be easy to proceed any further, except for individuals who are mathematically adept. In this complex scenario, we will look at how the jackknife method saves the day for us.

A note is in order before we proceed. The reader might wonder, who cares about Weibull distribution in this era of brawny super-computational machines? However, any reliability engineer will vouch for the usefulness of lifetime distributions, and Weibull is an important member of this class. The second point might be that the normal approximation will hold well for large samples. However, when we have moderate samples to carry out the inference, the normal approximation for a highly skewed distribution such as Weibull might lose out on the power and confidence of the tests. Besides, the question is that if we firmly believe that the underlying distribution is Weibull (without parameters) it remains a monumental mathematical task to obtain the exact distributions of the mean and variance of the Weibull distribution.

The R program will implement the jackknife technique for the mean and variance for given raw data:

> # Simulating observations from Weibull distribution
> set.seed(123)
> sr <- rweibull(1000,0.5,15)
> mean(sr); sd(sr); var(sr)
[1] 30.41584
[1] 69.35311
[1] 4809.854

As mentioned in earlier simulation scenarios, we plant the seed for the sake of reproducible results. The rweibull function helps to enact the task of simulating observations from the Weibull distribution. We calculate the mean, standard deviation, and variance of the sample. Next, we define the pv_mean function that will enable computation of pseudovalues of mean and variance:

> # Calculating the pseudovalues for the mean
> pv_mean <- NULL; n <- length(sr)
> for(i in 1:n)
+   pv_mean[i] <- sum(sr)- (n-1)*mean(sr[-i])
> head(sr,20)
 [1]  23.29756524   0.84873231  11.99112962   0.23216910   0.05650965
 [6] 143.11046494   6.11445277   0.19432310   5.31450418   9.21784734
[11]   0.02920662   9.38819985   2.27263386   4.66225355  77.54961762
[16]   0.16712791  29.48688494 150.60696742  18.64782005   0.03252283
> head(pv_mean,20)
 [1]  23.29756524   0.84873231  11.99112962   0.23216910   0.05650965
 [6] 143.11046494   6.11445277   0.19432310   5.31450418   9.21784734
[11]   0.02920662   9.38819985   2.27263386   4.66225355  77.54961762
[16]   0.16712791  29.48688494 150.60696742  18.64782005   0.03252283
> mean(pv_mean); sd(pv_mean)
[1] 30.41584
[1] 69.35311

Note that the values and pseudovalues of the mean and the value of the observation are the same for all observations. In fact, this is anticipated, as the statistic we are looking at is the mean, which is simply the average. Removing the average of other observations from that should return the value. Consequently, the mean of the pseudovalues and the sample mean would be the same too. However, that does not imply that the efforts are futile. We will continue with the computations for the variance term as follows:

> # Calculating the pseudovalues for the variance
> pv_var <- NULL
> pseudo_var <- function(x,i){
+   n = length(x)
+   psv <- (n/(n-2))*(x[i]-mean(x))^2-(1/(n-1)*(n-2))*sum(x-mean(x))^2
+   return(psv)
+ }
> pv_var <- NULL
> for(i in 1:n)
+   pv_var[i] <- pseudo_var(sr,i)
> head(pv_var)
[1]    50.77137   875.96574   340.15022   912.87970   923.53596 12725.52973
> var(sr); mean(pv_var)
[1] 4809.854
[1] 4814.673
> sd(pv_var)
[1] 35838.59

Now, there is no counterpart to the pseudovalue of the observation in the actual data. Here, the mean of the pseudovalues will approximately equal the sample variance. This is the standard deviation sd(pv_var) which will help in carrying out the inference related to the variance or standard deviation.

We have seen how the jackknife is useful in inferring the mean and variance. In the next part of this section, we will see how the pseudovalues can be useful in solving problems in the context of a survival regression problem.