Finite Population Correction for Variance of Sample Mean

We wish to calculate the variance of the sample mean $\bar X$ when sampling from a finite population. We know that
for a random sample $X_1, \ldots, X_n$ of an infinite population, the sample mean $\bar X = \frac{1}{n} \sum_{i=1}^{n} X_i$
is unbiased with mean $\mathbb{E}\left[ \bar X \right] = \mu$ and variance $\operatorname{Var} \left( \bar X \right)= \frac{\sigma}{n}$.
Let us now switch to a finite population of size $N$ with population units $y_1, \ldots, y_N$ and population mean
$\bar y = \frac{1}{N}\sum_{i=1}^{N} y_i$. Consider taking a sample of size $n$. If $\frac{n}{N} \approx 0$, the assumption of an infinite population might still be reasonable, but observe what happens when the proportion
of sampled units gets large: In the extreme, we take a sample of $N$ units without replacement, eliminating any uncertainty and getting a perfect estimate
of the population mean. In general, the higher the proportion of sampled units, the lower the variance will be compared the value under
independent random sampling with an infinite population. This suggests that we might be able to correct the
estimate of the sample standard deviation $s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar x)^2$ which is no longer unbiased
in the finite-population setting but overestimated the true variance. However, in order to find this finite population correction factor, we
first have to evaluate the variance of $\bar X$.
Let us introduce indicator random variables $Z_i$ which take the value one if the $i$-th unit of the population is in the
sample and zero otherwise. Then the sample mean can be written as


$$ \bar X = \frac{1}{n} \sum_{i=1}^{n} X_i = \frac{1}{n} \sum_{j=1}^{N} y_j Z_j $$

Clearly, the sample mean is unbiased for the population mean as


$$ \mathbb{E} \left[ \bar X \right] = \mathbb{E} \left[ \frac{1}{n} \sum_{j=1}^{N} y_j Z_j \right] = \frac{1}{n} \sum_{j=1}^{N} y_j \mathbb{E} \left[ Z_j \right] = \frac{1}{N} \sum_{j=1}^{N} y_j, $$

where we used linearity of expectations (recall that this property of expected values holds even in case of dependendt random variabbles such
as the $Z_i$) and the fact that $\mathbb{E} \left[ Z_i \right] = \frac{n}{N}$ (expected value of a hypergeometric random variable with $K=1$)

Let us proceed and calculate the variance of $\bar X$.
By definition,
$$ \operatorname{Var} \left( \bar X \right) = \operatorname{Var} \left( \frac{1}{n} \sum_{j=1}^{N} y_j Z_j \right) $$ which using the rules of working with variances evaluates to
$$ \frac{1}{n^2} \left[ \sum_{j=1}^{N} y_j^2 \operatorname{Var} \left( Z_j \right) + 2 \sum_{i \lt j} y_i y_j \operatorname{Cov} \left( Z_i, Z_j \right)\right] $$

Now, $ \operatorname{Var} \left( Z_j \right) = {n \over N}{N-n \over N}$ which is equal to the variance of a random variable from a hypergeometric distribution with $K=1$. Perhaps more intuitively, $Z_j$ takes either zero or one, where the probability of $y_j$ being in the sample is equal to $n \over N$. Hence, we can model $Z_j$ as a Bernoulli random variable with mean $\mathbb{E} \left[ Z_j \right] = {n \over N}$ and variance $\operatorname{Var} \left( Z_j \right) = {n \over N} (1 - {n \over N })$

To calculate the covariance between two $Z_i$ and $Z_j$, observe that $$ \mathbb{E} \left[ Z_i Z_j \right] = P(Z_i = 1, Z_j = 1) = \frac{n}{N} \frac{n-1}{N-1}, $$

which gives us $$ \operatorname{Cov} \left( Z_i, Z_j \right) = \mathbb{E} \left[ Z_i Z_j \right] - \mathbb{E} \left[ Z_i \right] \mathbb{E} \left[ Z_j \right] = \frac{n}{N} \frac{n-1}{N-1} - \frac{n^2}{N^2} = - \frac{n(1 - \frac{n}{N})}{N(N-1)} $$

Plugging these results into the formula for $\operatorname{Var} \left( \bar X \right)$, we get

$$ \begin{align} \operatorname{Var} \left( \bar X \right) &= \frac{1}{n^2} \left[ \sum_{j=1}^{N} y_j^2 {n \over N}{N-n \over N} - 2 \sum_{i \lt j} y_i y_j \frac{n(1 - \frac{n}{N})}{N(N-1)} \right] \\ &= \frac{N-n}{n N^2} \sum_{j=1}^{N} y_j^2 - 2 \frac{1}{n} \frac{(1 - \frac{n}{N})}{N(N-1)} \sum_{i \lt j} y_i y_j \\ &= \frac{N-n}{n N^2} \sum_{j=1}^{N} y_j^2 - 2 \frac{1}{n} \frac{(N - n)}{N^2(N-1)} \sum_{i \lt j} y_i y_j \\ &= \frac{(N-n)}{n N^2} \left[ \sum_{j=1}^{N} y_j^2 - 2 \frac{1}{(N-1)} \sum_{i \lt j} y_i y_j \right] \end{align} $$

Recall that the population variance is
$$ \tilde \sigma^2 = \frac{1}{N} \sum_{j=1}^{N} \left ( y_j - \bar y \right)^2 $$

Furthermore, observe that
$$ \begin{equation} \sum_{j=1}^{N} \left ( y_j - \bar y \right)^2 = \sum_{j=1}^{N} y_j^2 - N \bar y ^2, \end{equation} $$ with
$$ \begin{equation} \bar y^2 = \left( \frac{1}{N} \sum_{j=1}^{N} y_j \right)^2 = \frac{1}{N^2} \left[ \sum_{j=1}^{N} y_j^2 + 2 \sum_{i \lt j} y_i y_j \right] \end{equation} $$

Inserting Equation $(6)$ into Equation $(5)$, we get
$$ \begin{align} \sum_{j=1}^{N} \left ( y_j - \bar y \right)^2 &= \sum_{j=1}^{ N} y_j^2 - \frac{1}{N} \left[ \sum_{j=1}^{N} y_j^2 + 2 \sum_{i \lt j} y_i y_j \right] \\ &= \frac{N-1}{N} \sum_{j=1}^{ N} y_j^2 - \frac{2}{N}\sum_{i \lt j} y_i y_j \end{align} $$

Multiplying both sides of Equation $(8)$ with $N$ and dividing by $N-1$, we get
$$ \frac{N}{N-1} \sum_{j=1}^{N} \left( y_j - \bar y \right)^2 = \sum_{j=1}^{N} y_j^2 - 2 \frac{1}{N-1}\sum_{i \lt j} y_i y_j $$

We are now almost done, as the expression on the right-hand side is equal to the expression in square brackets in Equation $(4)$. Plugging in and simplifying, we get

\begin{align} \operatorname{Var} \left( \bar X \right)
&= \frac{(N-n)}{n N^2} \left[ \frac{N}{N-1}\sum_{j=1}^{N} \left( y_j - \bar y \right)^2 \right] \\ &= \frac{N-n}{N-1} \frac{\tilde \sigma^2}{n} \end{align}

Hence, when estimating the variance of $\bar X$, we should scale our estimate by multiplying it with the finite population correction factor $\frac{N-n}{N-1}$.