Example: Labour Force Survey

(From StatCan website: https://www.statcan.gc.ca/eng/survey/household/3701)

The Labour Force Survey (LFS) is a household survey carried out monthly by Statistics Canada. It is the only source of current, monthly estimates of total employment and unemployment…The survey is conducted in 54,000 households across Canada…to determine the characteristics of an entire population by using the answers of a much smaller, randomly chosen sample…

For simplicity, assume there are 10 million eligible households, the survey randomly selects 50,000 with equal probabilities, and each household has 1 employable person. Note that assuming equal probabilities implies that the event of each household being selected is an independent event.

Suppose we estimate the total unemployment rate using the unemployment rate among the selected sample:

$$\frac{\text{# unemployed in the sample}}{\text{total sample size}}.$$

Population

Set of all people, objects, or events of interest: All eligible households in Canada

Parameter

A population quantity: Total unemployment rate

Sample

A subset of the population: The households selected for the survey

Estimator

A quantity based on a sample that esimates a parameter: The sample unemployment rate

Suppose Figure 1 represents the employed (grey) vs. unemployed (black) households in a particular population.

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Figure 1: Employed (grey) vs. unemployed (black) households in a population.

Figure 2: Employed (grey) vs. unemployed (black) households in a sample.

Note that while the unemployment rate in the population (parameter) is fixed at 0.115, the estimator depends on the random sample and can take different values. Figure 3 shows estimates from 100 different samples and their distribution. Because the sampling process is random, the estimator is a random variable. An estimator’s distribution is called a sampling distribution.

Figure 3: Estimates of the unemployment rate from 100 different samples and their distribution.

Sampling Distribution

Let $T=h\left(X_1,X_2,\ldots,X_n\right)$ be an estimator based on a random sample $X_1$, $X_2$, …, $X_n$. The probability distribution of $T$ is called the sampling distribution of $T$.

For the labour force survey exampling, we can write the estimator mathematically using an indicator function. Let

`$$X_i=\begin{cases} 1 & \text{when }i\text{th survey participant is unemployed}\ 0 & \text{otherwise} $$

for $i=1,2,\ldots,n$ and

$$T_n=\overline{X}_n=\frac{\sum_{i=1}^n X_i}{n}.$$

$T_n$ is our estimator and its distribution is an example of a sampling distribution. Note that $X_i$ is a random variable since selecting an unemployed household as the $i$th participant is random. Consequently, the estimator $T_n$ is also a random variable.

Chebyshev’s Inequality

Suppose $X_i$ for $i=1,2,...$ are independent random variables with finite expectation $\mu$ and finite variance $\sigma^2$. Recall the result from our discussion on the average of normal random variables. The result holds true for an average of any random variables that are independent with equal and finite expecation and variance. That is

$$E\left[\overline{X}_n\right]=E\left[X_1\right]=\mu$$

and

$$=\text{Var}\left(\overline{X}_n\right)=\frac{\text{Var}\left(X_1\right)}{n}=\frac{s=\sigma^2}{n}.$$

This implies that sample variance $\text{Var}\left(\overline{X}_n\right)$ decreases as the sample size increase while the sampling distribution remains centred at the population mean $\mu$. This result is a consequence of the fact that most probability mass is within a few standard deviation from the expectation for a random variable with a finite expectation. Chebyshev’s inequality captures this property.

Any random variable $Y$ with $E[Y]<\infty$ and any number $a>0$ satisfy

$$P\left(\left|Y-E\left[Y\right]\right|\ge a\right)\le\frac{\text{Var}\left(Y\right)}{a^2}.$$

Applying Chebyshev’s inequality to the sample average estimator $\overline{X}_n$ we have

$$P\left(\left|\overline{X}_n-\mu\right|>\varepsilon\right)\le\frac{\sigma^2}{n\varepsilon^2}$$

for any $\varepsilon > 0$. When we increase the sample size $n$ and take it to $\infty$, we see that the probability approaches 0.

Law of Large Numbers

Suppose $X_1$, $X_2$, …, $X_n$ are independent random variables with expectation $\mu<\infty$ and variance $\sigma^2<\infty$. Then for any $> $,

$$\lim_{n\to\infty}P\left(\left|\overline{X}_n-\mu\right|>\varepsilon\right)=0,$$

where $\overline{X}_n=\left.\sum_{i=1}^n X_i\right/n$. We say $\overline{X}_n$ converges in probability to $\mu$ and write

$$\overline{X}_n\overset{p}{\longrightarrow}\mu.$$

Roughly speaking, sample mean converges to the population mean as we increase the sample size. The effect is illustrated with $\overline{X}_n$ based on simulated normal random variables in Figure 4.

$Sample means of normal random variables based on `$n=10$`, `$n=100$`, and `$n=1,000$`.$

Figure 4: Sample means of normal random variables based on `$n=10$`, `$n=100$`, and `$n=1,000$`.

LLN and Probability

Suppose we are interested in

$$p=P\left(a<X\le b\right)$$

for some random variable $X$. Assume while you don’t know the distribution of $X$, you can obtain $n$ random samples of $X$, $X_1$, $X_2$, …, $X_n$.

We have been working with such samples using simulations. Using the notion of probability as a long-term relative frequency, we counted the number of times the simulated samples satisfied the event of interest and divided by the simulation size.

In other words, we computed $\overline{Y}_n=\left.\sum_{i=1}^nY_i\right/n$, where

$$Y_i=\begin{cases} 1 & \text{when } a < X_i \le b \\ 0 & \text{otherwise.} \end{cases}$$

We can mathematically justify the simulation approach based on the law of large numbers. Applying the law of large numbers, we have

$$\overline{Y}_n\overset{p}{\longrightarrow}E\left[Y_1\right].$$

Using the definition of the expecation of a discrete random variable, we have

$$E\left[Y_1\right] = 1\cdot P\left(X_1\in\left(a,b\right]\right) + 0\cdot P\left(X_1\notin\left(a,b\right]\right) =P\left(X_1\in\left(a,b\right]\right)=p.$$

Therefore, $\overline{Y}_n$ converges to $p$ as we increase the simulation size.

Notes

10. Law of Large Numbers