Lecture 10: Law of Large Numbers

STA237: Probability, Statistics, and Data Analysis I

Michael Jongho Moon

PhD Student, DoSS, University of Toronto

June 15, 2022

Example: Canadian Student Tobacco, Alcohol and Drugs Survey

Canadian Student Tobacco, Alcohol and Drugs Survey (CSTADS)

From Government of Canada Website at https://www.canada.ca/en/health-canada/services/canadian-student-tobacco-alcohol-drugs-survey/2018-2019-summary.html

A total sample of 65,850 students in grades 7 to 12 (secondary I through V in Quebec) completed the survey … The weighted results represents over 2 million Canadian students…

In 2018-19, 3% of students in grade 7 to 12 were current cigarette smokers …

Estimating the prevalence of student smoking

For simplicity, assume

there are 2 million Canadian students eligible to participate in the survey; and
the survey randomly selects 50,000 with equal probabilities.

Note that assuming equal probabilities implies that the event of each student being selected is an independent event.

Suppose we estimate the total prevalence - or proportion - of student smoking in Canada, \(\theta\), using the proportion among the selected sample:

\[\frac{\text{# students who currently smoke in the sample}}{\text{total sample size}}\]

The population of interest is all Canadian students between grade 7 and 12 at the time of the survey.

The sample is those students who participated in the survey.

The parameter of interest is the prevalence of smoking among the population.

The estimator is the function that computes the proportion of smokers in the sample.

Population, sample, parameter, and estimator

A population is the entire group of interest - can be people, things, events, etc.

A sample is a subgroup of a population used for estimation. In particular, a (simple) random sample consists of samples that are independent and identically distributed.

A parameter is a quantity of interest of a population.

An estimator is a function of a sample that provides an estimate of a parameter.

Estimating the prevalence of student smoking

Suppose we estimate the total prevalence - or proportion - of student smoking in Canada using the proportion among the selected sample:

\[\frac{\text{# students who currently smoke in the sample}}{\text{total sample size}}\]

Population

Sample

Parameter

\(\theta=\) 0.0608

Estimate

\(T_{100}=\) 0.06

Note that the estimator is a random variable since the sampling process is random.

The distribution of the random variable is an example of a sampling distribution.

Sampling distribution of an estimator

Sampling distribution

Let \(T=h\left(X_1,X_2,\ldots,X_n\right)\) be an estimator based on a random sample \(X_1\), \(X_2\), \(X_3\), , \(X_n\). The probability distribution of \(T\) is called the sampling distribution of \(T\).

Example: CSTADS

Indicator function

\[X_i=\begin{cases}1 & \text{when }i\text{th student is a smoker}\\0 & \text{otherwise.}\end{cases}\]

Estimator

\[T_n=\overline{X}_n=\frac{\sum_{i=1}^nX_i}{n}\]

\(T_n\)’s distribution is an example of a sampling distribution
Note that we call a realized instance of the estimator an estimate - e.g., 0.06 from the first sample
An estimator is a random variable; an estimate is a fixed number
How do we know if it’s a good estimator?

Estimator

\[\frac{\text{# smokers in the sample}}{\text{total sample size}}\]

Checking the estimator’s expectation …

Recall …

\[E\left(T_n\right)=E\left(\overline{X}_n\right)=E\left(X_1\right)\]

where \(\overline{X}_n=\left.\sum_{i=1}^nX_i\right/n\).

That is, the expected value of the estimator is same as the parameter
This desired result holds true whenever we use the sample mean of a random sample to estimate a population mean

\(E\left(T_n\right)-E\left(X_1\right)\) is an example of a bias. It’s the difference between the estimator and the parameter of interest.

When it’s 0, we say the estimator is an unbiased estimator.

Checking the estimator’s variance …

Recall …

\[\text{Var}\left(T_n\right)=\text{Var}\left(\overline{X}_n\right)=\frac{\text{Var}\left(X_1\right)}{n}\]

where \(\overline{X}_n=\left.\sum_{i=1}^nX_i\right/n\).

We want the sample variance to be small so that the sampling distribution has a tighter concentration around its expectation
We see that with a larger sample, the variance of the sample mean decreases

Checking the estimator’s variance …

Law of large numbers

Convergence in probability

Let \(Y_1\), \(Y_2\), \(Y_3\), … be an infinite sequence of random variables, and let \(W\) be another random variable. We say the sequence \(\left\{Y_n\right\}\) converges in probability to \(W\) if

\[\lim_{n\to\infty}P\left(\left|Y_n-W\right|\ge\varepsilon\right)=0\]

for all \(\varepsilon >0\), and we write

\[Y_n\overset{p}{\to}W.\]

The definition works when \(W\) is a constant - i.e., \(P\left(W=w\right)=1\) for some fixed value \(w\)

Example: 4.2.2 from Evans & Rosenthal

Suppose \(Z_n\sim\text{Exp}\left(n\right)\) and \(y=0\). Let \(\varepsilon\) be any positive value (\(\varepsilon>0\)).

\(P\left(\left|Z_n - y\right|\ge \varepsilon\right)\)
\(=P\left(Z_n\ge \varepsilon\right)\)
\(=\int_\varepsilon^\infty ne^{-nu}du= e^{-n\varepsilon}\)
\(e^{-n\varepsilon}\to0\) as \(n\to\infty\)

\[\implies Z_n\overset{p}{\to}y\]

Chebyshev’s inequality

Any random variable \(Y\) with \(E\left(Y\right)<\infty\) and any \(a>0\) satisfy

\[P\left(\left|Y-E\left(Y\right)\right|\ge a\right)\le \frac{\text{Var}\left(Y\right)}{a^2}.\]

We won’t prove the inequality in class
See Section 13.2 of Dekking et al. or Section 3.6 of Evans & Rosenthal if interested

Example: Quick exercise 13.2 from Dekking et al.

Calculate \(P\left(\left|Y-\mu\right|<k\sigma\right)\) for \(k=1,2,3\) when \[Y\sim\text{Exp}(1),\]

\[\mu=E\left(Y\right),\] and \[\sigma^2=\text{Var}\left(Y\right).\]

Compare the computed values with the Chebyshev’s inequality bounds.

\[P\left(\left|Y-\mu\right|<k\sigma\right)=1-P\left(\left|Y-\mu\right|\ge k\sigma\right)\ge1-\frac{\sigma^2}{k^2\sigma^2}=1-\frac{1}{k^2}\]

Chebyshev’s inequality provides a lower (upper) bound for the probability of of a random variable being within (away) a certain distance from a value

Example: Sample mean

Apply Chebyshev’s inequality to \(\overline{X}_n=\left.\sum_{i=1}^nX_i\right/n\) where \(X_1\), \(X_2\), …, \(X_n\) are random samples from a population. Let \(\mu\) and \(\sigma^2\) be the population mean and variance.

For any \(\varepsilon>0\),

\[P\left(\left|\overline{X}_n-\mu\right|>\varepsilon\right)\le \frac{\sigma^2}{n\varepsilon^2}\]

What happens as \(n\to\infty\)?

(Weak) Law of large numbers

Suppose \(X_1\), \(X_2\), …, \(X_n\) are independent random variables with expectation \(\mu\) and variance \(\sigma^2\). Then for any \(\varepsilon > 0\),

\[\lim_{n\to\infty}P\left(\left|\overline{X}_n-\mu\right|>\varepsilon\right)=0,\]

where \(\overline{X}_n=\left.\sum_{i=1}^n X_i\right/n\).

That is, \(\overline{X}_n\) converges in probability to \(\mu=E\left(X_1\right)\)
The proof shown in class requires a finite variance but you can prove the law without the assumption

FYI, there is the Strong law of large number, which states \[P\left(\lim_{n\to\infty}\overline{X}_n=\mu\right)=1.\]
We will focus on the WLLN for the course

Example: Sample means from a normal distribution

Roughly speaking, the law states that a sample mean converges to the population mean as we increase the sample size.

For example, simulating

\[X_i\sim N(0,1)\]

for \(i=1,2,3,\dots,1000\) and computing \(\overline{X}_n\) for \(n=1,2,3,\dots,1000\)…

Example: Sample means from a Cauchy distribution

The law does not hold when the population mean doesn’t exist or is not finite.

Cauchy is an example of a distribution with out an expectation.

Simulating \(X_i\) from a Cauchy distribution for \(i=1,2,3,\dots,1000\) and computing \(\overline{X}_n\) for \(n=1,2,3,\dots,1000\)…

We have already seen the law in action!

Estimating a probability using LLN

Suppose we are interested in

\[\theta=P\left(X\in \mathcal{K}\right),\]

where \(X\) is some random variable and \(\mathcal{K}\) is a subinterval of \(\mathbb{R}\).

Assume that while you don’t know the distribution of \(X\), you can obtain \(n\) random samples of \(X\) - \(X_1\), \(X_2\), …, \(X_n\).

We have been estimating probabilities with \(X_1\), \(X_2\), …, \(X_n\) using simulation - e.g., we didn’t have the full distribution of winning \(m\) consecutive blackjack rounds
Using the notion of the probability as a long-term frequency, we counted the number of times \(X_i\in\mathcal{K}\) and divided by \(n\)

This is equivalent to using \(T_n\) as the estimator for parameter \(\theta\) where …

\[T_n=\frac{\sum_{i=1}^n \mathcal{I}_{X_i\in\mathcal{K}}}{n}\] and \(\mathcal{I}_{X_i\in\mathcal{K}}=1\) when \(X_i\in\mathcal{K}\) and \(0\) otherwise.

With a large \(n\) (e.g., \(1\ 000\), \(100\ 000\), …), we are using the property that

\[T_n\overset{p}{\to}\theta\]

since \(\mathcal{I}_{X_i\in\mathcal{K}}\sim \text{Ber}\left(\theta\right)\).

Practice questions

Exercises from Dekking et al. Chapter 13: 13.1 to 13.11
Simulate the probability in Exercise 13.2 b from Dekking et al. and compare with the Chebyshev’s bound

Simulation in R worksheet

Follow this link to open the worksheet