STA237: Probability, Statistics, and Data Analysis I
PhD Student, DoSS, University of Toronto
Wednesday, June 14, 2023
From Government of Canada page, Summary of results for the Canadaian Student Tobacco, Alcohol and Drugs Survey 2021-22
A total sample of \(61\ 096\) students in grades 7 to 12 (secondary I through V in Quebec) completed the survey … The weighted results represents over 2 million Canadian students…
[In 2021-22], 2% of students in grade 7 to 12 (\(43\ 000\)) reported current cigarette smoking …
For simplicity, assume that
\[\theta = \frac{\text{Number of smokers}}{\text{Number of all Canadian students}}\]
For simplicity, assume that
While \(\theta\) is unknown, we use the survey to estimate the quantity based on \(n=50\ 000\) survey responses.
\[T_{50\ 000} = \frac{\text{Number participants who smoke}}{\text{Number of all participants}}\]
In studying data, we call the collection of objects being studied the population of interest and the quantity of interest the parameter.
The subset of the objects collected in the data is the sample and an estimator is a rule using sample that estimates a parameter. The resulting value is an estimate of the parameter.
The estimator \(T_{50\ 000}\) is a random variable since the sampling process is random.
Thus, it has a distribution.
An estimator is a statistic but a statistic isn’t necessarily an estimator.
A statistic is a function of a sample and the quantity computed from the function.
Let \(T=h\left(X_1,X_2,\ldots,X_n\right)\) be a statistic based on a random sample \(X_1\), \(X_2\), \(X_3\), , \(X_n\). The probability distribution of \(T\) is called the sampling distribution of \(T\).
\[T_{50\ 000} = \frac{\text{Number participants who smoke}}{\text{Number of all participants}}\]
Let \(n=50\ 000\) and \(T_{n}=\left.\sum_{i=1}^n X_i\right/n\) where
\[X_i=\begin{cases} 1 & i\text{th survey participant} \\ & \quad\text{is a smoker}\\ 0 & \text{otherwise}\end{cases}\]
\[T_{n}=\frac{1}{n}\sum_{i=1}^n X_i\]
where
\[ X_i=\begin{cases} 1 & i\text{th survey participant} \\ & \quad\text{is a smoker}\\ 0 & \text{otherwise}\end{cases} \]
With a very large \(n\), let’s assume selection of each survey participant is independent and the probability of selecting a smoker remains identical.
The probability that a randomly selected student is a smoker is the proportion of smokers in the student population, \(\theta\).
\[T_{n}=\frac{1}{n}\sum_{i=1}^n X_i\]
where
\[X_i\sim\text{Ber}(\theta)\] for all \(i\in\{1,2,\ldots,n\}\) independently.
\(T_n\) is an example of an unbiased estimator of \(\theta\).
A sample mean of a random sample is an unbiased estimator of the population mean in general.
Let \(T\) be an estimator of \(\theta\). The bias of \(T\) is
\[E\left[T\right]-\theta\]
and we say \(T\) is unbiased when the bias is \(0\).
How about the \(\text{Var}\left(T_n\right)?\)
A larger sample size leads to less variability in the estimator.
\[T_{n}=\frac{1}{n}\sum_{i=1}^n X_i\]
where
\[X_i\sim\text{Ber}(\theta)\] for all \(i\in\{1,2,\ldots,n\}\) independently.
\(E[T_n]=\theta\)
The definition works when \(W\) is a constant - i.e., \(P\left(W=w\right)=1\) for some constant \(w\).
Let \(Y_1\), \(Y_2\), \(Y_3\), … be an infinite sequence of random variables, and let \(W\) be another random variable. We say the sequence \(\left\{Y_n\right\}\) converges in probability to \(W\) if
\[\lim_{n\to\infty}P\left(\left|Y_n-W\right|\ge\varepsilon\right)=0\]
for all \(\varepsilon >0\), and we write
\[Y_n\overset{p}{\to}W.\]
Suppose \(Z_n\sim\text{Exp}(n)\) for \(n=1,2,\ldots\), and \(y=0\). Let \(\varepsilon\) be any positive number.
\[e^{-n\varepsilon}\to0\quad\text{as}\quad n\to\infty\]
\[\implies Z_n\overset{p}{\to}y\]
Let \(U\sim U(0,1)\). Define \(X_n\) by
\[X_n=\begin{cases}1 & U\le \frac{1}{2}-\frac{1}{n} \\ 0 & \text{otherwise}\end{cases}\]
and \(Y\) by
\[Y=\begin{cases}1 & U\le \frac{1}{2} \\ 0 & \text{otherwise}\end{cases}\]
For any \(\varepsilon>0\),
The event \(\{X_n\neq Y\}=\{\left|X_n- Y\right|>0\}\) implies \(\{\left|X_n-Y\right|\ge \varepsilon\}\) for any \(\varepsilon>0\).
Let \(U\sim U(0,1)\). Define \(X_n\) by
\[X_n=\begin{cases}1 & U\le \frac{1}{2}-\frac{1}{n} \\ 0 & \text{otherwise}\end{cases}\]
and \(Y\) by
\[Y=\begin{cases}1 & U\le \frac{1}{2} \\ 0 & \text{otherwise}\end{cases}\]
For any \(\varepsilon>0\),
Let \(U\sim U(0,1)\). Define \(X_n\) by
\[X_n=\begin{cases}1 & U\le \frac{1}{2}-\frac{1}{n} \\ 0 & \text{otherwise}\end{cases}\]
and \(Y\) by
\[Y=\begin{cases}1 & U\le \frac{1}{2} \\ 0 & \text{otherwise}\end{cases}\]
For any \(\varepsilon>0\),
\[ \lim_{n\to\infty} P\left(\left\lvert X_n-Y\right\rvert \ge \varepsilon \right) = 0\]
\[\implies X_n\overset{p}{\to}Y\]
Any random variable \(Y\) with \(E\left(Y\right)<\infty\) and any \(a>0\) satisfy
\[P\left(\left|Y-E\left(Y\right)\right|\ge a\right)\le \frac{\text{Var}\left(Y\right)}{a^2}.\]
Consider a discrete random variable \(Y\) with \(E(Y)=\mu<\infty\) and with positive probability masses at \(y_i\) for \(i=1,2,\ldots\).
\(\left(y_i-\mu\right)^2\ge0\) and \(P\left(Y=y_i\right)\ge0\) for all \(i\).
This proves Chebyshev’s inequality for discrete random variables.
If interested, see the proof for continuous random variables in Section 13.2 of Dekking et al.
You won’t be tested on understanding the proof but understanding the implications and using the inequality.
Calculate \(P\left(\left|Y-\mu\right|<k\sigma\right)\) for \(k=1,2,3\) when \[Y\sim\text{Exp}(1),\] \[\mu=E\left(Y\right),\] and \[\sigma^2=\text{Var}\left(Y\right).\]
Compare the computed values with the Chebyshev’s inequality bounds.
Exact probability
Chebyshev’s inequality bound
Apply Chebyshev’s inequality to \(\overline{X}_n=\left.\sum_{i=1}^n X_i\right/n\) where \(X_1\), \(X_2\), …, \(X_n\) are random samples from a population. Let \(\mu\) and \(\sigma^2\) be the population mean and variance.
We will assume random samples from a population are independent and identically distributed.
For any \(\varepsilon > 0\),
\[\phantom{=}P\left(\left|\overline{X}_n-\mu\right|>\varepsilon\right) \le \frac{\sigma^2}{n \varepsilon^2}\]
What happens as we take \(n\) to infinity?
The proof shown in class requires a finite variance but you can prove the law without the assumption.
There is the strong law of large number, which states \[P\left(\lim_{n\to\infty}\overline{X}_n=\mu\right)=1\] but we will focus on the WLLN in this course.
Suppose \(X_1\), \(X_2\), …, \(X_n\) are independent random variables with expectation \(\mu\) and variance \(\sigma^2\). Then for any \(\varepsilon > 0\),
\[\lim_{n\to\infty}P\left(\left|\overline{X}_n-\mu\right|>\varepsilon\right)=0,\]
where \(\overline{X}_n=\left.\sum_{i=1}^n X_i\right/n\).
That is, \(\overline{X}_n\) converges in probability to \(\mu\).
Roughly speaking, the law states that a sample mean converges to the population mean as we increase the sample size.
For example, we can observe \(\overline{X}_n\) converging quickly to \(0\) when we simulate
\[X_i\sim N(0,1)\]
for \(i=1,2,3,\dots,100\).
The sample mean does not converge when the population mean doesn’t exist or is not finite.
Cauchy is an example of a distribution with out an expectation.
Simulating \(\overline{X}_n\) for a Cauchy distribution does now show a convergence even at \(n=1\ 000\).
Computer programs can mimic random samples - e.g., rnorm()
.
We have been using simulating random samples with R to estimate expectations and probabilities.
Suppose we are interested in
\[\theta=P\left(X\in \mathcal{K}\right),\] where \(X\) is some random variable and \(\mathcal{K}\) is a subinterval of \(\mathbb{R}\).
Assume that while you don’t know the distribution of \(X\), you can obtain \(n\) random samples of \(X\) - \(X_1\), \(X_2\), …, \(X_n\).
For example, we haven’t computed the full distribution of winning a blackjack round but we can simulate it.
Let \(T_n=\left.\sum_{i=1}^n \mathcal{I}_{X_i\in\mathcal{K}}\right/n\) where
\[\mathcal{I}_{X_i\in\mathcal{K}}=\begin{cases} 1 & X_i\in\mathcal{K} \\ 0 & \text{otherwise}\end{cases}\]
This is equivalent to counting the number of times \(X_i\in\mathcal{K}\) and divided by \(n\).
Based on the notion of the probability as a long-term relative frequency, \(T_n\) is an estimator of \(\theta\).
\[\implies \mathcal{I}_{X_i\in\mathcal{K}}\sim\text{Ber}\left(\theta\right)\] and
\[E\left[T_n\right]=\theta\]
\[\implies T_n\overset{p}{\to}\theta\]
With a large \(n\), we can expect the unbiased estimator to provide an estimate close to the parameter of interest.
\[\theta=P\left(X\in \mathcal{K}\right)\]
\[T_n=\left.\sum_{i=1}^n \mathcal{I}_{X_i\in\mathcal{K}}\right/n\]
where …
\[\mathcal{I}_{X_i\in\mathcal{K}}=\begin{cases} 1 & X_i\in\mathcal{K} \\ 0 & \text{otherwise}\end{cases}\]
learnr
and run R worksheetClick here to install learnr
on r.datatools.utoronto.ca
Follow this link to open the worksheet
If you see an error, try:
rlesson09
from Files paneOther steps you may try:
.Rmd
and .R
files on the home directory of r.datatools.utoronto.caTools
> Global Options
install.packages("learnr")
in RStudio after the steps above or click hereChapter 13, Dekking et al.
Read section on “Recovering the probability density function” on page. 189
Quick Exercises 13.1, 13.3
Exercises except 13.2 to 13.11
See a collection of corrections by the author here
© 2023. Michael J. Moon. University of Toronto.
Sharing, posting, selling, or using this material outside of your personal use in this course is NOT permitted under any circumstances.