Lecture 6: Variable Transformation

STA237: Probability, Statistics, and Data Analysis I

Michael Jongho Moon

PhD Student, DoSS, University of Toronto

Wednesday, May 31, 2023

Example: My coffee shop


  • Recall Michael’s coffee shop sees \(D\) customers per day and \(D\sim\text{Pois}(4)\).
  • We computed \(E[D]=4\).
  • What is \(\text{Var}(D)\)?

\(\text{Var}(D), D\sim\text{Pois}(\lambda)\)

  • \(E[D]=\lambda\)
  • \(E\left[D^2\right] = \sum_{x=0}^\infty x^2 \frac{\lambda^x{\color{orange}{e^{-\lambda}}}}{x!} ={\color{orange}{e^{-\lambda}}}\sum_{x=\color{red}{1}}^\infty \color{forestgreen}{x^2} \frac{\color{DarkOrchid}{\lambda^x}}{\color{forestgreen}{x!}}\)
  • \(\phantom{E\left[D^2\right]} = \color{DarkOrchid}{\lambda}{\color{orange}{e^{-\lambda}}} \sum_{x=\color{red}{1}}^\infty \color{forestgreen}{x}\frac{\color{DarkOrchid}{\lambda^{x-1}}}{\color{forestgreen}{(x-1)!}}\)

\(\sum x = \sum (x-1) + \sum 1\)

  • \(\phantom{E\left[D^2\right]} = \color{DarkOrchid}{\lambda}{\color{orange}{e^{-\lambda}}} \left(\sum_{x=1}^\infty \color{forestgreen}{(x - 1)}\frac{\color{DarkOrchid}{\lambda^{x-1}}}{\color{forestgreen}{(x-1)!}} + \sum_{x=1}^\infty \frac{\color{DarkOrchid}{\lambda^{x-1}}}{\color{forestgreen}{(x-1)!}}\right)\)
  • \(\phantom{E\left[D^2\right]} = \color{DarkOrchid}{\lambda}{\color{orange}{e^{-\lambda}}} \left(\color{DarkOrchid}{\lambda}\sum_{\color{red}{x''=0}}^\infty \frac{\color{DarkOrchid}{\lambda^{x''}}}{\color{forestgreen}{x''!}} + \sum_{\color{red}{x'=0}}^\infty\frac{\color{DarkOrchid}{\lambda^{x'}}}{\color{forestgreen}{x'!}}\right)\)
  • \(\phantom{E\left[D^2\right]} = \cdots = \lambda^2 + \lambda\)

Example: My coffee shop’s profit


Now suppose the coffee shop’s profit is $\(R\) per day, given by

\[R=g\left(D\right)=\begin{cases} 2D - 10 & 0 \le D < 10 \\ 4D - 30 & D \ge 10 \end{cases}\]

Michael is interested in the distribution of \(R\).

\(E[R]\)

  • \(E[R] = E[g(D)]= \sum_{x\ge0}g(x)p_D(x)\)

\(\sum_{u\in\mathbb{R}} u = \sum_{u<a} u + \sum_{u\ge a} u\)

  • \(\phantom{E[R]}=\sum_{x\in[0,10)}g(x)p_D(x) + \sum_{x\in[10,\infty)}g(x)p_D(x)\)
  • \(\phantom{E[R]}=\sum_{x=0}^9\left(2x-10\right)p_D(x) + \sum_{x\ge 10}\left(4x-30\right)p_D(x)\)

\(\sum_{u<a} u = \sum_{u<a} u +\sum_{u\ge a} u-\sum_{u\ge a} u = \sum_{u\in\mathbb{R}} u - \sum_{u \ge a} u\)

  • \(\phantom{E[R]}=\color{forestgreen}{\sum_{x\ge0} \left(4x-30\right) p_D(x)} + \sum_{x=0}^9\left[\left(2x-10\right)\color{forestgreen}{-\left(4x-30\right)}\right]p_D(x)\)
  • \(\phantom{E[R]}=\color{forestgreen}{4E[D]-30} + \sum_{x=0}^9\left(-2x+20\right)p_D(x)\)
  • \(\phantom{E[R]}\approx\color{forestgreen}{-14} + 12.008 = -1.992\)

\(\text{Var}(R)\)

  • \(\text{Var}(R)=E\left[R^2\right]-E\left[R\right]^2\)
  • \(E\left[R^2\right]=E\left[g(D)^2\right]\)
  • \(\phantom{E\left[R^2\right]}=\cdots=E\left[\left(4D-30\right)^2\right] - \sum_{x\in[0,10)}\left[\left(4x-30\right)^2 - \left(2x-10\right)^2\right]p_D(x)\)
  • \(\phantom{E\left[R^2\right]}=\cdots\approx 260 - 239.742\)
  • \(\phantom{E\left[R^2\right]}= 20.258\)
  • \(\text{Var}(R) \approx 20.258 - 1.992^2 = 16.290\)

But, I made $\(20\) yesterday!

\[P(R\ge 20)=?\]

To compute the probability, we need the full distribution.

Variable transformation

How transforming a variable changes the distribution.

Probabilities of equivalent events are the same


  • Recall that for \(X\sim N(\mu, \sigma^2)\) and \(Z\sim N(0,1)\), \[P(X\le x)=P\left(Z\le \frac{x-\mu}{\sigma}\right)\]

  • … because we know the events \(\{X\le x\}\) and \(\{Z \le (x-\mu)/\sigma\}\) are equivalent events. i.e., we only changed the units in this case.

  • When computing the distribution of a transformed random variable, we can start by considering the events. i.e., \(\{R\ge 20\}=\{D\ge 12.5\}\).

Probabilities of equivalent events are the same

\[F_{Y}(y)=P(g(X)\le y),\quad Y = g(X)\]


  • Recall that for \(X\sim N(\mu, \sigma^2)\) and \(Z\sim N(0,1)\), \[P(X\le x)=P\left(Z\le \frac{x-\mu}{\sigma}\right)\]

  • … because we know the events \(\{X\le x\}\) and \(\{Z \le (x-\mu)/\sigma\}\) are equivalent events. i.e., we only changed the units in this case.

  • When computing the distribution of a transformed random variable, we can start by considering the events. i.e., \(\{R\ge 20\}=\{D\ge 12.5\}\).

Variable transformation of discrete random variables

Let \(X\) be a discrete random variable with probability mass function \(p_X\) and \(Y=g\left(X\right)\), where \(g:\mathbb{R}\to\mathbb{R}\) is a function.

Then, \(Y\) is also discrete and its probability mass function \(p_Y\) is defined by

\[p_Y(y)=\sum_{x\in g^{-1}\left\{y\right\}} p_X\left(x\right)\]

where \(g^{-1}\left\{y\right\}\) is the set of all values \(X\) that satisfy \(g\left(x\right)=y\).

Example: My coffee shop’s profit


  • \(P(R\ge 20)=\)

Example: My coffee shop’s profit



  • \(P(R\ge 20)=P(D\ge 12.5)\)
  • \(\phantom{P(R\ge 20)}<0.0005\)

Example: Rolling a die


Let \(X\) be the outcome of a fair six-sided die roll and \(Y=X^2-3X+2\).

Compute \(P(Y=0)\).

Example: Rolling a die


Let \(X\) be the outcome of a fair six-sided die roll and \(Y=X^2-3X+2\).

Compute \(P(Y=0)\).

Note that \(Y=0\) when \(X\in\left\{1, 2\right\}\).

  • \(P(Y=0)=P(X=1) + P(X=2)\)
  • \(\phantom{P(Y=0)}=\frac{2}{6}=\frac{1}{3}\)

Example: Continuous to discrete

Let \(X\sim \text{U}(0,1)\) and \(Y = g\left(X\right)\), where

\[g(x) = \begin{cases}7 & x\le \frac{3}{4} \\ 5 & x > \frac{3}{4}.\end{cases}\]

\(Y\) is discrete with only 2 possible values whereas \(X\) is continuous.

We can compute the full distribution of \(Y\) by computing the probability massess associated with the 2 values.

  • \(p_Y(7) = P(X\le 3/4)=3/4\)
  • \(p_Y(5) = P(X> 3/4) = 1/4\)

\[p_Y(y) = \begin{cases} \frac{1}{4} & y = 5\\ \frac{3}{4} & y = 7 \\ 0 & \text{otherwise}\end{cases}\]

Example: Continuous to continuous

(Example 4.38 from Devore & Berk)

Let \(X\sim \text{Exp}\left(1/2\right)\) and \(Y=g\left(X\right)=60X\). Determine the distribution of \(Y\).

Both \(Y\) and \(X\) are continuous.

\(f_Y(g(x)) \neq f_X(x)\)

Equivalent events share the same probability not density.

  • \(F_Y(y)=P(Y\le y)=P(60 X \le y)\)
  • \(\phantom{F_Y(y)}=P\left(X\le y / 60\right)\)
  • When \(y>0\),
  • \(F_Y(y)=\int_{-\infty}^{y/60}f_X(u)du\)
  • \(\phantom{F_Y(y)}=\int_0^{y/60}\frac{1}{2}e^{-u/2}du\)
  • \(\phantom{F_Y(y)}=1-e^{-y/120}\)

\[F_Y(y)=\begin{cases} 1-e^{-y/120} & y > 0 \\ 0 & y \le 0 \end{cases}\]

Example: Continuous to continuous

\[F_Y(y)=\begin{cases} 1-e^{-y/120} & y > 0 \\ 0 & y \le 0 \end{cases}\]

When the \(F_Y\) is continuous and differentiable, we can differentiate \(F_Y\) to get \(f_Y\).

  • \(f_Y(y)=\begin{cases} \frac{1}{120}e^{-y/120} & y > 0 \\ 0 & y \le 0 \end{cases}\)

  • \(Y\sim \text{Exp}(1/120)\)

\(60X\) multiplies the unit of measurement by 60 thus, the expected rate is reduced by a factor of \(60\).

It is NOT always possible to get a closed-form of a the transformed \(F_Y\).

Under certain conditions, we may derive \(f_Y\) directly from \(f_X\) based on the Fundamental Theorem of Calculus and the chain rule.

\[\begin{align*} f_Y(y) =& \frac{d}{dy}F_Y(y) \\ =& \left.\frac{d}{dy}F_X\left(x\right)\right|_{x=g^{-1}\left(y\right)} \\ =& \left.\frac{dx}{dy}\frac{d}{dx}F_X\left(x\right)\right|_{x=g^{-1}\left(y\right)} \\ =& \left.\frac{dx}{dy}f_X(x)\right|_{x=g^{-1}\left(y\right)} \end{align*}\]

Differentiable and invertible transformation of continuous random variables

The absolute value is needed for decreasing \(g\).

Where \(f_X(x)=0\), \(f_Y(y)=0\) and the property of \(g\) doesn’t matter.

Let \(X\) be a continuous random variable with probability density function \(f_X\) and \(Y=g\left(X\right)\), where \(g:\mathbb{R}\to\mathbb{R}\) is a function that is differentiable, and strictly increasing or strictly decreasing at places for which \(f_X(x)>0\).

Then, \(Y\) is also continuous, and its density function \(f_Y\) is defined by

\[f_Y(y)=\left|\frac{d}{dy}h\left(y\right)\right|\cdot f_X\left(h\left(y\right)\right),\]

where \(X=h(Y)\).

Example: Exercise 8.7 from Dekking et al.

Suppose \(X\) is a continuous random variable with pdf \(f\) for some \(\alpha>0\).

\[f(x)=\begin{cases} \frac{\alpha}{x^{\alpha+1}} & x\ge 1 \\ 0 & \text{otherwise}\end{cases}\]

What is the distribution of \(Y=\log\left(X\right)\)?

\(\log(X)\) is strictly increasing when \(X\ge 0\).

\(Y\ge 0\) when \(X\ge 1\).

Example: Exercise 8.7 from Dekking et al.

  • Let \(h(y)=e^y\). Then, \(h'(y)=e^y\).
  • When \(y\ge0\), \(f_Y(y)=\left|e^y\right|\cdot \frac{\alpha}{e^{y\left(\alpha+1\right)}} = \alpha e^{-\alpha y}\).

\[f_Y(y)=\begin{cases}\alpha e^{-\alpha y} & y\ge 0\\ 0 & \text{otherwise}\end{cases}\]

\[\implies Y \sim \text{Exp}\left(\alpha\right)\]

Jensen’s inequality

It is a useful tool when you want to compare means of two related random variables without computing the distributions.


Recall …

\[E\left(rX+s\right)=r E\left(X\right) + s,\]

where \(r\) and \(s\) are constants.

When the transformation is NOT linear, we cannot directly calculate the expectation.

We may want to gauge the relative value of the transformed expectation, \(E\left[g\left(X\right)\right]\), compared to the original expectation \(E(X)\).

For a convex function, \(g\), you can gauge the value without computing the distributions or the exact expectation.

Convex function

A function \(g\) is called convex if for every \(a<b\), the line segment from \((a, g(a))\) to \((b, g(b))\) is on or above the graph of \(g\) on the interval \((a, b)\).

In other words, for \(a<b\), and \(\lambda \in (0,1)\), \[\lambda g(a) + (1-\lambda)g(b)\quad\] \[\quad\ge g\left(\lambda a + \left(1-\lambda\right)b\right)\]

When the line segment is strictly above the graph of \(g\), \(g\) is strictly convex on the interval \((a, b)\).

Jensen’s inequality

Let \(g\) be a convex function on interval \(I\), and let \(X\) be a random variable taking values from \(I\). Then Jensen’s inequality states that

\[g\left(E\left[X\right]\right) \le E\left[g\left(X\right)\right].\]

When \(g\) is strictly convex on interval \(I\) and \(X\) is a random variable taking values from \(I\), \(g\left[E\left(X\right)\right] < E\left[g\left(X\right)\right]\) unless \(\text{Var}\left(X\right)=0\).

Example: My coffee shop’s profit

Recall

\[R=g\left(D\right)=\begin{cases}2D - 10 & \text{when } 0\le D < 10 \\ 4D - 30 &\text{when } D\ge 10\end{cases}\]

\(g(x)\) is convex on \(x \ge 0\).

By Jensen’s inequality, we can deduce

\[E[R]\ge g(E[D]).\]

  • \(E[R]\approx-1.992\)
  • \(g(E[D])=g(4)=-2\)

Thanks to convexity of \(g\), I save almost a cent per day…

Example: Quick exercise 8.4 from Dekking et al.

Let \(X\) be a random variable with \(\text{Var}(X)>0\). Which of the following two quantities larger?

\[E\left[e^{-X}\right]\quad\text{vs.}\quad e^{-E\left[X\right]}\]

To check convexity of a continuous function, check whether its second derivative is positive.

  • \(\frac{d^2}{dx^2}e^{-x} = \frac{d}{dx} -e^{-x} = e^{-x}\)
  • \(\frac{d^2}{dx^2}e^{-x} > 0\) for all \(x\in\mathbb{R}\)
  • Thus, \(e^{-x}\) is strictly convex.

Example: Quick exercise 8.4 from Dekking et al.

Let \(X\) be a random variable with \(\text{Var}(X)>0\). Which of the following two quantities larger?

\[E\left[e^{-X}\right]\quad\text{vs.}\quad e^{-E\left[X\right]}\]

To check convexity of a continuous function, check whether its second derivative.

  • \(\frac{d^2}{dx^2}e^{-x} = \frac{d}{dx} -e^{-x} = e^{-x}\)
  • \(\frac{d^2}{dx^2}e^{-x} > 0\) for all \(x\in\mathbb{R}\)
  • Thus, \(e^{-x}\) is strictly convex.

By Jensen’s inequality, \(E\left[e^{-X}\right]\quad>\quad e^{-E\left[X\right]}\).

Graphical summary of (simulated) data

Histogram

Histograms are used to visualize distribution of a univariate data.


Steps:

  1. Divide the range of the data into (equal-length) intervals, or bins; the length of each interval is called the bin width

  2. Setting each bin’s height as

the number of data points that fall the interval


the total number of data points \(\times\) bin width

Histogram

Histograms are used to visualize distribution of a univariate data.


the number of data points that fall the interval


the total number of data points \(\times\) bin width

The heights reflects the relative number of data points that belong to each interval.

Density histogram

Histograms are used to visualize distribution of a univariate data.


the number of data points that fall the interval


the total number of data points \(\times\) bin width

The heights reflects the relative number of data points that belong to each interval.

In a regular histogram, we often display the counts along the y-axis. When we display the relative proportion as computed above, we call the plot a density histogram.

R worksheet

Install learnr and run R worksheet

  1. Click here to install learnr on r.datatools.utoronto.ca

  2. Follow this link to open the worksheet



If you seen an error, try:

  1. Log in to r.datatools.utoronto.ca
  2. Find rlesson06 from Files pane
  3. Click Run Document

Other steps you may try:

  1. Remove any .Rmd and .R files on the home directory of r.datatools.utoronto.ca
  2. In RStudio,
    1. Click Tools > Global Options
    2. Uncheck “Restore most recently opened project at startup”
  3. Run install.packages("learnr") in RStudio after the steps above or click here

Summary

  • When you apply a function to a random variable, the transformed variable is also a random variable.
  • The distribution of the transformed random variable can be inferred by deriving the probabilities of events expressed with the transformed random variable.
  • Histograms provide a visual summary of the distribution of observed data points.

Practice questions

Chapter 8, Dekking et al.

  • Quick Exercises 8.1, 8.2, 8.3
  • Exercises 8.1, 8.3, 8.4, 8.5, 8.6, 8.8, 8.9, 8.11, 8.12, 8.13, 8.14

If you want further reference on histograms, you can read Section 15.1 and Section 15.2 from Dekking et al..

  • See a collection of corrections by the author here