Inequalities

3 minute read

Published:

Introduction of probability and expectation inequalities.

Probability Inequalities

Theorem 1 (The Gaussian Tail Inequality). Let \(X\sim N(0.1)\). Then

\[\mathbb{P}(\mid X\mid >\epsilon) \leq \frac{2 e^{-\epsilon^{2} / 2}}{\epsilon}.\]

If \(X_{1}, \ldots, X_{n} \sim N(0,1)\) then

\[\mathbb{P}\left(\mid \bar{X}_{n}\mid >\epsilon\right) \leq \frac{2}{\sqrt{n} \epsilon} e^{-n \epsilon^{2} / 2} \stackrel{\text { large n}}{\leq} e^{-n \epsilon^{2} / 2}.\]

Proof. Density of \(X\) is \(\phi(x)=(2\pi)^{(-1/2)}e^{-x^2/2}\). For one side:

\[\begin{aligned} \mathbb{P}(X>\epsilon)&=\int_{\epsilon}^{\infty} \phi(s) d s=\int_{\epsilon}^{\infty} \frac{s}{s} \phi(s) d s \stackrel{s>\epsilon}{\leq} \frac{1}{\epsilon} \int_{\epsilon}^{\infty} s \phi(s) d s\\ &\stackrel{\phi^{\prime}(x)=-x\phi(x)}{=}\frac{1}{\epsilon}(-\phi^{\prime}(s))\arrowvert^{\infty}_{\epsilon}=\frac{\phi(\epsilon)}{\epsilon} \leq \frac{e^{-\epsilon^{2} / 2}}{\epsilon}. \end{aligned}\]

By symmetry,

\[\mathbb{P}(\mid X\mid >\epsilon) \leq \frac{2 e^{-\epsilon^{2} / 2}}{\epsilon}.\]

Let \(X_{1}, \ldots, X_{n} \sim N(0,1)\). Then \(\bar{X}_{n}=n^{-1} \sum_{i=1}^{n} X_{i} \sim N(0,1 / n)\). Thus, \(\bar{X}_{n} \stackrel{d}{=} n^{-1 / 2} Z\), where \(Z \sim N(0,1)\) and based on the inequality above:

\[\mathbb{P}\left(\mid \bar{X}_{n}\mid >\epsilon\right)=\mathbb{P}\left(n^{-1 / 2}\mid Z\mid >\epsilon\right)=\mathbb{P}(\mid Z\mid >\sqrt{n} \epsilon) \leq \frac{2}{\sqrt{n} \epsilon} e^{-n \epsilon^{2} / 2}.\]

Theorem 2 (Markov Inequality). Let \(X\) be a non-negative random variable and suppose that \(\mathbb{E}(X)\) exists. For any \(t>0\),

\[\label{eq:mk} \mathbb{P}(X>t) \leq \frac{\mathbb{E}(X)}{t}.\]

Proof. Since \(X>0\),

\[\begin{aligned} \mathbb{E}(X)&=\int_{0}^{\infty} x p(x) d x=\int_{0}^{t} x p(x) d x+\int_{t}^{\infty} x p(x) d x\\ &\geq\int_{t}^{\infty} x p(x) d x\stackrel{x\geq t}{\geq}t\int_{t}^{\infty}p(x) d x=t\mathbb{P}(X>t). \end{aligned}\]

Theorem 3 (Chebyshev’s inequality). Let \(\mu=\mathbb{E}(X)\) and \(\sigma^{2}=\operatorname{Var}(X)\). Then

\[\mathbb{P}(\mid X-\mu\mid \geq t) \leq \frac{\sigma^{2}}{t^{2}}\text{ and }\mathbb{P}(\mid Z\mid \geq k) \leq \frac{1}{k^{2}},\]

where \(Z=(X-\mu) / \sigma\). In particular, \(\mathbb{P}(\mid Z\mid>2) \leq 1 / 4 \text { and } \mathbb{P}(\mid Z\mid>3) \leq 1 / 9\).

Proof.

\[\mathbb{P}(\mid X-\mu\mid \geq t)=\mathbb{P}\left(\mid X-\mu\mid ^{2} \geq t^{2}\right) \stackrel{Markov}{\leq} \frac{\mathbb{E}(X-\mu)^{2}}{t^{2}}=\frac{\sigma^{2}}{t^{2}}.\]

Let \(t=k\sigma\),

\[\mathbb{P}(\mid Z\mid \geq k)=\mathbb{P}(\mid X-\mu\mid \geq k\sigma)\leq\frac{1}{k^2}.\]

Hoeffding’s Inequality

Lemma 1. Let \(X\) be a random variable. Then

\[\mathbb{P}(X>\epsilon) \leq \inf _{t \geq 0} e^{-t \epsilon} \mathbb{E}\left(e^{t X}\right).\]

Proof. For any \(t>0\),

\[\mathbb{P}(X>\epsilon)=\mathbb{P}(e^X>e^\epsilon)=\mathbb{P}(e^{tX}>e^{t\epsilon})\stackrel{Markov}{\leq}\frac{\mathbb{E}(e^{tX})}{e^{t\epsilon}}.\]

Lemma 2 (Chernoff’s method). Suppose that \(a\leq X\leq b\). Then

\[\mathbb{E}\left(e^{t X}\right) \leq e^{t \mu} e^{\frac{t^{2}(b-a)^{2}}{8}},\]

where \(\mu=\mathbb{E}[X]\).

Proof. Assume \(\mu=0\) for the simplicity. Since \(a\leq X\leq b\), \(X\) can be written as a convex combination of \(a\) and \(b\), \(X=\alpha b+(1-\alpha)a\), where \(\alpha=(X-a)/(b-a)\). Because \(e^x\) is also a convex function,

\[e^{tX}\stackrel{convex}{\leq}\alpha e^{tb}+(1-\alpha)e^{ta}=\frac{X-a}{b-a} e^{t b}+\frac{b-X}{b-a} e^{t a}.\]

Take the expection of both sides and remember \(\mathbb{E}(X)=\mu=0\),

\[\mathbb{E}(e^{tX})\leq -\frac{a}{b-a} e^{t b}+\frac{b}{b-a} e^{t a}=e^{g(u)},\]

where \(u=t(b-a)\), \(g(u)=-\gamma u+log(1-\gamma+\gamma e^u)\) and \(\gamma=-a/(b-a)\).

Why do we need such a \(g\) function? That because it can provide a upper bound of its value. Basic properties of \(g(u)\) are \(g(0)=0\); the first order derivative is

\[g\prime(u)=-\gamma+\frac{\gamma e^u}{1-\gamma+\gamma e^u}=-\gamma+1-\frac{1-\gamma}{1-\gamma+\gamma e^u},\]

and \(g\prime(0)=0\); and for the second derivative

\[g\prime\prime(u)=\frac{\gamma(1-\gamma)e^u}{(1-\gamma+\gamma e^u)^2},\]

which has a bound \(g\prime\prime(u)\leq1/4\) for all \(u>0\). By Taylor’s theorem to the second order, there is a \(\xi \in(0, u)\) such that

\[g(u)=g(0)+ug\prime(0)+\frac{u^2}{2}g\prime\prime(\xi)\leq\frac{u^2}{8}=\frac{t^2(b-a)^2}{8}.\]

Therefore,

\[\mathbb{E}(e^{tX})\leq e^{g(u)}\leq e^{\frac{t^2(b-a)^2}{8}}.\]

Theorem 4 (Hoeffding’s Inequality). Let \(Y_1,\dots,Y_n\) be iid observations such that \(\mathbb{E}(Y_i)=\mu\) and \(a\leq Y_i\leq b\). Then, for any \(\epsilon>0\),

\[\mathbb{P}(\mid \bar{Y}_n-\mu\mid\geq\epsilon)\leq2e^{-2n\epsilon^2/(b-a)^2}.\]

Proof. For one side \(\mathbb{P}(\bar{Y}_n\geq\epsilon)\), for any \(t>0\), we have

\[\begin{aligned} \mathbb{P}(\bar{Y}_n\geq\epsilon)&=\mathbb{P}(\sum_{i=1}^nY_i\geq n\epsilon)=\mathbb{P}(e^{\sum_{i=1}^nY_i}\geq e^{n\epsilon})\\ &=\mathbb{P}(e^{t\sum_{i=1}^nY_i}\geq e^{tn\epsilon})\stackrel{Markov}{\leq}\frac{\mathbb{E}(e^{t\sum_{i=1}^nY_i})}{e^{tn\epsilon}}\\ &=e^{-tn\epsilon}\prod_{i=1}^n\mathbb{E}(e^{tY_i})=e^{-tn\epsilon}(\mathbb{E}(e^{tY_i})^n. \end{aligned}\]

From Lemma 2, \(\mathbb{E}(e^{tY_i})\leq e^{t^2(b-a)^2/8}\). So

\[\mathbb{P}(\bar{Y}_n\geq\epsilon)\leq e^{-tn\epsilon+nt^2(b-a)^2/8},\]

which can achieve a minimum by setting \(t=4\epsilon/(b-a)^2\):

\[\mathbb{P}(\bar{Y}_n\geq\epsilon)\leq e^{-2n\epsilon^2/(b-a)^2}.\]

By symmetry, it yields the Hoeffding’s inequality.

Example. Let \(X_1,\dots,x_n\sim\)Bernoulli\((p)\). From Hoeffding’s inequality,

\[\mathbb{P}(\bar{X}_n-p)\leq2e^{-2n\epsilon^2}.\]

Expectation Inequalities

Theorem 4 (Cauchy-Schwartz inequality). If \(X\) and \(Y\) have finite variances, then

\[\mid\mathbb{E}(XY)\mid\leq\mathbb{E}\mid XY\mid\leq\sqrt{\mathbb{E}X^2\mathbb{E}Y^2}.\]

It can be written as:

\[\text{Cov}^2(X,Y)\leq\sigma^2_X\sigma^2_Y.\]

Theorem 5 (Jensen’s inequality). If \(g\) is convex, then

\[\mathbb{E}(g(X))\geq g(\mathbb{E}(X));\]

If concave, then

\[\mathbb{E}(g(X))\leq g(\mathbb{E}(X)).\]

Example. Let \(g(x)=x^2\), and it is convex. Then

\[\mathbb{E}(X^2)\geq(\mathbb{E}(X))^2.\]

Example. KL divergence is defined as

\[D_{KL}(p, q)=\int p(x) \log \left(\frac{p(x)}{q(x)}\right) d x.\]

If \(p=q\), \(D_{KL}(p,p)=0\). Else,

\[\begin{aligned} -D(p, q)&=\mathbb{E} \log \left(\frac{q(X)}{p(X)}\right) \stackrel{Jensen}{\leq} \log \mathbb{E}\left(\frac{q(X)}{p(X)}\right)\\ &=\log\int p(x)\frac{q(x)}{p(x)}dx=0. \end{aligned}\]