Machine Learning - 01 Introduction
The notes are based on the session. Many thanks to the great work.
- 0. Motivation
- 1. Frequentist vs Bayesian
- 2.Maximum Likelihood Estimation on Gaussian Distribution
- 3. Gaussian Distribution
- 4. Calculations on Gaussian
- 5. Conclusion
0. Motivation
After setting up this blog, I actually had no idea about what should I post. I am definitely not good at writing, but still I enjoy it a lot. As I am weak in machine learning and mathematics, for writing and learning, a ‘translation’ work of the session may be a good start. So here we are!
1. Frequentist vs Bayesian
Many machine learning methods can be illustrated in a probabilistic way. We now introduce two mainstream views of interpreting probability: frequentist and bayesian.
Firstly let us consider a simple example. Suppose we have a data \(X=(x_1,x_2,\ldots,x_N)^T\) where \(x_i\) are i.i.d. samples of the random variable \(x\) and is of \(d\) dimensions. Also, we define \(\theta\) be the parameter so that \(x\sim P(x\|\theta)\), and we call \(P(X\|\theta)\) the likelihood.
Frequentists: \(\theta\) is an unknown constant. What they care is how to use the data to infer the value of \(\theta\). The best known approach for that is maximum likelihood estimation (MLE):
\[\hat{\theta}_\text{MLE}=\arg\max_{\theta}P(X|\theta).\]Many traditional machine learning methods are based on this view. They construct a probabilistic model, then determine a loss function and minimize it by methods like gradient descent.
Bayesians: \(\theta\) is random variable, following distribution \(P(\theta)\), which is called prior distribution, and \(P(\theta\|X)\) is called posterior distribution. Then by Bayes’s theorem, we have maximum a posteriori (MAP):
\[\hat{\theta}_\text{MAP}=\arg\max_{\theta}P(\theta|X).\]In Bayesian inference, the posterior distribution, rather than the maximum, is the key. However, to determine the distribution sometimes is really hard as it may incur complex calculus:
\[P(\theta|X)=\frac{P(X|\theta)\cdot P(\theta)}{\int_{\theta}P(X|\theta)\cdot P(\theta)\text{d}\theta}\]Many powerful and elegant approaches were proposed for this. For examples, MCMC, probabilistic graphical model, etc.
I’d like to cite this to summary the differences.
“In short, according to the frequentist definition of probability, only repeatable random events (like the result of flipping a coin) have probabilities. These probabilities are equal to the long-term frequency of occurrence of the events in question. Frequentists don’t attach probabilities to hypotheses or to any fixed but unknown values in general.”
“In contrast, Bayesians view probabilities as a more general concept. As a Bayesian, you can use probabilities to represent the uncertainty in any event or hypothesis. Here, it’s perfectly acceptable to assign probabilities to non-repeatable events.”
2.Maximum Likelihood Estimation on Gaussian Distribution
In this section, we will give an example of MLE on Gaussian distribution. Just like what we defined in section 1, there is a data \(X=(x_1,x_2,\ldots,x_N)^T\) where \(x_i\) are i.i.d. samples of the random variable \(x\) and is of \(d=1\) dimension. Thus we have \(X\in\mathbb{R}^{N\times 1}\). We further attach a specific distribution to \(x\) as \(x\sim\mathcal{N}(\mu,\sigma^2)\). The unknown parameter then becomes \(\theta=(\mu,\sigma)\). Now we use MLE method to estimate the parameter:
\[\hat{\theta}_\text{MLE}=\arg\max_\theta P(X|\theta)=\arg\max_\theta \log P(X|\theta).\]Given \(x\sim\mathcal{N}(\mu,\sigma^2)\), the PDF of \(x\) is
\[p(x|\mu,\sigma)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}.\]
As \(x_i,i=1,\ldots,N\) are i.i.d. variables, it follows that
\[P(X|\theta)=\prod_{i=1}^NP(x_i|\theta).\]Hence,
\[\begin{aligned}\arg\max_\theta\log P(X|\theta)&=\arg\max_\theta\log \prod_{i=1}^N P(x_i|\theta)\\&=\arg\max_\theta\sum_{i=1}^N\log P(x_i|\theta)\\&=\arg\max_\theta\sum_{i=1}^N\log\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\\&=\arg\max_\theta\sum_{i=1}^N\left[\log \frac{1}{\sqrt{2\pi}}+\log\frac{1}{\sigma}-\frac{(x_i-\mu)^2}{2\sigma^2}\right]\\&=\arg\max_\theta\sum_{i=1}^N\left[\log\frac{1}{\sigma}-\frac{(x_i-\mu)^2}{2\sigma^2}\right]\end{aligned},\]which is equivalent to a numerical optimization problem. Denote the term \(\sum_{i=1}^N\left[\log\frac{1}{\sigma}-\frac{(x_i-\mu)^2}{2\sigma^2}\right]\) as \(\mathcal{L}\). The maximum occurs when
\[\frac{\partial\mathcal{L}}{\partial\mu}=0,\quad \frac{\partial\mathcal{L}}{\partial\sigma}=0.\]Solving the equations we have
\[\mu_\text{MLE}=\frac{1}{N}\sum_{i=1}^N x_i,\quad \sigma^2_\text{MLE}=\frac{1}{N}\sum_{i=1}^N(x_i-\mu_\text{MLE})^2.\]\(\mu_\text{MLE}\) is an unbiased estimation as \(\mathbb{E}[\mu_\text{MLE}]=\mu\), while \(\sigma_\text{MLE}^2\) is a biased estimation since \(\mathbb{E}[\sigma_\text{MLE}^2]=\frac{N-1}{N}\sigma^2\). The bias is incurred by the exploitation on samples rather than the true distribution. Intuitively, the variance of the samples will never be larger than the true distribution since they are totally generated from that!
3. Gaussian Distribution
Many advanced machine learning techniques, like linear Gaussian model, Kalman filter, P-PCA, etc, involve Gaussian distribution. Therefore it is quite practical to get familiar with Gaussian distribution. In this section, we are trying to explain why the PDF of bivariate Gaussian distribution is shaped by numerous ellipses.
Suppose we have a random variable \(x\sim \mathcal{N}(\mu,\Sigma)\) of \(d\) dimensions. Specifically, the definition follows that
\[P(x|\mu,\Sigma)=\frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}}e^{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)},\]and
\[x=\begin{pmatrix}x_{1}\\x_{2}\\\vdots\\x_{d}\end{pmatrix}_{d\times1},\quad \mu=\begin{pmatrix}\mu_1\\\mu_2\\\vdots\\\mu_{d}\end{pmatrix}_{d\times1},\quad\Sigma=\begin{pmatrix}\sigma_{11}&\sigma_{12}&\cdots&\sigma_{1d}\\\sigma_{21}&\sigma_{22}&\cdots&\sigma_{2d}\\\vdots&\vdots&\ddots&\vdots\\\sigma_{d1}&\sigma_{d2}&\cdots&\sigma_{dd}\end{pmatrix}_{d\times d}.\]As it involves matrix, a simplification will be a wise choice. For simplicity, we use \(\Delta(x)\) to represent \((x-\mu)^T\Sigma^{-1}(x-\mu)\) and suppose the covariance matrix \(\Sigma\) to be positive definite, which is reasonable. By the eigenvalue decomposition and the property of positive definite matrix, \(\Sigma\) can be decomposed as \(\Sigma=U\Lambda U^T\), where \(U=(u_1,u_2,\ldots,u_d)\) is an orthogonal matrix whose columns are the eigenvectors of \(\Sigma\), and \(\Lambda=\text{diag}(\lambda_1,\lambda_2,\ldots,\lambda_d)\) is a diagonal matrix whose entries are the eigenvalues of \(\Sigma\). Then we have
\[\begin{aligned}\Sigma^{-1}&=(U\Lambda U^T)^{-1}\\&=(U^T)^{-1}\Lambda^{-1}U^{-1}\\&=U\Lambda^{-1}U^T\\&=(u_1,u_2,\ldots,u_d)\begin{pmatrix}\frac{1}{\lambda_1}&0&\cdots&0\\0&\frac{1}{\lambda_2}&\cdots&0\\\vdots&\vdots&\ddots&0\\0&0&\cdots&\frac{1}{\lambda_d}\end{pmatrix}\begin{pmatrix}u_1\\u_2\\\vdots\\u_d\end{pmatrix}\\&=\begin{pmatrix}\frac{u_1}{\lambda_1},\frac{u_2}{\lambda_2},\ldots,\frac{u_d}{\lambda_d}\end{pmatrix}\begin{pmatrix}u_1\\u_2\\\vdots\\u_d\end{pmatrix}\\&=\sum_{i=1}^d\frac{u_iu_i^T}{\lambda_i}\end{aligned}.\]Plugging it in \(\Delta(x)\), we arrive at
\[\begin{aligned}\Delta(x)&=(x-\mu)^T\left(\sum_{i=1}^d\frac{u_iu_i^T}{\lambda_i}\right)(x-\mu)\\&=\sum_{i=1}^d(x-\mu)^Tu_i\frac{1}{\lambda_i}u_i^T(x-\mu)\end{aligned}.\]Before moving on, we take a pause to be acquainted with those notations. Specifically, \((x-\mu)\in\mathbb{R}^{d\times 1}\) as \(x, \mu\in\mathbb{R}^{d\times 1}\). \(u_i\) is the eigenvector of \(\Sigma\) so \(u_i\in\mathbb{R}^{d\times 1}\), while \(\lambda_i\) is the eigenvalue of \(\Sigma\) so \(\lambda_i\in\mathbb{R}\). Therefore \((x-\mu)^Tu_i\in\mathbb{R}\), \(u_i^T(x-\mu)\in\mathbb{R}\) and \(\Delta(x)\in\mathbb{R}\). It should not be surprising. Recall that the PDF
\[P(x|\mu,\Sigma)=\frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}}e^{-\frac{1}{2}\Delta(x)}\]represents a probability, which is definitely a real number. Now we introduce \(Y=(y_1,y_2,\ldots,y_d)^T\) where \(y_i=(x-\mu)^Tu_i\). Hence
\[\Delta(x)=\sum_{i=1}^dy_i\frac{1}{\lambda_i}y_i^T=\sum_{i=1}^d\frac{y_i^2}{\lambda_i}.\]The PDF of \(x\) is rewritten as
\[P(x|\mu,\Sigma)=\alpha e^{-\frac{1}{2}\sum_{i=1}^d y_i^2/\lambda_i},\]where \(\alpha=\frac{1}{(2\pi)^{d/2}\|\Sigma\|^{1/2}}\) is a constant. Now suppose we are trying to determine where \(P(x\|\mu,\Sigma)=\beta\). To make it intuitive, we consider the case \(d=2\), then it follows that
\[\begin{alignat*}{3}&& \alpha e^{-\frac{1}{2}\left(\frac{y_1^2}{\lambda_1}+\frac{y_2^2}{\lambda_2}\right)}&=\beta\\\Rightarrow&&\frac{y_1^2}{\lambda_1}+\frac{y_2^2}{\lambda_2}&=2(\ln\alpha-\ln\beta)\\\Rightarrow&&\frac{y_1^2}{a^2}+\frac{y_2^2}{b^2}&=1\end{alignat*},\]where \(a^2=2\lambda_1(\ln\alpha-\ln\beta)\) and \(b^2=2\lambda_2(\ln\alpha-\ln\beta)\), and we finally get a standard equation of ellipse with respect to \(y_1\) and \(y_2\)! Actually, the transformation \(y_i=(x-\mu)^Tu_i\) is the projection of \(x\) on \(u_i\). Thus in the coordinate system represented by \(x_1\) and \(x_2\), the ellipse will be changed linearly hence the ellipse will not be so standard. For different \(\beta\), we have ellipses with different major and minor axes, and that is why the graph of bivariate Gaussian distribution looks like it is composed of numerous concentration ellipses.
PS. The above discussion leaves much correctness proof to be done.
4. Calculations on Gaussian
In this section, all the random variables are Gaussian and we will not touch the complex PDF. Rather, we mainly focus on the statistics, the expectation and the variance, of Gaussian distribution.
4.1. Calculations on Union Distribution
Suppose \(x\sim\mathcal{N}(\mu, \Sigma)\) is the union distribution of \(x_a\) and \(x_b\), i.e. \(x=(x_a,x_b)^T\). Then the definition follows that
\[x=\begin{pmatrix}x_a\\x_b\end{pmatrix}_{d\times1},\quad \mu=\begin{pmatrix}\mu_a\\\mu_b\end{pmatrix}_{d\times1},\quad\Sigma=\begin{pmatrix}\Sigma_{aa}&\Sigma_{ab}\\\Sigma_{ba}&\Sigma_{bb}\end{pmatrix}_{d\times d},\]where
\[x_a,\mu_a\in\mathbb{R}^{m\times 1},x_b,\mu_b\in\mathbb{R}^{n\times 1}, m+n=d,\] \[\Sigma_{aa}\in\mathbb{R}^{m\times m}, \Sigma_{ab},\Sigma_{ba}^T\in\mathbb{R}^{m\times n}, \Sigma_{bb}\in\mathbb{R}^{n\times n}.\]The problem in this section is how to derive \(P(x_a)\) and \(P(x_b\|x_a)\), or symmetrically, \(P(x_b)\) and \(P(x_a\|x_b)\) given the union distribution. The derivation of \(P(x_a)\) is quite straightforward:
\[\begin{aligned}\mathbb{E}[x_a]&=\mathbb{E}\left[\begin{pmatrix}\mathbf{I}_{m\times m}&\mathbf{0}_{m\times n}\end{pmatrix}\begin{pmatrix}x_a\\x_b\end{pmatrix}\right]\\&=\begin{pmatrix}\mathbf{I}_{m\times m}&\mathbf{0}_{m\times n}\end{pmatrix}\mathbb{E}[x]\\&=\begin{pmatrix}\mathbf{I}_{m\times m}&\mathbf{0}_{m\times n}\end{pmatrix}\begin{pmatrix}\mu_a\\\mu_b\end{pmatrix}\\&=\mu_a\end{aligned},\]and
\[\begin{aligned}var[x_a]&=var\left[\begin{pmatrix}\mathbf{I}_{m\times m}&\mathbf{0}_{m\times n}\end{pmatrix}\begin{pmatrix}x_a\\x_b\end{pmatrix}\right]\\&=\begin{pmatrix}\mathbf{I}_{m\times m}&\mathbf{0}_{m\times n}\end{pmatrix}var[x]\begin{pmatrix}\mathbf{I}_{m\times m}\\\mathbf{0}_{n\times m}\end{pmatrix}\\&=\begin{pmatrix}\mathbf{I}_{m\times m}&\mathbf{0}_{m\times n}\end{pmatrix}\begin{pmatrix}\Sigma_{aa}&\Sigma_{ab}\\\Sigma_{ba}&\Sigma_{bb}\end{pmatrix}\begin{pmatrix}\mathbf{I}_{m\times m}\\\mathbf{0}_{n\times m}\end{pmatrix}\\&=\Sigma_{aa}\end{aligned}.\]Thus, we have \(x_a\sim\mathcal{N}(\mu_a,\Sigma_{aa})\), and similarly, \(x_b\sim\mathcal{N}(\mu_b,\Sigma_{bb})\). For the conditional probabilities \(P(x_b\|x_a)\), we first define variables
\[x_{b\cdot a}=x_b-\Sigma_{ba}\Sigma^{-1}_{aa}x_a,\] \[\mu_{b\cdot a}=\mu_b-\Sigma_{ba}\Sigma^{-1}_{aa}\mu_a,\]and
\[\Sigma_{bb\cdot a}=\Sigma_{bb}-\Sigma_{ba}\Sigma^{-1}_{aa}\Sigma_{ab}.\]The term \(\Sigma_{bb\cdot a}\) is also the Schur complement of \(\Sigma_{aa}\) of the matrix \(\Sigma\). Then it follows that
\[\begin{aligned}\mathbb{E}[x_{b\cdot a}]&=\mathbb{E}\left[\begin{pmatrix}-\Sigma_{ba}\Sigma_{aa}^{-1}&\mathbf{I}_{n\times n}\end{pmatrix}\begin{pmatrix}x_a\\x_b\end{pmatrix}\right]\\&=\begin{pmatrix}-\Sigma_{ba}\Sigma_{aa}^{-1}&\mathbf{I}_{n\times n}\end{pmatrix}\mathbb{E}[x]\\&=\begin{pmatrix}-\Sigma_{ba}\Sigma_{aa}^{-1}&\mathbf{I}_{n\times n}\end{pmatrix}\begin{pmatrix}\mu_a\\\mu_b\end{pmatrix}\\&=\mu_{b\cdot a}\end{aligned},\]and
\[\begin{aligned}var[x_{b\cdot a}]&=var\left[\begin{pmatrix}-\Sigma_{ba}\Sigma_{aa}^{-1}&\mathbf{I}_{n\times n}\end{pmatrix}\begin{pmatrix}x_a\\x_b\end{pmatrix}\right]\\&=\begin{pmatrix}-\Sigma_{ba}\Sigma_{aa}^{-1}&\mathbf{I}_{n\times n}\end{pmatrix}var[x]\begin{pmatrix}-\Sigma_{aa}^{-1}\Sigma_{ba}^T\\\mathbf{I}_{n\times n}\end{pmatrix}\\&=\begin{pmatrix}-\Sigma_{ba}\Sigma_{aa}^{-1}&\mathbf{I}_{n\times n}\end{pmatrix}\begin{pmatrix}\Sigma_{aa}&\Sigma_{ab}\\\Sigma_{ba}&\Sigma_{bb}\end{pmatrix}\begin{pmatrix}-\Sigma_{aa}^{-1}\Sigma_{ba}^T\\\mathbf{I}_{n\times n}\end{pmatrix}\\&=\Sigma_{bb\cdot a}\end{aligned}.\]Hence we have \(x_{b\cdot a}\sim\mathcal{N}(\mu_{b\cdot a},\Sigma_{bb\cdot a})\). Before showing the relation between \(x_{b\cdot a}\) and \(x_b\|x_a\), we introduce a helpful theorem.
Theorem:If \(x\sim\mathcal{N}(\mu,\Sigma)\), then \(Mx\perp Nx\iff M\Sigma N=\mathbf{0}.\)
Proof: According to the property of Gaussian, \(Mx\sim\mathcal{N}(M\mu,M\Sigma M^T)\) and \(Nx\sim\mathcal{N}(N\mu,N\Sigma N^T)\). Further,
\[\begin{aligned}Cov(Mx,Nx)&=\mathbb{E}[(Mx-M\mu)(Nx-N\mu)^T]\\&=M\mathbb{E}[(x-\mu)(x-\mu)^T]N\\&=M\Sigma N^T.\end{aligned}\]For Gaussian distribution, uncorrelation implies independence. Therefore \(M\Sigma N^T=\mathbf{0}\iff Mx\perp Nx.\tag*{\blacksquare}\)
Back to our case, we have
\[x_{b\cdot a}=\underbrace{\begin{pmatrix}-\Sigma_{ba}\Sigma_{aa}^{-1}&\mathbf{I}\end{pmatrix}}_M\underbrace{\begin{pmatrix}x_a\\x_b\end{pmatrix}}_x,\qquad x_a=\underbrace{\begin{pmatrix}\mathbf{I}&\mathbf{0}\end{pmatrix}}_{N}\underbrace{\begin{pmatrix}x_a\\x_b\end{pmatrix}}_x.\]Check \(M\Sigma N\):
\[\begin{aligned}M\Sigma N&=\begin{pmatrix}-\Sigma_{ba}\Sigma_{aa}^{-1}&\mathbf{I}\end{pmatrix}\begin{pmatrix}\Sigma_{aa}&\Sigma_{ab}\\\Sigma_{ba}&\Sigma_{bb}\end{pmatrix}\begin{pmatrix}\mathbf{I}\\\mathbf{0}\end{pmatrix}\\&=\begin{pmatrix}\mathbf{0}& \Sigma_{bb}-\Sigma_{ba}\Sigma_{aa}^{-1}\Sigma_{ab}\end{pmatrix}\begin{pmatrix}\mathbf{I}\\\mathbf{0}\end{pmatrix}\\&=\mathbf{0}\end{aligned}.\]Thus \(x_{b\cdot a}\perp x_a\), which means \(x_{b\cdot a}\|x_a=x_{b\cdot a}\). According to our definition, we finally arrive at
\[\begin{aligned}x_b|x_a&=x_{b\cdot a}|x_a+\Sigma_{ba}\Sigma_{aa}^{-1}x_a|x_a\\&=x_{b\cdot a}+\Sigma_{ba}\Sigma_{aa}^{-1}x_a\end{aligned}.\]There is a little abuse of notation: the term \(x_a\) after \(\|\) is a sample of \(x_a\) rather than a random variable. One should notice that \(x_a\) below is constant. Therefore,
\[\mathbb{E}[x_b|x_a]=\mathbb{E}[x_{b\cdot a}]+\Sigma_{ba}\Sigma_{aa}^{-1}x_a=\mu_{b\cdot a}+\Sigma_{ba}\Sigma_{aa}^{-1}x_a,\]and
\[var[x_b|x_a]=var[x_{b\cdot a}]+0=\Sigma_{bb\cdot a}.\]Removing all the extra variables we defined, we have
\[x_b|x_a\sim\mathcal{N}(\mu_b+\Sigma_{ba}\Sigma_{aa}^{-1}(x_a-\mu_a), \Sigma_{bb}-\Sigma_{ba}\Sigma^{-1}_{aa}\Sigma_{ab}),\]and the case for \(x_a\|x_b\) is similar.
4.2 Calculations on Marginal and Conditional Distribution
In this section, the problem is deriving \(P(y)\) and \(P(x\|y)\) given \(P(x)\) and \(P(y\|x)\). For example, let’s say, \(x\sim\mathcal{N}(\mu, \Sigma)\) and \(y\|x\sim\mathcal{N}(Ax+b,L^{-1})\), which is common in linear Gaussian model. The abuse of notation here is the same as we mentioned above: the \(x\ (y)\) in \(y\|x\ (x\|y)\) represents a sample rather than a random variable. Now we define \(y=Ax+b+\varepsilon\), where \(\varepsilon\sim\mathcal{N}(0,L^{-1})\) is the noise in practice. Then we have
\[\mathbb{E}[y]=\mathbb{E}[Ax+b+\varepsilon]=A\mu+b,\]and
\[var[y]=var[Ax+b]+var[\varepsilon]=A\Sigma A^T+L^{-1}.\]Therefore \(y\sim\mathcal{N}(A\mu+b, A\Sigma A^T+L^{-1})\). For the distribution of \(x\|y\), we can refer to the conclusion of section 4.1. Introducing a union variable \(z=(x,y)^T\), we obtain
\[z\sim\mathcal{N}\left(\begin{pmatrix}\mu\\A\mu+b\end{pmatrix},\begin{pmatrix}\Sigma&Cov(x,y)\\Cov(y,x)&A\Sigma A^T+L^{-1}\end{pmatrix}\right).\]By the property of covariance, we have \(Cov(y,x)=Cov(x,y)^T\). Further,
\[\begin{aligned}Cov(x,y)&=\mathbb{E}[(x-\mu)(y-A\mu-b)^T]\\&=\mathbb{E}[(x-\mu)(Ax+b+\varepsilon-A\mu-b)^T]\\&=\mathbb{E}[(x-\mu)(Ax-A\mu)^T]+\mathbb{E}[(x-\mu)\varepsilon^T]\\&=\mathbb{E}[(x-\mu)(x-\mu)^T]A^T+0\\&=var(x)A^T\\&=\Sigma A^T\end{aligned}.\]With the known union distribution \(z=(x,y)^T\) and \(P(y)\), we can derive \(P(x\|y)\) by the method mentioned in section 4.1.
5. Conclusion
In this post, we talk a lot about Gaussian distribution as it plays an important role in machine learning. Also, we are trying to get familar with those mathematical things.