I think my journey in statistics has been unconventional in that much of my statistical training came from social-behavioral stats classes, where the heavier math is often left at the door, and so I missed out on a lot of the intuition and proofs that many statistics students see in their undergraduate courses.
For me, this is certainly the case in connecting the estimating equations for the regression parameters (and their variances) that you learn in intro statistics courses to the estimating equation in the matrix form of OLS. I know the scalar equations and I know the matrix equation. But I haven’t seen how it is that they result in the same estimates.
This gap in understanding led me to reflect on a statistics professor I had during my master’s program. When it came to derivations with important results, he’d often insist that we “see it once!”. I think what he wanted was for us to know that the methods he was teaching stood on solid ground, and not necessarily for us to memorize each step of the derivation. The idea of “see it once” always struck me as a valuable principle.
This post is another opportunity to “see it once”, bridging the scalar and matrix notations of OLS. The derivations below follow Chapter 11 of Cosma Shalizi’s The Truth about Linear Regression, with a few added details for clarity.
OLS in scalar notation
OLS regression is typically presented in its simplest (and, I think clearest) form in introductory statistics classes
\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i,\quad i = 1, \dots, n. \]
The goal is to predict \(y_i\) from \(x_i\), with the optimal linear prediction of \(y_i\) given \(x_i\) obtained when the regression parameters \(\beta_0\) and \(\beta_1\) minimize the sum of the squared residuals (RSS). The parameters that accomplish this are given by the following equations
\[ \begin{align} \hat \beta_1 &= \frac{\hat{\text{Cov}(x, y)}}{\hat \sigma^2_x} \\ \hat \beta_0 &= \bar y - \hat \beta_1 \bar x. \end{align} \]
Assuming homoskedastic and independent errors, the variances of the regression parameters can be calculated as follows
\[ \mathbb{V}(\beta_0) = \frac{\sigma^2_\epsilon}{n} \left( 1 + \frac{\bar x^2}{\sigma^2_x} \right) \]
\[ \mathbb{V}(\beta_1) = \frac{\sigma^2_{\epsilon}}{\sum_{i=1}^n (x_i - \bar x)^2}. \]
OLS in matrix notation
While the equations in scalar notation are intuitive for simple regression, they become unwieldy with multiple predictors. This is where matrix notation comes in handy since it gives us a nice, compact notation. However, we’ll stick to the case of a single predictor so that we can focus on connecting the scalar notation to the matrix notation.
The equation connecting \(y_i\) to \(x_i\) can be re-written using vectors and matrices
\[ \begin{align} \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} &= \begin{bmatrix} \beta_0 + \beta_1 x_i + \epsilon_1\\ \vdots \\ \beta_0 + \beta_1 x_n + \epsilon_n \end{bmatrix} \\ %------------------ &= \begin{bmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \vdots \\ \epsilon_i \\ \end{bmatrix}. \end{align} \]
We can compactly write
\[ \begin{align} \mathbf{y} = \mathbf{x} \bm{\beta} + \bm{\epsilon}. \end{align} \]
The regression parameters that minimize the RSS are given by \[ \hat{\bm{\beta}} = \begin{bmatrix} \hat\beta_0 \\ \hat\beta_1\end{bmatrix} = \left(\mathbf{x}'\mathbf{x}\right)^{-1} \mathbf{x}'\mathbf{y} \]
and the variance-covariance matrix of the estimated coefficients by
\[ \bm{\Sigma} = \begin{bmatrix} \mathbb{V}[\hat\beta_0] & \text{Cov}(\hat\beta_0, \hat\beta_1) \\ \text{Cov}(\hat\beta_1, \hat\beta_0) & \mathbb{V}[\hat\beta_1]\end{bmatrix} = \sigma^2_\epsilon (\bm{x}'\bm{x})^{-1}. \]
Deriving the OLS estimator using matrix notation
The first step we’ll take in connecting the scalar notation equations to the matrix equation is to show that the matrix estimator yields a parameter vector that minimizes the RSS.
The residuals can be collected in a vector that can subsequently be used to compute the residual sum of squares (RSS)
\[ \begin{align} \bm \epsilon &= \begin{bmatrix} y_1 - \hat y_1 \\ \vdots \\ y_n - \hat y_n \end{bmatrix} \\ &= \mathbf{y} - \hat{\mathbf{y}} \\ &= \mathbf{y} - \mathbf{x} \bm{\beta}. \end{align} \]
\[ \begin{align} \text{RSS} &= \epsilon'\epsilon \\ &= \left(\mathbf{y} - \mathbf{x} \bm{\beta} \right)' \left(\mathbf{y} - \mathbf{x} \bm{\beta} \right). \end{align} \]
Expanding the terms in the RSS results in the following expression
\[ = \mathbf{y}'\mathbf{y} - 2\bm{\beta}'\mathbf{x}'\mathbf{y} + \bm{\beta}'\mathbf{x}'\mathbf{x}\bm{\beta}. \]
If you’re curious, the intervening steps can be found by expanding the box below.
To determine the vector \(\bm{\beta}\) that minimizes the residual sum of squares (RSS), we differentiate the RSS with respect to \(\bm{\beta}\), set the resulting expression equal to zero, and solve for \(\bm{\beta}\).
The gradient turns out to be given by
\[ \begin{align} \nabla \text{RSS} &= \nabla \mathbf{y}'\mathbf{y} - 2 \nabla \bm{\beta}'\mathbf{x}'\mathbf{y} + \nabla \bm{\beta}'\mathbf{x}'\mathbf{x}\bm{\beta} \\ &= 2 \left( \mathbf{x}'\mathbf{x}\bm{\beta} - \mathbf{x}'\mathbf{y} \right). \end{align} \]
Again, the curious can find details by expanding the box below.
Now, remember that \(\mathbf{x}\) and \(\mathbf{y}\) are sample data. We can estimate the coefficient vector by setting the gradient to zero and solving for it
\[ \begin{align} 2 \left( \mathbf{x}'\mathbf{x}\hat{\bm{\beta}} - \mathbf{x}' \mathbf{y} \right) &= 0 \\ \mathbf{x}'\mathbf{x}\hat{\bm{\beta}} - \mathbf{x}' \mathbf{y} &= 0 \\ \mathbf{x}'\mathbf{x}\hat{\bm{\beta}} &= \mathbf{x}' \mathbf{y} \\ \hat{\bm{\beta}} &= \left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}' \mathbf{y}. \end{align} \]
To confirm that \(\hat{\bm{\beta}}\) represents a minimum, we can do a second derivative test. This involves computing the Hessian by differentiating the gradient with respect to \(\bm{\beta}\) again, yielding \(2 \mathbf{x}'\mathbf{x}\). If \(\mathbf{x}\) is full rank and positive semi-definite then so is its Gram matrix \(\mathbf{x}'\mathbf{x}\), and the vector \(\hat{\bm\beta}\) is a minimum.
Elements of \(\hat{\boldsymbol{\beta}}\)
Okay so we’ve seen that the estimating equation \(\left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}' \mathbf{y}\) results in a vector of parameter estimates that minimizes the RSS, but let’s make sure that the elements in \(\hat{\bm{\beta}}\) indeed correspond to their scalar counterparts.
Specifically, let’s verify that
\[ \hat{\bm{\beta}} = \begin{bmatrix}\hat\beta_0 \\ \hat\beta_1\end{bmatrix} = \begin{bmatrix} \bar y - \hat \beta_1 \bar x \\ \frac{{\text{Cov}(x, y)}}{\hat \sigma^2_x} \end{bmatrix}. \]
As a starting point, we’ll introduce a normalizing factor \(n^{-1}\) to the estimating equation1 so that
\[ \hat{\bm{\beta}} = \left( n^{-1}\mathbf{x}'\mathbf{x} \right)^{-1} n^{-1}\mathbf{x}'\mathbf{y}. \]
We’ll compute this product in parts, beginning with the terms inside the parentheses. I’ll note here that in what follows, estimates of variances are defined as their maximum likelihood estimates.
\[ \begin{align} n^{-1}\mathbf{x}'\mathbf{x} &= \frac1n \begin{bmatrix} 1 & \dots & 1 \\ x_1 & \dots & x_n \end{bmatrix} \begin{bmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \\ % --------- &= \frac1n \begin{bmatrix} n & \sum x_i \\ \sum x_i & \sum x^2_i \end{bmatrix} \\ % --------- &= \begin{bmatrix} 1 & \bar{x}\\ \bar{x}& \bar{x^2} \end{bmatrix} \end{align} \]
Let’s not forget that we need to find the inverse of the resulting matrix 2
\[ \begin{align} \left(n^{-1}\mathbf{x}'\mathbf{x}\right)^{-1} &= \begin{bmatrix} 1 & \sum \bar{x}\\ \bar{x}& \sum \bar{x^2} \end{bmatrix}^{-1} \\ % --------- &= \frac{1}{\bar{x^2} - \bar{x}^2} \begin{bmatrix} \bar{x^2} & -\bar{x}\\ -\bar{x} & 1 \end{bmatrix}\\ % --------- &= \frac{1}{\hat\sigma^2_x} \begin{bmatrix} \bar{x^2} & -\bar{x}\\ -\bar{x} & 1 \end{bmatrix}. \end{align} \]
- We get \(\hat\sigma^2_x\) because \(\bar{x^2} - \bar{x}^2\) estimates \(\mathbb{V}[x] = \mathbb{E}[x^2] - \mathbb{E}[x]^2\).
Next, we’ll work through the term outside the parentheses
\[ \begin{align} n^{-1}\mathbf{x}'\mathbf{y} &= \frac1n \begin{bmatrix} 1 & \dots & 1 \\ x_1 & \dots & x_n \end{bmatrix} \begin{bmatrix} y_i \\ \vdots \\ y_n \end{bmatrix}\\ % --------- &= \frac1n \begin{bmatrix} \sum y_i \\ \sum x_i y_i \end{bmatrix} \\ % --------- &= \begin{bmatrix} \bar{y} \\ \overline{xy} \end{bmatrix}. \end{align} \]
Stitching it all together results in the following vector
\[ \begin{align} \hat{\bm{\beta}} &= \left( n^{-1}\mathbf{x}'\mathbf{x} \right)^{-1} n^{-1}\mathbf{x}'\mathbf{y} \\ % --------- &= \frac{1}{\hat\sigma^2_x} \begin{bmatrix} \bar{x^2} & -\bar{x}\\ -\bar{x} & 1 \end{bmatrix} \begin{bmatrix} \bar{y} \\ \overline{xy} \end{bmatrix} \\ % --------- &= \frac{1}{\hat\sigma^2_x} \begin{bmatrix} \bar{x^2}\bar{y} - \bar{x} \overline{xy} \\ - \bar{x} \bar{y} + \overline{xy} \end{bmatrix}. % --------- \end{align} \]
We’ve made good progress up to this point, but it still feels unclear how the elements of this vector correspond to the scalar equations for \(\hat\beta_0\) and \(\hat\beta_1\). To continue, we’ll have re-express the elements of the vector by taking advantage of the following definitions
- \(\hat{\sigma^2_x} = \bar{x^2} - \bar{x}^2\)
- \(\hat{\text{Cov}(x, y)} = \overline{xy} - \bar{x}\bar{y}\).
Doing so lets us re-express the vector as follows \[ \begin{align} &= \frac{1}{\hat\sigma^2_x} \begin{bmatrix} \left(\hat\sigma^2_x + \bar{x}^2\right)\bar{y} - \bar{x}\left(\hat{\text{Cov}(x,y)} + \bar{x} \bar{y}\right) \\ \hat{\text{Cov}(x,y)} \end{bmatrix} \\ % --------- &= \frac{1}{\hat\sigma^2_x} \begin{bmatrix} \hat\sigma^2_x\bar{y} + \bar{x}^2\bar{y} - \bar{x}\hat{\text{Cov}(x,y)} - \bar{x}^2 \bar{y} \\ \hat{\text{Cov}(x,y)} \end{bmatrix}. % --------- \end{align} \]
Out last task here is distributing \(\frac{1}{\hat\sigma^2_x}\) and simplifying
\[ \begin{align} &= \begin{bmatrix} \bar{y} - \frac{\hat{\text{Cov}(x,y)}}{\hat\sigma^2_x}\bar{x} \\ \frac{\hat{\text{Cov}(x,y)}}{\hat\sigma^2_x} \end{bmatrix} \\ % --------- &= \begin{bmatrix} \bar{y} - \hat\beta_1\bar{x} \\ \hat \beta_1 \end{bmatrix}. \end{align} \]
And there it is. Although it was a bit tedious, we can see that the elements in \(\hat{\bm{\beta}}\) indeed correspond to the scalar equations for \(\hat\beta_0\) and \(\hat\beta_1\).
Deriving the sampling distribution of \(\hat{\boldsymbol{\beta}}\)
To perform hypothesis tests and compute confidence intervals for the elements in \(\hat{\bm{\beta}}\), we need to understand its sampling distribution. We can do this by splitting \(\hat{\bm{\beta}}\) into a deterministic part and a random part.
Note that up to this point, we haven’t made any distributional assumptions. Now, though, we’ll make the usual assumption that the residuals are distributed \(\bm{\epsilon} \sim \mathcal{N}\left(\mathbf{0}, \sigma^2_{\epsilon}\mathbf{I}\right)\).
Remember that we’ve defined \(\mathbf{y} = \mathbf{x}\bm{\beta} + \bm{\epsilon}\) so
\[ \begin{align} \hat{\bm{\beta}} &= \left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}' \mathbf{y} \\ % --------- &= \left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}' \left(\mathbf{x}\bm{\beta} + \bm{\epsilon}\right) \\ % --------- &= \left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}'\mathbf{x}\bm{\beta} + \left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}'\bm{\epsilon} \\ % --------- &= \bm{\beta} + \left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}'\bm{\epsilon}. \end{align} \]
Taking the expectation we get
\[ \begin{align} \mathbb{E}[\hat{\bm{\beta}}] &= \mathbb{E}[\bm{\beta} + \left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}'\bm{\epsilon}] \\ % --------- &= \mathbb{E}[\bm{\beta}] + \mathbb{E}[\left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}'\bm{\epsilon}] \\ % --------- &= \bm{\beta} + \left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}'\mathbb{E}[\bm{\epsilon}] \\ % --------- &= \bm{\beta}. \end{align} \]
Now, turning our attention to the variance-covariance matrix of the coefficients
\[ \begin{align} \mathbb{V}[\hat{\bm{\beta}}] &= \mathbb{V}[\bm{\beta} + \left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}'\bm{\epsilon}] \\ % --------- &= \mathbb{V}[\bm{\beta}] + \mathbb{V}[\left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}'\bm{\epsilon}] \end{align} \]
The variance of \(\bm{\beta}\) is 0 because it is a fixed vector. To compute the variance for the remaining term we’ll use the fact that if \(\mathbf{a}\) is fixed and \(\mathbf{X}\) is random, then \(\mathbb{V}[\mathbf{a}\mathbf{X}] = \mathbf{a} \mathbb{V}[\mathbf{X}]\mathbf{a}'\)
\[ \begin{align} &= \left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}' \mathbb{V}[\bm{\epsilon}] \left( \left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}'\right)' \\ % --------- &= \left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}' \sigma^2_{\epsilon}\mathbf{I} \mathbf{x}\left(\mathbf{x}'\mathbf{x}\right)^{-1} \\ % --------- Multiplying by I does't change the matrix &= \sigma^2_{\epsilon} \left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}'\mathbf{x}\left(\mathbf{x}'\mathbf{x}\right)^{-1} \\ % --------- &= \sigma^2_{\epsilon} \left(\mathbf{x}'\mathbf{x}\right)^{-1}. \end{align} \]
Great, our work has shown that the estimated regression coefficients are normally distributed3 with mean vector \(\bm\beta\), and variance-covariance matrix \(\bm{\Sigma} = \sigma^2_{\epsilon} \left(\mathbf{x}'\mathbf{x}\right)^{-1}\)
\[ \hat{\bm\beta} \sim \mathcal{N} \left(\bm{\beta}, \bm{\Sigma}\right). \]
In practice, we don’t know \(\sigma^2_{\epsilon}\) so we estimate \(\hat{\bm{\Sigma}} = \hat{\sigma^2_\epsilon}\left(\mathbf{x}'\mathbf{x}\right)^{-1}\) for inferences.
Elements of the variance-covariance matrix
As we did with the parameter estimates, we want to make sure that the matrix representation of parameter variances correspond to their scalar counterparts.
To verify, we’ll again start by adding in a normalizing term of \(n^{-1}\) so that
\[ \begin{align} {\bm{\Sigma}} &= n^{-1}\sigma^2_{\epsilon} \left(n^{-1}\mathbf{x}'\mathbf{x}\right)^{-1} \\ % --------- &= \frac{\sigma^2_{\epsilon}}{n} \left(\frac{1}{\hat\sigma^2_x} \begin{bmatrix} \bar{x^2} & -\bar{x}\\ -\bar{x} & 1 \end{bmatrix} \right). \end{align} \]
The variance for \(\hat\beta_0\) corresponds to the first diagonal element of \({\bm{\Sigma}}\). Multiplying the relevant terms gets us
\[ \begin{align} \mathbb{V}[\hat\beta_0] &= \frac{\sigma^2_{\epsilon}}{n} \cdot \frac{1}{\hat\sigma^2_x} \cdot \bar{x^2} \\ % --------- &= \frac{\sigma^2_{\epsilon}\bar{x^2} }{n \hat\sigma^2_x} \\ % --------- &= \frac{\sigma^2_{\epsilon}( \bar{x}^2 + \hat\sigma^2_x )}{n \hat\sigma^2_x} \\ % --------- &= \frac{\sigma^2_{\epsilon}\bar{x}^2 + \sigma^2_{\epsilon}\hat\sigma^2_x}{n \hat\sigma^2_x} \\ % --------- &= \frac{\sigma^2_{\epsilon}\bar{x}^2}{n \hat\sigma^2_x} + \frac{\sigma^2_{\epsilon}\hat\sigma^2_x}{n \hat\sigma^2_x} \\ % --------- &= \frac{\sigma^2_{\epsilon}}{n} \left( \frac{\bar{x}^2}{\sigma^2_x} + 1 \right). \end{align} \]
The variance for \(\hat\beta_1\) corresponds to the second diagonal element of \({\bm{\Sigma}}\), and again multiplying the relevant terms
\[ \begin{align} \mathbb{V}[\hat\beta_1] &= \frac{\sigma^2_{\epsilon}}{n} \cdot \frac{1}{\hat\sigma^2_x} \cdot 1 \\ % --------- &= \frac{\sigma^2_{\epsilon}}{n} \cdot \frac{1}{\frac{\sum (x_i - \bar{x})^2}{n}}\\ % --------- &= \frac{\sigma^2_{\epsilon}}{\frac{n\sum (x_i - \bar{x})^2}{n}}\\ % --------- &= \frac{\sigma^2_{\epsilon}}{\sum_{i=1}^n\left(x_i - \bar{x}\right)^2}. \end{align} \]
With that, we’ve shown that the matrix representation of the parameter variances also match their scalar representations.
Conclusion
Well. There you have it. By working through the derivations, we’ve demonstrated that the OLS regression parameter estimates and their variances derived in matrix form are equivalent to those obtained from the traditional scalar notation.
We don’t need to remember every step, but we can feel confident that the scalar and matrix notations of OLS regression lead to the same results now that we’ve “seen it once”.
Footnotes
This will be handy in letting us define the elements in the relevant vectors and matrices as means and variances.↩︎
The inverse of a \(2 \times 2\) matrix is given by \(\left[\begin{smallmatrix}a & b \\ c & d\end{smallmatrix}\right]^{-1} = \frac{1}{ad-bc}\left[\begin{smallmatrix}d & -b \\ -c & a\end{smallmatrix}\right]\).↩︎
This stems from the fact that we defined \(\hat{\bm{\beta}} = \bm{\beta} + \left(\mathbf{x}'\mathbf{x}\right)^{-1}\mathbf{x}'\bm{\epsilon}\) and the normality of \(\bm{\epsilon}\) is preserved under affine transformations.↩︎