Math Ed Blog from Bruce Yoshiwara: The Correlation Coeffiicent as cosine theta

Mathematicians define the dot product between vectors $\vec{v}= (v_{1}, v_{2}, \, \ldots \, , v_{n})$ and $\vec{w}= (w_{1}, w_{2}, \, \ldots \, , w_{n})$ as

$\vec{v} \cdot \vec{w} = v_{1} w_{1} + v_{2} w_{2} + \, \cdots \, + v_{n} w_{n}$

On the other hand, the alternate geometric definition for the dot product popular with physicists is

$\vec{v} \cdot \vec{w} = \left|\left|{\vec{v}\right|\right| \,\left|\left|{\vec{w}\right|\right| \,\cos \, \theta$

$\cos \, \theta = \frac{\vec{v} \cdot \vec{w}}{\left|\left|{\vec{v}\right|\right| \,\left|\left|{\vec{w}\right|\right|$

And statisticians define Pearson's correlation coefficient r so that

$r = \frac {\sum (x_{i} - \bar{x})(y_{i} - \bar{y}) } {\sqrt{\sum (x_{i} - \bar{x})^2} \sqrt{ \sum (y_{i} - \bar{y})^2}}$

Thus if we set $\vec{v} = (x_1 - \bar{x}, x_2 - \bar{x},\, \ldots \, , x_n - \bar{x})$ and $\vec{w} = (y_1 - \bar{y}, y_2 - \bar{y},\, \ldots \, , y_n - \bar{y})$ , then $r = \cos \,\theta$ .

The idea is to think not of n ordered pairs (x₁, y₁), (x₂, y₂), ..., (x_n, y_n), but rather to think of two vectors in n-dimensional space. When the vectors are pointing in the same direction, the angle between them is zero and the correlation coefficient is cos 0 = 1. When the vectors point in opposite directions, the correlation coefficient is the cosine of a straight angle, r = -1. And when the vectors are orthogonal, the correlation coefficient is the cosine of a right angle, r = 0.

The only tricky part is that the two n-dimensional vectors are not the vectors $\vec{x}$ and $\vec{y}$ , the vectors containing all the $x_{i}$ and $y_{i}$ respectively. Instead, the necessary two n-dimensional vectors are the $\vec{v}$ and $\vec{w}$ defined above.

And nicely, the least-squares regression line for the $(x_i , y_i )$ data is y = mx + b, where $m= r \frac{\left|\left|\vec{w}\right|\right|}{\left|\left|\vec{v}\right|\right| }$ and $b = \bar{y} - m \bar{x}$ . (Notice that the variance $\sigma_{x}^{2} = \frac{\vec{v} \cdot \vec{v}}{n}$ , so m can also be written as $m= r \frac{\sigma_y}{\sigma_x}$ .

One typically derives the least-squares regression line by finding m and b that minimize $\sum (m x_i +b - y_i )^2$ . But one can alternatively use the n-dimensional vector point of view, where the coefficients m and b correspond to the solution of the vector equation $m\vec{x} + b\vec{1} = \hat{y}$ . The vector $\vec{1}= (1, \, 1, \, \ldots \, , \, 1)$ is the vector of all 1's and the vector $\hat{y}$ is the orthogonal projection of the vector $\vec{y}$ onto the space spanned by $\vec{x}$ and $\vec{1}$ .

Math Ed Blog from Bruce Yoshiwara

Tuesday, December 15, 2009

The Correlation Coeffiicent as cosine theta

No comments:

Search This Blog

Bruce Yoshiwara

About

Labels

Blog Archive

Followers

MathDL Math in the News RSS feed