Tuesday, December 15, 2009

The Correlation Coeffiicent as cosine theta

Mathematicians define the dot product between vectors  \vec{v}= (v_{1}, v_{2}, \, \ldots \, , v_{n}) and  \vec{w}= (w_{1}, w_{2}, \, \ldots \, , w_{n}) as

\vec{v} \cdot \vec{w} = v_{1} w_{1} + v_{2} w_{2} + \, \cdots \, + v_{n} w_{n}

On the other hand, the alternate geometric definition for the dot product popular with physicists is

\vec{v} \cdot \vec{w} = \left|\left|{\vec{v}\right|\right| \,\left|\left|{\vec{w}\right|\right| \,\cos \, \theta

\cos \, \theta = \frac{\vec{v} \cdot \vec{w}}{\left|\left|{\vec{v}\right|\right| \,\left|\left|{\vec{w}\right|\right|

And statisticians define Pearson's correlation coefficient r so that

r = \frac {\sum (x_{i} - \bar{x})(y_{i} - \bar{y}) }  {\sqrt{\sum (x_{i} - \bar{x})^2}  \sqrt{ \sum (y_{i} - \bar{y})^2}}

Thus if we set  \vec{v} = (x_1 - \bar{x}, x_2 - \bar{x},\, \ldots \, , x_n - \bar{x}) and  \vec{w} = (y_1 - \bar{y}, y_2 - \bar{y},\, \ldots \, , y_n - \bar{y}) , then r = \cos \,\theta.

The idea is to think not of n ordered pairs (x1, y1), (x2, y2), ..., (xn, yn), but rather to think of two vectors in n-dimensional space. When the vectors are pointing in the same direction, the angle between them is zero and the correlation coefficient is cos 0 = 1. When the vectors point in opposite directions, the correlation coefficient is the cosine of a straight angle, r = -1. And when the vectors are orthogonal, the correlation coefficient is the cosine of a right angle, r = 0.

The only tricky part is that the two n-dimensional vectors are not the vectors \vec{x} and  \vec{y}, the vectors containing all the x_{i} and y_{i} respectively.  Instead, the necessary two n-dimensional vectors are the \vec{v} and \vec{w} defined above.

And nicely, the least-squares regression line for the (x_i , y_i ) data is y = mx + b, where  m= r \frac{\left|\left|\vec{w}\right|\right|}{\left|\left|\vec{v}\right|\right| } and b = \bar{y} - m \bar{x}.  (Notice that the variance \sigma_{x}^{2} = \frac{\vec{v} \cdot \vec{v}}{n}, so m can also be written as  m= r \frac{\sigma_y}{\sigma_x}.

One typically derives the least-squares regression line by finding m and b that minimize  \sum  (m x_i +b - y_i )^2.  But one can alternatively use the n-dimensional vector point of view, where the coefficients m and b correspond to the solution of the vector equation m\vec{x} + b\vec{1} = \hat{y}.  The vector \vec{1}= (1, \, 1, \, \ldots \, , \, 1) is the vector of all 1's and the vector \hat{y}  is the orthogonal projection of the vector  \vec{y} onto the space spanned by \vec{x} and \vec{1}.

No comments: