56 Pearson correlation thoughts

Let’s have another look at the formula for computing Pearson’s r:

\[r_{xy} = \frac{\Sigma_{i=1}^N(x_i-\overline{x})(y_i-\overline{y})}{\sqrt{\Sigma_{i=1}^N(x_i-\overline{x})^2}\sqrt{\Sigma_{i=1}^N(y_i-\overline{y})^2}}\]

Note that the first step is to simply subtract the mean of variable \(x\) from each participant’s value for \(x\) and the mean of variable \(y\) from each participant’s value for \(y\). This step is referred to as “centring” the data. After subtracting the mean from each value, the resulting deviations have themselves a mean of 0. We could simply do this step in advance (or we might have values that happen to have a mean of 0). In either case, the equation simplifies to:

\[r_{xy} = \frac{\Sigma_{i=1}^N(x_i \times y_i)}{\sqrt{\Sigma_{i=1}^N(x_i^2)} \times \sqrt{\Sigma_{i=1}^N(y_i^2)}} = \frac{\Sigma_{i=1}^N(x_i \times y_i)}{\sqrt{(\Sigma_{i=1}^N(x_i^2)) \times (\Sigma_{i=1}^N(y_i^2))}}\]

Put very simply, you could think of the analysis applying these steps to the data:

Numerator: multiply → sum up
Denominator: square → sum up → multiply → square root

But why do we have to start with squaring the numbers in the denominator? Why don’t we simply do this?:

\[r_{xy} = \frac{\Sigma_{i=1}^N(x_i \times y_i)}{\Sigma_{i=1}^Nx_i \times \Sigma_{i=1}^Ny_i}\]

Remember that one of the fundamental properties of the mean is that the sum of deviations from the mean is always 0. If we were to simply sum up the values for \(x\) and \(y\) without squaring them first, the sums would always be 0—which would not be very helpful… We can avoid this problem by squaring the values first. The fact that we need to calculate the square root later is a direct consequence of this step. The square root is required to “undo” the squaring of the values. Otherwise the denominator would be too large and we would never get a perfect correlation, even if there were one.

Of course, to get a perfect (positive) correlation, the numerator and the denominator must be identical (because any number divided by itself is 1).

56.1 A perfect correlation

Examples in Shiny

You can try out and visualise the following examples in this Pearson correlation Shiny app.

Let’s look at a simple example. It’s hopefully easy to see that all of the following pairs of (x, y) values lie on a straight line:

x	y
-2	-2
-1	-1
1	1
2	2

Therefore, their correlation must be 1 (a correlation of 1 means basically exactly that: all values lie on a straight line). If any one value is not on the line, the correlation must be less than 1.

Let’s calculate the numerator for these values. Remember: Multiply, then sum up.

x	y	x*y
-2	-2	4
-1	-1	1
1	1	1
2	2	4
		—–
	Sum:	10

Now, let’s calculate the denominator for these values. Remember: square, sum up, multiply, square root.

x	y	x^2	y^2
-2	-2	4	4
-1	-1	1	1
1	1	1	1
2	2	4	4
		—–	—–
	Sum:	10	10

Now, let’s multiply the sums and take the square root:

\[\sqrt{10 \times 10} = \sqrt{100} = 10\]

And \(\frac{10}{10}\) is of course 1.

56.2 A less than perfect correlation

Let’s swap two neighbouring y values and see what happens:

x	y
-2	-1
-1	-2
1	2
2	1

Note that values are the same as before. Only the order has changed. Thus, the mean of x and y must still be 0 (and no centring is required). Let’s see what happens to the correlation though.

Let’s calculate the numerator for these values:

x	y	x*y
-2	-1	2
-1	-2	2
1	2	2
2	1	2
		—–
	Sum:	8

Now, our sum is only 8 (whereas it was 10 before).

Let’s see what happens to the denominator:

x	y	x^2	y^2
-2	-1	4	1
-1	-2	1	4
1	2	1	4
2	1	4	1
		—–	—–
	Sum:	10	10

Our sums are still 10. Which makes sense: After all, we’re summing up the same values as before, just in a different order. As the sums are the same, the square root of the product of the sums is still 10.

As our numerator is now smaller than our denominator, our correlation must be smaller than 1. It’s easy to show that the correlation is now 0.8:

\[r_{xy} = \frac{8}{10} = 0.8\]

Thus, if the values are not on a straight line, the correlation must be less than 1.

The key point here is that the size of numerator depends on how the data points covary. The size of denominator, on the other hand, only depends on how the data points vary. Rearranging number pairs thus only affects the numerator, but not the denominator.

56.3 Another perfect correlation: Scaling the values

What happens if we multiply the \(y\) values by 2?:

x	y
-2	-4
-1	-2
1	2
2	4

Note that the means for both variables are still 0.

Let’s calculate the numerator for these values:

x	y	x*y
-2	-4	8
-1	-2	2
1	2	2
2	4	8
		—–
	Sum:	20

Now, our sum is 20.

Let’s see what happens to the denominator:

x	y	x^2	y^2
-2	-4	4	16
-1	-2	1	4
1	2	1	4
2	4	4	16
		—–	—–
	Sum:	10	40

Now, our sums are 10 and 40. Let’s multiply them and take the square root:

\[\sqrt{10 \times 40} = \sqrt{400} = 20\]

And \(\frac{20}{20}\) is of course 1.

As we can see, the correlation is 1 again. This is because we’ve simply scaled the values for y. The correlation is not affected by scaling the values for \(x\) or \(y\).

56.4 Mean-centring is required

Let’s look at another example. We’ll add 4 to all \(y\) values:

x	y
-2	2
-1	3
1	5
2	6

The mean for x is still 0. However, the mean for y is now 4. If you plot these values, you’ll see that they still lie on a straight line. Thus, the correlation must still be 1.

Let’s see what happens if we calculate the correlation without centring the data.

Let’s calculate the numerator for these values:

x	y	x*y
-2	2	-4
-1	3	-3
1	5	5
2	6	12
		—–
	Sum:	10

The sum is still 10. However, the sum for the denominator is now different:

x	y	x^2	y^2
-2	2	4	4
-1	3	1	9
1	5	1	25
2	6	4	36
		—–	—–
	Sum:	10	74

Now, our sums are 10 and 74. Let’s multiply them and take the square root:

\[\sqrt{10 \times 74} = \sqrt{740} \approx 27.2\]

Clearly, \(\frac{10}{27.2}\) is not 1. This shows that mean-centring is required to get the correct result.

56.5 Correlation as the slope of the regression line

You might come across the claim that the correlation is identical to the slope of the regression line. This, however, is only true for z-standardised variables.

Let’s look at the following example:

x	y
-2	-3
-2	-1
2	1
2	3

If you plot a regression line for these non-standardised values, you’ll see that its slope is 1. However, the correlation for these data points is of course not 1 (as they do not lie on a straight line). Note that this is not an issue with centring the data: The mean for both x and y is 0. The issue is that the slope of the regression line can only be interpreted as the correlation coefficient if the variables are z-standardised.

We z-standardise the variables by subtracting the mean from each value and dividing the result by the standard deviation. As our means are 0, this simplifies to dividing each value by the standard deviation. The SD for x is 2.31. The SD for y is 2.58. If we divide each value by the respective SD, we get:

x	y
-0.866	-1.162
-0.866	-0.387
0.866	0.387
0.866	1.162

If we were to plot the regression line for these values, its slope would indeed be identical to the correlation coefficient, which for this example is .894.

56.6 Guess the regression line

Here is a more advanced Shiny app that lets you guess the regression line for a set of random data points. The aim is to get as close as possible to the actual regression line by changing the slope and intercept of the default regression line (which has a slope of 1 and an intercept of 0). The smaller the residual sum of squares, the closer you are to the actual regression line.

Note that you can directly enter numbers into the slope and intercept fields (and don’t need to click on the up and down arrows loads of times).

Note also that you can even challenge your friends to a game of “Guess the regression line” by agreeing on a seed you will use for the randomisation. Smaller residual sum of squares wins!