Measuring correlation

Correlation is defined as how close two variables are to having a dependence relationship with each other. At first sight, it looks kind of simple, but there are two main problems:

  1. Despite the obvious situations (i.e. correlation = 1), it is difficult to say whether 2 variables are correlated or not (i.e correlation = 0.7). For instance, would you be able to say if the variables X and Y from the following to plots are correlated?
  2. There are different ways of measure of correlation that may not agree when comparing different distributions. As an example, which plot shows a higher correlation? The answer will depend on how you do measure the correlation since if you use Pearson correlation, you would pick A whereas if you choose Spearman correlation you will take B

Here, I will explain some of the different correlation measures you can use:

Pearson product-moment correlation coefficient

  • What does it measure? Only linear dependencies between the variables.
  • How it is obtained? By dividing the covariance of the two variables by the product of their standard deviations. (It is defined only if both of the standard deviations are finite and nonzero). \rho _{X,Y}={\frac {\operatorname {cov} (X,Y)}{\sigma _{X}\sigma _{Y}}}
  • Properties:
  1. ρ (X,Y) = +1 : perfect direct (increasing) linear relationship (correlation).
  2. ρ (X,Y) = -1 : perfect decreasing (inverse) linear relationship (anticorrelation).
  3. In all other cases, ρ (X,Y) indicates the degree of linear dependence between the variables. As it approaches zero there is less of a relationship (closer to uncorrelated).
  4. Only gives a perfect value when X and Y are related by a linear function.
  • When is it useful? For the case of a linear model with a single independent variable, the coefficient of determination (R squared) is the square of r, Pearson’s product-moment coefficient.

 

Spearman’s rank correlation coefficient:

  • What does it measure? How well the relationship between two variables can be described using a monotonic function (a function that only goes up or only goes down).
  • How it is obtained? Pearson correlation between the rank values of the two variables.

{\displaystyle r_{s}=\rho _{\operatorname {rg} _{X},\operatorname {rg} _{Y}}={\frac {\operatorname {cov} (\operatorname {rg} _{X},\operatorname {rg} _{Y})}{\sigma _{\operatorname {rg} _{X}}\sigma _{\operatorname {rg} _{Y}}}}}

Only if all n ranks are distinct integers, it can be computed using the popular formula.

{\displaystyle r_{s}={1-{\frac {6\sum d_{i}^{2}}{n(n^{2}-1)}}}.}

Where di is the difference between the two ranks of each observation.

  • Properties:
  1. rs (X,Y) = +1:  X and Y are related by any increasing monotonic function.
  2. rs (X,Y) = -1:  X and Y are related by any decreasing monotonic function.
  3. The Spearman correlation increases in magnitude as X and Y become closer to being perfect monotone functions of each other.
  • When is it useful? It is appropriate for both continuous and discrete ordinal variables. It can be use for looking for non-linear dependence relationships.

Kendall’s tau coefficient

  • What does it measure? The ordinal association between two measured quantities.
  • How it is obtained?

{\displaystyle \tau ={\frac {({\text{number of concordant pairs}})-({\text{number of discordant pairs}})}{n(n-1)/2}}.}

Any pair of observations (xi , yi)  and (xj, yj) are said to be concordant if the ranks for both elements agree. That happens if xi-xj and yi-xj have the same sign. If their sign are different, they are considered as discordant pairs

  • Properties:
  1. τ (X,Y) = +1: The agreement between the two rankings is perfect (i.e., the two rankings are the same)
  2. τ (X,Y) = -1: The disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the other)
  3. If X and Y are independent, then we would expect the coefficient to be approximately zero.
  • When is it useful? It is appropriate for both continuous and discrete ordinal variables. It can be use for looking for non-linear dependence relationships.

Distance correlation:

  • What does it measure? Both linear and nonlinear association between two random variables or random vectors.
  • How is it obtained? By dividing the variable’s distance covariance by the product of their distance standard deviations:

\operatorname {dCor}(X,Y)={\frac {\operatorname {dCov}(X,Y)}{{\sqrt {\operatorname {dVar}(X)\,\operatorname {dVar}(Y)}}}},

The distance covariance is defined as:

{\displaystyle \operatorname {dCov} _{n}^{2}(X,Y):={\frac {1}{n^{2}}}\sum _{j=1}^{n}\sum _{k=1}^{n}A_{j,k}\,B_{j,k}.}

Where:

{\displaystyle A_{j,k}:=a_{j,k}-{\overline {a}}_{j\cdot }-{\overline {a}}_{\cdot k}+{\overline {a}}_{\cdot \cdot },\qquad B_{j,k}:=b_{j,k}-{\overline {b}}_{j\cdot }-{\overline {b}}_{\cdot k}+{\overline {b}}_{\cdot \cdot },}

{\begin{aligned}a_{{j,k}}&=\|X_{j}-X_{k}\|,\qquad j,k=1,2,\ldots ,n,\\b_{{j,k}}&=\|Y_{j}-Y_{k}\|,\qquad j,k=1,2,\ldots ,n,\end{aligned}}

where || ⋅ || denotes Euclidean norm.

  • Properties:
  1. dCor (X,Y) = 0 if and only if the random vectors are independent.
  2. dCor (X,Y) = 1: Perfect dependence between the two distributions.
  3. dCor (X,Y) is defined for X and Y in arbitrary dimension.
  • When is it useful? It is appropriate to find any kind  dependence relationships between the 2 variables. Also if X and Y have different dimensions.

Author