Artificial Intelligence

Introducing Distance Correlation, a Superior Correlation Metric.

A modern-day metric that addresses the number one problem of Pearson's correlation

Terence Shin, MSc, MBA

Feb 12, 2021

3 min read

Photo by Coffee Geek on Unsplash

Table of Content

Introduction
What is Distance Correlation?
Mathematics behind Distance Correlation
Implementing Distance Correlation in Python

Introduction

I think we can agree that one of the most commonly used measures in business is correlation, more specifically, Pearson’s correlation.

To recap, correlation measures the linear relationship between two variables, and that in itself is already a problem because there are MANY relationships that are not linear.

And so, for the sake of an example, you might conclude that the relationship between variable X and revenue is not correlated, when it in fact is correlated, just not linearly.

And this is where distance correlation comes in!

What is Distance Correlation?

Distance correlation is a measure of association strength between non-linear random variables. It goes beyond Pearson’s correlation because it can spot more than linear associations and it can work multi-dimensionally. Distance correlation ranges from 0 to 1, where 0 implies independence between X & Y and 1 implies that the linear subspaces of X & Y are equal.

The image below shows how distance correlation measurements compare to Pearson’s correlation.

The formula for distance correlation as follows:

Distance correlation is not the correlation between the distances themselves, but it is a correlation between the scalar products which the "double centered" matrices are composed of.

If that didn’t make sense to you, let’s dive deeper into the math.

Mathematics behind distance correlation

Let (Xk, Yk), k = 1, 2, …, n be a statistical sample from a pair of two random variables, X & Y.

First, we compute the n by n distance matrices (aj, k) and (bj, k) containing all pairwise distances.

Then we take the double centered distances.

From a visual perspective, by taking the double centered distances, we are transforming the matrix representation (the left) to the diagram on the right (double centered matrix).

Why do we do this?

The reason that we do this is for the following reason. Any sort of covariance is the cross-product of moments. Since distances aren’t moments, we have to compute them into moments. To compute these moments, you have to calculate the deviations from the mean first, which is what double centering achieves.

Lastly, we compute the arithmetic average of the products A and B to get the squared sample distance covariance:

The distance variance is simply the distance covariance of two identical variables. It is the square root of the following:

Implementing Distance Correlation in Python

Convinced that this is the metric for you? You’re in luck because there’s a library for distance correlation, making it super easy to implement.

Here’s an example code snippet:

import dcor

def distance_correlation(a,b):
    return dcor.distance_correlation(a,b)

With this function, you can easily calculate the distance correlation of two samples, a and b.

Thanks for Reading!

I hope you found this interesting! Personally, I’ve found this extremely useful in my day-to-day, and I hope you find it useful too.

There are definitely pros and cons to this metric and I would love to hear your thoughts. What do think about a correlation metric that can detect non-linear relationships but is bounded by a range only between 0 and 1?

As always, I wish you the best in your learning endeavors!

Not sure what to read next? I’ve picked another article for you:

10 Statistical Concepts You Should Know For Data Science Interviews

and another one!

21 Tips for Every Data Scientist for 2021

Terence Shin

If you enjoyed this, follow me on Medium for more
Interested in collaborating? Let’s connect on LinkedIn
Sign up for my email list here!

Written By

Terence Shin, MSc, MBA

See all from Terence Shin, MSc, MBA

Artificial Intelligence, Data Science, Education, Machine Learning, Statistics

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Implementing Convolutional Neural Networks in TensorFlow
Artificial Intelligence

Step-by-step code guide to building a Convolutional Neural Network

Shreya Rao

August 20, 2024

6 min read
What Do Large Language Models “Understand”?
Artificial Intelligence

A deep dive on the meaning of understanding and how it applies to LLMs

Tarik Dzekman

August 21, 2024

31 min read
How to Forecast Hierarchical Time Series
Artificial Intelligence

A beginner’s guide to forecast reconciliation

Dr. Robert Kübler

August 20, 2024

13 min read
Hands-on Time Series Anomaly Detection using Autoencoders, with Python
Data Science

Here’s how to use Autoencoders to detect signals with anomalies in a few lines of…

Piero Paialunga

August 21, 2024

12 min read
3 AI Use Cases (That Are Not a Chatbot)
Machine Learning

Feature engineering, structuring unstructured data, and lead scoring

Shaw Talebi

August 21, 2024

7 min read
Solving a Constrained Project Scheduling Problem with Quantum Annealing
Data Science

Solving the resource constrained project scheduling problem (RCPSP) with D-Wave’s hybrid constrained quadratic model (CQM)

Luis Fernando PÉREZ ARMAS, Ph.D.

August 20, 2024

29 min read
Back To Basics, Part Uno: Linear Regression and Cost Function
Data Science

An illustrated guide on essential machine learning concepts

Shreya Rao

February 3, 2023

6 min read

Introducing Distance Correlation, a Superior Correlation Metric.

Table of Content

Introduction

What is Distance Correlation?

Mathematics behind distance correlation

Implementing Distance Correlation in Python

Thanks for Reading!

Terence Shin

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

What Do Large Language Models “Understand”?

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

3 AI Use Cases (That Are Not a Chatbot)

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Back To Basics, Part Uno: Linear Regression and Cost Function