Why simple bivariate correlations are not sufficient for establishing causation?

A correlation is a measure or degree of relationship between two variables. A set of data can be positively correlated, negatively correlated or not correlated at all. As one set of values increases the other set tends to increase then it is called a positive correlation.

Why simple bivariate correlations are not sufficient for establishing causation?

As one set of values increases the other set tends to decrease then it is called a negative correlation.

Why simple bivariate correlations are not sufficient for establishing causation?

If the change in values of one set doesn't affect the values of the other, then the variables are said to have "no correlation" or "zero correlation."

Why simple bivariate correlations are not sufficient for establishing causation?

A causal relation between two events exists if the occurrence of the first causes the other. The first event is called the cause and the second event is called the effect. A correlation between two variables does not imply causation. On the other hand, if there is a causal relationship between two variables, they must be correlated.

Example:

A study shows that there is a negative correlation between a student's anxiety before a test and the student's score on the test. But we cannot say that the anxiety causes a lower score on the test; there could be other reasons—the student may not have studied well, for example. So the correlation here does not imply causation.

However, consider the positive correlation between the number of hours you spend studying for a test and the grade you get on the test. Here, there is causation as well; if you spend more time studying, it results in a higher grade.

One of the most commonly used measures of correlation is Pearson Product Moment Correlation or Pearson's correlation coefficient. It is measured using the formula,

r x y = n ∑ x y − ∑ x ∑ y ( n ∑ x 2 − ( ∑ x ) 2 ) ( n ∑ y 2 − ( ∑ y ) 2 )

The value of Pearson's correlation coefficient vary from − 1 to + 1 where –1 indicates a strong negative correlation and + 1 indicates a strong positive correlation.

Day by day we find situations in which it is essential to correctly understand the underlying mechanisms from the analysis of data generated by all kinds of systems. We have very clear examples in the field of cybersecurity, where it is fundamental to point to the origin of a threat, given some signs of attack; or in the field of Industry 4.0, where it is equally decisive to know what to do and where to act when a failure is detected in a system or a productive process.

It is often said that correlation does not imply causation, although, inadvertently, we sometimes make the mistake of supposing that there is a causal link between two variables that follow a certain common pattern. In this post we will review some implications that must be taken into account when working with causal analysis techniques. These procedures are key in determining with high certainty the origin of a problem when we only observe some symptoms.

Warming up: Regression and causation

Okay, correlation does not imply causation. But, does a linear regression imply causation?

The quick answer is, no. It is easy to find examples of non-related data that, after a regression calculation, do pass all sorts of statistical tests. The following is a popular example that illustrates the concept of data-driven “causality”.

In view of the data, do we need more pirates to cool down the planet? Or maybe pirates are extremely sensitive to the increase in global average temperature? :-)

Sometimes it is said that this type of “causality” is only in the eye of the beholder. In a humorous tone, there is a popular website from Tyler Vigen called Spurious Correlations where you can find more funny examples.

For time series there is the concept of causality in Granger’s sense. The purpose of the so-called Granger test is to derive whether the current and past behavior of a time series A predicts the behavior of a time series B. Its main limitation is that it only finds the “predictable causality”, and moreover, it can lead to misleading conclusions like the ones shown above. However, it is still used as a basic tool for causal analysis due to its computational simplicity.

Before continuing, let’s look at the difference between correlation and regression, two types of analysis that are sometimes confused and mixed up. The analysis based on correlation allows to quantify the degree to which two variables are related. Instead, the regression-based analysis tries to find the best-fitting line (or curve) to predict the value of a dependent variable Y from the known value of an independent variable X. In a correlation, both variables are in equal conditions (the correlation coefficient is the same if they are swapped). In a regression, however, it does matter what is X and what is Y, since in general the function that best predicts Y from X does not match the function that best predicts X from Y.

Does no correlation imply no causation?

This is not true in general. And any control system serves as counterexample. Control is by definition impossible without causal relationships, but to control something means, roughly speaking, that some variable remains constant, which implies that this variable will not be correlated with other variables, including those that cause it to be constant.

An example of this is Milton Friedman’s thermostat: As you know, when you press the accelerator of a vehicle, it goes faster. And if the vehicle has to go uphill, then it goes slower. But suppose that this information is unknown to a passenger who sees how the driver tries to maintain a constant speed on a mountain road. The passenger will see the accelerator pedal going up and down, and the vehicle going downhill and uphill. If the driver is skilled and the car powerful enough, he will notice that the vehicle speed stays constant. Thus, if he just look at these variables, he could easily conclude that the position of the pedal and the slope of the road have no effect on the speed.

There is no way to avoid this misinterpretation by means of multivariate regression techniques among speed, pedal position and slope. This is because the position of the pedal and the slope are perfectly collinear. In addition, the observed correlation between pedal position and speed, as well as between slope and speed, is zero.

Does causation imply correlation?

We know that there can be multiple explanations for correlation. But let’s reverse the implication: Does causation imply correlation? It seems it does. But again, the answer is that it does not have to.

In the first place, the existence of causality does not imply that there is some kind of linear correlation (the way in which correlation between two variables is usually imagined). Specifically, the correlation coefficient (r) reflects how one variable changes as the other variable changes: if r is positive, there is a tendency for one variable to increase as the other increases; conversely, if r is negative, there is a tendency for one variable to increase as the other decreases. However, the correlation coefficient does not provide information about the slope of the relationship nor many other aspects of nonlinear relationships, as shown in the following sets of (x, y) points.

Secondly, the existence of causality does not even imply that some kind of complex correlation between two variables can be measured.

To illustrate it with an example, suppose that we throw successively two coins and that only when both show the same result –two heads or two tails– a system turns on a light bulb. We can say that both coins are responsible for turning the light on or off (clearly, there is causation). Nevertheless, if we look at just one of the coins and the state of the bulb, we can not establish any kind of correlation or statistical dependence.

In probability theory and information theory, the concept of mutual information measures the dependence between two random variables. That is, it measures how much knowing one of these variables reduces uncertainty about the other (it is, therefore, closely linked to the concept of entropy). For nonlinear relationships such as those in the lower row in the previous figure, Y could be perfectly caused by X, but the correlation between both variables is zero. What can be asserted in those cases is that causation implies high mutual information.

Transitivity and causal bidirectionality

If we have a probabilistic causal chain such as A → B → C, that is, where A causes B, and where B causes C, can we infer that A causes C?

Again, intuition can play a trick on us, and the answer (at this point, expected) is that not necessarily. The formal explanation is that probabilistic causal relationships are guaranteed to be transitive only if the so-called Markov condition is met. This condition is related to the concept of conditional independence and states that, given the present, the future does not depend on the past.

In general, causal intransitivity may be due to various reasons [1]. One of the most common is causal chunking. Let’s see two examples: The chain “exercising” → “becoming thirsty” → “drinking water” can be understood as a unitary mechanism or chunk, so it is possible to assume a causal transitivity relationship (that is, “exercising” causes “ drinking water”). However, let’s consider this popular proverb with butterfly effect reminiscences:

For want of a nail the shoe was lost,
For want of a shoe the horse was lost,
For want of a horse the rider was lost,
For want of a rider the battle was lost,
For want of a battle the kingdom was lost,
And all for the want of a horseshoe nail.

While each causal link in the chain may seem –to some extent– plausible, the overall causal connection between the first cause and the last effect seems too weak, which, from an analytical standpoint, leads to causal intransitivity.

Besides that, causality is not necessarily one-way. One interesting aspect of causal relationships is the possibility of bidirectional or reciprocal causation, giving rise to feedback mechanisms. For example, in dynamic biological systems with predators and prey, predator numbers affect prey numbers and, at the same time, prey numbers (i.e. food supply) affect predator numbers.

Under what conditions does correlation imply causation?

Let’s see the following example: one can perform an experiment on identical twins who are known to consistently get the same grades on their exams. A twin is sent to study for six hours while the other is sent to the amusement park. If their test scores suddenly diverge, this could be taken as strong evidence that studying produces a causal effect on test scores. In this case, correlation between studying and test scores would almost certainly imply causation.

But correlation is not a sufficient condition for causation. In the previous example, it could be argued that twins always cheat on exams using a device that tells them the answers, and the twin who goes to the amusement park loses his device; hence the low grade.

A good way to clarify all this is to think of the structure of Bayesian network that may be generating the observable data, as proposed by Pearl in his book Causality [2]: the key is to look for possible hidden variables. If there is some hidden variable which the observed data depends on, then correlation would not imply causation (we would speak of a spurious relationship). If we are able to discard any hidden variable, then a causal relationship can be inferred.

Let’s clarify it with another example: Children who sleep with the light on are more likely to develop early myopia. Therefore, sleeping with the light on causes myopia. However, later research found a strong link between parental myopia and the development of childhood myopia. In addition, it was discovered that myopic parents were more likely to leave a light on in their children’s bedroom. As the following causal graph shows, that was the hidden variable.

Practical considerations

Answering the question in the title, what are then the implications if there is a measurable correlation between A and B? In this post we have not delved into causal analysis techniques nor the maths that allow modeling or measuring uncertainty. In any case, the recommendation is to build a comprehensive list ranging all possible options and methodically review each of them to determine which one is most likely. So, if A is correlated with B, then:

• A could cause B

• B could cause A

• C could cause both A and B

• Data could be defective

• It could be a coincidence

Reichenbach’s Common Cause Principle (CCP) [3] states that if an improbable coincidence has occurred, there must exist a common cause. This means that strong correlations have causal explanations. For example, suppose that in a room two light bulbs suddenly go out. It is considered improbable that, by chance, both bulbs have blown at the same time, so we will look for the cause in a common burned fuse or in a general interruption of the electrical supply. Thereby, the improbable coincidence is explained as result of a common cause.

Causal analysis and Big Data

Here at Gradiant we work on diverse Big Data Analytics projects where it is essential to correctly identify situations involving causation. In fact, Root Cause Analysis (RCA) has become a buzzword in the IT field. In this context, the use of modeling and causal inference techniques is key to effectively investigate and solve problems which are the cause of incidents affecting one or more services. The challenge is double when combining Big Data and real time, since algorithms not only have to provide high certainty in their outcomes, but also have to respond with the highest possible speed: a fast and accurate notification can mean saving a lot of time and money.

Actually, these techniques are not intended to replace human analysts but to be a powerful tool for them. Training an artificial intelligence with the precision of an expert, but capable of analyzing huge amounts of data, is an exciting challenge that we will address in a future post.

[1] Samuel G. B. Johnson and Woo-Kyoung Ahn, “Causal Networks or Causal Islands? The Representation of Mechanisms and the Transitivity of Causal Judgment”. Cognitive Science, 2015.

[2] Judea Pearl, “Causality: models, reasoning and inference”. Cambridge University Press, 2009.

[3] Hans Reichenbach, “The Direction of Time”. Dover Publications Inc., 2003.

This article was originally published in Spanish at https://www.gradiant.org/blog/claves-analisis-causal/

Are bivariate correlations sufficient to make a causal claim?

Correlation tests for a relationship between two variables. However, seeing two variables moving together does not necessarily mean we know whether one variable causes the other to occur. This is why we commonly say “correlation does not imply causation.”

Why a correlation does not prove causation?

The first reason why correlation may not equal causation is that there is some third variable (Z) that affects both X and Y at the same time, making X and Y move together. The technical term for this missing (often unobserved) variable Z is “omitted variable”.

What are two reasons that multiple regression Analyses Cannot completely establish causation?

What are two reasons that multiple regression designs cannot completely establish causation? They cant establish temporal precedence; researchers cant control for variables they don't measure (there could be a third variable that they didn't measure that is responsible for the relationship);

Does a regression model imply causation explain why or why not?

Regression deals with dependence amongst variables within a model. But it cannot always imply causation. For example, we stated above that rainfall affects crop yield and there is data that support this. However, this is a one-way relationship: crop yield cannot affect rainfall.