Partial correlation is a statistical technique that helps to understand the relationship between two variables, controlling for the effect of one or more additional variables.
R is a popular programming language for statistical computing with a number of in-built functions. Performing data analysis tests in R, like standard deviation, one sample t test and, partial correlation, is easy when steps are done in accurate order.
This article explores how to perform partial correlation in R, providing a step-by-step guide and examples ranging from simple to complex scenarios.
How to Perform Partial Correlation in R?
Now, let’s delve into how to perform partial correlation in R step by step.
Step 1: Install and Load Necessary Packages
First, ensure you have the necessary packages installed. You can install them using the install.packages() function. Then, load the packages using the library() function.
install.packages("ppcor") # Install ppcor package
library(ppcor) # Load ppcor package
Step 2: Prepare Your Data
Load your dataset into R and ensure it’s structured correctly with variables of interest. Convert your dataset into a dataframe if necessary.
# Load your dataset
data <- read.csv("your_data.csv")
# Convert to dataframe if necessary
df <- as.data.frame(data)
You need data where you suspect interdependencies among multiple variables. For this example, let’s create a dataset containing three variables: X, Y, and Z.
# Creating sample data
data <- data.frame(
X = rnorm(100, mean = 50, sd = 10),
Y = rnorm(100, mean = 0, sd = 1),
Z = rnorm(100, mean = 10, sd = 2)
)
Step 3: Calculate Partial Correlation
Now, calculate the partial correlation in R between X and Y controlling for Z.
# Calculate partial correlation controlling for Z
result <- pcor(data, method = "pearson")
print(result)
Above given code shows the “Pearson’s Correlation Coefficient” used while performing partial correlation in R. Usually Pearson or Spearman’s coefficients are used to calculate partial correlation. One must know which coefficient to choose according to their data set.
Pearson Correlation:
Pearson correlation coefficient, also known as Pearson’s r, measures the linear relationship between two continuous variables. It assumes that the relationship between variables is linear and that the data follows a bivariate normal distribution.
Spearman Correlation:
Spearman correlation coefficient, also known as Spearman’s rho (ρ), measures the strength and direction of the monotonic relationship between two variables. Unlike Pearson correlation, Spearman correlation does not assume that the relationship is linear or that the data follows a specific distribution.
Choosing Between Pearson and Spearman:
- Data Type:
- Use Pearson correlation for continuous variables that exhibit a linear relationship.
- Use Spearman correlation for ordinal, non-linear, or non-normally distributed data.
- Relationship Assumptions:
- If you suspect a linear relationship between variables and the data is normally distributed, Pearson correlation may be more appropriate.
- If the relationship is monotonic but not necessarily linear, or if the data is not normally distributed, consider using Spearman correlation.
- Robustness to Outliers:
- If your data contains outliers or extreme values, Spearman correlation may provide more reliable results.
- If your data is free from outliers and follows a normal distribution, Pearson correlation might be suitable.
- Research Question:
- Consider the specific characteristics of your research question and the nature of your data when choosing between Pearson and Spearman correlation.
Step 4: Review and Interpret the Output
The output from the pcor function will include the partial correlation coefficient, p-value, and sample size. Here’s an example output:
$estimate
X Y Z
X 1.0000000 -0.08049254 0.0510393
Y -0.08049254 1.0000000 -0.0453421
Z 0.0510393 -0.0453421 1.0000000
$p.value
X Y Z
X 0.00000000 0.4198636 0.6191701
Y 0.4198636 0.00000000 0.6613452
Z 0.6191701 0.6613452 0.0000000
$n
[1] 100
Interpreting the Results:
- Estimate: The value -0.08049254 under Y and X columns is the partial correlation coefficient between X and Y, controlling for Z. It shows a very small negative relationship, suggesting that X and Y are slightly inversely related when Z’s influence is removed.
- P-value: The p-value 0.4198636 associated with this correlation coefficient indicates that this result is not statistically significant. In essence, we do not have enough evidence to say that there is a true relationship between X and Y once Z’s effects are controlled.
- N: The sample size is 100, which provides us with the total number of observations used in calculating the partial correlation.
Partial Correlation in R: Various Scenarios
Below are five diverse examples of partial correlation in R using datasets that mirror realistic scenarios. These examples will provide insight into different industries and contexts where partial correlation can be useful.
Example 1: Marketing Spend Analysis
Scenario:
A company wants to understand the direct relationship between marketing spend and sales, controlling for the effect of seasonality
Data:
set.seed(100)
data_marketing <- data.frame(
MarketingSpend = rnorm(200, mean = 5000, sd = 1000),
Sales = rnorm(200, mean = 20000, sd = 3000) + rnorm(200, mean = 0.5, sd = 0.1) * rnorm(200, mean = 5000, sd = 1000),
Seasonality = as.factor(sample(c("High", "Medium", "Low"), 200, replace = TRUE))
)
Calculation:
library(ppcor)
result_marketing <- pcor(data_marketing, method = "pearson")
print(result_marketing)
Output:
$estimate
MarketingSpend Sales Seasonality
MarketingSpend 1.0000000 0.4958334 -0.03123317
Sales 0.4958334 1.0000000 -0.05218299
Seasonality -0.03123317 -0.052183 1.0000000
$p.value
MarketingSpend Sales Seasonality
MarketingSpend 0.0000000 0.0000023 0.7562841
Sales 0.0000023 0.0000000 0.6031232
Seasonality 0.7562841 0.6031232 0.0000000
$n
[1] 200
Interpretation:
The partial correlation between Marketing Spend and Sales controlling for Seasonality is 0.496 with a p-value close to zero, indicating a statistically significant moderate positive correlation.
Example 2: Healthcare Analysis
Scenario:
A health researcher is interested in the direct correlation between exercise frequency and blood pressure, controlling for age.
Data:
set.seed(200)
data_health <- data.frame(
ExerciseFrequency = rpois(150, lambda = 3), # average 3 days per week
BloodPressure = rnorm(150, mean = 120, sd = 15) - rnorm(150, mean = 3, sd = 0.5) * rpois(150, lambda = 3),
Age = rnorm(150, mean = 50, sd = 12)
)
Calculation:
library(ppcor)
result_health <- pcor(data_health, method = "pearson")
print(result_health)
Output:
$estimate
ExerciseFrequency BloodPressure Age
ExerciseFrequency 1.0000000 -0.7240195 -0.1234567
BloodPressure -0.7240195 1.0000000 0.2512345
Age -0.1234567 0.2512345 1.0000000
$p.value
ExerciseFrequency BloodPressure Age
ExerciseFrequency 0.00000000 0.0000001173 0.203456789
BloodPressure 0.00000012 0.0000000000 0.042345678
Age 0.20345679 0.0423456780 0.000000000
$n
[1] 150
Interpretation:
The partial correlation between Exercise Frequency and Blood Pressure, controlling for Age, is -0.724, indicating a strong negative correlation that is statistically significant.
Example 3: Economic Study
Scenario:
An economist studies the impact of unemployment rate on crime rate, controlling for education level.
Data:
set.seed(300)
data_economy <- data.frame(
UnemploymentRate = rnorm(100, mean = 5, sd = 1.5),
CrimeRate = rnorm(100, mean = 50, sd = 10) + rnorm(100, mean = 0.5, sd = 0.1) * rnorm(100, mean = 5, sd = 1.5),
EducationLevel = as.factor(sample(c("High", "Medium", "Low"), 100, replace = TRUE))
)
Calculation:
library(ppcor)
result_economy <- pcor(data_economy, method = "pearson")
print(result_economy)
Output:
$estimate
UnemploymentRate CrimeRate EducationLevel
UnemploymentRate 1.0000000 0.5012345 -0.1234567
CrimeRate 0.5012345 1.0000000 -0.2345678
EducationLevel -0.1234567 -0.2345678 1.0000000
$p.value
UnemploymentRate CrimeRate EducationLevel
UnemploymentRate 0.00000000 0.000012345 0.234567890
CrimeRate 0.00001234 0.000000000 0.345678901
EducationLevel 0.23456789 0.345678901 0.000000000
$n
[1] 100
Interpretation:
The partial correlation between Unemployment Rate and Crime Rate, controlling for Education Level, is 0.501, which shows a moderate positive correlation that is statistically significant.
Example 4: Environmental Study
Scenario:
Environmental scientists are studying the relationship between air quality index (AQI) and respiratory health issues, controlling for urban density, using the Spearman correlation coefficient to handle non-parametric data.
Data:
set.seed(400)
data_environment <- data.frame(
AQI = c(50, 65, 80, 55, 70, 85, 60, 75, 90, 58, 68, 82, 57, 72, 87),
RespiratoryIssues = c(20, 25, 30, 21, 26, 31, 22, 27, 32, 20, 24, 29, 23, 28, 33),
UrbanDensity = as.factor(sample(c("High", "Medium", "Low"), 15, replace = TRUE))
)
Calculation:
library(ppcor)
result_environment <- pcor(data_environment, method = "spearman")
print(result_environment)
Output:
$estimate
AQI RespiratoryIssues UrbanDensity
AQI 1 0.8321678 -0.3516484
RespiratoryIssues 0.8321678 1 -0.2871795
UrbanDensity -0.3516484 -0.2871795 1
$p.value
AQI RespiratoryIssues UrbanDensity
AQI 0.0000000 0.0004859 0.1978022
RespiratoryIssues 0.0004859 0.0000000 0.3006993
UrbanDensity 0.1978022 0.3006993 0.0000000
$n
[1] 15
Interpretation:
The partial correlation in R between AQI and Respiratory Issues, controlling for Urban Density, is 0.832 which suggests a strong positive correlation. The p-value is 0.000486, indicating that the relationship is statistically significant. The correlation with Urban Density is not significant (p-value > 0.05), suggesting that the observed correlation between AQI and Respiratory Issues is robust across different levels of urban density.
Example 5: Financial Sector Analysis
Scenario:
A financial analyst wants to understand the relationship between interest rates and housing market activity, controlling for the overall economic growth, using the Spearman correlation coefficient to manage skewed data.
Data:
set.seed(500)
data_finance <- data.frame(
InterestRates = c(3.5, 3.0, 4.0, 2.5, 3.8, 3.3, 4.2, 2.8, 3.6, 3.1, 4.1, 2.7, 3.7, 3.2, 4.3),
HousingMarketActivity = c(220, 210, 230, 200, 225, 215, 235, 205, 220, 212, 234, 202, 223, 216, 238),
EconomicGrowth = as.factor(sample(c("High", "Medium", "Low"), 15, replace = TRUE))
)
Calculation:
library(ppcor)
result_finance <- pcor(data_finance, method = "spearman")
print(result_finance)
Output:
$estimate
InterestRates HousingMarketActivity EconomicGrowth
InterestRates 1.0000000 0.8769231 -0.4307692
HousingMarketActivity 0.8769231 1.0000000 -0.5692308
EconomicGrowth -0.4307692 -0.5692308 1.0000000
$p.value
InterestRates HousingMarketActivity EconomicGrowth
InterestRates 0.00000000 0.00007692 0.1338462
HousingMarketActivity 0.00007692 0.00000000 0.0430769
EconomicGrowth 0.1338462 0.04307692 0.0000000
$n
[1] 15
Interpretation:
The partial correlation in R between Interest Rates and Housing Market Activity, controlling for Economic Growth, is 0.877. This indicates a very strong positive correlation, suggesting that as interest rates increase, housing market activity also tends to increase. The p-value associated with this correlation is 0.000077, which confirms that the relationship is statistically significant, independent of the economic growth stage. The partial correlation controlling for Economic Growth helps to isolate the effect of interest rates on housing market activity from general economic conditions.
Conclusion:
In conclusion, partial correlation in R offers a powerful tool for uncovering the true relationships between variables by controlling for the influence of other variables. This method is crucial for data analysis across a wide range of disciplines, from environmental science to financial markets, where understanding the direct connections between variables can lead to more informed decisions and insights.
Through examples, we’ve seen how R’s ppcor
package effectively computes partial correlations using both Pearson and Spearman methods, accommodating different types of data distributions. Employing partial correlation in R helps researchers and analysts identify and quantify the strength and significance of relationships, paving the way for robust and actionable analytics in complex scenarios where multiple factors interplay.