Correlation and Regression are statistical methods used to examine relationships between variables. Correlation measures the strength and direction of a relationship, while regression helps predict one variable based on another.

Correlation
Correlation is a statistical measure that describes the strength and direction of the relationship between two variables. It is represented by the correlation coefficient (r), which ranges from −1 to +1.
Types of Correlation
Correlation can be classified into different types based on the direction of the relationship between variables.

Positive Correlation: A positive correlation occurs when both variables move in the same direction. As one variable increases, the other also increases, and vice versa.
Zero or No Correlation: There is no significant relationship between the variables. A change in one variable does not affect the other. The correlation coefficient is r = 0.
Negative Correlation: A negative correlation occurs when the variables move in opposite directions. As one variable increases, the other decreases.
Types of Correlation Coefficients
Some of the common correlation coefficients are:
- Pearson Correlation Coefficient (r): Measures the linear relationship between two continuous variables.
- Spearman's Rank Correlation (ρ): Measures the relationship between ranked variables.
- Kendall's Tau (τ): Measures the strength of association between ranked variables.
Regression
Regression is a statistical technique used to examine the relationship between a dependent variable and one or more independent variables. It helps predict the value of the dependent variable based on the values of the independent variables.
Types of Regression
Regression can be classified into different types based on the nature of the relationship between variables.
Simple Linear Regression: Simple linear regression examines the relationship between one independent variable and one dependent variable using a straight-line equation.
Multiple Linear Regression: Multiple linear regression analyzes the effect of two or more independent variables on a single dependent variable.
Polynomial Regression: Polynomial regression models a nonlinear relationship between variables using a polynomial equation.
Logistic Regression: Logistic regression is used when the dependent variable is categorical, such as yes/no or pass/fail outcomes.
Formulas
Formulas related to Correlation and Regression are:
Correlation Formula: The Pearson correlation coefficient (r) is calculated using the formula:
r = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sqrt{\sum (X - \bar{X})^2 \sum (Y - \bar{Y})^2}}
Simple Linear Regression Formula: The regression line is represented as:
Y = a + bX
Where,
- Y is the dependent variable,
- X is the independent variable,
- a is the intercept,
- b is the slope.
Solved Problems
Problem 1: Given two variables, X and Y, calculate the Pearson correlation coefficient.
- X: [1, 2, 3, 4, 5]
- Y: [2, 4, 6, 8, 10]
Solution:
r = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sqrt{\sum (X - \bar{X})^2 \sum (Y - \bar{Y})^2}} Calculating the mean of X and Y:
- Mean of X (\bar{X}) = 3
- Mean of Y (\bar{Y}) = 6
Calculating the Pearson correlation coefficient:
r = \frac{(1-3)(2-6) + (2-3)(4-6) + (3-3)(6-6) + (4-3)(8-6) + (5-3)(10-6)}{\sqrt{(1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2} \sqrt{(2-6)^2 + (4-6)^2 + (6-6)^2 + (8-6)^2 + (10-6)^2}}= \frac{2*4 + 1*2 + 0 + 1*2 + 2*4}{\sqrt{4 + 1 + 0 + 1 + 4} \sqrt{16 + 4 + 0 + 4 + 16}} = \frac{20}{\sqrt{10} \cdot \sqrt{40}} = \frac{20}{\sqrt{400}} = \frac{20}{20} = 1 So, the Pearson correlation coefficient ( r = 1 ), indicating a perfect positive linear relationship between X and Y.
Problem 2: Calculate the Spearman's rank correlation coefficient for the following data:
- X: [10, 20, 30, 40, 50]
- Y: [30, 40, 10, 20, 50]
Solution:
1. Rank the data points in X and Y:
- Ranks of X: [1, 2, 3, 4, 5]
- Ranks of Y: [3, 4, 1, 2, 5]
2. Calculate the difference between ranks (di):
Differences (d): [2, 2, -2, -2, 0]
3. Calculate the square of differences
(d_i^2) ):Squared differences (d2): [4, 4, 4, 4, 0]
4. Sum the squared differences:
\sum d_i^2 = 4 + 4 + 4 + 4 + 0 = 16 5. Use the Spearman's rank correlation formula:
r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} Here, n = 5:
r_s = 1 - \frac{6 \cdot 16}{5(5^2 - 1)} = 1 - \frac{96}{120} = 1 - 0.8 = 0.2 So, the Spearman's rank correlation coefficient n =5, indicating a weak positive rank correlation between X and Y.
Problem 3: Given data points, predict the value of Y for X = 6 using the equation y = a + bX.
- Data points: (1, 2), (2, 4), (3, 6), (4, 8), (5, 10)
Solution:
Calculate the slope (b) and intercept (a):
b = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sum (X - \bar{X})^2}
\bar{X} = 3, \quad \bar{Y} = 6
b = \frac{(1-3)(2-6) + (2-3)(4-6) + (3-3)(6-6) + (4-3)(8-6) + (5-3)(10-6)}{(1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2} = \frac{20}{10} = 2
a = \bar{Y} - b\bar{X} = 6 - 2 \cdot 3 = 0 Equation: Y = 0 + 2X
For X = 6:
Y = 2 × 6 = 12
So, the predicted value of Y for X = 6 is 12.
Problem 4: Given the following data, predict the value of Y for X1 = 3 and X2 = 4:
Data:
- (X1, X2, Y)
- (1, 2, 3)
- (2, 3, 4)
- (3, 4, 5)
- (4, 5, 6)
- (5, 6, 7)
Solution:
The multiple linear regression equation is .
Using a statistical software or calculation tool (e.g., Excel, R), we can determine:
- b1 = 0.5
- b2 = 0.5
- a = 1.5
So, the equation becomes:
Y = 1.5 + 0.5X1 + 0.5X2
For X1 = 3 and X2 = 4:
Y = 1.5 + 0.5 × 3 + 0.5 × 4 = 1.5 + 1.5 + 2 = 5
So, the predicted value of Y for X1 = 3 and X2 = 4 is 5.
Problem 5: Given the following data, determine the probability of Y being 1 for X = 4:
Data:
- (X, Y)
- (1, 0)
- (2, 0)
- (3, 1)
- (4, 1)
- (5, 1)
Solution:
The logistic regression model is:
\log\left(\frac{p}{1-p}\right) = a + bX Using a statistical software or calculation tool, we can determine:
- b = 1.1
- a = -3
So, the equation becomes:
\log\left(\frac{p}{1-p}\right) = -3 + 1.1X For X = 4:
\log\left(\frac{p}{1-p}\right) = -3 + 1.1 \times 4 = 1.4 Solving for p:
\frac{p}{1-p} = e^{1.4} \approx 4.055
p \approx \frac{4.055}{1 + 4.055} \approx 0.802 So, the probability of Y being 1 for X = 4 is approximately 0.802, or 80.2%.