The MGS Blog

Friday, January 26, 2018

Pearson's Correlation Coefficient

Linear regression: The appearance of presence and strength of a linear relationship or correlation between two quantitative variables. A linear regression can be tested by calculating the correlation coefficient (r) between variables. This approach is one of (many) tools for data analysis. A caveat however, correlations don't always correspond to actual, causal, linked relationships - they may simply be coincidental.

Pearson's correlation coefficient is calculated as the ratio of covariance between two sets of paired data values, easily represented visually as a scatter plot. In essence it is a line fitted to a cloud plot of points, data pairs along two axes (x and y). This test assumes that a linear dependence can be fitted between pairs of x & y values on the scatter plot. (review the Wikipedia article for examples of false correlations). Note, it ignores sequence importance (for example if a third property/value like time is recorded). It simply focuses on whether value ordered pairs exhibit correlation - but remember, correlation is not causation.

The correlation coefficient, known as r, is somewhat like the normalised sum of distances from a straight line best-fitted to the cloud plot. The magnitude value of r is a measure of closeness of fit between pairs to the fitted trend-line. It may vary from -1 to 1. Values closer to +/- 1 suggest a very close relationship between the value of data pairs. Values closer to 0 suggest small to zero relationship between pairs. A correlation coefficient with +ve sign indicates that when x is large y will be large; and when x is small y will be small. A correlation coefficient with -ve sign indicates that when x is large y will be small; and when x is small y will be large.

Example:


xy
yearUnemployment rateIMF Japan Position
19973.41,464.13
19984.11,936.85
19994.78,539.35
20004.79,281.27
200159,294.58
20025.48,015.22
20035.28,109.09
20044.78,942.11
20054.411,300.41
20064.112,028.59
20073.812,431.28
2008411,587.69
20095.110,562.95
20105.110,320.52
20114.611,509.82
20124.314,631.47
2013415,023.61
20143.615,239.68
20153.415,177.52
20163.127,204.34

scatter plot and correlation trend line fitted


Notes on using a spreadsheet:
See the spreadsheet in the sandbox area (drive link)

Notes on using R:
Getting started with R. The software first.
Download and install R from CRAN http://cran.r-project.org/
Download and install R Studio for the desktop (the same people also run/operate Shiny) http://www.rstudio.com/products/rstudio/download/
R reference card (PDF) by Tom Short (more can be found under Short Documents and Reference Cards here)

Steps
japan <- read_csv("simplecorrelationexercise.csv")
head(japan)
names(japan)
plot(japan$`Unemployment rate (x)`, japan$`IMF Japan Position (y)`)
summary(japan)
cor(japan$`Unemployment rate (x)`, japan$`IMF Japan Position (y)`)

Notes on sources:
Statistics Bureau of Japan (http://www.stat.go.jp/english/)

Unemployment FAQs (http://www.stat.go.jp/english/data/roudou/qa-1.htm)
Historical data (http://www.stat.go.jp/english/data/roudou/lngindex.htm) Report a-1
(file Japan_EmploymentRatesHistorical_lt01-a10.xls)

IMF's Holdings of Currency (Holdings Rate) SDR Million
IMF Japan Position at Dec31/year e.g. https://goo.gl/tB37m2
(file IMFHoldingYen_1984-2016.csv)

Historical Statistics of Japan http://www.stat.go.jp/english/data/chouki/index.htm

OECD data on Japan https://data.oecd.org/
Measuring the Digital Economy: A New Perspective (link to pdf)

Footnote:
The P-value is a probability test used to infer a level confidence that there is a relationship between two populations, that a relationship holds for some percentage of cases.
Null hypothesis - that there is no relationship between the two populations
The P-value is the percentage or decimal to support or reject the null hypothesis.
Typically we infer that the Significant: <=5%; Marginally significant: <=10%; Insignificant: >10%