Pearson's Correlation Coefficient

Linear regression: The appearance of presence and strength of a linear relationship or correlation between two quantitative variables. A linear regression can be tested by calculating the correlation coefficient (r) between variables. This approach is one of (many) tools for data analysis. A caveat however, correlations don't always correspond to actual, causal, linked relationships - they may simply be coincidental.

Pearson's correlation coefficient is calculated as the ratio of covariance between two sets of paired data values, easily represented visually as a scatter plot. In essence it is a line fitted to a cloud plot of points, data pairs along two axes (x and y). This test assumes that a linear dependence can be fitted between pairs of x & y values on the scatter plot. (review the Wikipedia article for examples of false correlations). Note, it ignores sequence importance (for example if a third property/value like time is recorded). It simply focuses on whether value ordered pairs exhibit correlation - but remember, correlation is not causation.

The correlation coefficient, known as r, is somewhat like the normalised sum of distances from a straight line best-fitted to the cloud plot. The magnitude value of r is a measure of closeness of fit between pairs to the fitted trend-line. It may vary from -1 to 1. Values closer to +/- 1 suggest a very close relationship between the value of data pairs. Values closer to 0 suggest small to zero relationship between pairs. A correlation coefficient with +ve sign indicates that when x is large y will be large; and when x is small y will be small. A correlation coefficient with -ve sign indicates that when x is large y will be small; and when x is small y will be large.

Example:

	x	y

year	Unemployment rate	IMF Japan Position
1997	3.4	1,464.13
1998	4.1	1,936.85
1999	4.7	8,539.35
2000	4.7	9,281.27
2001	5	9,294.58
2002	5.4	8,015.22
2003	5.2	8,109.09
2004	4.7	8,942.11
2005	4.4	11,300.41
2006	4.1	12,028.59
2007	3.8	12,431.28
2008	4	11,587.69
2009	5.1	10,562.95
2010	5.1	10,320.52
2011	4.6	11,509.82
2012	4.3	14,631.47
2013	4	15,023.61
2014	3.6	15,239.68
2015	3.4	15,177.52
2016	3.1	27,204.34

scatter plot and correlation trend line fitted

Notes on using a spreadsheet:
See the spreadsheet in the sandbox area (drive link)

Notes on using R:

Getting started with R. The software first.

Download and install R from CRAN http://cran.r-project.org/

Download and install R Studio for the desktop (the same people also run/operate Shiny) http://www.rstudio.com/products/rstudio/download/
R reference card (PDF) by Tom Short (more can be found under Short Documents and Reference Cards here)

Steps

japan <- read_csv("simplecorrelationexercise.csv")

head(japan)

names(japan)

plot(japan$`Unemployment rate (x)`, japan$`IMF Japan Position (y)`)

summary(japan)

cor(japan$`Unemployment rate (x)`, japan$`IMF Japan Position (y)`)

Notes on sources:
Statistics Bureau of Japan (http://www.stat.go.jp/english/)

Unemployment FAQs (http://www.stat.go.jp/english/data/roudou/qa-1.htm)
Historical data (http://www.stat.go.jp/english/data/roudou/lngindex.htm) Report a-1
(file Japan_EmploymentRatesHistorical_lt01-a10.xls)

IMF's Holdings of Currency (Holdings Rate) SDR Million
IMF Japan Position at Dec31/year e.g. https://goo.gl/tB37m2
(file IMFHoldingYen_1984-2016.csv)

Historical Statistics of Japan http://www.stat.go.jp/english/data/chouki/index.htm

OECD data on Japan https://data.oecd.org/
Measuring the Digital Economy: A New Perspective (link to pdf)

Footnote:

The P-value is a probability test used to infer a level confidence that there is a relationship between two populations, that a relationship holds for some percentage of cases.

Null hypothesis - that there is no relationship between the two populations

The P-value is the percentage or decimal to support or reject the null hypothesis.

Typically we infer that the Significant: <=5%; Marginally significant: <=10%; Insignificant: >10%

Outsourcing and Offshoring - Open Educational Resource

The MGS Blog

Friday, January 26, 2018

Pearson's Correlation Coefficient