Pearson's correlation coefficient is calculated as the ratio of covariance between two sets of paired data values, easily represented visually as a scatter plot. In essence it is a line fitted to a cloud plot of points, data pairs along two axes (x and y). This test assumes that a linear dependence can be fitted between pairs of x & y values on the scatter plot. (review the Wikipedia article for examples of false correlations). Note, it ignores sequence importance (for example if a third property/value like time is recorded). It simply focuses on whether value ordered pairs exhibit correlation - but remember, correlation is not causation.
The correlation coefficient, known as r, is somewhat like the normalised sum of distances from a straight line best-fitted to the cloud plot. The magnitude value of r is a measure of closeness of fit between pairs to the fitted trend-line. It may vary from -1 to 1. Values closer to +/- 1 suggest a very close relationship between the value of data pairs. Values closer to 0 suggest small to zero relationship between pairs. A correlation coefficient with +ve sign indicates that when x is large y will be large; and when x is small y will be small. A correlation coefficient with -ve sign indicates that when x is large y will be small; and when x is small y will be large.
Example:
x | y | |
year | Unemployment rate | IMF Japan Position |
1997 | 3.4 | 1,464.13 |
1998 | 4.1 | 1,936.85 |
1999 | 4.7 | 8,539.35 |
2000 | 4.7 | 9,281.27 |
2001 | 5 | 9,294.58 |
2002 | 5.4 | 8,015.22 |
2003 | 5.2 | 8,109.09 |
2004 | 4.7 | 8,942.11 |
2005 | 4.4 | 11,300.41 |
2006 | 4.1 | 12,028.59 |
2007 | 3.8 | 12,431.28 |
2008 | 4 | 11,587.69 |
2009 | 5.1 | 10,562.95 |
2010 | 5.1 | 10,320.52 |
2011 | 4.6 | 11,509.82 |
2012 | 4.3 | 14,631.47 |
2013 | 4 | 15,023.61 |
2014 | 3.6 | 15,239.68 |
2015 | 3.4 | 15,177.52 |
2016 | 3.1 | 27,204.34 |
scatter plot and correlation trend line fitted |
See the spreadsheet in the sandbox area (drive link)
Notes on using R:
Getting started with R. The software first.
Download and install R from CRAN http://cran.r-project.org/
Download and install R Studio for the desktop (the same people also run/operate Shiny) http://www.rstudio.com/products/rstudio/download/
R reference card (PDF) by Tom Short (more can be found under Short Documents and Reference Cards here)
R reference card (PDF) by Tom Short (more can be found under Short Documents and Reference Cards here)
Steps
japan <- read_csv("simplecorrelationexercise.csv")
head(japan)
names(japan)
plot(japan$`Unemployment rate (x)`, japan$`IMF Japan Position (y)`)
summary(japan)
cor(japan$`Unemployment rate (x)`, japan$`IMF Japan Position (y)`)
Statistics Bureau of Japan (http://www.stat.go.jp/english/)
Unemployment FAQs (http://www.stat.go.jp/english/data/roudou/qa-1.htm)
Historical data (http://www.stat.go.jp/english/data/roudou/lngindex.htm) Report a-1
(file Japan_EmploymentRatesHistorical_lt01-a10.xls)
IMF's Holdings of Currency (Holdings Rate) SDR Million
IMF Japan Position at Dec31/year e.g. https://goo.gl/tB37m2
(file IMFHoldingYen_1984-2016.csv)
Historical Statistics of Japan http://www.stat.go.jp/english/data/chouki/index.htm
OECD data on Japan https://data.oecd.org/
Measuring the Digital Economy: A New Perspective (link to pdf)
Footnote:
The P-value is a probability test used to infer a level confidence that there is a relationship between two populations, that a relationship holds for some percentage of cases.
Null hypothesis - that there is no relationship between the two populations
The P-value is the percentage or decimal to support or reject the null hypothesis.
Typically we infer that the Significant: <=5%; Marginally significant: <=10%; Insignificant: >10%