Linear regression: The appearance of presence and strength of a linear relationship or correlation between two quantitative variables. A linear regression can be tested by calculating the correlation coefficient (r) between variables. This approach is one of (many) tools for data analysis. A caveat however, correlations don't always correspond to actual, causal, linked relationships - they may simply be coincidental.
Pearson's correlation coefficient is calculated as the ratio of covariance between two sets of paired data values, easily represented visually as a scatter plot. In essence it is a line fitted to a cloud plot of points, data pairs along two axes (x and y). This test assumes that a linear dependence can be fitted between pairs of x & y values on the scatter plot. (review the
Wikipedia article for examples of false correlations). Note, it ignores sequence importance (for example if a third property/value like time is recorded). It simply focuses on whether value ordered pairs exhibit correlation - but remember, correlation is not causation.
The correlation coefficient, known as r, is somewhat like the normalised sum of distances from a straight line best-fitted to the cloud plot. The magnitude value of r is a measure of closeness of fit between pairs to the fitted trend-line. It may vary from -1 to 1. Values closer to +/- 1 suggest a very close relationship between the value of data pairs. Values closer to 0 suggest small to zero relationship between pairs. A correlation coefficient with +ve sign indicates that when x is large y will be large; and when x is small y will be small. A correlation coefficient with -ve sign indicates that when x is large y will be small; and when x is small y will be large.
Example:
| x | y |
| | |
year | Unemployment rate | IMF Japan Position |
1997 | 3.4 | 1,464.13 |
1998 | 4.1 | 1,936.85 |
1999 | 4.7 | 8,539.35 |
2000 | 4.7 | 9,281.27 |
2001 | 5 | 9,294.58 |
2002 | 5.4 | 8,015.22 |
2003 | 5.2 | 8,109.09 |
2004 | 4.7 | 8,942.11 |
2005 | 4.4 | 11,300.41 |
2006 | 4.1 | 12,028.59 |
2007 | 3.8 | 12,431.28 |
2008 | 4 | 11,587.69 |
2009 | 5.1 | 10,562.95 |
2010 | 5.1 | 10,320.52 |
2011 | 4.6 | 11,509.82 |
2012 | 4.3 | 14,631.47 |
2013 | 4 | 15,023.61 |
2014 | 3.6 | 15,239.68 |
2015 | 3.4 | 15,177.52 |
2016 | 3.1 | 27,204.34 |
|
scatter plot and correlation trend line fitted |
Notes on using a spreadsheet:
See the spreadsheet in the sandbox area (
drive link)
Notes on using R:
Getting started with R. The software first.
Steps
japan <- read_csv("simplecorrelationexercise.csv")
head(japan)
names(japan)
plot(japan$`Unemployment rate (x)`, japan$`IMF Japan Position (y)`)
summary(japan)
cor(japan$`Unemployment rate (x)`, japan$`IMF Japan Position (y)`)
Notes on sources:
Statistics Bureau of Japan (
http://www.stat.go.jp/english/)
Unemployment FAQs (
http://www.stat.go.jp/english/data/roudou/qa-1.htm)
Historical data (
http://www.stat.go.jp/english/data/roudou/lngindex.htm) Report a-1
(file Japan_EmploymentRatesHistorical_lt01-a10.xls)
IMF's Holdings of Currency (Holdings Rate) SDR Million
IMF Japan Position at Dec31/year e.g.
https://goo.gl/tB37m2
(file IMFHoldingYen_1984-2016.csv)
Historical Statistics of Japan
http://www.stat.go.jp/english/data/chouki/index.htm
Footnote:
The P-value is a probability test used to infer a level confidence that there is a relationship between two populations, that a relationship holds for some percentage of cases.
Null hypothesis - that there is no relationship between the two populations
The P-value is the percentage or decimal to support or reject the null hypothesis.
Typically we infer that the Significant: <=5%; Marginally significant: <=10%; Insignificant: >10%