R and Harvard's PH525.1x MOOC

Having enrolled "Statistics and R for the Life Sciences" run on EdX by Harvard (MOOC module PH525.1x)...

Some notes and pointers for R; introductory and tutorials for basic statistical treatment of quantitative data.

As an aside comment, would you believe that running R commands and generating plots, makes data fun and surprising? Or is it just me? At least it's something different, perhaps more interactive than doing the same thing in a spreadsheet. Did I say it felt like fun already? Perhaps when generating those beautiful little graphs automatically?

Definitions: p-value, confidence interval, random variables, null distributions, central limit theorem, inference tests like t-test, association test, permutation test.

Getting started with R. The software first.

Download and install R from CRAN http://cran.r-project.org/

Download and install R Studio for the desktop (the same people also run/operate Shiny) http://www.rstudio.com/products/rstudio/download/

Q: Am I on the latest version of R?
Check you're on the latest version, if not download the latest package (on MacOS use Safari).
Run md5 checker from the terminal and verify.

md5 R-3.2.2.pkg MD5 (R-3.2.2.pkg) = dd8999f50c5d4e392832797d091642db

Then check the installation works, run R.app, type version at the R console to verify the expected installed version is running.

Q: Same: am I on the latest version of R?
Check you're on the latest version, if not download the latest package (on MacOS use Safari).

md5 RStudio-0.99.489.dmg MD5 (RStudio-0.99.489.dmg) = 05cf866b07df6552583f98314ed09d38

Again, check the installation works, run RStudio and go to RStudio>About RStudio to see the version window.

Links for learning.

Pointers to external material (from the EdX course)

R reference card (PDF) by Tom Short (more can be found under Short Documents and Reference Cards here)
Quick-R: quick online reference for data input, basic statistics and plots
Thomas Girke's R & Bioconductor manuals
R programming class on Coursera, taught by Roger Peng, Jeff Leek and Brian Caffo
The R class from Code School is also a good place to start: http://tryr.codeschool.com/
'quick R' http://www.statmethods.net/.

The underlying concepts.

Vectors (single or multi-element row, single data type)
Matrix (multi-element array, single data type)
Lists (a vector of mixed data types, organised into named components $xxx)
Data Frames (an array-like form of R list, a matrix of mixed data types in which each column of the matrix corresponds to a vector. Can be addressed different ways, by named component or index location)

Matrix commands

> m <- rbind(c(1,4),c(2,2)) # rbind( ) is a function for row bind. cbind( ) is the corresponding function for column bind.

> m

[,1] [,2]

[1,] 1 4

[2,] 2 2

> m %*% c(1,1) # matrix multiplication operator

[,1]

[1,] 5

[2,] 4

> m[1,2] # return value at matrix index location 1,2

[1] 4

> m[2,2] # return value at matrix index location 2,2

[1] 2

> m[1,] # row 1, shows how to extract submatrices from a matrix

[1] 1 4

> m[,2] # column 2, shows how to extract submatrices from a matrix

[1] 4 2

A series of setup steps in R/RStudio (and using stuff at https://github.com/genomicsclass/dagdata):

> install.packages("devtools")

> library(devtools)

> install_github("genomicsclass/dagdata")

> install_github("ririzarr/rafalib")
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> x <- 1:10
> y <- rnorm(10)
> plot(x,y)
> ? read.csv
> dat <- read.csv("femaleMiceWeights.csv")

You might need to set the working directory to get read.csv to find the right file

> class(dat)
> head(dat)
> dim(dat)
> dat$order
> colnames(dat)
> dat$sleep_total
> c(dat$sleep_total, 1000)
> plot(dat$brainwt, dat$sleep_total)
> plot(dat$brainwt, dat$sleep_total, log="x")
> summary(dat)
> dat[c(1,2),]
> dat[ dat$sleep_total > 18, ]
> dat$sleep_total[ c(1,2)]
> dat[dat$sleep_total > 18,6]
> mean(dat[ dat$sleep_total > 18,6 ])
> dat <- read.csv("msleep_ggplot2.csv")
> which(dat$sleep_total>18)
> dat$sleep_total[which(dat$sleep_total>18)]
> dat$sleep_total[22]
> which(dat$sleep_rem<3)
> which(dat$sleep_rem<3 & dat$sleep_total>18)
> sort(dat$sleep_total)
> order(dat$sleep_total)
> dat$sleep_total[order(dat$sleep_total)]
> rank(dat$sleep_total)
> rank(c(1,2,2,3))
> match(c("Cow", "Owl monkey", "Cheetah"), dat$name)
> idx=match(c("Cow", "Owl monkey", "Cheetah"), dat$name)
> dat[idx]

Q: Running a simple correlation analysis in R
Useful videos (link) (link2)
The presence and strength of a linear relationship or correlation between two quantitative variables can be tested by calculating the correlation coefficient (r) between the two variables.

> argentina<-read.csv('countries.csv')
> head(countries)
country year gsli.total gsli.financial gsli.people u5_mortality_rate life_expectancy unemployment_rate
1 Argentina 2004 5.07 3.25 0.74 17.9 74.5 12.2
2 Argentina 2005 5.05 3.14 0.93 17.1 75.0 10.6
etc.

Run a correlation test between the two variables (vectors) we want to compare, i.e. countries$gsli.total and countries$unemployment_rate.

Q: Making a pairs plot to highlight possible correlations between multiple variables
Produce all possible pair-wise plots (link)
> names(countries)
[1] "country" "year"
[3] "gsli.total" "gsli.financial"
[5] "gsli.people" "u5mr_mortality_rate_median"
[7] "life_expectancy_from_birth" "GET_UR"
> pairs(countries[,3:8])

pairs-plot for the countries dataset (for axes read x-y as illustrated gsli.total v gsli-fin)

By inspection the only variable without obvious possible correlation is gsli.total.

MIS41010 Outsourcing and Offshoring - Open Educational Resource

The MGS Blog

Tuesday, April 18, 2017

R and Harvard's PH525.1x MOOC