The MGS Blog

Tuesday, April 18, 2017

R and Harvard's PH525.1x MOOC

Having enrolled "Statistics and R for the Life Sciences" run on EdX by Harvard (MOOC module PH525.1x)...
Some notes and pointers for R; introductory and tutorials for basic statistical treatment of quantitative data.
As an aside comment, would you believe that running R commands and generating plots, makes data fun and surprising? Or is it just me? At least it's something different, perhaps more interactive than doing the same thing in a spreadsheet. Did I say it felt like fun already? Perhaps when generating those beautiful little graphs automatically?

Definitions: p-value, confidence interval, random variables, null distributions, central limit theorem, inference tests like t-test, association test, permutation test.

Getting started with R. The software first.
Download and install R from CRAN
Download and install R Studio for the desktop (the same people also run/operate Shiny) 

Q: Am I on the latest version of R?
Check you're on the latest version, if not download the latest package (on MacOS use Safari).
Run md5 checker from the terminal and verify.
md5 R-3.2.2.pkg MD5 (R-3.2.2.pkg) = dd8999f50c5d4e392832797d091642db
Then check the installation works, run, type version at the R console to verify the expected installed version is running.

Q: Same: am I on the latest version of R?
Check you're on the latest version, if not download the latest package (on MacOS use Safari).
md5 RStudio-0.99.489.dmg MD5 (RStudio-0.99.489.dmg) = 05cf866b07df6552583f98314ed09d38
Again, check the installation works, run RStudio and go to RStudio>About RStudio to see the version window.

Links for learning.
Pointers to external material (from the EdX course)
The underlying concepts.
  • Vectors (single or multi-element row, single data type)
  • Matrix (multi-element array, single data type)
  • Lists (a vector of mixed data types, organised into named components $xxx)
  • Data Frames (an array-like form of R list,  a matrix of mixed data types in which each column of the matrix corresponds to a vector. Can be addressed different ways, by named component or index location)
Matrix commands
> m <- rbind(c(1,4),c(2,2))  # rbind( ) is a function for row bind. cbind( ) is the corresponding function for column bind.
> m
    [,1] [,2]
[1,]   1    4
[2,]   2    2
> m %*% c(1,1)  # matrix multiplication operator
[1,]   5
[2,]   4
> m[1,2] # return value at matrix index location 1,2
[1] 4
> m[2,2] # return value at matrix index location 2,2
[1] 2
> m[1,] # row 1, shows how to extract submatrices from a matrix
[1] 1  4
> m[,2] # column 2, shows how to extract submatrices from a matrix
[1] 4  2

A series of setup steps in R/RStudio (and using stuff at
> install.packages("devtools")
> library(devtools)
> install_github("genomicsclass/dagdata")
> install_github("ririzarr/rafalib")
> 1:10
    [1]  1  2  3  4  5  6  7  8  9 10
> x <- 1:10
> y <- rnorm(10)
> plot(x,y)
> ? read.csv
> dat <- read.csv("femaleMiceWeights.csv")
You might need to set the working directory to get read.csv to find the right file

> class(dat)
> head(dat)
> dim(dat)
> dat$order
> colnames(dat)
> dat$sleep_total
> c(dat$sleep_total, 1000)
> plot(dat$brainwt, dat$sleep_total)
> plot(dat$brainwt, dat$sleep_total, log="x")
> summary(dat)
> dat[c(1,2),]
> dat[ dat$sleep_total > 18, ]
> dat$sleep_total[ c(1,2)]
> dat[dat$sleep_total > 18,6]
> mean(dat[ dat$sleep_total > 18,6 ])
> dat <- read.csv("msleep_ggplot2.csv")
> which(dat$sleep_total>18)
> dat$sleep_total[which(dat$sleep_total>18)]
> dat$sleep_total[22]
> which(dat$sleep_rem<3)
> which(dat$sleep_rem<3 & dat$sleep_total>18)
> sort(dat$sleep_total)
> order(dat$sleep_total)
> dat$sleep_total[order(dat$sleep_total)]
> rank(dat$sleep_total)
> rank(c(1,2,2,3))
> match(c("Cow", "Owl monkey", "Cheetah"), dat$name)
> idx=match(c("Cow", "Owl monkey", "Cheetah"), dat$name)
> dat[idx]

Q: Running a simple correlation analysis in R
Useful videos (link) (link2)
The presence and strength of a linear relationship or correlation between two quantitative variables can be tested by calculating the correlation coefficient (r) between the two variables.

> argentina<-read.csv('countries.csv')
> head(countries)
    country year gsli.people u5_mortality_rate life_expectancy unemployment_rate
1 Argentina 2004       5.07           3.25        0.74                       17.9                       74.5   12.2
2 Argentina 2005       5.05           3.14        0.93                       17.1                       75.0   10.6

Run a correlation test between the two variables (vectors) we want to compare, i.e. countries$ and countries$unemployment_rate.

Q: Making a pairs plot to highlight possible correlations between multiple variables
Produce all possible pair-wise plots (link)
> names(countries)
[1] "country"                    "year"                  
[3] ""                 ""        
[5] "gsli.people"                "u5mr_mortality_rate_median"
[7] "life_expectancy_from_birth" "GET_UR"                
> pairs(countries[,3:8])
pairs-plot for the countries dataset (for axes read x-y as illustrated v gsli-fin)
By inspection the only variable without obvious possible correlation is