Some notes and pointers for R; introductory and tutorials for basic statistical treatment of quantitative data.

As an aside comment, would you believe that running R commands and generating plots, makes data fun and surprising? Or is it just me? At least it's something different, perhaps more interactive than doing the same thing in a spreadsheet. Did I say it felt like fun already? Perhaps when generating those beautiful little graphs automatically?

Definitions: p-value, confidence interval, random variables, null distributions, central limit theorem, inference tests like t-test, association test, permutation test.

Getting started with R. The software first.

Download and install R from CRAN http://cran.r-project.org/

Download and install R Studio for the desktop (the same people also run/operate Shiny) http://www.rstudio.com/products/rstudio/download/

**Q: Am I on the latest version of R?**

Check you're on the latest version, if not download the latest package (on MacOS use Safari).

Run md5 checker from the terminal and verify.

md5 R-3.2.2.pkg MD5 (R-3.2.2.pkg) = dd8999f50c5d4e392832797d091642dbThen check the installation works, run R.app, type version at the R console to verify the expected installed version is running.

**Q: Same: am I on the latest version of R?**

Check you're on the latest version, if not download the latest package (on MacOS use Safari).

md5 RStudio-0.99.489.dmg MD5 (RStudio-0.99.489.dmg) = 05cf866b07df6552583f98314ed09d38Again, check the installation works, run RStudio and go to RStudio>About RStudio to see the version window.

**Links for learning.**

- The R Project for Statistical Computing
- R Introduction by Chi Yau
- R Tutorial by Chi Yau on elementary statistics with quantitative data
- Using R for Multivariate Analysis

Pointers to external material (from the EdX course)

- R reference card (PDF) by Tom Short (more can be found under Short Documents and Reference Cards here)
- Quick-R: quick online reference for data input, basic statistics and plots
- Thomas Girke's R & Bioconductor manuals
- R programming class on Coursera, taught by Roger Peng, Jeff Leek and Brian Caffo
- The R class from Code School is also a good place to start: http://tryr.codeschool.com/
- 'quick R' http://www.statmethods.net/.

The underlying concepts.

- Vectors (single or multi-element row, single data type)
- Matrix (multi-element array, single data type)
- Lists (a vector of mixed data types, organised into named components $xxx)
- Data Frames (an array-like form of R list, a matrix of mixed data types in which each column of the matrix corresponds to a vector. Can be addressed different ways, by named component or index location)

Matrix commands

> m <- rbind(c(1,4),c(2,2)) # rbind( ) is a function for row bind. cbind( ) is the corresponding function for column bind.

> m

[,1] [,2]

[1,] 1 4

[2,] 2 2

> m %*% c(1,1) # matrix multiplication operator

[,1]

[1,] 5

[2,] 4

> m[1,2] # return value at matrix index location 1,2

[1] 4

> m[2,2] # return value at matrix index location 2,2

[1] 2

> m[1,] # row 1, shows how to extract submatrices from a matrix

[1] 1 4

> m[,2] # column 2, shows how to extract submatrices from a matrix

[1] 4 2

A series of setup steps in R/RStudio (and using stuff at https://github.com/genomicsclass/dagdata):

> install.packages("devtools")

> library(devtools)

> install_github("genomicsclass/dagdata")

> install_github("ririzarr/rafalib")

> 1:10

[1] 1 2 3 4 5 6 7 8 9 10

> x <- 1:10

> y <- rnorm(10)

> plot(x,y)

> ? read.csv

> dat <- read.csv("femaleMiceWeights.csv")

> class(dat)

> head(dat)

> dim(dat)

> dat$order

> colnames(dat)

> dat$sleep_total

> c(dat$sleep_total, 1000)

> plot(dat$brainwt, dat$sleep_total)

> plot(dat$brainwt, dat$sleep_total, log="x")

> summary(dat)

> dat[c(1,2),]

> dat[ dat$sleep_total > 18, ]

> dat$sleep_total[ c(1,2)]

> dat[dat$sleep_total > 18,6]

> mean(dat[ dat$sleep_total > 18,6 ])

> dat <- read.csv("msleep_ggplot2.csv")

> which(dat$sleep_total>18)

> dat$sleep_total[which(dat$sleep_total>18)]

> dat$sleep_total[22]

> which(dat$sleep_rem<3)

> which(dat$sleep_rem<3 & dat$sleep_total>18)

> sort(dat$sleep_total)

> order(dat$sleep_total)

> dat$sleep_total[order(dat$sleep_total)]

> rank(dat$sleep_total)

> rank(c(1,2,2,3))

> match(c("Cow", "Owl monkey", "Cheetah"), dat$name)

> idx=match(c("Cow", "Owl monkey", "Cheetah"), dat$name)

> dat[idx]

Useful videos (link) (link2)

The presence and strength of a linear relationship or correlation between two quantitative variables can be tested by calculating the correlation coefficient (r) between the two variables.

> argentina<-read.csv('countries.csv')

> head(countries)

country year gsli.total gsli.financial gsli.people u5_mortality_rate life_expectancy unemployment_rate

1 Argentina 2004 5.07 3.25 0.74 17.9 74.5 12.2

2 Argentina 2005 5.05 3.14 0.93 17.1 75.0 10.6

etc.

Run a correlation test between the two variables (vectors) we want to compare, i.e. countries$gsli.total and countries$unemployment_rate.

Produce all possible pair-wise plots (link)

> names(countries)

[1] "country" "year"

[3] "gsli.total" "gsli.financial"

[5] "gsli.people" "u5mr_mortality_rate_median"

[7] "life_expectancy_from_birth" "GET_UR"

> pairs(countries[,3:8])

By inspection the only variable without obvious possible correlation is gsli.total.

> 1:10

[1] 1 2 3 4 5 6 7 8 9 10

> x <- 1:10

> y <- rnorm(10)

> plot(x,y)

> ? read.csv

> dat <- read.csv("femaleMiceWeights.csv")

You might need to set the working directory to get read.csv to find the right file |

> class(dat)

> head(dat)

> dim(dat)

> dat$order

> colnames(dat)

> dat$sleep_total

> c(dat$sleep_total, 1000)

> plot(dat$brainwt, dat$sleep_total)

> plot(dat$brainwt, dat$sleep_total, log="x")

> summary(dat)

> dat[c(1,2),]

> dat[ dat$sleep_total > 18, ]

> dat$sleep_total[ c(1,2)]

> dat[dat$sleep_total > 18,6]

> mean(dat[ dat$sleep_total > 18,6 ])

> dat <- read.csv("msleep_ggplot2.csv")

> which(dat$sleep_total>18)

> dat$sleep_total[which(dat$sleep_total>18)]

> dat$sleep_total[22]

> which(dat$sleep_rem<3)

> which(dat$sleep_rem<3 & dat$sleep_total>18)

> sort(dat$sleep_total)

> order(dat$sleep_total)

> dat$sleep_total[order(dat$sleep_total)]

> rank(dat$sleep_total)

> rank(c(1,2,2,3))

> match(c("Cow", "Owl monkey", "Cheetah"), dat$name)

> idx=match(c("Cow", "Owl monkey", "Cheetah"), dat$name)

> dat[idx]

**Q: Running a simple correlation analysis in R**Useful videos (link) (link2)

The presence and strength of a linear relationship or correlation between two quantitative variables can be tested by calculating the correlation coefficient (r) between the two variables.

> argentina<-read.csv('countries.csv')

> head(countries)

country year gsli.total gsli.financial gsli.people u5_mortality_rate life_expectancy unemployment_rate

1 Argentina 2004 5.07 3.25 0.74 17.9 74.5 12.2

2 Argentina 2005 5.05 3.14 0.93 17.1 75.0 10.6

etc.

Run a correlation test between the two variables (vectors) we want to compare, i.e. countries$gsli.total and countries$unemployment_rate.

**Q: Making a pairs plot to highlight possible correlations between multiple variables**Produce all possible pair-wise plots (link)

> names(countries)

[1] "country" "year"

[3] "gsli.total" "gsli.financial"

[5] "gsli.people" "u5mr_mortality_rate_median"

[7] "life_expectancy_from_birth" "GET_UR"

> pairs(countries[,3:8])

pairs-plot for the countries dataset (for axes read x-y as illustrated gsli.total v gsli-fin) |