Some notes and pointers for R; introductory and tutorials for basic statistical treatment of quantitative data.
As an aside comment, would you believe that running R commands and generating plots, makes data fun and surprising? Or is it just me? At least it's something different, perhaps more interactive than doing the same thing in a spreadsheet. Did I say it felt like fun already? Perhaps when generating those beautiful little graphs automatically?
Definitions: p-value, confidence interval, random variables, null distributions, central limit theorem, inference tests like t-test, association test, permutation test.
Getting started with R. The software first.
Download and install R from CRAN http://cran.r-project.org/
Download and install R Studio for the desktop (the same people also run/operate Shiny) http://www.rstudio.com/products/rstudio/download/
Q: Am I on the latest version of R?
Check you're on the latest version, if not download the latest package (on MacOS use Safari).
Run md5 checker from the terminal and verify.
md5 R-3.2.2.pkg MD5 (R-3.2.2.pkg) = dd8999f50c5d4e392832797d091642dbThen check the installation works, run R.app, type version at the R console to verify the expected installed version is running.
Q: Same: am I on the latest version of R?
Check you're on the latest version, if not download the latest package (on MacOS use Safari).
md5 RStudio-0.99.489.dmg MD5 (RStudio-0.99.489.dmg) = 05cf866b07df6552583f98314ed09d38Again, check the installation works, run RStudio and go to RStudio>About RStudio to see the version window.
Links for learning.
- The R Project for Statistical Computing
- R Introduction by Chi Yau
- R Tutorial by Chi Yau on elementary statistics with quantitative data
- Using R for Multivariate Analysis
Pointers to external material (from the EdX course)
- R reference card (PDF) by Tom Short (more can be found under Short Documents and Reference Cards here)
- Quick-R: quick online reference for data input, basic statistics and plots
- Thomas Girke's R & Bioconductor manuals
- R programming class on Coursera, taught by Roger Peng, Jeff Leek and Brian Caffo
- The R class from Code School is also a good place to start: http://tryr.codeschool.com/
- 'quick R' http://www.statmethods.net/.
The underlying concepts.
- Vectors (single or multi-element row, single data type)
- Matrix (multi-element array, single data type)
- Lists (a vector of mixed data types, organised into named components $xxx)
- Data Frames (an array-like form of R list, a matrix of mixed data types in which each column of the matrix corresponds to a vector. Can be addressed different ways, by named component or index location)
Matrix commands
> m <- rbind(c(1,4),c(2,2)) # rbind( ) is a function for row bind. cbind( ) is the corresponding function for column bind.
> m
[,1] [,2]
[1,] 1 4
[2,] 2 2
> m %*% c(1,1) # matrix multiplication operator
[,1]
[1,] 5
[2,] 4
> m[1,2] # return value at matrix index location 1,2
[1] 4
> m[2,2] # return value at matrix index location 2,2
[1] 2
> m[1,] # row 1, shows how to extract submatrices from a matrix
[1] 1 4
> m[,2] # column 2, shows how to extract submatrices from a matrix
[1] 4 2
A series of setup steps in R/RStudio (and using stuff at https://github.com/genomicsclass/dagdata):
> install.packages("devtools")
> library(devtools)
> install_github("genomicsclass/dagdata")
> install_github("ririzarr/rafalib")
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> x <- 1:10
> y <- rnorm(10)
> plot(x,y)
> ? read.csv
> dat <- read.csv("femaleMiceWeights.csv")
> class(dat)
> head(dat)
> dim(dat)
> dat$order
> colnames(dat)
> dat$sleep_total
> c(dat$sleep_total, 1000)
> plot(dat$brainwt, dat$sleep_total)
> plot(dat$brainwt, dat$sleep_total, log="x")
> summary(dat)
> dat[c(1,2),]
> dat[ dat$sleep_total > 18, ]
> dat$sleep_total[ c(1,2)]
> dat[dat$sleep_total > 18,6]
> mean(dat[ dat$sleep_total > 18,6 ])
> dat <- read.csv("msleep_ggplot2.csv")
> which(dat$sleep_total>18)
> dat$sleep_total[which(dat$sleep_total>18)]
> dat$sleep_total[22]
> which(dat$sleep_rem<3)
> which(dat$sleep_rem<3 & dat$sleep_total>18)
> sort(dat$sleep_total)
> order(dat$sleep_total)
> dat$sleep_total[order(dat$sleep_total)]
> rank(dat$sleep_total)
> rank(c(1,2,2,3))
> match(c("Cow", "Owl monkey", "Cheetah"), dat$name)
> idx=match(c("Cow", "Owl monkey", "Cheetah"), dat$name)
> dat[idx]
Q: Running a simple correlation analysis in R
Useful videos (link) (link2)
The presence and strength of a linear relationship or correlation between two quantitative variables can be tested by calculating the correlation coefficient (r) between the two variables.
> argentina<-read.csv('countries.csv')
> head(countries)
country year gsli.total gsli.financial gsli.people u5_mortality_rate life_expectancy unemployment_rate
1 Argentina 2004 5.07 3.25 0.74 17.9 74.5 12.2
2 Argentina 2005 5.05 3.14 0.93 17.1 75.0 10.6
etc.
Run a correlation test between the two variables (vectors) we want to compare, i.e. countries$gsli.total and countries$unemployment_rate.
Q: Making a pairs plot to highlight possible correlations between multiple variables
Produce all possible pair-wise plots (link)
> names(countries)
[1] "country" "year"
[3] "gsli.total" "gsli.financial"
[5] "gsli.people" "u5mr_mortality_rate_median"
[7] "life_expectancy_from_birth" "GET_UR"
> pairs(countries[,3:8])
By inspection the only variable without obvious possible correlation is gsli.total.
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> x <- 1:10
> y <- rnorm(10)
> plot(x,y)
> ? read.csv
> dat <- read.csv("femaleMiceWeights.csv")
You might need to set the working directory to get read.csv to find the right file |
> class(dat)
> head(dat)
> dim(dat)
> dat$order
> colnames(dat)
> dat$sleep_total
> c(dat$sleep_total, 1000)
> plot(dat$brainwt, dat$sleep_total)
> plot(dat$brainwt, dat$sleep_total, log="x")
> summary(dat)
> dat[c(1,2),]
> dat[ dat$sleep_total > 18, ]
> dat$sleep_total[ c(1,2)]
> dat[dat$sleep_total > 18,6]
> mean(dat[ dat$sleep_total > 18,6 ])
> dat <- read.csv("msleep_ggplot2.csv")
> which(dat$sleep_total>18)
> dat$sleep_total[which(dat$sleep_total>18)]
> dat$sleep_total[22]
> which(dat$sleep_rem<3)
> which(dat$sleep_rem<3 & dat$sleep_total>18)
> sort(dat$sleep_total)
> order(dat$sleep_total)
> dat$sleep_total[order(dat$sleep_total)]
> rank(dat$sleep_total)
> rank(c(1,2,2,3))
> match(c("Cow", "Owl monkey", "Cheetah"), dat$name)
> idx=match(c("Cow", "Owl monkey", "Cheetah"), dat$name)
> dat[idx]
Q: Running a simple correlation analysis in R
Useful videos (link) (link2)
The presence and strength of a linear relationship or correlation between two quantitative variables can be tested by calculating the correlation coefficient (r) between the two variables.
> argentina<-read.csv('countries.csv')
> head(countries)
country year gsli.total gsli.financial gsli.people u5_mortality_rate life_expectancy unemployment_rate
1 Argentina 2004 5.07 3.25 0.74 17.9 74.5 12.2
2 Argentina 2005 5.05 3.14 0.93 17.1 75.0 10.6
etc.
Run a correlation test between the two variables (vectors) we want to compare, i.e. countries$gsli.total and countries$unemployment_rate.
Q: Making a pairs plot to highlight possible correlations between multiple variables
Produce all possible pair-wise plots (link)
> names(countries)
[1] "country" "year"
[3] "gsli.total" "gsli.financial"
[5] "gsli.people" "u5mr_mortality_rate_median"
[7] "life_expectancy_from_birth" "GET_UR"
> pairs(countries[,3:8])
pairs-plot for the countries dataset (for axes read x-y as illustrated gsli.total v gsli-fin) |