The MGS Blog

Tuesday, April 18, 2017

R and Harvard's PH525.1x MOOC

Having enrolled "Statistics and R for the Life Sciences" run on EdX by Harvard (MOOC module PH525.1x)...
Some notes and pointers for R; introductory and tutorials for basic statistical treatment of quantitative data.
As an aside comment, would you believe that running R commands and generating plots, makes data fun and surprising? Or is it just me? At least it's something different, perhaps more interactive than doing the same thing in a spreadsheet. Did I say it felt like fun already? Perhaps when generating those beautiful little graphs automatically?

Definitions: p-value, confidence interval, random variables, null distributions, central limit theorem, inference tests like t-test, association test, permutation test.

Getting started with R. The software first.
Download and install R from CRAN http://cran.r-project.org/
Download and install R Studio for the desktop (the same people also run/operate Shiny) http://www.rstudio.com/products/rstudio/download/ 

Q: Am I on the latest version of R?
Check you're on the latest version, if not download the latest package (on MacOS use Safari).
Run md5 checker from the terminal and verify.
md5 R-3.2.2.pkg MD5 (R-3.2.2.pkg) = dd8999f50c5d4e392832797d091642db
Then check the installation works, run R.app, type version at the R console to verify the expected installed version is running.

Q: Same: am I on the latest version of R?
Check you're on the latest version, if not download the latest package (on MacOS use Safari).
md5 RStudio-0.99.489.dmg MD5 (RStudio-0.99.489.dmg) = 05cf866b07df6552583f98314ed09d38
Again, check the installation works, run RStudio and go to RStudio>About RStudio to see the version window.

Links for learning.
Pointers to external material (from the EdX course)
The underlying concepts.
  • Vectors (single or multi-element row, single data type)
  • Matrix (multi-element array, single data type)
  • Lists (a vector of mixed data types, organised into named components $xxx)
  • Data Frames (an array-like form of R list,  a matrix of mixed data types in which each column of the matrix corresponds to a vector. Can be addressed different ways, by named component or index location)
Matrix commands
> m <- rbind(c(1,4),c(2,2))  # rbind( ) is a function for row bind. cbind( ) is the corresponding function for column bind.
> m
    [,1] [,2]
[1,]   1    4
[2,]   2    2
> m %*% c(1,1)  # matrix multiplication operator
    [,1] 
[1,]   5
[2,]   4
> m[1,2] # return value at matrix index location 1,2
[1] 4
> m[2,2] # return value at matrix index location 2,2
[1] 2
> m[1,] # row 1, shows how to extract submatrices from a matrix
[1] 1  4
> m[,2] # column 2, shows how to extract submatrices from a matrix
[1] 4  2

A series of setup steps in R/RStudio (and using stuff at https://github.com/genomicsclass/dagdata):
> install.packages("devtools")
> library(devtools)
> install_github("genomicsclass/dagdata")
> install_github("ririzarr/rafalib")
> 1:10
    [1]  1  2  3  4  5  6  7  8  9 10
> x <- 1:10
> y <- rnorm(10)
> plot(x,y)
> ? read.csv
> dat <- read.csv("femaleMiceWeights.csv")
You might need to set the working directory to get read.csv to find the right file


> class(dat)
> head(dat)
> dim(dat)
> dat$order
> colnames(dat)
> dat$sleep_total
> c(dat$sleep_total, 1000)
> plot(dat$brainwt, dat$sleep_total)
> plot(dat$brainwt, dat$sleep_total, log="x")
> summary(dat)
> dat[c(1,2),]
> dat[ dat$sleep_total > 18, ]
> dat$sleep_total[ c(1,2)]
> dat[dat$sleep_total > 18,6]
> mean(dat[ dat$sleep_total > 18,6 ])
> dat <- read.csv("msleep_ggplot2.csv")
> which(dat$sleep_total>18)
> dat$sleep_total[which(dat$sleep_total>18)]
> dat$sleep_total[22]
> which(dat$sleep_rem<3)
> which(dat$sleep_rem<3 & dat$sleep_total>18)
> sort(dat$sleep_total)
> order(dat$sleep_total)
> dat$sleep_total[order(dat$sleep_total)]
> rank(dat$sleep_total)
> rank(c(1,2,2,3))
> match(c("Cow", "Owl monkey", "Cheetah"), dat$name)
> idx=match(c("Cow", "Owl monkey", "Cheetah"), dat$name)
> dat[idx]

Q: Running a simple correlation analysis in R
Useful videos (link) (link2)
The presence and strength of a linear relationship or correlation between two quantitative variables can be tested by calculating the correlation coefficient (r) between the two variables.

> argentina<-read.csv('countries.csv')
> head(countries)
    country year gsli.total gsli.financial gsli.people u5_mortality_rate life_expectancy unemployment_rate
1 Argentina 2004       5.07           3.25        0.74                       17.9                       74.5   12.2
2 Argentina 2005       5.05           3.14        0.93                       17.1                       75.0   10.6
etc.

Run a correlation test between the two variables (vectors) we want to compare, i.e. countries$gsli.total and countries$unemployment_rate.

Q: Making a pairs plot to highlight possible correlations between multiple variables
Produce all possible pair-wise plots (link)
> names(countries)
[1] "country"                    "year"                  
[3] "gsli.total"                 "gsli.financial"        
[5] "gsli.people"                "u5mr_mortality_rate_median"
[7] "life_expectancy_from_birth" "GET_UR"                
> pairs(countries[,3:8])
pairs-plot for the countries dataset (for axes read x-y as illustrated gsli.total v gsli-fin)
By inspection the only variable without obvious possible correlation is gsli.total.

Learning online

Is Python a good language to learn programming?

A question regularly posed to Slashdot followers is "how to become a programmer?" or "what programming language is the best to learn?" There are no easy answers because it is not easy to learn to program nor is there one best language for learning to program or to program 'professionally'. However a broad consensus exists that Python is useful both as language for learning how to program and as a programming language in its own right suitable for developing serious software applications.

A terminal session driving my python program.


Learn Python The Hard Way (2nd Ed) by Zed A. Shaw is a sufficiently challenging yet productive step-wise set of exercises that you can use to gradually learn both how to wrangle your computer and how to program. You will learn that the answer is both out there (thanks search engines and people who post to blogs or groups) so long as you ask the question, and within you if you work hard enough to discover and understand why something doesn't work the way you expect it to work. You will find that docs.python.org is an essential resource for understanding the whats and whys of Python, and Wikipedia an aid for learning fundamental concepts.

What python course does google use? Google's own python class offers a well paced introduction to the language https://developers.google.com/edu/python/ (link)

Thursday, April 13, 2017

[guest lecture] Edna Hogan: NGOs, ICT, Tanzania, Africa

Our guest lecture by Edna Lyatuu Hogan 12:00pm midday Tuesday 11th Apr in N303. The topic was how NGOs Revolutionise ICT for Development; talking about technology and innovation in Tanzania.

Vikas, Edna and Anna at the start of the lecture.
Edna concluded by setting a challenge, to support social enterprises in Tanzania which benefit from the sales of her book "POEMS for the Soul and the Bold Minds!: Life and Love Poems" on Kindle.

Saturday, April 1, 2017

Adam Grant on "are you a giver or a taker?"

Adam Grant's take on the false economy of performance review, firing underachievers, and all that other bad stuff that happens to good people.


I suppose it's no accident that the constructs he talks about and in his research include help-givers, help-seekers, givers, takers.