Testing Hypotheses Using PermutationDilini Talagala1 / 31

Data structures

Learn more on data structures in R - read 'Advanced R' by Hadley Wickham

2 / 31

Data structures

R's base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether they are homogeneous or heterogeneous .
Most commonly used data types in data analysis:

Homogeneous

(All contents must be of the same type)

Atomic vector [1d]

Matrix [2d]

Array [nd]

Heterogeneous

(The contents can be of different types)

List [1d]

Data frame [2d]

3 / 31

1. Vectors

Vectors in R are either
- atomic vectors or
- lists

4 / 31

1.1 Atomic vectors

All elements of an atomic vector must be the same type.
Common types of atomic vectors:

c(0.5, 0.6, 0.7) ## numeric (double)

## [1] 0.5 0.6 0.7

# With the L suffix, you get an integer rather than a double
c(1L, 2L, 3L) ## integer

## [1] 1 2 3

c(TRUE, FALSE, TRUE) ## logical

## [1]  TRUE FALSE  TRUE

c("a", "b", "c") ## character

## [1] "a" "b" "c"

5 / 31

1.2 Lists

Lists are different from atomic vectors because their elements can be of different types, including lists.

x <- list(a = 1:3,  b = c(TRUE, FALSE, TRUE), 
          c = c(2.3, 5.9), d = list(y = c(1,2,3), z = c("A", "B")))
x

## $a
## [1] 1 2 3
## 
## $b
## [1]  TRUE FALSE  TRUE
## 
## $c
## [1] 2.3 5.9
## 
## $d
## $d$y
## [1] 1 2 3
## 
## $d$z
## [1] "A" "B"

6 / 31

x$b

## [1]  TRUE FALSE  TRUE

x$d$z

## [1] "A" "B"

str(x)

## List of 4
##  $ a: int [1:3] 1 2 3
##  $ b: logi [1:3] TRUE FALSE TRUE
##  $ c: num [1:2] 2.3 5.9
##  $ d:List of 2
##   ..$ y: num [1:3] 1 2 3
##   ..$ z: chr [1:2] "A" "B"

7 / 31

2. Matrices and arrays

Adding a dim() attribute to an atomic vector allows it to create a multi-dimensional array.
A special case of the array is the matrix, which has two dimensions.
Matrices are common. Arrays are much rarer.

8 / 31

2.1 Matrix

# Two scalar arguments to specify rows and columns
a <- matrix(1:6, ncol = 3, nrow = 2)
a

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

a[2, 3] #a[row, column]

## [1] 6

a[ , 3]#third column

## [1] 5 6

a[1, ]#first row

## [1] 1 3 5

is.matrix(a)

## [1] TRUE

is.array(a)

## [1] TRUE

9 / 31

2.2 Array

# One vector argument to describe all dimensions
b <- array(1:12, c(2, 3, 2))
b

## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

10 / 31

3. Data frames

A data frame is the most common way of storing data in R.
Few data frames that we are already familiar with: economics, gapminder

library(dplyr)
data(economics, package = "ggplot2")
glimpse(economics)

## Observations: 574
## Variables: 6
## $ date     <date> 1967-07-01, 1967-08-01, 1967-09-01, 1967-10-01, 1967...
## $ pce      <dbl> 507.4, 510.5, 516.3, 512.9, 518.1, 525.8, 531.5, 534....
## $ pop      <int> 198712, 198911, 199113, 199311, 199498, 199657, 19980...
## $ psavert  <dbl> 12.5, 12.5, 11.7, 12.5, 12.5, 12.1, 11.7, 12.2, 11.6,...
## $ uempmed  <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4...
## $ unemploy <int> 2944, 2945, 2958, 3143, 3066, 3018, 2878, 3001, 2877,...

11 / 31

data(gapminder, package = "gapminder")
glimpse(gapminder)

## Observations: 1,704
## Variables: 6
## $ country   <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

12 / 31

Managing data frames with the
dplyr package

13 / 31

Managing data frames with the dplyr package

Learn more on 'Managing data frames with the dplyr package' - read 'R Programming for Data Science' by Roger D. Peng
Some of the key "verbs" provided by the dplyr package are
- select(): return a subset of the columns of a data frame
- filter(): extract a subset of rows from a data frame
- arrange(): reorder rows of a data frame
- rename(): rename variables in a data frame
- mutate(): add new variables/columns or transform existing variables
- group_by: takes an existing tbl and converts it into a grouped tbl
- summarise(): generate summary statistics of different variables in the data frame, possibly within groups

14 / 31

%>% operator

"Ceci n'est pas une pipe" - (This is not a pipe)

15 / 31

%>% operator

%>%: the "pipe" operator is used to connect multiple functions in a sequence of operations.

Format: second_fun( first_fun(x) )

Difficult to read a sequence of operations

summarise(group_by(gapminder, continent), max = max(lifeExp))

## # A tibble: 5 x 2
##   continent    max
##      <fctr>  <dbl>
## 1    Africa 76.442
## 2  Americas 80.653
## 3      Asia 82.603
## 4    Europe 81.757
## 5   Oceania 81.235

16 / 31

%>% operator makes code more readable
Reads more naturally in a left-to-right fashion.

Format: x %>% first_fun() %>% second_fun

gapminder %>%
  group_by(continent) %>%
  summarise(max = max(lifeExp))

## # A tibble: 5 x 2
##   continent    max
##      <fctr>  <dbl>
## 1    Africa 76.442
## 2  Americas 80.653
## 3      Asia 82.603
## 4    Europe 81.757
## 5   Oceania 81.235

Once you travel down the pipeline with %>%, the first argument is taken to be the output of the previous function in the pipeline.

17 / 31

Creating data frames with the
tibble package

18 / 31

Creating a data frame with the tibble package

Learn more on tibbles read
- 'R for Data Science' by Garrett Grolemund and Hadley Wickham
- Rstudio blog

     vignette("tibble")

A data frame can be created using tibble().

library(tibble)
df <- tibble(x = 1:3, y = 3:1)
df

## # A tibble: 3 x 2
##       x     y
##   <int> <int>
## 1     1     3
## 2     2     2
## 3     3     1

19 / 31

#The add_row()/ add_column() functions allows 
#control over where the new rows/columns are added
df %>% 
  add_row(x = 4, y = 0, .before = 2)

## # A tibble: 4 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     3
## 2     4     0
## 3     2     2
## 4     3     1

df %>%
  add_column(z = -1:1, .after = "x")

## # A tibble: 3 x 3
##       x     z     y
##   <int> <int> <int>
## 1     1    -1     3
## 2     2     0     2
## 3     3     1     1

20 / 31

Subsetting# Extract by name
df$x

## [1] 1 2 3
df[["x"]]

## [1] 1 2 3
# Extract by position
df[[1]]

## [1] 1 2 3
# To use in a pipe, use 
# the special placeholder .:
df %>% .$x

## [1] 1 2 3
df %>% .[["x"]]

## [1] 1 2 3
21 / 31

yawn_expt <- tibble(group = c(rep("control", 16), 
                                  rep("treatment", 34)),
                        yawn = c(rep("no", 12), rep("yes", 4),
                                 rep("no", 24), rep("yes", 10)))

22 / 31

Let's take a look at the data frame we created

#Print out the first few rows
head(yawn_expt)

## # A tibble: 6 x 2
##     group  yawn
##     <chr> <chr>
## 1 control    no
## 2 control    no
## 3 control    no
## 4 control    no
## 5 control    no
## 6 control    no

#Get a glimpse of your data.
glimpse(yawn_expt)

## Observations: 50
## Variables: 2
## $ group <chr> "control", "control", "control", "control", "control", "...
## $ yawn  <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "n...

#Print out the last few rows
tail(yawn_expt)

## # A tibble: 6 x 2
##       group  yawn
##       <chr> <chr>
## 1 treatment   yes
## 2 treatment   yes
## 3 treatment   yes
## 4 treatment   yes
## 5 treatment   yes
## 6 treatment   yes

23 / 31

Creating a contingency table from a data frame

library(dplyr)
library(tidyr)
library(knitr)
yawn_expt %>%
   group_by(group, yawn) %>% 
   tally() %>%
   ungroup() %>%
   spread(yawn, n) %>% 
   mutate(total = rowSums(.[-1]))

## # A tibble: 2 x 4
##       group    no   yes total
##       <chr> <int> <int> <dbl>
## 1   control    12     4    16
## 2 treatment    24    10    34

24 / 31

Creating a contingency table from a data frame

library(dplyr)
library(tidyr)
library(knitr)
yawn_expt %>%
   group_by(group, yawn) %>% 
   tally() %>%
   ungroup() %>%
   spread(yawn, n) %>% 
   mutate(total = rowSums(.[-1]))

## # A tibble: 2 x 4
##       group    no   yes total
##       <chr> <int> <int> <dbl>
## 1   control    12     4    16
## 2 treatment    24    10    34

Your turn

Compute the proportion of the treatment and control groups who yawned. Add this to the table.

24 / 31

Permutation Test

prop_dif <- function(data){
  dtbl <- data %>%
    mutate(yawn = sample(yawn)) #Permutate yawn variable
    # Yurn turn to compute the difference
    # between proportions of treaments and crontrol groups
    return(pdif)
}

25 / 31

Setting the random number seed

Setting the random number seed with set.seed() ensures reproducibility of the sequence of random numbers.

Compare the resulted outputs of the following commands:

set.seed(100)
rnorm(5)

## [1] -0.50219235  0.13153117 -0.07891709  0.88678481  0.11697127

rnorm(5)

## [1]  0.3186301 -0.5817907  0.7145327 -0.8252594 -0.3598621

set.seed(100)
rnorm(5)

## [1] -0.50219235  0.13153117 -0.07891709  0.88678481  0.11697127

26 / 31

Run the function 10000 times, saving the results

set.seed(444)
# here we create an empty numeric vector of 
#length 10000 to store our results
pdif <- numeric(10000)
## Your turn to write the for-loop

27 / 31

Plotting with ggplot2

28 / 31

Histogram

library(ggplot2)
# 'economics' is the name of the data frame and
# it has a variable called 'pce'.
ggplot(data = economics, aes(x = pce)) +
  geom_histogram(binwidth = 500,  colour = "blue", fill ="lightblue")+
  geom_vline(xintercept = 10000 , colour = "red")

binwidth is the width of the histogram bins

29 / 31

Your turn

Make a histogram of the results.
Draw a vertical line on the plot that represent the difference for the actual data.

pdif <- data.frame(pdif)
# your turn to use ggplot to produce the histogram

30 / 31

Most of the material I've used here are based on

'Advanced R' by Hadley Wickham

'R Programming for Data Science' by Roger D. Peng

'R for Data Science' by Garrett Grolemund and Hadley Wickham

Happy learning with R :)

31 / 31

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Testing Hypotheses Using Permutation

Dilini Talagala

Data structures

Data structures

(All contents must be of the same type)

(The contents can be of different types)

1. Vectors

1.1 Atomic vectors

1.2 Lists

2. Matrices and arrays

2.1 Matrix

2.2 Array

3. Data frames

Managing data frames with the dplyr package

Managing data frames with the dplyr package

%>% operator

"Ceci n'est pas une pipe" - (This is not a pipe)

%>% operator

Format: second_fun( first_fun(x) )

Format: x %>% first_fun() %>% second_fun

Creating data frames with the tibble package

Creating a data frame with the tibble package

Subsetting

Creating a contingency table from a data frame

Creating a contingency table from a data frame

Your turn

Permutation Test

Setting the random number seed

Run the function 10000 times, saving the results

Plotting with ggplot2

Histogram

Your turn

Happy learning with R :)

Data structures

Help

Managing data frames with the
dplyr package

Creating data frames with the
tibble package