+ - 0:00:00
Notes for current slide
Notes for next slide

Testing Hypotheses Using Permutation

Dilini Talagala

1 / 31

Data structures

  • R's base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether they are homogeneous or heterogeneous .

  • Most commonly used data types in data analysis:

Homogeneous

(All contents must be of the same type)

Atomic vector [1d]

Matrix [2d]

Array [nd]

Heterogeneous

(The contents can be of different types)

List [1d]

Data frame [2d]

3 / 31

1. Vectors

  • Vectors in R are either

    • atomic vectors or

    • lists

4 / 31

1.1 Atomic vectors

  • All elements of an atomic vector must be the same type.
  • Common types of atomic vectors:
c(0.5, 0.6, 0.7) ## numeric (double)
## [1] 0.5 0.6 0.7
# With the L suffix, you get an integer rather than a double
c(1L, 2L, 3L) ## integer
## [1] 1 2 3
c(TRUE, FALSE, TRUE) ## logical
## [1] TRUE FALSE TRUE
c("a", "b", "c") ## character
## [1] "a" "b" "c"
5 / 31

1.2 Lists

  • Lists are different from atomic vectors because their elements can be of different types, including lists.
x <- list(a = 1:3, b = c(TRUE, FALSE, TRUE),
c = c(2.3, 5.9), d = list(y = c(1,2,3), z = c("A", "B")))
x
## $a
## [1] 1 2 3
##
## $b
## [1] TRUE FALSE TRUE
##
## $c
## [1] 2.3 5.9
##
## $d
## $d$y
## [1] 1 2 3
##
## $d$z
## [1] "A" "B"
6 / 31
x$b
## [1] TRUE FALSE TRUE
x$d$z
## [1] "A" "B"
str(x)
## List of 4
## $ a: int [1:3] 1 2 3
## $ b: logi [1:3] TRUE FALSE TRUE
## $ c: num [1:2] 2.3 5.9
## $ d:List of 2
## ..$ y: num [1:3] 1 2 3
## ..$ z: chr [1:2] "A" "B"
7 / 31

2. Matrices and arrays

  • Adding a dim() attribute to an atomic vector allows it to create a multi-dimensional array.

  • A special case of the array is the matrix, which has two dimensions.

  • Matrices are common. Arrays are much rarer.

8 / 31

2.1 Matrix

# Two scalar arguments to specify rows and columns
a <- matrix(1:6, ncol = 3, nrow = 2)
a
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
a[2, 3] #a[row, column]
## [1] 6
a[ , 3]#third column
## [1] 5 6
a[1, ]#first row
## [1] 1 3 5
is.matrix(a)
## [1] TRUE
is.array(a)
## [1] TRUE
9 / 31

2.2 Array

# One vector argument to describe all dimensions
b <- array(1:12, c(2, 3, 2))
b
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
10 / 31

3. Data frames

  • A data frame is the most common way of storing data in R.

  • Few data frames that we are already familiar with: economics, gapminder

library(dplyr)
data(economics, package = "ggplot2")
glimpse(economics)
## Observations: 574
## Variables: 6
## $ date <date> 1967-07-01, 1967-08-01, 1967-09-01, 1967-10-01, 1967...
## $ pce <dbl> 507.4, 510.5, 516.3, 512.9, 518.1, 525.8, 531.5, 534....
## $ pop <int> 198712, 198911, 199113, 199311, 199498, 199657, 19980...
## $ psavert <dbl> 12.5, 12.5, 11.7, 12.5, 12.5, 12.1, 11.7, 12.2, 11.6,...
## $ uempmed <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4...
## $ unemploy <int> 2944, 2945, 2958, 3143, 3066, 3018, 2878, 3001, 2877,...
11 / 31
data(gapminder, package = "gapminder")
glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $ country <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
12 / 31

Managing data frames with the
dplyr package

13 / 31

Managing data frames with the dplyr package

  • Learn more on 'Managing data frames with the dplyr package' - read 'R Programming for Data Science' by Roger D. Peng

  • Some of the key "verbs" provided by the dplyr package are

    • select(): return a subset of the columns of a data frame

    • filter(): extract a subset of rows from a data frame

    • arrange(): reorder rows of a data frame

    • rename(): rename variables in a data frame

    • mutate(): add new variables/columns or transform existing variables

    • group_by: takes an existing tbl and converts it into a grouped tbl

    • summarise(): generate summary statistics of different variables in the data frame, possibly within groups

14 / 31

%>% operator

"Ceci n'est pas une pipe" - (This is not a pipe)

15 / 31

%>% operator

  • %>%: the "pipe" operator is used to connect multiple functions in a sequence of operations.

Format: second_fun( first_fun(x) )

  • Difficult to read a sequence of operations
summarise(group_by(gapminder, continent), max = max(lifeExp))
## # A tibble: 5 x 2
## continent max
## <fctr> <dbl>
## 1 Africa 76.442
## 2 Americas 80.653
## 3 Asia 82.603
## 4 Europe 81.757
## 5 Oceania 81.235
16 / 31
  • %>% operator makes code more readable

  • Reads more naturally in a left-to-right fashion.

Format: x %>% first_fun() %>% second_fun

gapminder %>%
group_by(continent) %>%
summarise(max = max(lifeExp))
## # A tibble: 5 x 2
## continent max
## <fctr> <dbl>
## 1 Africa 76.442
## 2 Americas 80.653
## 3 Asia 82.603
## 4 Europe 81.757
## 5 Oceania 81.235
  • Once you travel down the pipeline with %>%, the first argument is taken to be the output of the previous function in the pipeline.
17 / 31

Creating data frames with the
tibble package

18 / 31

Creating a data frame with the tibble package

vignette("tibble")
  • A data frame can be created using tibble().
library(tibble)
df <- tibble(x = 1:3, y = 3:1)
df
## # A tibble: 3 x 2
## x y
## <int> <int>
## 1 1 3
## 2 2 2
## 3 3 1
19 / 31
#The add_row()/ add_column() functions allows
#control over where the new rows/columns are added
df %>%
add_row(x = 4, y = 0, .before = 2)
## # A tibble: 4 x 2
## x y
## <dbl> <dbl>
## 1 1 3
## 2 4 0
## 3 2 2
## 4 3 1
df %>%
add_column(z = -1:1, .after = "x")
## # A tibble: 3 x 3
## x z y
## <int> <int> <int>
## 1 1 -1 3
## 2 2 0 2
## 3 3 1 1
20 / 31

Subsetting

# Extract by name
df$x
## [1] 1 2 3
df[["x"]]
## [1] 1 2 3
# Extract by position
df[[1]]
## [1] 1 2 3
# To use in a pipe, use
# the special placeholder .:
df %>% .$x
## [1] 1 2 3
df %>% .[["x"]]
## [1] 1 2 3
21 / 31
yawn_expt <- tibble(group = c(rep("control", 16),
rep("treatment", 34)),
yawn = c(rep("no", 12), rep("yes", 4),
rep("no", 24), rep("yes", 10)))
22 / 31

Let's take a look at the data frame we created

#Print out the first few rows
head(yawn_expt)
## # A tibble: 6 x 2
## group yawn
## <chr> <chr>
## 1 control no
## 2 control no
## 3 control no
## 4 control no
## 5 control no
## 6 control no
#Get a glimpse of your data.
glimpse(yawn_expt)
## Observations: 50
## Variables: 2
## $ group <chr> "control", "control", "control", "control", "control", "...
## $ yawn <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "n...
#Print out the last few rows
tail(yawn_expt)
## # A tibble: 6 x 2
## group yawn
## <chr> <chr>
## 1 treatment yes
## 2 treatment yes
## 3 treatment yes
## 4 treatment yes
## 5 treatment yes
## 6 treatment yes
23 / 31

Creating a contingency table from a data frame

library(dplyr)
library(tidyr)
library(knitr)
yawn_expt %>%
group_by(group, yawn) %>%
tally() %>%
ungroup() %>%
spread(yawn, n) %>%
mutate(total = rowSums(.[-1]))
## # A tibble: 2 x 4
## group no yes total
## <chr> <int> <int> <dbl>
## 1 control 12 4 16
## 2 treatment 24 10 34
24 / 31

Creating a contingency table from a data frame

library(dplyr)
library(tidyr)
library(knitr)
yawn_expt %>%
group_by(group, yawn) %>%
tally() %>%
ungroup() %>%
spread(yawn, n) %>%
mutate(total = rowSums(.[-1]))
## # A tibble: 2 x 4
## group no yes total
## <chr> <int> <int> <dbl>
## 1 control 12 4 16
## 2 treatment 24 10 34

Your turn

Compute the proportion of the treatment and control groups who yawned. Add this to the table.

24 / 31

Permutation Test

prop_dif <- function(data){
dtbl <- data %>%
mutate(yawn = sample(yawn)) #Permutate yawn variable
# Yurn turn to compute the difference
# between proportions of treaments and crontrol groups
return(pdif)
}
25 / 31

Setting the random number seed

  • Setting the random number seed with set.seed() ensures reproducibility of the sequence of random numbers.

Compare the resulted outputs of the following commands:

set.seed(100)
rnorm(5)
## [1] -0.50219235 0.13153117 -0.07891709 0.88678481 0.11697127
rnorm(5)
## [1] 0.3186301 -0.5817907 0.7145327 -0.8252594 -0.3598621
set.seed(100)
rnorm(5)
## [1] -0.50219235 0.13153117 -0.07891709 0.88678481 0.11697127
26 / 31

Run the function 10000 times, saving the results

set.seed(444)
# here we create an empty numeric vector of
#length 10000 to store our results
pdif <- numeric(10000)
## Your turn to write the for-loop
27 / 31

Plotting with ggplot2

28 / 31

Histogram

library(ggplot2)
# 'economics' is the name of the data frame and
# it has a variable called 'pce'.
ggplot(data = economics, aes(x = pce)) +
geom_histogram(binwidth = 500, colour = "blue", fill ="lightblue")+
geom_vline(xintercept = 10000 , colour = "red")

  • binwidth is the width of the histogram bins
29 / 31

Your turn

  1. Make a histogram of the results.

  2. Draw a vertical line on the plot that represent the difference for the actual data.

pdif <- data.frame(pdif)
# your turn to use ggplot to produce the histogram
30 / 31
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow