R's base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether they are homogeneous or heterogeneous .
Most commonly used data types in data analysis:
Homogeneous
Atomic vector [1d]
Matrix [2d]
Array [nd]
Heterogeneous
List [1d]
Data frame [2d]
Vectors in R are either
atomic vectors or
lists
c(0.5, 0.6, 0.7) ## numeric (double)
## [1] 0.5 0.6 0.7
# With the L suffix, you get an integer rather than a doublec(1L, 2L, 3L) ## integer
## [1] 1 2 3
c(TRUE, FALSE, TRUE) ## logical
## [1] TRUE FALSE TRUE
c("a", "b", "c") ## character
## [1] "a" "b" "c"
x <- list(a = 1:3, b = c(TRUE, FALSE, TRUE), c = c(2.3, 5.9), d = list(y = c(1,2,3), z = c("A", "B")))x
## $a## [1] 1 2 3## ## $b## [1] TRUE FALSE TRUE## ## $c## [1] 2.3 5.9## ## $d## $d$y## [1] 1 2 3## ## $d$z## [1] "A" "B"
x$b
## [1] TRUE FALSE TRUE
x$d$z
## [1] "A" "B"
str(x)
## List of 4## $ a: int [1:3] 1 2 3## $ b: logi [1:3] TRUE FALSE TRUE## $ c: num [1:2] 2.3 5.9## $ d:List of 2## ..$ y: num [1:3] 1 2 3## ..$ z: chr [1:2] "A" "B"
Adding a dim() attribute to an atomic vector allows it to create a multi-dimensional array.
A special case of the array is the matrix, which has two dimensions.
Matrices are common. Arrays are much rarer.
# Two scalar arguments to specify rows and columnsa <- matrix(1:6, ncol = 3, nrow = 2)a
## [,1] [,2] [,3]## [1,] 1 3 5## [2,] 2 4 6
a[2, 3] #a[row, column]
## [1] 6
a[ , 3]#third column
## [1] 5 6
a[1, ]#first row
## [1] 1 3 5
is.matrix(a)
## [1] TRUE
is.array(a)
## [1] TRUE
# One vector argument to describe all dimensionsb <- array(1:12, c(2, 3, 2))b
## , , 1## ## [,1] [,2] [,3]## [1,] 1 3 5## [2,] 2 4 6## ## , , 2## ## [,1] [,2] [,3]## [1,] 7 9 11## [2,] 8 10 12
A data frame is the most common way of storing data in R.
Few data frames that we are already familiar with: economics, gapminder
library(dplyr)data(economics, package = "ggplot2")glimpse(economics)
## Observations: 574## Variables: 6## $ date <date> 1967-07-01, 1967-08-01, 1967-09-01, 1967-10-01, 1967...## $ pce <dbl> 507.4, 510.5, 516.3, 512.9, 518.1, 525.8, 531.5, 534....## $ pop <int> 198712, 198911, 199113, 199311, 199498, 199657, 19980...## $ psavert <dbl> 12.5, 12.5, 11.7, 12.5, 12.5, 12.1, 11.7, 12.2, 11.6,...## $ uempmed <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4...## $ unemploy <int> 2944, 2945, 2958, 3143, 3066, 3018, 2878, 3001, 2877,...
data(gapminder, package = "gapminder")glimpse(gapminder)
## Observations: 1,704## Variables: 6## $ country <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...## $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
Some of the key "verbs" provided by the dplyr package are
select(): return a subset of the columns of a data frame
filter(): extract a subset of rows from a data frame
arrange(): reorder rows of a data frame
rename(): rename variables in a data frame
mutate(): add new variables/columns or transform existing variables
group_by: takes an existing tbl and converts it into a grouped tbl
summarise(): generate summary statistics of different variables in the data frame, possibly within groups
summarise(group_by(gapminder, continent), max = max(lifeExp))
## # A tibble: 5 x 2## continent max## <fctr> <dbl>## 1 Africa 76.442## 2 Americas 80.653## 3 Asia 82.603## 4 Europe 81.757## 5 Oceania 81.235
%>% operator makes code more readable
Reads more naturally in a left-to-right fashion.
gapminder %>% group_by(continent) %>% summarise(max = max(lifeExp))
## # A tibble: 5 x 2## continent max## <fctr> <dbl>## 1 Africa 76.442## 2 Americas 80.653## 3 Asia 82.603## 4 Europe 81.757## 5 Oceania 81.235
Learn more on tibbles read
vignette("tibble")
library(tibble)df <- tibble(x = 1:3, y = 3:1)df
## # A tibble: 3 x 2## x y## <int> <int>## 1 1 3## 2 2 2## 3 3 1
#The add_row()/ add_column() functions allows #control over where the new rows/columns are addeddf %>% add_row(x = 4, y = 0, .before = 2)
## # A tibble: 4 x 2## x y## <dbl> <dbl>## 1 1 3## 2 4 0## 3 2 2## 4 3 1
df %>% add_column(z = -1:1, .after = "x")
## # A tibble: 3 x 3## x z y## <int> <int> <int>## 1 1 -1 3## 2 2 0 2## 3 3 1 1
# Extract by namedf$x
## [1] 1 2 3
df[["x"]]
## [1] 1 2 3
# Extract by positiondf[[1]]
## [1] 1 2 3
# To use in a pipe, use # the special placeholder .:df %>% .$x
## [1] 1 2 3
df %>% .[["x"]]
## [1] 1 2 3
yawn_expt <- tibble(group = c(rep("control", 16), rep("treatment", 34)), yawn = c(rep("no", 12), rep("yes", 4), rep("no", 24), rep("yes", 10)))
Let's take a look at the data frame we created
#Print out the first few rowshead(yawn_expt)
## # A tibble: 6 x 2## group yawn## <chr> <chr>## 1 control no## 2 control no## 3 control no## 4 control no## 5 control no## 6 control no
#Get a glimpse of your data.glimpse(yawn_expt)
## Observations: 50## Variables: 2## $ group <chr> "control", "control", "control", "control", "control", "...## $ yawn <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "n...
#Print out the last few rowstail(yawn_expt)
## # A tibble: 6 x 2## group yawn## <chr> <chr>## 1 treatment yes## 2 treatment yes## 3 treatment yes## 4 treatment yes## 5 treatment yes## 6 treatment yes
library(dplyr)library(tidyr)library(knitr)yawn_expt %>% group_by(group, yawn) %>% tally() %>% ungroup() %>% spread(yawn, n) %>% mutate(total = rowSums(.[-1]))
## # A tibble: 2 x 4## group no yes total## <chr> <int> <int> <dbl>## 1 control 12 4 16## 2 treatment 24 10 34
library(dplyr)library(tidyr)library(knitr)yawn_expt %>% group_by(group, yawn) %>% tally() %>% ungroup() %>% spread(yawn, n) %>% mutate(total = rowSums(.[-1]))
## # A tibble: 2 x 4## group no yes total## <chr> <int> <int> <dbl>## 1 control 12 4 16## 2 treatment 24 10 34
Compute the proportion of the treatment and control groups who yawned. Add this to the table.
prop_dif <- function(data){ dtbl <- data %>% mutate(yawn = sample(yawn)) #Permutate yawn variable # Yurn turn to compute the difference # between proportions of treaments and crontrol groups return(pdif)}
Compare the resulted outputs of the following commands:
set.seed(100)rnorm(5)
## [1] -0.50219235 0.13153117 -0.07891709 0.88678481 0.11697127
rnorm(5)
## [1] 0.3186301 -0.5817907 0.7145327 -0.8252594 -0.3598621
set.seed(100)rnorm(5)
## [1] -0.50219235 0.13153117 -0.07891709 0.88678481 0.11697127
set.seed(444)# here we create an empty numeric vector of #length 10000 to store our resultspdif <- numeric(10000)## Your turn to write the for-loop
library(ggplot2)# 'economics' is the name of the data frame and# it has a variable called 'pce'.ggplot(data = economics, aes(x = pce)) + geom_histogram(binwidth = 500, colour = "blue", fill ="lightblue")+ geom_vline(xintercept = 10000 , colour = "red")
Make a histogram of the results.
Draw a vertical line on the plot that represent the difference for the actual data.
pdif <- data.frame(pdif)# your turn to use ggplot to produce the histogram
Most of the material I've used here are based on
'Advanced R' by Hadley Wickham
'R Programming for Data Science' by Roger D. Peng
'R for Data Science' by Garrett Grolemund and Hadley Wickham
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |