Data Visualization

If you’d like to learn more about the theoretical underpinnings of ggplot2 before you start, I’d recommend reading “The Layered Grammar of Graphics”, http://vita.had.co.nz/papers/layered-grammar.pdf.

workflow

workflow

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.4
## -- Attaching packages -------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.3.0
## v readr   1.1.1     v forcats 0.3.0
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'readr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'forcats' was built under R version 3.4.4
## -- Conflicts ----------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

That one line of code loads the core tidyverse; packages which you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded).

First Step

Use mpg dataset. A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). mpg contains observations collected by the US Environment Protection Agency on 38 models of car.

Among the variables in mpg are: 1. displ, a car’s engine size, in litres. 2. hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.

mpg
## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl   
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
##  1 audi         a4        1.80  1999     4 auto(l~ f        18    29 p    
##  2 audi         a4        1.80  1999     4 manual~ f        21    29 p    
##  3 audi         a4        2.00  2008     4 manual~ f        20    31 p    
##  4 audi         a4        2.00  2008     4 auto(a~ f        21    30 p    
##  5 audi         a4        2.80  1999     6 auto(l~ f        16    26 p    
##  6 audi         a4        2.80  1999     6 manual~ f        18    26 p    
##  7 audi         a4        3.10  2008     6 auto(a~ f        18    27 p    
##  8 audi         a4 quat~  1.80  1999     4 manual~ 4        18    26 p    
##  9 audi         a4 quat~  1.80  1999     4 auto(l~ 4        16    25 p    
## 10 audi         a4 quat~  2.00  2008     4 manual~ 4        20    28 p    
## # ... with 224 more rows, and 1 more variable: class <chr>
with(mpg, plot(displ, hwy))

# create a ggplot
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes. ggplot2 looks for the mapped variable in the data argument, in this case, mpg.

Exercise

Run ggplot(data = mpg). What do you see?

How many rows are in mpg? How many columns?

What does the drv variable describe? Read the help for ?mpg to find out.

Make a scatterplot of hwy vs cyl.

What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

# 1. 
ggplot(data=mpg) # see an empty plot

# 2.
dim(mpg) # 234  11
## [1] 234  11
# 3. f = front-wheel drive, r = rear wheel drive, 4 = 4wd
# 4. 
ggplot(data = mpg) + geom_point(aes(x = hwy, y = cyl))

# 5. 
ggplot(data = mpg) + geom_point(aes(x = class, y = drv))

# not useful because both x and y are categorical, with no numerical relationships between them.

Aesthetic Mappings

Problem one group of points (highlighted in red) seems to fall outside of the linear trend. These cars have a higher mileage than you might expect. How can you explain these cars?

Let’s hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the class value for each car. The class variable of the mpg dataset classifies cars into groups such as compact, midsize, and SUV. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs became popular).

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot2 will also add a legend that explains which levels correspond to which values.

Explanation: The colors reveal that many of the unusual points are two-seater cars. These cars don’t seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.

Exercise

What’s gone wrong with this code? Why are the points not blue?

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = “blue”))

Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

What happens if you map the same variable to multiple aesthetics?

What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?

# 1. need to put color outside of aes()
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

# 2. 
mpg
## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl   
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
##  1 audi         a4        1.80  1999     4 auto(l~ f        18    29 p    
##  2 audi         a4        1.80  1999     4 manual~ f        21    29 p    
##  3 audi         a4        2.00  2008     4 manual~ f        20    31 p    
##  4 audi         a4        2.00  2008     4 auto(a~ f        21    30 p    
##  5 audi         a4        2.80  1999     6 auto(l~ f        16    26 p    
##  6 audi         a4        2.80  1999     6 manual~ f        18    26 p    
##  7 audi         a4        3.10  2008     6 auto(a~ f        18    27 p    
##  8 audi         a4 quat~  1.80  1999     4 manual~ 4        18    26 p    
##  9 audi         a4 quat~  1.80  1999     4 auto(l~ 4        16    25 p    
## 10 audi         a4 quat~  2.00  2008     4 manual~ 4        20    28 p    
## # ... with 224 more rows, and 1 more variable: class <chr>
# from the attribute we see that character variables must be categorical, we can also treat some of the integer values as categorical, such as cyl (number of cylinders)
# 3. 
ggplot(data = mpg) + geom_point(aes(x = displ, y = hwy, color = cty, shape = factor(cyl)), size = 2)

# 4. 
ggplot(data = mpg) + geom_point(aes(x = displ, y = hwy, color = cyl, shape = factor(cyl)), size = 2)

# 5. Use the stroke aesthetic to modify the width of the
# border
# 6. 
ggplot(data = mpg) + geom_point(aes(x = displ, y = hwy, color = displ < 5, shape = factor(cyl)), size = 2)

Facets

One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.

To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ cyl, nrow = 1)

To facet your plot on the combination of two variables, add facet_grid() to your plot call. The first argument of facet_grid() is also a formula. This time the formula should contain two variable names separated by a ~.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

# If you prefer to not facet in the rows or columns dimension, use a . instead of a variable name
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(. ~ cyl)

Geometric Objects

Each plot uses a different visual object to represent the data. In ggplot2 syntax, we say that they use different geoms.

A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses.

# left
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

# right
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess'

geom_smooth() will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
## `geom_smooth()` using method = 'loess'

In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the linetype example).

Exercise

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
## Warning: package 'bindrcpp' was built under R version 3.4.4
## `geom_smooth()` using method = 'loess'

# Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'

# Recreate the R code necessary to generate the following graphs.
ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point(size=3) + geom_smooth(size=2, se=FALSE)
## `geom_smooth()` using method = 'loess'

ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point(size=3) + geom_smooth(aes(group = drv), size=2, se=FALSE, show.legend = FALSE)
## `geom_smooth()` using method = 'loess'

ggplot(data = mpg, aes(x = displ, y = hwy, color = drv)) + geom_point(size=3) + geom_smooth(size=2, se=FALSE, show.legend = FALSE)
## `geom_smooth()` using method = 'loess'

ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = drv), size=3) + geom_smooth(size=2, se=FALSE, show.legend = FALSE)
## `geom_smooth()` using method = 'loess'

ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = drv), size=3) + geom_smooth(aes(linetype = drv), size=2, se=FALSE, show.legend = FALSE)
## `geom_smooth()` using method = 'loess'

ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point(aes(fill = drv), shape = 21, color = "white", stroke = 2, size=3) + theme_dark()

Statistical Transformation

The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.

You can learn which stat a geom uses by inspecting the default value for the stat argument. For example, ?geom_bar shows that the default value for stat is “count”, which means that geom_bar() uses stat_count(). stat_count() is documented on the same page as geom_bar(), and if you scroll down you can find a section called “Computed variables”. That describes how it computes two new variables: count and prop.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

ggplot(data = diamonds) + 
  stat_count(mapping = aes(x = cut))

This works because every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation.

There are three reasons you might need to use a stat explicitly: 1. Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows.

demo <- tribble(
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551
)

ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")

  1. You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportion, rather than count:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

3. You might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarises the y values for each unique x value, to draw attention to the summary that you’re computing:

ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

In our proportion barchart, we need to set group = 1. Why? In other words, why is this graph not useful?

..prop.. finds proportions of the groups in the data. If we don’t specify that we want all the data to be regarded as one group, then geom_barchart we end up with each cut as a separate group, and if we find the proprtion of “Premium” diamonds that are “Premium”, the answer is obviously 1.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, colour = cut), size=2)

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))

Note what happens if you map the fill aesthetic to another variable, like clarity: the bars are automatically stacked. Each colored rectangle represents a combination of cut and clarity. If you don’t want a stacked bar chart, you can use one of three other options: “identity”, “dodge” or “fill”.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))

position = “fill” works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

position = “dodge” places overlapping objects directly beside one another. This makes it easier to compare individual values.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

Use jitter in geom_point(): The values of hwy and displ are rounded so the points appear on a grid and many points overlap each other. This problem is known as overplotting. This arrangement makes it hard to see where the mass of the data is. Are the data points spread equally throughout the graph, or is there one special combination of hwy and displ that contains 109 values?

You can avoid this gridding by setting the position adjustment to jitter. position = “jitter” adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

#### Exercise

  1. Compare and contrast geom_jitter() with geom_count().

geom_jitter randomly moves the points to stop them overlapping. geom_count deterministically counts the points at a given point and maps them to the size of a single point. The determinism of geom_count makes it useful in discrete situations, but it does not work when the points are overlapping but not in exactly the same place.

  1. What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.
ggplot(data = mpg) + geom_boxplot(aes(x = drv, y = hwy, color = class))

The default adjustment is position_dodge. This means that the points are moved to the side by a discrete amount

Coordinate Systems

Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful.

coord_flip() switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful for long labels: it’s hard to get them to fit without overlapping on the x-axis.

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

coord_quickmap() sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2 (which unfortunately we don’t have the space to cover in this book).

nz <- map_data("nz")
## Warning: package 'maps' was built under R version 3.4.4
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()

coord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.

bar <- ggplot(data = diamonds) + 
  geom_bar(
    mapping = aes(x = cut, fill = cut), 
    show.legend = FALSE,
    width = 1
  ) + 
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

bar + coord_flip()

bar + coord_polar()

Exercise

  1. Turn a stacked bar chart into a pie chart using coord_polar().
ggplot(data = mpg) + geom_bar(aes(x = drv, fill = class)) + coord_polar() + labs(fill = "Classes", x = "")

  1. What’s the difference between coord_quickmap() and coord_map()?

coord_quickmap uses an approximation to the mercator projection, whereas coord_map can use a variety of projections from the mapproj package. This means that coord_quickmap runs faster and doesn’t require additional packages, but isn’t as accurate, and won’t work right far from the equator.

  1. What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() + 
  coord_fixed(ratio = 1)

geom_abline is used to plot lines defined by slope (a) and intercept (b) parameters. Used with no arguments, like here, it will plot a line with slope 1 and intercept 0, so passing through the origin at 45 degrees. coord_fixed is important because x and y have the same units, so we want to maintain the slope of the line, and see that city mileage is worse than highway, but more importantly that this is better explained by a constant offset than a multiplicative factor.

# ggplot(data = <DATA>) + 
#   <GEOM_FUNCTION>(
#      mapping = aes(<MAPPINGS>),
#      stat = <STAT>, 
#      position = <POSITION>
#   ) +
#   <COORDINATE_FUNCTION> +
#   <FACET_FUNCTION>

Data Transformation

This chapter will teach you how to transform your data using the dplyr package and a new dataset on flights departing New York City in 2013.

library(nycflights13)
## Warning: package 'nycflights13' was built under R version 3.4.4
library(tidyverse)
flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1      533            529        4.      850
##  3  2013     1     1      542            540        2.      923
##  4  2013     1     1      544            545       -1.     1004
##  5  2013     1     1      554            600       -6.      812
##  6  2013     1     1      554            558       -4.      740
##  7  2013     1     1      555            600       -5.      913
##  8  2013     1     1      557            600       -3.      709
##  9  2013     1     1      557            600       -3.      838
## 10  2013     1     1      558            600       -2.      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

You might notice that this data frame prints a little differently from other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. (To see the whole dataset, you can run View(flights) which will open the dataset in the RStudio viewer). It prints differently because it’s a tibble. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. For now, you don’t need to worry about the differences; we’ll come back to tibbles in more detail in wrangle.

dplyr basics

  • Pick observations by their values (filter()).
  • Reorder the rows (arrange()).
  • Pick variables by their names (select()).
  • Create new variables with functions of existing variables (mutate()).
  • Collapse many values down to a single summary (summarise()).

These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.

All verbs work similarly:

  1. The first argument is a data frame.
  2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
  3. The result is a new data frame.

filter rows

filter(flights, month == 1, day == 1)
## # A tibble: 842 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1      533            529        4.      850
##  3  2013     1     1      542            540        2.      923
##  4  2013     1     1      544            545       -1.     1004
##  5  2013     1     1      554            600       -6.      812
##  6  2013     1     1      554            558       -4.      740
##  7  2013     1     1      555            600       -5.      913
##  8  2013     1     1      557            600       -3.      709
##  9  2013     1     1      557            600       -3.      838
## 10  2013     1     1      558            600       -2.      753
## # ... with 832 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
filter(flights, !(arr_delay > 120 | dep_delay > 120))
## # A tibble: 316,050 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1      533            529        4.      850
##  3  2013     1     1      542            540        2.      923
##  4  2013     1     1      544            545       -1.     1004
##  5  2013     1     1      554            600       -6.      812
##  6  2013     1     1      554            558       -4.      740
##  7  2013     1     1      555            600       -5.      913
##  8  2013     1     1      557            600       -3.      709
##  9  2013     1     1      557            600       -3.      838
## 10  2013     1     1      558            600       -2.      753
## # ... with 316,040 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Missing values

One important feature of R that can make comparison tricky are missing values, or NAs (“not availables”). NA represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.

# Let x be Mary's age. We don't know how old she is.
x <- NA

# Let y be John's age. We don't know how old he is.
y <- NA

# Are John and Mary the same age?
x == y
## [1] NA
#> [1] NA
# We don't know!

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly with is.na(x):

df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
## # A tibble: 1 x 1
##       x
##   <dbl>
## 1    3.
filter(df, is.na(x) | x > 1)
## # A tibble: 2 x 1
##       x
##   <dbl>
## 1   NA 
## 2    3.

Arrange

arrange(flights, year, month, day)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1      533            529        4.      850
##  3  2013     1     1      542            540        2.      923
##  4  2013     1     1      544            545       -1.     1004
##  5  2013     1     1      554            600       -6.      812
##  6  2013     1     1      554            558       -4.      740
##  7  2013     1     1      555            600       -5.      913
##  8  2013     1     1      557            600       -3.      709
##  9  2013     1     1      557            600       -3.      838
## 10  2013     1     1      558            600       -2.      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

Use desc() to re-order by a column in descending order:

arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     9      641            900     1301.     1242
##  2  2013     6    15     1432           1935     1137.     1607
##  3  2013     1    10     1121           1635     1126.     1239
##  4  2013     9    20     1139           1845     1014.     1457
##  5  2013     7    22      845           1600     1005.     1044
##  6  2013     4    10     1100           1900      960.     1342
##  7  2013     3    17     2321            810      911.      135
##  8  2013     6    27      959           1900      899.     1236
##  9  2013     7    22     2257            759      898.      121
## 10  2013    12     5      756           1700      896.     1058
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Missing values are always sorted at the end:

df <- tibble(x = c(5, 2, NA))
arrange(df, x)
## # A tibble: 3 x 1
##       x
##   <dbl>
## 1    2.
## 2    5.
## 3   NA
arrange(df, desc(x))
## # A tibble: 3 x 1
##       x
##   <dbl>
## 1    5.
## 2    2.
## 3   NA

Select select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

# Select columns by name
select(flights, year, month, day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
## # A tibble: 336,776 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##       <int>          <int>     <dbl>    <int>          <int>     <dbl>
##  1      517            515        2.      830            819       11.
##  2      533            529        4.      850            830       20.
##  3      542            540        2.      923            850       33.
##  4      544            545       -1.     1004           1022      -18.
##  5      554            600       -6.      812            837      -25.
##  6      554            558       -4.      740            728       12.
##  7      555            600       -5.      913            854       19.
##  8      557            600       -3.      709            723      -14.
##  9      557            600       -3.      838            846       -8.
## 10      558            600       -2.      753            745        8.
## # ... with 336,766 more rows, and 10 more variables: carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

There are a number of helper functions you can use within select():

  • starts_with(“abc”): matches names that begin with “abc”.
  • ends_with(“xyz”): matches names that end with “xyz”.
  • contains(“ijk”): matches names that contain “ijk”.
  • matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
  • num_range(“x”, 1:3): matches x1, x2 and x3.

select() can be used to rename variables, but it’s rarely useful because it drops all of the variables not explicitly mentioned. Instead, use rename()

Rename, which is a variant of select() that keeps all the variables that aren’t explicitly mentioned:

rename(flights, tail_num = tailnum) # new name = original name
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1      533            529        4.      850
##  3  2013     1     1      542            540        2.      923
##  4  2013     1     1      544            545       -1.     1004
##  5  2013     1     1      554            600       -6.      812
##  6  2013     1     1      554            558       -4.      740
##  7  2013     1     1      555            600       -5.      913
##  8  2013     1     1      557            600       -3.      709
##  9  2013     1     1      557            600       -3.      838
## 10  2013     1     1      558            600       -2.      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tail_num <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.

select(flights, time_hour, air_time, everything())
## # A tibble: 336,776 x 19
##    time_hour           air_time  year month   day dep_time sched_dep_time
##    <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
##  1 2013-01-01 05:00:00     227.  2013     1     1      517            515
##  2 2013-01-01 05:00:00     227.  2013     1     1      533            529
##  3 2013-01-01 05:00:00     160.  2013     1     1      542            540
##  4 2013-01-01 05:00:00     183.  2013     1     1      544            545
##  5 2013-01-01 06:00:00     116.  2013     1     1      554            600
##  6 2013-01-01 05:00:00     150.  2013     1     1      554            558
##  7 2013-01-01 06:00:00     158.  2013     1     1      555            600
##  8 2013-01-01 06:00:00      53.  2013     1     1      557            600
##  9 2013-01-01 06:00:00     140.  2013     1     1      557            600
## 10 2013-01-01 06:00:00     138.  2013     1     1      558            600
## # ... with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>

Mutate

mutate() always adds new columns at the end of your dataset so we’ll start by creating a narrower dataset so we can see the new variables. Remember that when you’re in RStudio, the easiest way to see all the columns is View().

flights_sml <- select(flights, 
  year:day, 
  ends_with("delay"), 
  distance, 
  air_time
)
mutate(flights_sml,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)
## # A tibble: 336,776 x 9
##     year month   day dep_delay arr_delay distance air_time  gain speed
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
##  1  2013     1     1        2.       11.    1400.     227.   -9.  370.
##  2  2013     1     1        4.       20.    1416.     227.  -16.  374.
##  3  2013     1     1        2.       33.    1089.     160.  -31.  408.
##  4  2013     1     1       -1.      -18.    1576.     183.   17.  517.
##  5  2013     1     1       -6.      -25.     762.     116.   19.  394.
##  6  2013     1     1       -4.       12.     719.     150.  -16.  288.
##  7  2013     1     1       -5.       19.    1065.     158.  -24.  404.
##  8  2013     1     1       -3.      -14.     229.      53.   11.  259.
##  9  2013     1     1       -3.       -8.     944.     140.    5.  405.
## 10  2013     1     1       -2.        8.     733.     138.  -10.  319.
## # ... with 336,766 more rows

Note that you can refer to columns that you’ve just created:

mutate(flights_sml,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
## # A tibble: 336,776 x 10
##     year month   day dep_delay arr_delay distance air_time  gain hours
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
##  1  2013     1     1        2.       11.    1400.     227.   -9. 3.78 
##  2  2013     1     1        4.       20.    1416.     227.  -16. 3.78 
##  3  2013     1     1        2.       33.    1089.     160.  -31. 2.67 
##  4  2013     1     1       -1.      -18.    1576.     183.   17. 3.05 
##  5  2013     1     1       -6.      -25.     762.     116.   19. 1.93 
##  6  2013     1     1       -4.       12.     719.     150.  -16. 2.50 
##  7  2013     1     1       -5.       19.    1065.     158.  -24. 2.63 
##  8  2013     1     1       -3.      -14.     229.      53.   11. 0.883
##  9  2013     1     1       -3.       -8.     944.     140.    5. 2.33 
## 10  2013     1     1       -2.        8.     733.     138.  -10. 2.30 
## # ... with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>

If you only want to keep the new variables, use transmute():

transmute(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
## # A tibble: 336,776 x 3
##     gain hours gain_per_hour
##    <dbl> <dbl>         <dbl>
##  1   -9. 3.78          -2.38
##  2  -16. 3.78          -4.23
##  3  -31. 2.67         -11.6 
##  4   17. 3.05           5.57
##  5   19. 1.93           9.83
##  6  -16. 2.50          -6.40
##  7  -24. 2.63          -9.11
##  8   11. 0.883         12.5 
##  9    5. 2.33           2.14
## 10  -10. 2.30          -4.35
## # ... with 336,766 more rows

Creation functions

  • Modular arithmetic: %/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y). Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights dataset, you can compute hour and minute from dep_time with:
transmute(flights,
  dep_time,
  hour = dep_time %/% 100,
  minute = dep_time %% 100
)
## # A tibble: 336,776 x 3
##    dep_time  hour minute
##       <int> <dbl>  <dbl>
##  1      517    5.    17.
##  2      533    5.    33.
##  3      542    5.    42.
##  4      544    5.    44.
##  5      554    5.    54.
##  6      554    5.    54.
##  7      555    5.    55.
##  8      557    5.    57.
##  9      557    5.    57.
## 10      558    5.    58.
## # ... with 336,766 more rows
  • Offsets: lead() and lag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)). They are most useful in conjunction with group_by(), which you’ll learn about shortly.
(x <- 1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10
#>  [1]  1  2  3  4  5  6  7  8  9 10
lag(x)
##  [1] NA  1  2  3  4  5  6  7  8  9
#>  [1] NA  1  2  3  4  5  6  7  8  9
lead(x)
##  [1]  2  3  4  5  6  7  8  9 10 NA
#>  [1]  2  3  4  5  6  7  8  9 10 NA
  • Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means. If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.

  • Ranking: there are a number of ranking functions, but you should start with min_rank(). It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks.

y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
## [1]  1  2  2 NA  4  5
#> [1]  1  2  2 NA  4  5
min_rank(desc(y)) # desc() Transform a vector into a format that will be sorted in descending order. 
## [1]  5  3  3 NA  2  1
#> [1]  5  3  3 NA  2  1
  • If min_rank() doesn’t do what you need, look at the variants row_number(), dense_rank(), percent_rank(), cume_dist(), ntile(). See their help pages for more details.

Summarise

summarise() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”. For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date:

by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 x 4
## # Groups:   year, month [?]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ... with 355 more rows

Pipe

Imagine that we want to explore the relationship between the distance and average delay for each location. There are three steps to prepare this data:

  1. Group flights by destination.
  2. Summarise to compute distance, average delay, and number of flights.
  3. Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.

This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don’t care about it. Naming things is hard, so this slows down our analysis.

There’s another way to tackle the same problem with the pipe, %>%:

delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")

ggplot(data = delays, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'

Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom. We’ll use piping frequently from now on because it considerably improves the readability of code

Counts

Whenever you do any aggregation, it’s always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))). That way you can check that you’re not drawing conclusions based on very small amounts of data. For example, let’s look at the planes (identified by their tail number) that have the highest average delays:

not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay)
  )

ggplot(data = delays, mapping = aes(x = delay)) + 
  geom_freqpoly(binwidth = 10)

ggplot(data = delays, mapping = aes(x = delay)) + 
  geom_histogram(binwidth = 10)

Wow, there are some planes that have an average delay of 5 hours (300 minutes)!

The story is actually a little more nuanced. We can get more insight if we draw a scatterplot of number of flights vs. average delay:

delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  )

ggplot(data = delays, mapping = aes(x = n, y = delay)) + 
  geom_point(alpha = 1/10)

ggplot(data = delays, mapping = aes(x = n, y = delay)) + 
  geom_jitter(size=0.3)

The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you’ll see that the variation decreases as the sample size increases.

When looking at this sort of plot, it’s often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This is what the following code does, as well as showing you a handy pattern for integrating ggplot2 into dplyr flows. It’s a bit painful that you have to switch from %>% to +, but once you get the hang of it, it’s quite convenient.

delays %>%
  filter(n > 25) %>%
  ggplot(mapping = aes(x = n, y = delay)) + 
  geom_point(alpha = 1/10)

RStudio tip: a useful keyboard shortcut is Cmd/Ctrl + Shift + P. This resends the previously sent chunk from the editor to the console. This is very convenient when you're (e.g.) exploring the value of n in the example above. You send the whole block once with Cmd/Ctrl + Enter, then you modify the value of n and press Cmd/Ctrl + Shift + P to resend the complete block.

Other summay functions

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    avg_delay1 = mean(arr_delay),
    avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
  )
## # A tibble: 365 x 5
## # Groups:   year, month [?]
##     year month   day avg_delay1 avg_delay2
##    <int> <int> <int>      <dbl>      <dbl>
##  1  2013     1     1     12.7         32.5
##  2  2013     1     2     12.7         32.0
##  3  2013     1     3      5.73        27.7
##  4  2013     1     4     -1.93        28.3
##  5  2013     1     5     -1.53        22.6
##  6  2013     1     6      4.24        24.4
##  7  2013     1     7     -4.95        27.8
##  8  2013     1     8     -3.23        20.8
##  9  2013     1     9     -0.264       25.6
## 10  2013     1    10     -5.90        27.3
## # ... with 355 more rows
# When do the first and last flights leave each day?
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    first = min(dep_time),
    last = max(dep_time)
  )
## # A tibble: 365 x 5
## # Groups:   year, month [?]
##     year month   day first  last
##    <int> <int> <int> <dbl> <dbl>
##  1  2013     1     1  517. 2356.
##  2  2013     1     2   42. 2354.
##  3  2013     1     3   32. 2349.
##  4  2013     1     4   25. 2358.
##  5  2013     1     5   14. 2357.
##  6  2013     1     6   16. 2355.
##  7  2013     1     7   49. 2359.
##  8  2013     1     8  454. 2351.
##  9  2013     1     9    2. 2252.
## 10  2013     1    10    3. 2320.
## # ... with 355 more rows
  • Measures of position: first(x), nth(x, 2), last(x). These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    first_dep = first(dep_time), 
    last_dep = last(dep_time)
  )
## # A tibble: 365 x 5
## # Groups:   year, month [?]
##     year month   day first_dep last_dep
##    <int> <int> <int>     <int>    <int>
##  1  2013     1     1       517     2356
##  2  2013     1     2        42     2354
##  3  2013     1     3        32     2349
##  4  2013     1     4        25     2358
##  5  2013     1     5        14     2357
##  6  2013     1     6        16     2355
##  7  2013     1     7        49     2359
##  8  2013     1     8       454     2351
##  9  2013     1     9         2     2252
## 10  2013     1    10         3     2320
## # ... with 355 more rows
not_cancelled %>% 
  group_by(year, month, day) %>% 
  mutate(r = min_rank(desc(dep_time))) %>% 
  filter(r %in% range(r))
## # A tibble: 770 x 20
## # Groups:   year, month, day [365]
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1     2356           2359       -3.      425
##  3  2013     1     2       42           2359       43.      518
##  4  2013     1     2     2354           2359       -5.      413
##  5  2013     1     3       32           2359       33.      504
##  6  2013     1     3     2349           2359      -10.      434
##  7  2013     1     4       25           2359       26.      505
##  8  2013     1     4     2358           2359       -1.      429
##  9  2013     1     4     2358           2359       -1.      436
## 10  2013     1     5       14           2359       15.      503
## # ... with 760 more rows, and 13 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, r <int>
  • Counts: You’ve seen n(), which takes no arguments, and returns the size of the current group. To count the number of non-missing values, use sum(!is.na(x)). To count the number of distinct (unique) values, use n_distinct(x).
# Which destinations have the most carriers?
not_cancelled %>% 
  group_by(dest) %>% 
  summarise(carriers = n_distinct(carrier)) %>% 
  arrange(desc(carriers))
## # A tibble: 104 x 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # ... with 94 more rows

Simple Count

not_cancelled %>% 
  count(dest)
## # A tibble: 104 x 2
##    dest      n
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     264
##  3 ALB     418
##  4 ANC       8
##  5 ATL   16837
##  6 AUS    2411
##  7 AVL     261
##  8 BDL     412
##  9 BGR     358
## 10 BHM     269
## # ... with 94 more rows

You can optionally provide a weight variable. For example, you could use this to “count” (sum) the total number of miles a plane flew:

not_cancelled %>% 
  count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
##    tailnum       n
##    <chr>     <dbl>
##  1 D942DN    3418.
##  2 N0EGMQ  239143.
##  3 N10156  109664.
##  4 N102UW   25722.
##  5 N103US   24619.
##  6 N104UW   24616.
##  7 N10575  139903.
##  8 N105UW   23618.
##  9 N107US   21677.
## 10 N108UW   32070.
## # ... with 4,027 more rows

Ungrouping

If you need to remove grouping, and return to operations on ungrouped data, use ungroup().

Exercise

  1. Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).
# 1. 
not_cancelled %>% count(dest)
## # A tibble: 104 x 2
##    dest      n
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     264
##  3 ALB     418
##  4 ANC       8
##  5 ATL   16837
##  6 AUS    2411
##  7 AVL     261
##  8 BDL     412
##  9 BGR     358
## 10 BHM     269
## # ... with 94 more rows
not_cancelled %>% group_by(dest) %>% summarise(n = n())
## # A tibble: 104 x 2
##    dest      n
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     264
##  3 ALB     418
##  4 ANC       8
##  5 ATL   16837
##  6 AUS    2411
##  7 AVL     261
##  8 BDL     412
##  9 BGR     358
## 10 BHM     269
## # ... with 94 more rows
# 2. 
not_cancelled %>% count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
##    tailnum       n
##    <chr>     <dbl>
##  1 D942DN    3418.
##  2 N0EGMQ  239143.
##  3 N10156  109664.
##  4 N102UW   25722.
##  5 N103US   24619.
##  6 N104UW   24616.
##  7 N10575  139903.
##  8 N105UW   23618.
##  9 N107US   21677.
## 10 N108UW   32070.
## # ... with 4,027 more rows
not_cancelled %>% group_by(tailnum) %>% summarise(n = sum(distance))
## # A tibble: 4,037 x 2
##    tailnum       n
##    <chr>     <dbl>
##  1 D942DN    3418.
##  2 N0EGMQ  239143.
##  3 N10156  109664.
##  4 N102UW   25722.
##  5 N103US   24619.
##  6 N104UW   24616.
##  7 N10575  139903.
##  8 N105UW   23618.
##  9 N107US   21677.
## 10 N108UW   32070.
## # ... with 4,027 more rows
  1. Our definition of cancelled flights (!is.na(dep_delay) & !is.na(arr_delay)) is slightly suboptimal. Why? Which is the most important column?
flights %>%
    group_by(departed = !is.na(dep_delay), arrived = !is.na(arr_delay)) %>%
    summarise(n=n())
## # A tibble: 3 x 3
## # Groups:   departed [?]
##   departed arrived      n
##   <lgl>    <lgl>    <int>
## 1 FALSE    FALSE     8255
## 2 TRUE     FALSE     1175
## 3 TRUE     TRUE    327346

There are no flights which arrived but did not depart, so we can just use !is.na(dep_delay).

  1. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
flights %>%
  mutate(dep_date = lubridate::make_datetime(year, month, day)) %>%
  group_by(dep_date) %>%
  summarise(cancelled = sum(is.na(dep_delay)), 
            n = n(),
            mean_dep_delay = mean(dep_delay,na.rm=TRUE),
            mean_arr_delay = mean(arr_delay,na.rm=TRUE)) %>%
    ggplot(aes(x= cancelled/n)) + 
    geom_point(aes(y=mean_dep_delay), colour='blue', alpha=0.5) + 
    geom_point(aes(y=mean_arr_delay), colour='red', alpha=0.5) + 
    ylab('mean delay (minutes)')

# or my ans
flights  %>% 
  group_by(year, month, day) %>%
  summarise(cancel_n = sum(is.na(dep_delay)),
            total_n = n(),
            cancel_prop = cancel_n / total_n,
            mean_dep_delay = mean(dep_delay, na.rm = TRUE),
            mean_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = cancel_n, y = mean_dep_delay)) + 
  geom_point(color = 'blue', alpha = .5) + 
  geom_point(aes(y = mean_arr_delay), color = 'red', alpha = .5) + 
  labs(y = "Mean Delay in Min", x = "Proportion of Cancellation")

We can see that on most days, there is not a strong relationship between cancellations and delay, but if one is unusually high, then the other probably is, too.

  1. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))
flights %>% filter(dep_delay > 0) %>% group_by(carrier, dest) %>% summarise(n()) %>%
  ggplot(aes(x = dest, fill = carrier, y = `n()`)) + 
  geom_bar(stat = "identity") +
  ylab("Number of Delay") + coord_flip() + theme(axis.text.y = element_text(size = 5))

ggsave("test.png")
## Saving 7 x 5 in image

I don’t think we can disentangle the bad airports from the bad carriers, because each carrier performances differently in each airport.

  1. For each plane, count the number of flights before the first delay of greater than 1 hour.
flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1      533            529        4.      850
##  3  2013     1     1      542            540        2.      923
##  4  2013     1     1      544            545       -1.     1004
##  5  2013     1     1      554            600       -6.      812
##  6  2013     1     1      554            558       -4.      740
##  7  2013     1     1      555            600       -5.      913
##  8  2013     1     1      557            600       -3.      709
##  9  2013     1     1      557            600       -3.      838
## 10  2013     1     1      558            600       -2.      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
flights %>%
    mutate(dep_date = lubridate::make_datetime(year, month, day)) %>%
    group_by(tailnum) %>%
    arrange(dep_date) %>%
    filter(!cumany(dep_delay > 60)) %>%
    tally(sort = TRUE)
## # A tibble: 3,755 x 2
##    tailnum     n
##    <chr>   <int>
##  1 N954UW    206
##  2 N952UW    163
##  3 N957UW    142
##  4 N5FAAA    117
##  5 N38727     99
##  6 N3742C     98
##  7 N5EWAA     98
##  8 N705TW     97
##  9 N765US     97
## 10 N635JB     94
## # ... with 3,745 more rows

Exploratory Data Analysis (EDA)

DA is an iterative cycle. You:

  1. Generate questions about your data.
  2. Search for answers by visualising, transforming, and modelling your data.
  3. Use what you learn to refine your questions and/or generate new questions.
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you'll eventually write up and communicate to others.

Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.

First Step in EDA, ask questions!

our goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

To create ggplot histogram bins by hand:

diamonds %>% 
  count(cut_width(carat, .5))
## # A tibble: 11 x 2
##    `cut_width(carat, 0.5)`     n
##    <fct>                   <int>
##  1 [-0.25,0.25]              785
##  2 (0.25,0.75]             29498
##  3 (0.75,1.25]             15977
##  4 (1.25,1.75]              5313
##  5 (1.75,2.25]              2002
##  6 (2.25,2.75]               322
##  7 (2.75,3.25]                32
##  8 (3.25,3.75]                 5
##  9 (3.75,4.25]                 4
## 10 (4.25,4.75]                 1
## 11 (4.75,5.25]                 1

Zoom to small values of the y-axis with coord_cartesian()

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))

(coord_cartesian() also has an xlim() argument for when you need to zoom into the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly differently: they throw away the data outside the limits.)

Exercise

  1. Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.4.4
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
p1 <- ggplot(diamonds) + geom_histogram(aes(x = x), binwidth = 0.1) + coord_cartesian(xlim = c(0, 10))
# need to first use filter to remove the potential outliers for y and z
p2 <- ggplot(filter(diamonds, y < 20)) + geom_histogram(aes(x = y), binwidth = 0.1) + coord_cartesian(xlim = c(0, 10))
p3 <- ggplot(filter(diamonds, z < 20)) + geom_histogram(aes(x = z), binwidth = 0.1) + coord_cartesian(xlim = c(0, 10))
grid.arrange(p1, p2, p3, ncol = 1)

Based on the distribution of x, y, and z on the same scale, z has a mean lower than both x and y, in addition, x and y share the same distribution, hence x and y should be the length, width of a diamond and z should be its depth.

  1. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
ggplot(diamonds) + geom_histogram(aes(x = price), binwidth = 100)

The price value distribution has many spikes, and there is a gap around 1500 where there are no diamond with that price.

  1. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
diamonds %>% 
  filter(carat == .99 | carat == 1) %>% 
  count(carat)
## # A tibble: 2 x 2
##   carat     n
##   <dbl> <int>
## 1 0.990    23
## 2 1.00   1558
# filter out these two groups and compare variable distribution/bar plot on potential metrices
carat_99vs1 <- diamonds %>% 
  filter(carat == .99 | carat == 1)
# distribution of clarity
p1 <- ggplot(carat_99vs1, aes(x = clarity, group = 1)) + geom_bar(aes(y = ..prop..), position = "dodge", show.legend = FALSE) + facet_wrap(~ factor(carat)) + theme(axis.text.x = element_text(size = 9, vjust = 0.5, angle = -45))
p2 <- ggplot(carat_99vs1, aes(x = cut, group = 1)) + geom_bar(aes(y = ..prop..), position = "dodge") + facet_wrap(~ factor(carat)) + theme(axis.text.x = element_text(size = 9, vjust = 0.5, angle = -45))
p3 <- ggplot(carat_99vs1, aes(x = depth, color = factor(carat))) + geom_freqpoly(size = 1.1, stat = 'density', show.legend = FALSE) + scale_y_continuous(labels = scales::percent_format())
p4 <- ggplot(carat_99vs1, aes(x = table, color = factor(carat))) + geom_freqpoly(size = 1.1, stat = 'density') + scale_y_continuous(labels = scales::percent_format())
grid.arrange(p1, p2, p3, p4, nrow = 2) 

We could also compute the count for each carat value at 1 decimal point to see if some merchant rounds the number up from 0.99 to 1 for better sale price.

diamonds %>%
  count(cut_width(carat, 0.1, boundary = 0, closed = "left")) %>%
  ggplot(aes(x = `cut_width(carat, 0.1, boundary = 0, closed = "left")`, y = n)) + geom_bar(stat = "identity") + labs(x = "Carat Range", y = "Count") + theme(axis.text.x = element_text(angle = -45, size = 8, vjust = 0.5))

Notice that there is a major increase from range [0.9, 1) and range [1, 1.1), which indicate a strong tendency that merchant would round up the carat value.

  1. Difference between xlim(), ylim() and coord_cartesian()

The coord_cartesian() function zooms in on the area specified by the limits, after having calculated and drawn the geoms. Since the histogram bins have already been calculated, it is unaffected.

However, the xlim() and ylim() functions influence actions before the calculation of the stats related to the histogram. Thus, any values outside the x- and y-limits are dropped before calculating bin widths and counts. This can influence how the histogram looks.

Visualizing Covariates

  • Category vs. Numerical
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

Or we can utilize Boxplot

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))

Exercise

  1. Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.
p1 <- flights %>%
  mutate(cancelled = ifelse(is.na(dep_time), "Cancelled", "Non-cancelled")) %>%
  ggplot(aes(x = sched_dep_time, color = cancelled)) + geom_freqpoly(stat = 'density', size = 0.8, show.legend = FALSE)

p2 <- flights %>%
  mutate(cancelled = ifelse(is.na(dep_time), "Cancelled", "Non-cancelled")) %>%
  ggplot(aes(x = cancelled, y = sched_dep_time, fill = cancelled)) + geom_boxplot() + theme(axis.text.x = element_blank())

grid.arrange(p1, p2, nrow = 1)

  1. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
p1 <- ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
p2 <- ggplot(diamonds, aes(x = color, y = price)) +
  geom_boxplot()
p3 <- ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = clarity, y = price))

grid.arrange(p1, p2, p3, ncol = 1)

For both clarity and color, there is a large amount of variation within each category, which overwhelms the between category trend. Carat is clearly the best predictor of its price.

Now that we have established that carat appears to be the best predictor of price, what is the relationship between it and cut?

ggplot(diamonds, aes(x = cut, y = carat)) +
  geom_boxplot()

Noticeably, the largest carat diamonds have a cut of “Fair” (the lowest).This negative relationship can be due to the way in which diamonds are selected for sale. A larger diamond can be profitably sold with a lower quality cut, while a smaller diamond requires a better cut.

  1. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?
library("lvplot")
## Warning: package 'lvplot' was built under R version 3.4.4
ggplot(diamonds, aes(x = cut, y = price)) +
  geom_lv(aes(fill=..LV..))  + scale_fill_lv()

ggplot(diamonds, aes(x = cut, y = price)) +
  geom_lv(aes(fill=..LV..))  + scale_fill_brewer()
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Blues is 9
## Returning the palette you asked for with that many colors

  1. Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?
p1 <- ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
  geom_freqpoly(mapping = aes(color = cut), binwidth = 500)
p2 <- ggplot(data = diamonds, mapping = aes(x = price)) +
  geom_histogram() +
  facet_wrap(~ cut, ncol = 1, scales = "free_y")
p3 <- ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_violin() +
  coord_flip()
grid.arrange(p1, p3)

p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

geom_freqpoly() is better for look-up: meaning that given a price, it is easy to tell which cut has the highest density. However, the overlapping lines makes it difficult to distinguish how the overall distributions relate to each other. The geom_violin() and faceted geom_histogram() have similar strengths and weaknesses. It is easy to visually distinguish differences in the overall shape of the distributions (skewness, central values, variance, etc). However, since we can’t easily compare the vertical values of the distribution, it is difficult to look up which category has the highest density for a given price.

All of these methods depend on tuning parameters to determine the level of smoothness of the distribution.

  1. If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.
  • geom_quasirandom() produces plots that are a mix of jitter and violin plots. There are several different methods that determine exactly how the random location of the points is generated.
  • geom_beeswarm() produces a plot similar to a violin plot, but by offsetting the points.
library("ggbeeswarm")
## Warning: package 'ggbeeswarm' was built under R version 3.4.4
p1 <- ggplot(data = mpg) +
  geom_quasirandom(mapping = aes(x = reorder(class, hwy, FUN = median),
                                 y = hwy)) + labs(x = "", y ="", title = "Default")

p2 <- ggplot(data = mpg) +
  geom_quasirandom(mapping = aes(x = reorder(class, hwy, FUN = median),
                                 y = hwy),
                   method = "tukey")+ labs(x = "", y = "", title = "tukey")
p3 <- ggplot(data = mpg) +
  geom_quasirandom(mapping = aes(x = reorder(class, hwy, FUN = median),
                                 y = hwy),
                   method = "tukeyDense")+ labs(x = "", y = "", title = "tukeyDense")

p4 <- ggplot(data = mpg) +
  geom_quasirandom(mapping = aes(x = reorder(class, hwy, FUN = median),
                                 y = hwy),
                   method = "frowney")+ labs(x = "", y = "", title = "frowney")
p5 <-  ggplot(data = mpg) +
  geom_quasirandom(mapping = aes(x = reorder(class, hwy, FUN = median),
                                 y = hwy),
                   method = "smiley")+ labs(x = "", y = "", title = "smiley")
p6 <- ggplot(data = mpg) +
  geom_beeswarm(mapping = aes(x = reorder(class, hwy, FUN = median),
                                 y = hwy))+ labs(x = "", y = "", title = "beeswarm")

grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 3)

Two Categorical Variables

Visualize the covariation between categorical variables.

ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color))

Another way is to compute count with dplyr and visualize with geom_tile()

diamonds %>%
  count(color, cut) %>%
  ggplot(aes(x = color, y = cut)) + 
  geom_tile(aes(fill = n), show.legend = FALSE)

If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the d3heatmap or heatmaply packages, which create interactive plots.

Exercise

  1. How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?
# cut within color
ggplot(diamonds, aes(x = color, fill = cut)) + geom_bar(position = "fill")

# color within cut
ggplot(diamonds, aes(x = cut, fill = color)) + geom_bar(position = "fill")

  1. Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?
flights %>%
  filter(dep_delay > 0) %>%
  group_by(dest, month) %>%
  summarise(avg_delay = mean(dep_delay)) %>%
  ggplot(aes(dest, factor(month))) + geom_tile(aes(fill = avg_delay)) + 
  labs(y = "Months", x = "Destination")

The plot is difficult to read because there are too many destinations to show. In order to imporve this large plot, try the d3heatmap or heatmaply packages, which create interactive plots.

library(d3heatmap)
## Warning: package 'd3heatmap' was built under R version 3.4.4
flights %>%
  filter(dep_delay > 0) %>%
  group_by(dest, month) %>%
  summarise(avg_delay = mean(dep_delay)) %>%
  spread(dest, avg_delay) %>%
  d3heatmap(scale = "column", dendrogram = "none", color = "Blues")

Or we can filter out destinations that are not at least once per month and reorder the dest according to departure time to better visualize.

library("viridis")
## Warning: package 'viridis' was built under R version 3.4.4
## Loading required package: viridisLite
flights %>%
  group_by(month, dest) %>%
  summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  group_by(dest) %>%
  filter(n() == 12) %>%
  ungroup() %>%
  mutate(dest = reorder(dest, dep_delay)) %>%
  ggplot(aes(x = factor(month), y = dest, fill = dep_delay)) +
  geom_tile() +
  scale_fill_viridis() +
  labs(x = "Month", y = "Destination", fill = "Departure Delay")

  1. Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?
# let's compare 
p1 <- ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color))
p2 <- ggplot(data = diamonds) +
  geom_count(mapping = aes(x = color, y = cut))
grid.arrange(p1, p2, nrow = 1)

It’s usually better to use the categorical variable with a larger number of categories or the longer labels on the y axis. Another justification, for switching the order is that the larger numbers are at the top when x = color and y = cut, and that lowers the cognitive burden of interpreting the plot.

Two Continuous Varibales

Previously you used geom_histogram() and geom_freqpoly() to bin in one dimension. Now you’ll learn how to use geom_bin2d() and geom_hex() to bin in two dimensions.

geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. geom_bin2d() creates rectangular bins. geom_hex() creates hexagonal bins.

library(hexbin)
## Warning: package 'hexbin' was built under R version 3.4.4
p1 <- diamonds %>% 
  filter(carat < 3) %>%
  ggplot() +
  geom_bin2d(mapping = aes(x = carat, y = price), show.legend = FALSE)

p2 <- diamonds %>% 
  filter(carat < 3) %>%
ggplot() +
  geom_hex(mapping = aes(x = carat, y = price), show.legend = FALSE)

grid.arrange(p1, p2, nrow = 1)

Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualising the combination of a categorical and a continuous variable that you learned about.

diamonds %>% 
  filter(carat < 3) %>%
  ggplot(mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it’s difficult to tell that each boxplot summarises a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with varwidth = TRUE.

diamonds %>% 
  filter(carat < 3) %>%
  ggplot(mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)), varwidth = TRUE)

Another approach is to display approximately the same number of points in each bin. That’s the job of cut_number():

diamonds %>% 
  filter(carat < 3) %>%
ggplot(mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_number(carat, 20)))

Exercise

  1. Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?

Using freqpoly, I need consider the smoothness (binwidth or number of points in a bin); we don’t want to either bin too much points and lose the trend of the convariate, or include few points so that the trend is affected heavily by local points. **Bin need to be large enough to remove noise but not so small in order to keep pattern.

small_data <-  diamonds %>% 
  filter(carat < 3)

ggplot(small_data, aes(x = carat, color = cut_width(price, 2500, boundary = 0))) + geom_freqpoly(size = .8, binwidth = 0.1) + scale_color_discrete("Price Range")

  1. Visualise the distribution of carat, partitioned by price.

Besides the plots above, we can also use a boxplot to do this

p1 <- ggplot(small_data, aes(x = price, y = carat)) + geom_boxplot(aes(group = cut_number(price, 10))) + coord_flip()
p2 <- ggplot(small_data, aes(x = price, y = carat)) + geom_boxplot(aes(group = cut_width(price, 2000, boundary = 0))) + coord_flip()

grid.arrange(p1, p2, nrow = 1)

  1. How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?
ggplot(small_data, aes(x = price)) + geom_freqpoly(aes(color = cut_width(carat, 0.5, boundary = 0)), binwidth = 1000, size = .8, stat = "density") + scale_color_discrete("Carat Range")
## Warning: Ignoring unknown parameters: binwidth

The price distribution of large diamonds is left skewed (more dense on the right tail) while the one for small diamonds is right skewed. It accords with my expectation because large diamonds are more easily to be sold for a much higher price than average diamonds.

  1. Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price.
ggplot(small_data, aes(x = carat, y = price)) + geom_jitter(size = .5, alpha = .5) + facet_wrap(~cut)

  1. Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.
ggplot(data = diamonds) +
  geom_point(mapping = aes(x = x, y = y), size = .7) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

Patterns and Models

Patterns in your data provide clues about relationships.If you spot a pattern, ask yourself:

  • Could this pattern be due to coincidence (i.e. random chance)?

  • How can you describe the relationship implied by the pattern?

  • How strong is the relationship implied by the pattern?

  • What other variables might affect the relationship?

  • Does the relationship change if you look at individual subgroups of the data?

A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions

ggplot(data = faithful) + 
  geom_point(mapping = aes(x = eruptions, y = waiting))

Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.

library(modelr)
## Warning: package 'modelr' was built under R version 3.4.4
mod <- lm(log(price) ~ log(carat), data = diamonds)

diamonds2 <- diamonds %>% 
  add_residuals(mod) %>% 
  mutate(resid = exp(resid))

ggplot(data = diamonds2) + 
  geom_point(mapping = aes(x = carat, y = resid))

The residuals give us a view of the price of the diamond, once the effect of carat has been removed.

Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.

ggplot(data = diamonds2) + 
  geom_boxplot(mapping = aes(x = cut, y = resid))

You’ll learn how models, and the modelr package, work in the final part of the book, model. We’re saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.

Data Wrangling

This part of the book proceeds as follows:

tibbles

Almost all of the functions that you’ll use in this book produce tibbles, as tibbles are one of the unifying features of the tidyverse. Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble. You can do that with as_tibble()

as_tibble(iris)
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1         5.10        3.50         1.40       0.200 setosa 
##  2         4.90        3.00         1.40       0.200 setosa 
##  3         4.70        3.20         1.30       0.200 setosa 
##  4         4.60        3.10         1.50       0.200 setosa 
##  5         5.00        3.60         1.40       0.200 setosa 
##  6         5.40        3.90         1.70       0.400 setosa 
##  7         4.60        3.40         1.40       0.300 setosa 
##  8         5.00        3.40         1.50       0.200 setosa 
##  9         4.40        2.90         1.40       0.200 setosa 
## 10         4.90        3.10         1.50       0.100 setosa 
## # ... with 140 more rows

You can create a new tibble from individual vectors with tibble().

tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)
## # A tibble: 5 x 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1    1.    2.
## 2     2    1.    5.
## 3     3    1.   10.
## 4     4    1.   17.
## 5     5    1.   26.

If you’re already familiar with data.frame(), note that tibble() does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names.

It’s possible for a tibble to have column names that are not valid R variable names, aka non-syntactic names.

tb <- tibble(
  `:)` = "smile", 
  ` ` = "space",
  `2000` = "number"
)
tb
## # A tibble: 1 x 3
##   `:)`  ` `   `2000`
##   <chr> <chr> <chr> 
## 1 smile space number

Another way to create a tibble is with tribble(), short for transposed tibble. tribble() is customised for data entry in code: column headings are defined by formulas (i.e. they start with ~), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy to read form.

tribble(
  ~x, ~y, ~z,
  #--|--|----
  "a", 2, 3.6,
  "b", 1, 8.5
)
## # A tibble: 2 x 3
##   x         y     z
##   <chr> <dbl> <dbl>
## 1 a        2.  3.60
## 2 b        1.  8.50

Print tibble

nycflights13::flights %>% 
  print(n = 10, width = Inf) # print all columns
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1      533            529        4.      850
##  3  2013     1     1      542            540        2.      923
##  4  2013     1     1      544            545       -1.     1004
##  5  2013     1     1      554            600       -6.      812
##  6  2013     1     1      554            558       -4.      740
##  7  2013     1     1      555            600       -5.      913
##  8  2013     1     1      557            600       -3.      709
##  9  2013     1     1      557            600       -3.      838
## 10  2013     1     1      558            600       -2.      753
##    sched_arr_time arr_delay carrier flight tailnum origin dest  air_time
##             <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>
##  1            819       11. UA        1545 N14228  EWR    IAH       227.
##  2            830       20. UA        1714 N24211  LGA    IAH       227.
##  3            850       33. AA        1141 N619AA  JFK    MIA       160.
##  4           1022      -18. B6         725 N804JB  JFK    BQN       183.
##  5            837      -25. DL         461 N668DN  LGA    ATL       116.
##  6            728       12. UA        1696 N39463  EWR    ORD       150.
##  7            854       19. B6         507 N516JB  EWR    FLL       158.
##  8            723      -14. EV        5708 N829AS  LGA    IAD        53.
##  9            846       -8. B6          79 N593JB  JFK    MCO       140.
## 10            745        8. AA         301 N3ALAA  LGA    ORD       138.
##    distance  hour minute time_hour          
##       <dbl> <dbl>  <dbl> <dttm>             
##  1    1400.    5.    15. 2013-01-01 05:00:00
##  2    1416.    5.    29. 2013-01-01 05:00:00
##  3    1089.    5.    40. 2013-01-01 05:00:00
##  4    1576.    5.    45. 2013-01-01 05:00:00
##  5     762.    6.     0. 2013-01-01 06:00:00
##  6     719.    5.    58. 2013-01-01 05:00:00
##  7    1065.    6.     0. 2013-01-01 06:00:00
##  8     229.    6.     0. 2013-01-01 06:00:00
##  9     944.    6.     0. 2013-01-01 06:00:00
## 10     733.    6.     0. 2013-01-01 06:00:00
## # ... with 3.368e+05 more rows

You can also control the default print behaviour by setting options:

  • options(tibble.print_max = n, tibble.print_min = m): if more than n rows, print only m rows. Use options(tibble.print_min = Inf) to always show all rows.

  • Use options(tibble.width = Inf) to always print all columns, regardless of the width of the screen.

A final option is to use RStudio’s built-in data viewer to get a scrollable view of the complete dataset.

nycflights13::flights %>% 
  View()

Data Extraction

df <- tibble(
  x = runif(5),
  y = rnorm(5)
)

# Extract by name
df$x
## [1] 0.4596974 0.7775757 0.4863925 0.7403212 0.6311137
df[["x"]]
## [1] 0.4596974 0.7775757 0.4863925 0.7403212 0.6311137
# Extract by position
df[[1]]
## [1] 0.4596974 0.7775757 0.4863925 0.7403212 0.6311137

To use these in a pipe, you’ll need to use the special placeholder .:

df %>% .$x
## [1] 0.4596974 0.7775757 0.4863925 0.7403212 0.6311137
df %>% .[["x"]]
## [1] 0.4596974 0.7775757 0.4863925 0.7403212 0.6311137

Some older functions don’t work with tibbles. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data.frame

The main reason that some older functions don’t work with tibble is the [ function.

With base R data frames, [ sometimes returns a data frame, and sometimes returns a vector. With tibbles, [ always returns another tibble.

df <- data.frame(abc = 1, xyz = "a")
df$x # partial matching
## [1] a
## Levels: a
df[, "xyz"]
## [1] a
## Levels: a
df[, c("abc", "xyz")]
##   abc xyz
## 1   1   a
df <- as.tibble(df)
df$x
## Warning: Unknown or uninitialised column: 'x'.
## NULL
df[, "xyz"]
## # A tibble: 1 x 1
##   xyz  
##   <fct>
## 1 a
df[, c("abc", "xyz")]
## # A tibble: 1 x 2
##     abc xyz  
##   <dbl> <fct>
## 1    1. a

Exercise

  1. Practice referring to non-syntactic names in the following data frame by:
  • Extracting the variable called 1.

  • Plotting a scatterplot of 1 vs 2.

  • Creating a new column called 3 which is 2 divided by 1.

  • Renaming the columns to one, two and three.

annoying <- tibble(
  `1` = 1:10,
  `2` = `1` * 2 + rnorm(length(`1`))
)

# 1.
annoying$`1`
##  [1]  1  2  3  4  5  6  7  8  9 10
annoying[["1"]] # will return a vector
##  [1]  1  2  3  4  5  6  7  8  9 10
annoying["1"] # will return a tibble
## # A tibble: 10 x 1
##      `1`
##    <int>
##  1     1
##  2     2
##  3     3
##  4     4
##  5     5
##  6     6
##  7     7
##  8     8
##  9     9
## 10    10
# 2. 
ggplot(annoying) + geom_point(aes(x = `1`, y = `2`))

# 3. 
annoying[["3"]] <- annoying[[2]] / annoying[[1]]
# 4. renaming
annoying <- rename(annoying, one = `1`, two = `2`, three = `3`)
  1. What does tibble::enframe() do? When might you use it?

enframe() converts named atomic vectors or lists to two-column data frames. For unnamed vectors, the natural sequence is used as name column.

enframe(1:3)
## # A tibble: 3 x 2
##    name value
##   <int> <int>
## 1     1     1
## 2     2     2
## 3     3     3
enframe(c(a = 5, b = 7))
## # A tibble: 2 x 2
##   name  value
##   <chr> <dbl>
## 1 a        5.
## 2 b        7.

3.What option controls how many additional column names are printed at the footer of a tibble?

The n_extra argument determines the number of extra columns to print information for.

Data Import

load flat files in R with the readr package, which is part of the core tidyverse.

Most of readr’s functions are concerned with turning flat files into data frames:

Parsing Vector

we need to take a little detour to talk about the parse_*() functions. These functions take a character vector and return a more specialised vector like a logical, integer, or date:

str(parse_logical(c("TRUE", "FALSE", "NA")))
##  logi [1:3] TRUE FALSE NA
str(parse_integer(c("1", "2", "3")))
##  int [1:3] 1 2 3
str(parse_date(c("2010-01-01", "1979-10-14")))
##  Date[1:2], format: "2010-01-01" "1979-10-14"

These functions are useful in their own right, but are also an important building block for readr. Once you’ve learned how the individual parsers work in this section, we’ll circle back and see how they fit together to parse a complete file in the next section.

If parsing fails, you’ll get a warning:

x <- parse_integer(c("123", "345", "abc", "123.45"))
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 2 parsing failures.
## row # A tibble: 2 x 4 col     row   col expected               actual expected   <int> <int> <chr>                  <chr>  actual 1     3    NA an integer             abc    row 2     4    NA no trailing characters .45
x
## [1] 123 345  NA  NA
## attr(,"problems")
## # A tibble: 2 x 4
##     row   col expected               actual
##   <int> <int> <chr>                  <chr> 
## 1     3    NA an integer             abc   
## 2     4    NA no trailing characters .45

And the failures will be missing in the output. If there are many parsing failures, you’ll need to use problems() to get the complete set. This returns a tibble, which you can then manipulate with dplyr.

problems(x)
## # A tibble: 2 x 4
##     row   col expected               actual
##   <int> <int> <chr>                  <chr> 
## 1     3    NA an integer             abc   
## 2     4    NA no trailing characters .45

Using parsers is mostly a matter of understanding what’s available and how they deal with different types of input. There are eight particularly important parsers:

Parse Numbers

  1. People write numbers differently in different parts of the world. For example, some countries use . in between the integer and fractional parts of a real number, while others use ,.
  2. Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”.
  3. Numbers often contain “grouping” characters to make them easier to read, like “1,000,000”, and these grouping characters vary around the world.

To address the first problem, readr has the notion of a “locale”, an object that specifies parsing options that differ from place to place. When parsing numbers, the most important option is the character you use for the decimal mark. You can override the default value of . by creating a new locale and setting the decimal_mark argument:

parse_double("1.23")
## [1] 1.23
parse_double("1,23", locale = locale(decimal_mark = ","))
## [1] 1.23

parse_number() addresses the second problem: it ignores non-numeric characters before and after the number. This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.

parse_number("$100")
## [1] 100
parse_number("20%")
## [1] 20
parse_number("It cost $123.45")
## [1] 123.45

The final problem is addressed by the combination of parse_number() and the locale as parse_number() will ignore the “grouping mark”:

parse_number("$123,456,789")
## [1] 123456789
# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
## [1] 123456789
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
## [1] 123456789

Parse String

x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"

parse_character(x1, locale = locale(encoding = "Latin1"))
## [1] "El Niño was particularly bad this year"
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
## [1] "<U+3053><U+3093><U+306B><U+3061><U+306F>"

How do you find the correct encoding? If you’re lucky, it’ll be included somewhere in the data documentation. Unfortunately, that’s rarely the case, so readr provides guess_encoding() to help you figure it out. It’s not foolproof, and it works better when you have lots of text (unlike here), but it’s a reasonable place to start. Expect to try a few different encodings before you find the right one.

In R, we can get at the underlying representation of a string using charToRaw():

guess_encoding(charToRaw(x1))
## # A tibble: 2 x 2
##   encoding   confidence
##   <chr>           <dbl>
## 1 ISO-8859-1      0.460
## 2 ISO-8859-9      0.230
guess_encoding(charToRaw(x2))
## # A tibble: 1 x 2
##   encoding confidence
##   <chr>         <dbl>
## 1 KOI8-R        0.420

Parse Factors

Give parse_factor() a vector of known levels to generate a warning whenever an unexpected value is present:

fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
## Warning: 1 parsing failure.
## row # A tibble: 1 x 4 col     row   col expected           actual   expected   <int> <int> <chr>              <chr>    actual 1     3    NA value in level set bananana
## [1] apple  banana <NA>  
## attr(,"problems")
## # A tibble: 1 x 4
##     row   col expected           actual  
##   <int> <int> <chr>              <chr>   
## 1     3    NA value in level set bananana
## Levels: apple banana

Dates, date-times and times

You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date-time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight). When called without any additional arguments:

parse_datetime("2010-10-01T2010")
## [1] "2010-10-01 20:10:00 UTC"
# If time is omitted, it will be set to midnight
parse_datetime("20101010")
## [1] "2010-10-10 UTC"
parse_date("2010-10-01")
## [1] "2010-10-01"
parse_date("2010/10/01")
## [1] "2010-10-01"
library(hms)
## Warning: package 'hms' was built under R version 3.4.4
parse_time("01:10 am")
## 01:10:00
parse_time("20:10:01")
## 20:10:01

If you’re using %b or %B with non-English month names, you’ll need to set the lang argument to locale(). See the list of built-in languages in date_names_langs(), or if your language is not already included, create your own with date_names().

parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
## [1] "2015-01-01"
parse_date("14 oct. 1979", "%d %b %Y", locale = locale("fr"))
## [1] "1979-10-14"

Exercise

  1. What happens if you try and set decimal_mark and grouping_mark to the same character? What happens to the default value of grouping_mark when you set decimal_mark to “,”? What happens to the default value of decimal_mark when you set the grouping_mark to “.”?

If the decimal and grouping marks are set to the same character, locale throws an error; If the decimal_mark is set to the comma “,”, then the grouping mark is set to the period “.”; If the grouping mark is set to a period, then the decimal mark is set to a comma

  1. If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
parse_date("02/01/2006", locale = locale(date_format = "%m/%d/%Y"))
## [1] "2006-02-01"
parse_date("02/01/2006", "%m/%d/%Y")
## [1] "2006-02-01"
  1. Generate the correct format string to parse each of the following dates and times:
d1 <- "January 1, 2010"
parse_date(d1, "%B %d, %Y")
## [1] "2010-01-01"
d2 <- "2015-Mar-07"
parse_date(d2, "%Y-%b-%d")
## [1] "2015-03-07"
d3 <- "06-Jun-2017"
parse_date(d3, "%d-%b-%Y")
## [1] "2017-06-06"
d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, "%B %d (%Y)")
## [1] "2015-08-19" "2015-07-01"
d5 <- "12/30/14" # Dec 30, 2014
parse_date(d5, "%m/%d/%y")
## [1] "2014-12-30"
t1 <- "1705"
parse_time(t1, "%H%M")
## 17:05:00
t2 <- "11:15:10.12 PM"
parse_time(t2, "%I:%M:%OS %p")
## 23:15:10.12

Parsing a file

readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. You can emulate this process with a character vector using guess_parser(), which returns readr’s best guess, and parse_guess() which uses that guess to parse the column:

guess_parser("2010-10-01")
## [1] "date"
guess_parser("15:01")
## [1] "time"
guess_parser(c("TRUE", "FALSE"))
## [1] "logical"
guess_parser(c("1", "5", "9"))
## [1] "integer"
guess_parser(c("12,352,561"))
## [1] "number"
str(parse_guess("2010-10-10"))
##  Date[1:1], format: "2010-10-10"

The heuristic tries each of the following types, stopping when it finds a match:

  • logical: contains only “F”, “T”, “FALSE”, or “TRUE”.
  • integer: contains only numeric characters (and -).
  • double: contains only valid doubles (including numbers like 4.5e-5).
  • number: contains valid doubles with the grouping mark inside.
  • time: matches the default time_format.
  • date: matches the default date_format.
  • date-time: any ISO8601 date.

If none of these rules apply, then the column will stay as a vector of strings.

Problems

These defaults don’t always work for larger files. There are two basic problems:

The first thousand rows might be a special case, and readr guesses a type that is not sufficiently general. For example, you might have a column of doubles that only contains integers in the first 1000 rows.

The column might contain a lot of missing values. If the first 1000 rows contain only NAs, readr will guess that it’s a character vector, whereas you probably want to parse it as something more specific.

challenge <- read_csv(readr_example("challenge.csv"))
## Parsed with column specification:
## cols(
##   x = col_integer(),
##   y = col_character()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 1000 parsing failures.
## row # A tibble: 5 x 5 col     row col   expected               actual             file               expected   <int> <chr> <chr>                  <chr>              <chr>              actual 1  1001 x     no trailing characters .23837975086644292 'F:/R-3.4.3/libra~ file 2  1002 x     no trailing characters .41167997173033655 'F:/R-3.4.3/libra~ row 3  1003 x     no trailing characters .7460716762579978  'F:/R-3.4.3/libra~ col 4  1004 x     no trailing characters .723450553836301   'F:/R-3.4.3/libra~ expected 5  1005 x     no trailing characters .614524137461558   'F:/R-3.4.3/libra~
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.
problems(challenge)
## # A tibble: 1,000 x 5
##      row col   expected               actual             file             
##    <int> <chr> <chr>                  <chr>              <chr>            
##  1  1001 x     no trailing characters .23837975086644292 'F:/R-3.4.3/libr~
##  2  1002 x     no trailing characters .41167997173033655 'F:/R-3.4.3/libr~
##  3  1003 x     no trailing characters .7460716762579978  'F:/R-3.4.3/libr~
##  4  1004 x     no trailing characters .723450553836301   'F:/R-3.4.3/libr~
##  5  1005 x     no trailing characters .614524137461558   'F:/R-3.4.3/libr~
##  6  1006 x     no trailing characters .473980569280684   'F:/R-3.4.3/libr~
##  7  1007 x     no trailing characters .5784610391128808  'F:/R-3.4.3/libr~
##  8  1008 x     no trailing characters .2415937229525298  'F:/R-3.4.3/libr~
##  9  1009 x     no trailing characters .11437866208143532 'F:/R-3.4.3/libr~
## 10  1010 x     no trailing characters .2983446326106787  'F:/R-3.4.3/libr~
## # ... with 990 more rows

A good strategy is to work column by column until there are no problems remaining. Here we can see that there are a lot of parsing problems with the x column - there are trailing characters after the integer value. That suggests we need to use a double parser instead.

Every parse_xyz() function has a corresponding col_xyz() function. You use parse_xyz() when the data is in a character vector in R already; you use col_xyz() when you want to tell readr how to load the data.

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_double(),
    y = col_date()
  )
)
tail(challenge)
## # A tibble: 6 x 2
##       x y         
##   <dbl> <date>    
## 1 0.805 2019-11-21
## 2 0.164 2018-03-29
## 3 0.472 2014-08-04
## 4 0.718 2015-08-16
## 5 0.270 2020-02-04
## 6 0.608 2019-01-06
  • Sometimes it’s easier to diagnose problems if you just read in all the columns as character vectors:
challenge2 <- read_csv(readr_example("challenge.csv"), 
  col_types = cols(.default = col_character())
)

This is particularly useful in conjunction with type_convert(), which applies the parsing heuristics to the character columns in a data frame.

df <- tribble(
  ~x,  ~y,
  "1", "1.21",
  "2", "2.32",
  "3", "4.56"
)
df
## # A tibble: 3 x 2
##   x     y    
##   <chr> <chr>
## 1 1     1.21 
## 2 2     2.32 
## 3 3     4.56
type_convert(df)
## Parsed with column specification:
## cols(
##   x = col_integer(),
##   y = col_double()
## )
## # A tibble: 3 x 2
##       x     y
##   <int> <dbl>
## 1     1  1.21
## 2     2  2.32
## 3     3  4.56

Save CSV

Note that the type information is lost when you save to csv

his makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two alternatives:

  • write_rds() and read_rds() are uniform wrappers around the base functions readRDS() and saveRDS(). These store data in R’s custom binary format called RDS:
write_rds(challenge, "challenge.rds")
read_rds("challenge.rds")
## # A tibble: 2,000 x 2
##        x y         
##    <dbl> <date>    
##  1  404. NA        
##  2 4172. NA        
##  3 3004. NA        
##  4  787. NA        
##  5   37. NA        
##  6 2332. NA        
##  7 2489. NA        
##  8 1449. NA        
##  9 3665. NA        
## 10 3863. NA        
## # ... with 1,990 more rows
  • The feather package implements a fast binary file format that can be shared across programming languages:
library(feather)
## Warning: package 'feather' was built under R version 3.4.4
write_feather(challenge, "challenge.feather")
read_feather("challenge.feather")
## # A tibble: 2,000 x 2
##        x y         
##    <dbl> <date>    
##  1  404. NA        
##  2 4172. NA        
##  3 3004. NA        
##  4  787. NA        
##  5   37. NA        
##  6 2332. NA        
##  7 2489. NA        
##  8 1449. NA        
##  9 3665. NA        
## 10 3863. NA        
## # ... with 1,990 more rows

Feather tends to be faster than RDS and is usable outside of R. RDS supports list-columns (which you’ll learn about in many models); feather currently does not.

Tidy Data

Spreading and Gathering

To tidy a dataset like this, we need to gather those columns into a new pair of variables. To describe that operation we need three parameters:

  • The set of columns that represent values, not variables. In this example, those are the columns 1999 and 2000.

  • The name of the variable whose values form the column names. I call that the key, and here it is year.

  • The name of the variable whose values are spread over the cells. I call that value, and here it’s the number of cases.

  • To combine the tidied versions of table4a and table4b into a single tibble, we need to use dplyr::left_join()

table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4a %>% 
  gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766
table4b %>% 
  gather(`1999`, `2000`, key = "year", value = "population")
## # A tibble: 6 x 3
##   country     year  population
##   <chr>       <chr>      <int>
## 1 Afghanistan 1999    19987071
## 2 Brazil      1999   172006362
## 3 China       1999  1272915272
## 4 Afghanistan 2000    20595360
## 5 Brazil      2000   174504898
## 6 China       2000  1280428583
tidy4a <- table4a %>% 
  gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>% 
  gather(`1999`, `2000`, key = "year", value = "population")
left_join(tidy4a, tidy4b)
## Joining, by = c("country", "year")
## # A tibble: 6 x 4
##   country     year   cases population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 1999     745   19987071
## 2 Brazil      1999   37737  172006362
## 3 China       1999  212258 1272915272
## 4 Afghanistan 2000    2666   20595360
## 5 Brazil      2000   80488  174504898
## 6 China       2000  213766 1280428583

gathering To tidy this up, we first analyse the representation in similar way to gather(). This time, however, we only need two parameters:

  • The column that contains variable names, the key column. Here, it’s type.

  • The column that contains values from multiple variables, the value column. Here it’s count.

table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table2 %>%
    spread(key = type, value = count)
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
spreading

spreading

Exercise

  1. Why are gather() and spread() not perfectly symmetrical? Carefully consider the following example:
stocks <- tibble(
  year   = c(2015, 2015, 2016, 2016),
  half  = c(   1,    2,     1,    2),
  return = c(1.88, 0.59, 0.92, 0.17)
)
stocks %>% 
  spread(year, return) %>%
  gather(`2015`:`2016`, key = year, value = return)
## # A tibble: 4 x 3
##    half year  return
##   <dbl> <chr>  <dbl>
## 1    1. 2015   1.88 
## 2    2. 2015   0.590
## 3    1. 2016   0.920
## 4    2. 2016   0.170

The functions spread() and gather() are not perfectly symmetrical because column type information is not transferred between them. In the original table the column year was numeric, but after running spread() and gather() it is a character vector. This is because variable names are always converted to a character vector by gather().

The convert argument tries to convert character vectors to the appropriate type. In the background this uses the type.convert() function.

stocks %>% 
  spread(year, return) %>%
  gather(`2015`:`2016`, key = year, value = return, convert = TRUE)
## # A tibble: 4 x 3
##    half  year return
##   <dbl> <int>  <dbl>
## 1    1.  2015  1.88 
## 2    2.  2015  0.590
## 3    1.  2016  0.920
## 4    2.  2016  0.170
  1. Why does this code fail?
table4a 
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
# table4a %>% 
#   gather(1999, 2000, key = "year", value = "cases")

Because when specifying column names, for names like 1999 we need to use 1999. The tidyverse functions will interpret 1999 and 2000 without quotes as looking for the 1999th and 2000th column of the data frame.

table4a %>%
  gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766
  1. Why does spreading this tibble fail? How could you add a new column to fix the problem?
people <- tribble(
  ~name,             ~key,    ~value,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
)

Spreading the data frame fails because there are two rows with “age” for “Phillip Woods”. We would need to add another column with an indicator for the number observation it is.

people <- tribble(
  ~name,             ~key,    ~value, ~obs,
  #-----------------|--------|------|------
  "Phillip Woods",   "age",       45, 1,
  "Phillip Woods",   "height",   186, 1,
  "Phillip Woods",   "age",       50, 2,
  "Jessica Cordero", "age",       37, 1,
  "Jessica Cordero", "height",   156, 1
)

people %>%
  spread(key, value)
## # A tibble: 3 x 4
##   name              obs   age height
##   <chr>           <dbl> <dbl>  <dbl>
## 1 Jessica Cordero    1.   37.   156.
## 2 Phillip Woods      1.   45.   186.
## 3 Phillip Woods      2.   50.    NA
  1. Tidy the simple tibble below. Do you need to spread or gather it? What are the variables?
preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes",     NA,    10,
  "no",      20,    12
)

We need to gather it!

preg %>%
  gather(male, female, key = Sex, value = Number)
## # A tibble: 4 x 3
##   pregnant Sex    Number
##   <chr>    <chr>   <dbl>
## 1 yes      male      NA 
## 2 no       male      20.
## 3 yes      female    10.
## 4 no       female    12.
gather(preg, sex, count, male, female) %>%
  mutate(pregnant = pregnant == "yes",
         female = sex == "female") %>%
  select(-sex)
## # A tibble: 4 x 3
##   pregnant count female
##   <lgl>    <dbl> <lgl> 
## 1 TRUE       NA  FALSE 
## 2 FALSE      20. FALSE 
## 3 TRUE       10. TRUE  
## 4 FALSE      12. TRUE

Separating and Uniting

table3 has a different problem: we have one column (rate) that contains two variables (cases and population). To fix this problem, we’ll need the separate() function. You’ll also learn about the complement of separate(): unite(), which you use if a single variable is spread across multiple columns.

table3
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

separate() takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure 12.4 and the code below.

** The default ‘sep’ value is a regular expression that matches any sequence of non-alphanumeric values.**

If you wish to use a specific character to separate a column, you can pass the character to the sep argument of separate()

table3 %>% 
  separate(rate, into = c("cases", "population"))
## # A tibble: 6 x 4
##   country      year cases  population
## * <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

separating Look carefully at the column types: you’ll notice that cases and population are character columns. This is the default behaviour in separate(): it leaves the type of the column as is. Here, however, it’s not very useful as those really are numbers. We can ask separate() to try and convert to better types using convert = TRUE:

table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE)
## # A tibble: 6 x 4
##   country      year  cases population
## * <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

You can also pass a vector of integers to sep. separate() will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. When using integers to separate strings, the length of sep should be one less than the number of names in into.

(table5 <- table3 %>% 
  separate(year, into = c("century", "year"), sep = 2) %>%
  separate(rate, into = c("class", "population"), convert = TRUE))
## # A tibble: 6 x 5
##   country     century year   class population
## * <chr>       <chr>   <chr>  <int>      <int>
## 1 Afghanistan 19      99       745   19987071
## 2 Afghanistan 20      00      2666   20595360
## 3 Brazil      19      99     37737  172006362
## 4 Brazil      20      00     80488  174504898
## 5 China       19      99    212258 1272915272
## 6 China       20      00    213766 1280428583

unite() is the inverse of separate(): it combines multiple columns into a single column. You’ll need it much less frequently than separate(), but it’s still a useful tool to have in your back pocket.

uniting

uniting

We can use unite() to rejoin the century and year columns that we created. unite() takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in dplyr::select() style:

table5 %>% 
  unite(new, century, year)
## # A tibble: 6 x 4
##   country     new    class population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 19_99    745   19987071
## 2 Afghanistan 20_00   2666   20595360
## 3 Brazil      19_99  37737  172006362
## 4 Brazil      20_00  80488  174504898
## 5 China       19_99 212258 1272915272
## 6 China       20_00 213766 1280428583

In this case we also need to use the sep argument. The default will place an underscore (_) between the values from different columns. Here we don’t want any separator so we use “”:

table5 %>% 
  unite(new, century, year, sep = "")
## # A tibble: 6 x 4
##   country     new    class population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 1999     745   19987071
## 2 Afghanistan 2000    2666   20595360
## 3 Brazil      1999   37737  172006362
## 4 Brazil      2000   80488  174504898
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583

Exercise

  1. What do the extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     f    
## 3 h     i     j
# use extra
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"), extra = "merge")
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     f,g  
## 3 h     i     j
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     <NA> 
## 3 f     g     i
# use fill
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"), fill = "left") # fill with missing values to the left
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 <NA>  d     e    
## 3 f     g     i
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     <NA> 
## 3 f     g     i
  1. Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?

You would set it to FALSE if you want to create a new variable, but keep the old one.

table3 %>% 
  separate(year, into = c("century", "year"), sep = 2)
## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583
table3 %>% 
  separate(year, into = c("century", "short_year"), sep = 2, remove = FALSE)
## # A tibble: 6 x 5
##   country      year century short_year rate             
## * <chr>       <int> <chr>   <chr>      <chr>            
## 1 Afghanistan  1999 19      99         745/19987071     
## 2 Afghanistan  2000 20      00         2666/20595360    
## 3 Brazil       1999 19      99         37737/172006362  
## 4 Brazil       2000 20      00         80488/174504898  
## 5 China        1999 19      99         212258/1272915272
## 6 China        2000 20      00         213766/1280428583
  1. Compare and contrast separate() and extract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite?

separate() use regexp to match the separator while extract() uses regexp to match the match groups; Because unite() or concatenate from already separated terms are much easiler compared with separating them apart.

library(dplyr)
df <- data.frame(x = c(NA, "a-b", "a-d", "b-c", "d-e"))
df %>% extract(x, "A")
##      A
## 1 <NA>
## 2    a
## 3    a
## 4    b
## 5    d
df %>% extract(x, c("A", "B"), "([[:alnum:]]+)-([[:alnum:]]+)")
##      A    B
## 1 <NA> <NA>
## 2    a    b
## 3    a    d
## 4    b    c
## 5    d    e

Missing values

  • Explicitly, i.e. flagged with NA.
  • Implicitly, i.e. simply not present in the data.
stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)
stocks
## # A tibble: 7 x 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1 2015.    1.  1.88 
## 2 2015.    2.  0.590
## 3 2015.    3.  0.350
## 4 2015.    4. NA    
## 5 2016.    2.  0.920
## 6 2016.    3.  0.170
## 7 2016.    4.  2.66

The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains NA.

The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.

The way that a dataset is represented can make implicit values explicit. For example, we can make the implicit missing value explicit by putting years in the columns:

stocks %>% 
  spread(year, return)
## # A tibble: 4 x 3
##     qtr `2015` `2016`
##   <dbl>  <dbl>  <dbl>
## 1    1.  1.88  NA    
## 2    2.  0.590  0.920
## 3    3.  0.350  0.170
## 4    4. NA      2.66

Because these explicit missing values may not be important in other representations of the data, you can set na.rm = TRUE in gather() to turn explicit missing values implicit:

stocks %>% 
  spread(year, return) %>% 
  gather(year, return, `2015`:`2016`, na.rm = TRUE)
## # A tibble: 6 x 3
##     qtr year  return
## * <dbl> <chr>  <dbl>
## 1    1. 2015   1.88 
## 2    2. 2015   0.590
## 3    3. 2015   0.350
## 4    2. 2016   0.920
## 5    3. 2016   0.170
## 6    4. 2016   2.66

Another important tool for making missing values explicit in tidy data is complete():

stocks %>% 
  complete(year, qtr)
## # A tibble: 8 x 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1 2015.    1.  1.88 
## 2 2015.    2.  0.590
## 3 2015.    3.  0.350
## 4 2015.    4. NA    
## 5 2016.    1. NA    
## 6 2016.    2.  0.920
## 7 2016.    3.  0.170
## 8 2016.    4.  2.66

complete() takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit NAs where necessary.

Filling missing

There is one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:

You can fill in these missing values with fill(). It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).

treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)

treatment %>% 
  fill(person)
## # A tibble: 4 x 3
##   person           treatment response
##   <chr>                <dbl>    <dbl>
## 1 Derrick Whitmore        1.       7.
## 2 Derrick Whitmore        2.      10.
## 3 Derrick Whitmore        3.       9.
## 4 Katherine Burke         1.       4.
# .direction    argument
# Direction in which to fill missing values. Currently either "down" (the default) or "up".

Case Study

The tidyr::who dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report, available at http://www.who.int/tb/country/data/download/en/.

who
## # A tibble: 7,240 x 60
##    country     iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534
##    <chr>       <chr> <chr> <int>       <int>        <int>        <int>
##  1 Afghanistan AF    AFG    1980          NA           NA           NA
##  2 Afghanistan AF    AFG    1981          NA           NA           NA
##  3 Afghanistan AF    AFG    1982          NA           NA           NA
##  4 Afghanistan AF    AFG    1983          NA           NA           NA
##  5 Afghanistan AF    AFG    1984          NA           NA           NA
##  6 Afghanistan AF    AFG    1985          NA           NA           NA
##  7 Afghanistan AF    AFG    1986          NA           NA           NA
##  8 Afghanistan AF    AFG    1987          NA           NA           NA
##  9 Afghanistan AF    AFG    1988          NA           NA           NA
## 10 Afghanistan AF    AFG    1989          NA           NA           NA
## # ... with 7,230 more rows, and 53 more variables: new_sp_m3544 <int>,
## #   new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
## #   new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
## #   new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
## #   new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
## #   new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
## #   new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
## #   new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
## #   new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
## #   new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,
## #   new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
## #   new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
## #   new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,
## #   new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>,
## #   newrel_m1524 <int>, newrel_m2534 <int>, newrel_m3544 <int>,
## #   newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>,
## #   newrel_f014 <int>, newrel_f1524 <int>, newrel_f2534 <int>,
## #   newrel_f3544 <int>, newrel_f4554 <int>, newrel_f5564 <int>,
## #   newrel_f65 <int>

This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, who is messy, and we’ll need multiple steps to tidy it.

Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you’ll usually need to string together multiple verbs into a pipeline.

So we need to gather together all the columns from new_sp_m014 to newrel_f65. We don’t know what those values represent yet, so we’ll give them the generic name “key”. We know the cells represent the count of cases, so we’ll use the variable cases. There are a lot of missing values in the current representation, so for now we’ll use na.rm just so we can focus on the values that are present.

who1 <- who %>% 
  gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)
who1
## # A tibble: 76,046 x 6
##    country     iso2  iso3   year key         cases
##  * <chr>       <chr> <chr> <int> <chr>       <int>
##  1 Afghanistan AF    AFG    1997 new_sp_m014     0
##  2 Afghanistan AF    AFG    1998 new_sp_m014    30
##  3 Afghanistan AF    AFG    1999 new_sp_m014     8
##  4 Afghanistan AF    AFG    2000 new_sp_m014    52
##  5 Afghanistan AF    AFG    2001 new_sp_m014   129
##  6 Afghanistan AF    AFG    2002 new_sp_m014    90
##  7 Afghanistan AF    AFG    2003 new_sp_m014   127
##  8 Afghanistan AF    AFG    2004 new_sp_m014   139
##  9 Afghanistan AF    AFG    2005 new_sp_m014   151
## 10 Afghanistan AF    AFG    2006 new_sp_m014   193
## # ... with 76,036 more rows

We can get some hint of the structure of the values in the new key column by counting them:

who1 %>%
  count(key)
## # A tibble: 56 x 2
##    key              n
##    <chr>        <int>
##  1 new_ep_f014   1032
##  2 new_ep_f1524  1021
##  3 new_ep_f2534  1021
##  4 new_ep_f3544  1021
##  5 new_ep_f4554  1017
##  6 new_ep_f5564  1017
##  7 new_ep_f65    1014
##  8 new_ep_m014   1038
##  9 new_ep_m1524  1026
## 10 new_ep_m2534  1020
## # ... with 46 more rows

You might be able to parse this out by yourself with a little thought and some experimentation, but luckily we have the data dictionary handy. It tells us:

  1. The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.

  2. The next two letters describe the type of TB:

  • rel stands for cases of relapse
  • ep stands for cases of extrapulmonary TB
  • sn stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)
  • sp stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)
  1. The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).

  2. The remaining numbers gives the age group. The dataset groups cases into seven age groups:

  • 014 = 0 - 14 years old
  • 1524 = 15 - 24 years old
  • 2534 = 25 - 34 years old
  • 3544 = 35 - 44 years old
  • 4554 = 45 - 54 years old
  • 5564 = 55 - 64 years old
  • 65 = 65 or older We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of new_rel we have newrel (it’s hard to spot this here but if you don’t fix it we’ll get errors in subsequent steps). You’ll learn about str_replace() in strings, but the basic idea is pretty simple: replace the characters “newrel” with “new_rel”. This makes all variable names consistent.
who2 <- who1 %>% 
  mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2
## # A tibble: 76,046 x 6
##    country     iso2  iso3   year key         cases
##    <chr>       <chr> <chr> <int> <chr>       <int>
##  1 Afghanistan AF    AFG    1997 new_sp_m014     0
##  2 Afghanistan AF    AFG    1998 new_sp_m014    30
##  3 Afghanistan AF    AFG    1999 new_sp_m014     8
##  4 Afghanistan AF    AFG    2000 new_sp_m014    52
##  5 Afghanistan AF    AFG    2001 new_sp_m014   129
##  6 Afghanistan AF    AFG    2002 new_sp_m014    90
##  7 Afghanistan AF    AFG    2003 new_sp_m014   127
##  8 Afghanistan AF    AFG    2004 new_sp_m014   139
##  9 Afghanistan AF    AFG    2005 new_sp_m014   151
## 10 Afghanistan AF    AFG    2006 new_sp_m014   193
## # ... with 76,036 more rows

We can separate the values in each code with two passes of separate(). The first pass will split the codes at each underscore.

who3 <- who2 %>% 
  separate(key, c("new", "type", "sexage"), sep = "_")
who3
## # A tibble: 76,046 x 8
##    country     iso2  iso3   year new   type  sexage cases
##    <chr>       <chr> <chr> <int> <chr> <chr> <chr>  <int>
##  1 Afghanistan AF    AFG    1997 new   sp    m014       0
##  2 Afghanistan AF    AFG    1998 new   sp    m014      30
##  3 Afghanistan AF    AFG    1999 new   sp    m014       8
##  4 Afghanistan AF    AFG    2000 new   sp    m014      52
##  5 Afghanistan AF    AFG    2001 new   sp    m014     129
##  6 Afghanistan AF    AFG    2002 new   sp    m014      90
##  7 Afghanistan AF    AFG    2003 new   sp    m014     127
##  8 Afghanistan AF    AFG    2004 new   sp    m014     139
##  9 Afghanistan AF    AFG    2005 new   sp    m014     151
## 10 Afghanistan AF    AFG    2006 new   sp    m014     193
## # ... with 76,036 more rows

Then we might as well drop the new column because it’s constant in this dataset. While we’re dropping columns, let’s also drop iso2 and iso3 since they’re redundant.

who3 %>% 
  count(new)
## # A tibble: 1 x 2
##   new       n
##   <chr> <int>
## 1 new   76046
who4 <- who3 %>% 
  select(-new, -iso2, -iso3)

Next we’ll separate sexage into sex and age by splitting after the first character:

who5 <- who4 %>% 
  separate(sexage, c("sex", "age"), sep = 1)
who5
## # A tibble: 76,046 x 6
##    country      year type  sex   age   cases
##    <chr>       <int> <chr> <chr> <chr> <int>
##  1 Afghanistan  1997 sp    m     014       0
##  2 Afghanistan  1998 sp    m     014      30
##  3 Afghanistan  1999 sp    m     014       8
##  4 Afghanistan  2000 sp    m     014      52
##  5 Afghanistan  2001 sp    m     014     129
##  6 Afghanistan  2002 sp    m     014      90
##  7 Afghanistan  2003 sp    m     014     127
##  8 Afghanistan  2004 sp    m     014     139
##  9 Afghanistan  2005 sp    m     014     151
## 10 Afghanistan  2006 sp    m     014     193
## # ... with 76,036 more rows

The who dataset is now tidy!

This typically isn’t how you’d work interactively. Instead, you’d gradually build up a complex pipe:

who %>%
  gather(key, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% 
  mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)

Exercise

  1. I claimed that iso2 and iso3 were redundant with country. Confirm this claim.
select(who3, country, iso2, iso3) %>%
  distinct() %>%
  group_by(country) %>%
  filter(n() > 1)
## # A tibble: 0 x 3
## # Groups:   country [0]
## # ... with 3 variables: country <chr>, iso2 <chr>, iso3 <chr>

Relational Data

We will use the nycflights13 package to learn about relational data. nycflights13 contains four tibbles that are related to the flights table that you used in data transformation:

  • airlines lets you look up the full carrier name from its abbreviated code:
  • airports gives information about each airport, identified by the faa airport code:
  • planes gives information about each plane, identified by its tailnum:
  • weather gives the weather at each NYC airport for each hour:

One way to show the relationships between the different tables is with a drawing: nyc_flights

library(tidyverse)
library(nycflights13)

The key to understanding diagrams like this is to remember each relation always concerns a pair of tables. You don’t need to understand the whole thing; you just need to understand the chain of relations between the tables that you are interested in.

For nycflights13:

  • flights connects to planes via a single variable, tailnum.

  • flights connects to airlines through the carrier variable.

  • flights connects to airports in two ways: via the origin and dest variables.

  • flights connects to weather via origin (the location), and year, month, day and hour (the time).

Mutating Joins

A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.

Like mutate(), the join functions add variables to the right, so if you have a lot of variables already, the new variables won’t get printed out.

Imagine you want to add the full airline name to the flights2 data. You can combine the airlines and flights2 data frames with left_join():

flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 x 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1    5. EWR    IAH   N14228  UA     
##  2  2013     1     1    5. LGA    IAH   N24211  UA     
##  3  2013     1     1    5. JFK    MIA   N619AA  AA     
##  4  2013     1     1    5. JFK    BQN   N804JB  B6     
##  5  2013     1     1    6. LGA    ATL   N668DN  DL     
##  6  2013     1     1    5. EWR    ORD   N39463  UA     
##  7  2013     1     1    6. EWR    FLL   N516JB  B6     
##  8  2013     1     1    6. LGA    IAD   N829AS  EV     
##  9  2013     1     1    6. JFK    MCO   N593JB  B6     
## 10  2013     1     1    6. LGA    ORD   N3ALAA  AA     
## # ... with 336,766 more rows
flights2 %>%
  select(-origin, -dest) %>% 
  left_join(airlines, by = "carrier")
## # A tibble: 336,776 x 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1    5. N14228  UA      United Air Lines Inc.   
##  2  2013     1     1    5. N24211  UA      United Air Lines Inc.   
##  3  2013     1     1    5. N619AA  AA      American Airlines Inc.  
##  4  2013     1     1    5. N804JB  B6      JetBlue Airways         
##  5  2013     1     1    6. N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1    5. N39463  UA      United Air Lines Inc.   
##  7  2013     1     1    6. N516JB  B6      JetBlue Airways         
##  8  2013     1     1    6. N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1    6. N593JB  B6      JetBlue Airways         
## 10  2013     1     1    6. N3ALAA  AA      American Airlines Inc.  
## # ... with 336,766 more rows

Inner Join: An inner join matches pairs of observations whenever their keys are equal:

x %>% 
  inner_join(y, by = "key")

Outer joins: An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:

  • A left join keeps all observations in x.
  • A right join keeps all observations in y.
  • A full join keeps all observations in x and y.
four joins

four joins

Defining Key Columns - The default, by = NULL, uses all variables that appear in both tables - A character vector, by = “x”. This is like a natural join, but uses only some of the common variables. - A named character vector: by = c(“a” = “b”). This will match variable a in table x to variable b in table y. The variables from x will be used in the output.

For example, if we want to draw a map we need to combine the flights data with the airports data which contains the location (lat and lon) of each airport. Each flight has an origin and destination airport, so we need to specify which one we want to join to:

flights2 %>% 
  left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 x 15
##     year month   day  hour origin dest  tailnum carrier name     lat   lon
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl>
##  1  2013     1     1    5. EWR    IAH   N14228  UA      Georg~  30.0 -95.3
##  2  2013     1     1    5. LGA    IAH   N24211  UA      Georg~  30.0 -95.3
##  3  2013     1     1    5. JFK    MIA   N619AA  AA      Miami~  25.8 -80.3
##  4  2013     1     1    5. JFK    BQN   N804JB  B6      <NA>    NA    NA  
##  5  2013     1     1    6. LGA    ATL   N668DN  DL      Harts~  33.6 -84.4
##  6  2013     1     1    5. EWR    ORD   N39463  UA      Chica~  42.0 -87.9
##  7  2013     1     1    6. EWR    FLL   N516JB  B6      Fort ~  26.1 -80.2
##  8  2013     1     1    6. LGA    IAD   N829AS  EV      Washi~  38.9 -77.5
##  9  2013     1     1    6. JFK    MCO   N593JB  B6      Orlan~  28.4 -81.3
## 10  2013     1     1    6. LGA    ORD   N3ALAA  AA      Chica~  42.0 -87.9
## # ... with 336,766 more rows, and 4 more variables: alt <int>, tz <dbl>,
## #   dst <chr>, tzone <chr>

Filtering Joins

Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:

  • semi_join(x, y) keeps all observations in x that have a match in y.
  • anti_join(x, y) drops all observations in x that have a match in y.
top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
top_dest
## # A tibble: 10 x 2
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705
# Now you want to find each flight that went to one of those destinations. You could construct a filter yourself:
flights %>% 
  filter(dest %in% top_dest$dest)
## # A tibble: 141,145 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      542            540        2.      923
##  2  2013     1     1      554            600       -6.      812
##  3  2013     1     1      554            558       -4.      740
##  4  2013     1     1      555            600       -5.      913
##  5  2013     1     1      557            600       -3.      838
##  6  2013     1     1      558            600       -2.      753
##  7  2013     1     1      558            600       -2.      924
##  8  2013     1     1      558            600       -2.      923
##  9  2013     1     1      559            559        0.      702
## 10  2013     1     1      600            600        0.      851
## # ... with 141,135 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

But it’s difficult to extend that approach to multiple variables. For example, imagine that you’d found the 10 days with highest average delays. How would you construct the filter statement that used year, month, and day to match it back to flights?

flights %>% 
  semi_join(top_dest)
## Joining, by = "dest"
## # A tibble: 141,145 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      542            540        2.      923
##  2  2013     1     1      554            600       -6.      812
##  3  2013     1     1      554            558       -4.      740
##  4  2013     1     1      555            600       -5.      913
##  5  2013     1     1      557            600       -3.      838
##  6  2013     1     1      558            600       -2.      753
##  7  2013     1     1      558            600       -2.      924
##  8  2013     1     1      558            600       -2.      923
##  9  2013     1     1      559            559        0.      702
## 10  2013     1     1      600            600        0.      851
## # ... with 141,135 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
# If NULL, the default, *_join() will do a natural join, using all variables with common names across the two tables.

Only the existence of a match is important; it doesn’t matter which observation is matched. This means that filtering joins never duplicate rows like mutating joins do

Anti-joins are useful for diagnosing join mismatches. For example, when connecting flights and planes, you might be interested to know that there are many flights that don’t have a match in planes:

flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(tailnum, sort = TRUE)
## # A tibble: 722 x 2
##    tailnum     n
##    <chr>   <int>
##  1 <NA>     2512
##  2 N725MQ    575
##  3 N722MQ    513
##  4 N723MQ    507
##  5 N713MQ    483
##  6 N735MQ    396
##  7 N0EGMQ    371
##  8 N534MQ    364
##  9 N542MQ    363
## 10 N531MQ    349
## # ... with 712 more rows

Strings

Matching with RegExp

To learn regular expressions, we’ll use str_view() and str_view_all(). These functions take a character vector and a regular expression, and show you how they match.

If  is used as an escape character in regular expressions, how do you match a literal ? Well you need to escape it, creating the regular expression \. To create that regular expression, you need to use a string, which also needs to escape . That means to match a literal  you need to write “\\” - you need four backslashes to match one!

x <- c("apple", "banana", "pear")
str_view(x, "an")
# To create the regular expression, we need \\
dot <- "\\."

# But the expression itself only contains one:
writeLines(dot)
## \.
#> \.

# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")
x <- "a\\b"
writeLines(x)
## a\b
#> a\b

str_view(x, "\\\\")

Character Classes

There are a number of special patterns that match more than one character. You’ve already seen ., which matches any character apart from a newline. There are four other useful tools:

  • : matches any digit.
  • : matches any whitespace (e.g. space, tab, newline).

Repetition

  • ?: 0 or 1
  • +: 1 or more
  • *: 0 or more
  • {n}: exactly n
  • {n,}: n or more
  • {,m}: at most m
  • {n,m}: between n and m

Grouping and backreferences

Earlier, you learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with backreferences, like , etc. For example, the following regular expression finds all fruits that have a repeated pair of letters.

str_view(fruit, "(a)(..)\\2", match = TRUE)

Notice

Regular expression is powerful but not simple to solve problems!

A word of caution before we continue: because regular expressions are so powerful, it’s easy to try and solve every problem with a single regular expression.

As a cautionary tale, check out this regular expression that checks if a email address is valid:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)

This is a somewhat pathological example (because email addresses are actually surprisingly complex), but is used in real code. See the stackoverflow discussion at http://stackoverflow.com/a/201378 for more details.

Replacing

Instead of replacing with a fixed string you can use backreferences to insert components of the match. In the following code, I flip the order of the second and third words.

sentences %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
  head(5)
## [1] "The canoe birch slid on the smooth planks." 
## [2] "Glue sheet the to the dark blue background."
## [3] "It's to easy tell the depth of a well."     
## [4] "These a days chicken leg is a rare dish."   
## [5] "Rice often is served in round bowls."

With str_replace_all() you can perform multiple replacements by supplying a named vector:

x <- c("1 house 3", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
## [1] "one house three" "two cars"        "three people"

Date and Time

This chapter will focus on the lubridate package, which makes it easier to work with dates and times in R. lubridate is not part of core tidyverse because you only need it when you’re working with dates/times. We will also need nycflights13 for practice data.

library(tidyverse)
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.4.4
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:hms':
## 
##     hms
## The following object is masked from 'package:base':
## 
##     date
library(nycflights13)

today()
## [1] "2018-08-10"
now()
## [1] "2018-08-10 02:51:20 EDT"

there are three ways you’re likely to create a date/time - From a string. - From individual date-time components. - From an existing date/time object.

ymd("2017-01-31")
## [1] "2017-01-31"
mdy("January 31st, 2017")
## [1] "2017-01-31"
dmy("31-Jan-2017")
## [1] "2017-01-31"
ymd(20170131)
## [1] "2017-01-31"
ymd_hms("2017-01-31 20:11:59")
## [1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
## [1] "2017-01-31 08:01:00 UTC"
ymd(20170131, tz = "UTC")
## [1] "2017-01-31 UTC"

To create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times:

flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_datetime(year, month, day, hour, minute))
## # A tibble: 336,776 x 6
##     year month   day  hour minute departure          
##    <int> <int> <int> <dbl>  <dbl> <dttm>             
##  1  2013     1     1    5.    15. 2013-01-01 05:15:00
##  2  2013     1     1    5.    29. 2013-01-01 05:29:00
##  3  2013     1     1    5.    40. 2013-01-01 05:40:00
##  4  2013     1     1    5.    45. 2013-01-01 05:45:00
##  5  2013     1     1    6.     0. 2013-01-01 06:00:00
##  6  2013     1     1    5.    58. 2013-01-01 05:58:00
##  7  2013     1     1    6.     0. 2013-01-01 06:00:00
##  8  2013     1     1    6.     0. 2013-01-01 06:00:00
##  9  2013     1     1    6.     0. 2013-01-01 06:00:00
## 10  2013     1     1    6.     0. 2013-01-01 06:00:00
## # ... with 336,766 more rows

Let’s do the same thing for each of the four time columns in flights. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once I’ve created the date-time variables, I focus in on the variables we’ll explore in the rest of the chapter.

make_datetime_100 <- function(year, month, day, time) {
  make_datetime(year, month, day, time %/% 100, time %% 100)
}

flights_dt <- flights %>% 
  filter(!is.na(dep_time), !is.na(arr_time)) %>% 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) %>% 
  select(origin, dest, ends_with("delay"), ends_with("time"))

flights_dt
## # A tibble: 328,063 x 9
##    origin dest  dep_delay arr_delay dep_time           
##    <chr>  <chr>     <dbl>     <dbl> <dttm>             
##  1 EWR    IAH          2.       11. 2013-01-01 05:17:00
##  2 LGA    IAH          4.       20. 2013-01-01 05:33:00
##  3 JFK    MIA          2.       33. 2013-01-01 05:42:00
##  4 JFK    BQN         -1.      -18. 2013-01-01 05:44:00
##  5 LGA    ATL         -6.      -25. 2013-01-01 05:54:00
##  6 EWR    ORD         -4.       12. 2013-01-01 05:54:00
##  7 EWR    FLL         -5.       19. 2013-01-01 05:55:00
##  8 LGA    IAD         -3.      -14. 2013-01-01 05:57:00
##  9 JFK    MCO         -3.       -8. 2013-01-01 05:57:00
## 10 LGA    ORD         -2.        8. 2013-01-01 05:58:00
## # ... with 328,053 more rows, and 4 more variables: sched_dep_time <dttm>,
## #   arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>

With this data, I can visualise the distribution of departure times across the year:

flights_dt %>% 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day

Or within a single day:

flights_dt %>% 
  filter(dep_time < ymd(20130102)) %>% 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 600) # 600 s = 10 minutes

You may want to switch between a date-time and a date. That’s the job of as_datetime() and as_date():

as_datetime(today())
## [1] "2018-08-10 UTC"
as_date(now())
## [1] "2018-08-10"
# Sometimes you'll get date/times as numeric offsets from the "Unix Epoch", 1970-01-01.
as_datetime(60 * 60 * 10)
## [1] "1970-01-01 10:00:00 UTC"
as_date(365 * 10 + 2)
## [1] "1980-01-01"

You can pull out individual parts of the date with the accessor functions year(), month(), mday() (day of the month), yday() (day of the year), wday() (day of the week), hour(), minute(), and second().

datetime <- ymd_hms("2016-07-08 12:34:56")
year(datetime)
## [1] 2016
wday(datetime, label = TRUE, abbr = FALSE)
## [1] Friday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

We can use wday() to see that more flights depart during the week than on the weekend:

flights_dt %>% 
  mutate(wday = wday(dep_time, label = TRUE)) %>% 
  ggplot(aes(x = wday)) +
    geom_bar()

Time Durations

A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the duration.

h_age <- today() - ymd(19791014)
h_age
## Time difference of 14180 days
as.duration(h_age)
## [1] "1225152000s (~38.82 years)"
# Durations come with a bunch of convenient constructors
dseconds(15)
## [1] "15s"
dminutes(10)
## [1] "600s (~10 minutes)"
dhours(c(12, 24))
## [1] "43200s (~12 hours)" "86400s (~1 days)"
dweeks(3)
## [1] "1814400s (~3 weeks)"

Durations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds at the standard rate (60 seconds in a minute, 60 minutes in an hour, 24 hours in day, 7 days in a week, 365 days in a year).

You can add and multiply durations:

2 * dyears(1)
## [1] "63072000s (~2 years)"
dyears(1) + dweeks(12) + dhours(15)
## [1] "38847600s (~1.23 years)"

However, because durations represent an exact number of seconds, sometimes you might get an unexpected result:

one_pm <- ymd_hms("2016-03-12 13:00:00", tz = "America/New_York")

one_pm
## [1] "2016-03-12 13:00:00 EST"
one_pm + ddays(1)
## [1] "2016-03-13 14:00:00 EDT"

Why is one day after 1pm on March 12, 2pm on March 13?! If you look carefully at the date you might also notice that the time zones have changed. Because of DST, March 12 only has 23 hours, so if we add a full days worth of seconds we end up with a different time.

Periods: Like durations, periods can be created with a number of friendly constructor functions.

Compared to durations, periods are more likely to do what you expect:

one_pm
## [1] "2016-03-12 13:00:00 EST"
one_pm + days(1)
## [1] "2016-03-13 13:00:00 EDT"
seconds(15)
## [1] "15S"
minutes(10)
## [1] "10M 0S"
weeks(3:6)
## [1] "21d 0H 0M 0S" "28d 0H 0M 0S" "35d 0H 0M 0S" "42d 0H 0M 0S"
years(1)
## [1] "1y 0m 0d 0H 0M 0S"
10 * (months(6) + days(1))
## [1] "60m 10d 0H 0M 0S"
days(50) + hours(25) + minutes(2)
## [1] "50d 25H 2M 0S"
# A leap year
ymd("2016-01-01") + dyears(1)
## [1] "2016-12-31"
ymd("2016-01-01") + years(1)
## [1] "2017-01-01"
# Daylight Savings Time
one_pm + ddays(1)
## [1] "2016-03-13 14:00:00 EDT"
one_pm + days(1)
## [1] "2016-03-13 13:00:00 EDT"

Let’s use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination before they departed from New York City. These are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding days(1) to the arrival time of each overnight flight.

flights_dt <- flights_dt %>% 
  mutate(
    overnight = arr_time < dep_time,
    arr_time = arr_time + days(overnight * 1),
    sched_arr_time = sched_arr_time + days(overnight * 1)
  )

Now all of our flights obey the laws of physics.

flights_dt %>% 
  filter(overnight, arr_time < dep_time) 
## # A tibble: 0 x 10
## # ... with 10 variables: origin <chr>, dest <chr>, dep_delay <dbl>,
## #   arr_delay <dbl>, dep_time <dttm>, sched_dep_time <dttm>,
## #   arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>,
## #   overnight <lgl>
datetime summary

datetime summary

Model

Transformation

Transformations are useful because you can use them to approximate non-linear functions. If you’ve taken a calculus class, you may have heard of Taylor’s theorem which says you can approximate any smooth function with an infinite sum of polynomials. That means you can use a polynomial function to get arbitrarily close to a smooth function by fitting an equation like y = a_1 + a_2 * x + a_3 * x^2 + a_4 * x ^ 3. Typing that sequence by hand is tedious, so R provides a helper function: poly():

df <- tribble(
  ~y, ~x,
   1,  1,
   2,  2, 
   3,  3
)

model_matrix(df, y ~ poly(x, 2))
## # A tibble: 3 x 3
##   `(Intercept)` `poly(x, 2)1` `poly(x, 2)2`
##           <dbl>         <dbl>         <dbl>
## 1            1.     -7.07e- 1         0.408
## 2            1.     -7.85e-17        -0.816
## 3            1.      7.07e- 1         0.408

However there’s one major problem with using poly(): outside the range of the data, polynomials rapidly shoot off to positive or negative infinity. One safer alternative is to use the natural spline, splines::ns().

library(splines)
model_matrix(df, y ~ ns(x, 2))
## # A tibble: 3 x 3
##   `(Intercept)` `ns(x, 2)1` `ns(x, 2)2`
##           <dbl>       <dbl>       <dbl>
## 1            1.       0.          0.   
## 2            1.       0.566      -0.211
## 3            1.       0.344       0.771

Let’s see what that looks like when we try and approximate a non-linear function:

library(modelr)
sim5 <- tibble(
  x = seq(0, 3.5 * pi, length = 50),
  y = 4 * sin(x) + rnorm(length(x))
)

ggplot(sim5, aes(x, y)) +
  geom_point()

# fit 5 models to this data
mod1 <- lm(y ~ ns(x, 1), data = sim5)
mod2 <- lm(y ~ ns(x, 2), data = sim5)
mod3 <- lm(y ~ ns(x, 3), data = sim5)
mod4 <- lm(y ~ ns(x, 4), data = sim5)
mod5 <- lm(y ~ ns(x, 5), data = sim5)

grid <- sim5 %>% 
  data_grid(x = seq_range(x, n = 50, expand = 0.1)) %>% 
  gather_predictions(mod1, mod2, mod3, mod4, mod5, .pred = "y")

ggplot(sim5, aes(x, y)) + 
  geom_point() +
  geom_line(data = grid, colour = "red") +
  facet_wrap(~ model)

Notice that the extrapolation outside the range of the data is clearly bad. This is the downside to approximating a function with a polynomial. But this is a very real problem with every model: the model can never tell you if the behaviour is true when you start extrapolating outside the range of the data that you have seen. You must rely on theory and science.

Other Model Families

  • Generalised linear models, e.g. stats::glm(). Linear models assume that the response is continuous and the error has a normal distribution. Generalised linear models extend linear models to include non-continuous responses (e.g. binary data or counts). They work by defining a distance metric based on the statistical idea of likelihood.

  • Generalised additive models, e.g. mgcv::gam(), extend generalised linear models to incorporate arbitrary smooth functions. That means you can write a formula like y ~ s(x) which becomes an equation like y = f(x) and let gam() estimate what that function is (subject to some smoothness constraints to make the problem tractable).

  • Penalised linear models, e.g. glmnet::glmnet(), add a penalty term to the distance that penalises complex models (as defined by the distance between the parameter vector and the origin). This tends to make models that generalise better to new datasets from the same population.

  • Robust linear models, e.g. MASS:rlm(), tweak the distance to downweight points that are very far away. This makes them less sensitive to the presence of outliers, at the cost of being not quite as good when there are no outliers.

  • Trees, e.g. rpart::rpart(), attack the problem in a completely different way than linear models. They fit a piece-wise constant model, splitting the data into progressively smaller and smaller pieces. Trees aren’t terribly effective by themselves, but they are very powerful when used in aggregate by models like random forests (e.g. randomForest::randomForest()) or gradient boosting machines (e.g. xgboost::xgboost.)

Graphics for Communication

Not Overlapped Labels

we can use the ggrepel package by Kamil Slowikowski. This useful package will automatically adjust labels so that they don’t overlap:

library(ggrepel)
## Warning: package 'ggrepel' was built under R version 3.4.4
best_in_class <- mpg %>%
  group_by(class) %>%
  filter(row_number(desc(hwy)) == 1)

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  geom_point(size = 3, shape = 1, data = best_in_class) +
  ggrepel::geom_label_repel(aes(label = model), data = best_in_class)

Themes

themes

themes

Saving Plots

There are two main ways to get your plots out of R and into your final write-up: ggsave() and knitr. ggsave() will save the most recent plot to disk.

If you don’t specify the width and height they will be taken from the dimensions of the current plotting device. For reproducible code, you’ll want to specify them.

Generally, however, I think you should be assembling your final reports using R Markdown, so I want to focus on the important code chunk options that you should know about for graphics. You can learn more about ggsave() in the documentation.

There are five main options that control figure sizing: fig.width, fig.height, fig.asp, out.width and out.height. Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e., height, width, and aspect ratio: pick two of three).

I only ever use three of the five options:

If you find that you’re having to squint to read the text in your plot, you need to tweak fig.width. If fig.width is larger than the size the figure is rendered in the final doc, the text will be too small; if fig.width is smaller, the text will be too big. You’ll often need to do a little experimentation to figure out the right ratio between the fig.width and the eventual width in your document.

When mingling code and text, like I do in this book, I recommend setting fig.show = “hold” so that plots are shown after the code. This has the pleasant side effect of forcing you to break up large blocks of code with their explanations.

To add a caption to the plot, use fig.cap. In R Markdown this will change the figure from inline to “floating”.

It’s a good idea to name code chunks that produce figures, even if you don’t routinely label other chunks. The chunk label is used to generate the file name of the graphic on disk, so naming your chunks makes it much easier to pick out plots and reuse in other circumstances (i.e. if you want to quickly drop a single plot into an email or a tweet).