If you’d like to learn more about the theoretical underpinnings of ggplot2 before you start, I’d recommend reading âThe Layered Grammar of Graphicsâ, http://vita.had.co.nz/papers/layered-grammar.pdf.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.4
## -- Attaching packages -------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.0
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'readr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'forcats' was built under R version 3.4.4
## -- Conflicts ----------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
That one line of code loads the core tidyverse; packages which you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded).
Use mpg dataset. A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). mpg contains observations collected by the US Environment Protection Agency on 38 models of car.
Among the variables in mpg are: 1. displ, a car’s engine size, in litres. 2. hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
## 1 audi a4 1.80 1999 4 auto(l~ f 18 29 p
## 2 audi a4 1.80 1999 4 manual~ f 21 29 p
## 3 audi a4 2.00 2008 4 manual~ f 20 31 p
## 4 audi a4 2.00 2008 4 auto(a~ f 21 30 p
## 5 audi a4 2.80 1999 6 auto(l~ f 16 26 p
## 6 audi a4 2.80 1999 6 manual~ f 18 26 p
## 7 audi a4 3.10 2008 6 auto(a~ f 18 27 p
## 8 audi a4 quat~ 1.80 1999 4 manual~ 4 18 26 p
## 9 audi a4 quat~ 1.80 1999 4 auto(l~ 4 16 25 p
## 10 audi a4 quat~ 2.00 2008 4 manual~ 4 20 28 p
## # ... with 224 more rows, and 1 more variable: class <chr>
with(mpg, plot(displ, hwy))
# create a ggplot
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes. ggplot2 looks for the mapped variable in the data argument, in this case, mpg.
Run ggplot(data = mpg). What do you see?
How many rows are in mpg? How many columns?
What does the drv variable describe? Read the help for ?mpg to find out.
Make a scatterplot of hwy vs cyl.
What happens if you make a scatterplot of class vs drv? Why is the plot not useful?
# 1.
ggplot(data=mpg) # see an empty plot
# 2.
dim(mpg) # 234 11
## [1] 234 11
# 3. f = front-wheel drive, r = rear wheel drive, 4 = 4wd
# 4.
ggplot(data = mpg) + geom_point(aes(x = hwy, y = cyl))
# 5.
ggplot(data = mpg) + geom_point(aes(x = class, y = drv))
# not useful because both x and y are categorical, with no numerical relationships between them.
Problem one group of points (highlighted in red) seems to fall outside of the linear trend. These cars have a higher mileage than you might expect. How can you explain these cars?
Let’s hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the class value for each car. The class variable of the mpg dataset classifies cars into groups such as compact, midsize, and SUV. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs became popular).
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot2 will also add a legend that explains which levels correspond to which values.
Explanation: The colors reveal that many of the unusual points are two-seater cars. These cars don’t seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.
What’s gone wrong with this code? Why are the points not blue?
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = “blue”))
Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?
Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
What happens if you map the same variable to multiple aesthetics?
What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?
# 1. need to put color outside of aes()
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
# 2.
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
## 1 audi a4 1.80 1999 4 auto(l~ f 18 29 p
## 2 audi a4 1.80 1999 4 manual~ f 21 29 p
## 3 audi a4 2.00 2008 4 manual~ f 20 31 p
## 4 audi a4 2.00 2008 4 auto(a~ f 21 30 p
## 5 audi a4 2.80 1999 6 auto(l~ f 16 26 p
## 6 audi a4 2.80 1999 6 manual~ f 18 26 p
## 7 audi a4 3.10 2008 6 auto(a~ f 18 27 p
## 8 audi a4 quat~ 1.80 1999 4 manual~ 4 18 26 p
## 9 audi a4 quat~ 1.80 1999 4 auto(l~ 4 16 25 p
## 10 audi a4 quat~ 2.00 2008 4 manual~ 4 20 28 p
## # ... with 224 more rows, and 1 more variable: class <chr>
# from the attribute we see that character variables must be categorical, we can also treat some of the integer values as categorical, such as cyl (number of cylinders)
# 3.
ggplot(data = mpg) + geom_point(aes(x = displ, y = hwy, color = cty, shape = factor(cyl)), size = 2)
# 4.
ggplot(data = mpg) + geom_point(aes(x = displ, y = hwy, color = cyl, shape = factor(cyl)), size = 2)
# 5. Use the stroke aesthetic to modify the width of the
# border
# 6.
ggplot(data = mpg) + geom_point(aes(x = displ, y = hwy, color = displ < 5, shape = factor(cyl)), size = 2)
One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.
To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here âformulaâ is the name of a data structure in R, not a synonym for âequationâ). The variable that you pass to facet_wrap() should be discrete.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ cyl, nrow = 1)
To facet your plot on the combination of two variables, add facet_grid() to your plot call. The first argument of facet_grid() is also a formula. This time the formula should contain two variable names separated by a ~.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
# If you prefer to not facet in the rows or columns dimension, use a . instead of a variable name
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
Each plot uses a different visual object to represent the data. In ggplot2 syntax, we say that they use different geoms.
A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses.
# left
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
# right
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess'
geom_smooth() will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
## `geom_smooth()` using method = 'loess'
In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the linetype example).
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
## Warning: package 'bindrcpp' was built under R version 3.4.4
## `geom_smooth()` using method = 'loess'
# Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
# Recreate the R code necessary to generate the following graphs.
ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point(size=3) + geom_smooth(size=2, se=FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point(size=3) + geom_smooth(aes(group = drv), size=2, se=FALSE, show.legend = FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(data = mpg, aes(x = displ, y = hwy, color = drv)) + geom_point(size=3) + geom_smooth(size=2, se=FALSE, show.legend = FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = drv), size=3) + geom_smooth(size=2, se=FALSE, show.legend = FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = drv), size=3) + geom_smooth(aes(linetype = drv), size=2, se=FALSE, show.legend = FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point(aes(fill = drv), shape = 21, color = "white", stroke = 2, size=3) + theme_dark()
The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.
You can learn which stat a geom uses by inspecting the default value for the stat argument. For example, ?geom_bar shows that the default value for stat is âcountâ, which means that geom_bar() uses stat_count(). stat_count() is documented on the same page as geom_bar(), and if you scroll down you can find a section called âComputed variablesâ. That describes how it computes two new variables: count and prop.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
This works because every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation.
There are three reasons you might need to use a stat explicitly: 1. Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows.
demo <- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)
ggplot(data = demo) +
geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
3. You might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarises the y values for each unique x value, to draw attention to the summary that you’re computing:
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
In our proportion barchart, we need to set group = 1. Why? In other words, why is this graph not useful?
..prop.. finds proportions of the groups in the data. If we don’t specify that we want all the data to be regarded as one group, then geom_barchart we end up with each cut as a separate group, and if we find the proprtion of “Premium” diamonds that are “Premium”, the answer is obviously 1.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut), size=2)
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
Note what happens if you map the fill aesthetic to another variable, like clarity: the bars are automatically stacked. Each colored rectangle represents a combination of cut and clarity. If you don’t want a stacked bar chart, you can use one of three other options: “identity”, “dodge” or “fill”.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
position = “fill” works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
position = “dodge” places overlapping objects directly beside one another. This makes it easier to compare individual values.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
Use jitter in geom_point(): The values of hwy and displ are rounded so the points appear on a grid and many points overlap each other. This problem is known as overplotting. This arrangement makes it hard to see where the mass of the data is. Are the data points spread equally throughout the graph, or is there one special combination of hwy and displ that contains 109 values?
You can avoid this gridding by setting the position adjustment to jitter. position = “jitter” adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
#### Exercise
geom_jitter randomly moves the points to stop them overlapping. geom_count deterministically counts the points at a given point and maps them to the size of a single point. The determinism of geom_count makes it useful in discrete situations, but it does not work when the points are overlapping but not in exactly the same place.
ggplot(data = mpg) + geom_boxplot(aes(x = drv, y = hwy, color = class))
The default adjustment is position_dodge. This means that the points are moved to the side by a discrete amount
Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful.
coord_flip() switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful for long labels: it’s hard to get them to fit without overlapping on the x-axis.
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
coord_quickmap() sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2 (which unfortunately we don’t have the space to cover in this book).
nz <- map_data("nz")
## Warning: package 'maps' was built under R version 3.4.4
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
coord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.
bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip()
bar + coord_polar()
ggplot(data = mpg) + geom_bar(aes(x = drv, fill = class)) + coord_polar() + labs(fill = "Classes", x = "")
coord_quickmap uses an approximation to the mercator projection, whereas coord_map can use a variety of projections from the mapproj package. This means that coord_quickmap runs faster and doesn’t require additional packages, but isn’t as accurate, and won’t work right far from the equator.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed(ratio = 1)
geom_abline is used to plot lines defined by slope (a) and intercept (b) parameters. Used with no arguments, like here, it will plot a line with slope 1 and intercept 0, so passing through the origin at 45 degrees. coord_fixed is important because x and y have the same units, so we want to maintain the slope of the line, and see that city mileage is worse than highway, but more importantly that this is better explained by a constant offset than a multiplicative factor.
# ggplot(data = <DATA>) +
# <GEOM_FUNCTION>(
# mapping = aes(<MAPPINGS>),
# stat = <STAT>,
# position = <POSITION>
# ) +
# <COORDINATE_FUNCTION> +
# <FACET_FUNCTION>
This chapter will teach you how to transform your data using the dplyr package and a new dataset on flights departing New York City in 2013.
library(nycflights13)
## Warning: package 'nycflights13' was built under R version 3.4.4
library(tidyverse)
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 542 540 2. 923
## 4 2013 1 1 544 545 -1. 1004
## 5 2013 1 1 554 600 -6. 812
## 6 2013 1 1 554 558 -4. 740
## 7 2013 1 1 555 600 -5. 913
## 8 2013 1 1 557 600 -3. 709
## 9 2013 1 1 557 600 -3. 838
## 10 2013 1 1 558 600 -2. 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
You might notice that this data frame prints a little differently from other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. (To see the whole dataset, you can run View(flights) which will open the dataset in the RStudio viewer). It prints differently because it’s a tibble. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. For now, you don’t need to worry about the differences; we’ll come back to tibbles in more detail in wrangle.
These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.
All verbs work similarly:
filter rows
filter(flights, month == 1, day == 1)
## # A tibble: 842 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 542 540 2. 923
## 4 2013 1 1 544 545 -1. 1004
## 5 2013 1 1 554 600 -6. 812
## 6 2013 1 1 554 558 -4. 740
## 7 2013 1 1 555 600 -5. 913
## 8 2013 1 1 557 600 -3. 709
## 9 2013 1 1 557 600 -3. 838
## 10 2013 1 1 558 600 -2. 753
## # ... with 832 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
filter(flights, !(arr_delay > 120 | dep_delay > 120))
## # A tibble: 316,050 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 542 540 2. 923
## 4 2013 1 1 544 545 -1. 1004
## 5 2013 1 1 554 600 -6. 812
## 6 2013 1 1 554 558 -4. 740
## 7 2013 1 1 555 600 -5. 913
## 8 2013 1 1 557 600 -3. 709
## 9 2013 1 1 557 600 -3. 838
## 10 2013 1 1 558 600 -2. 753
## # ... with 316,040 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Missing values
One important feature of R that can make comparison tricky are missing values, or NAs (ânot availablesâ). NA represents an unknown value so missing values are âcontagiousâ: almost any operation involving an unknown value will also be unknown.
# Let x be Mary's age. We don't know how old she is.
x <- NA
# Let y be John's age. We don't know how old he is.
y <- NA
# Are John and Mary the same age?
x == y
## [1] NA
#> [1] NA
# We don't know!
filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly with is.na(x):
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
## # A tibble: 1 x 1
## x
## <dbl>
## 1 3.
filter(df, is.na(x) | x > 1)
## # A tibble: 2 x 1
## x
## <dbl>
## 1 NA
## 2 3.
Arrange
arrange(flights, year, month, day)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 542 540 2. 923
## 4 2013 1 1 544 545 -1. 1004
## 5 2013 1 1 554 600 -6. 812
## 6 2013 1 1 554 558 -4. 740
## 7 2013 1 1 555 600 -5. 913
## 8 2013 1 1 557 600 -3. 709
## 9 2013 1 1 557 600 -3. 838
## 10 2013 1 1 558 600 -2. 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
Use desc() to re-order by a column in descending order:
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301. 1242
## 2 2013 6 15 1432 1935 1137. 1607
## 3 2013 1 10 1121 1635 1126. 1239
## 4 2013 9 20 1139 1845 1014. 1457
## 5 2013 7 22 845 1600 1005. 1044
## 6 2013 4 10 1100 1900 960. 1342
## 7 2013 3 17 2321 810 911. 135
## 8 2013 6 27 959 1900 899. 1236
## 9 2013 7 22 2257 759 898. 121
## 10 2013 12 5 756 1700 896. 1058
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Missing values are always sorted at the end:
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
## # A tibble: 3 x 1
## x
## <dbl>
## 1 2.
## 2 5.
## 3 NA
arrange(df, desc(x))
## # A tibble: 3 x 1
## x
## <dbl>
## 1 5.
## 2 2.
## 3 NA
Select select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
# Select columns by name
select(flights, year, month, day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
## # A tibble: 336,776 x 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <int> <int> <dbl> <int> <int> <dbl>
## 1 517 515 2. 830 819 11.
## 2 533 529 4. 850 830 20.
## 3 542 540 2. 923 850 33.
## 4 544 545 -1. 1004 1022 -18.
## 5 554 600 -6. 812 837 -25.
## 6 554 558 -4. 740 728 12.
## 7 555 600 -5. 913 854 19.
## 8 557 600 -3. 709 723 -14.
## 9 557 600 -3. 838 846 -8.
## 10 558 600 -2. 753 745 8.
## # ... with 336,766 more rows, and 10 more variables: carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
There are a number of helper functions you can use within select():
select() can be used to rename variables, but it’s rarely useful because it drops all of the variables not explicitly mentioned. Instead, use rename()
Rename, which is a variant of select() that keeps all the variables that aren’t explicitly mentioned:
rename(flights, tail_num = tailnum) # new name = original name
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 542 540 2. 923
## 4 2013 1 1 544 545 -1. 1004
## 5 2013 1 1 554 600 -6. 812
## 6 2013 1 1 554 558 -4. 740
## 7 2013 1 1 555 600 -5. 913
## 8 2013 1 1 557 600 -3. 709
## 9 2013 1 1 557 600 -3. 838
## 10 2013 1 1 558 600 -2. 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tail_num <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.
select(flights, time_hour, air_time, everything())
## # A tibble: 336,776 x 19
## time_hour air_time year month day dep_time sched_dep_time
## <dttm> <dbl> <int> <int> <int> <int> <int>
## 1 2013-01-01 05:00:00 227. 2013 1 1 517 515
## 2 2013-01-01 05:00:00 227. 2013 1 1 533 529
## 3 2013-01-01 05:00:00 160. 2013 1 1 542 540
## 4 2013-01-01 05:00:00 183. 2013 1 1 544 545
## 5 2013-01-01 06:00:00 116. 2013 1 1 554 600
## 6 2013-01-01 05:00:00 150. 2013 1 1 554 558
## 7 2013-01-01 06:00:00 158. 2013 1 1 555 600
## 8 2013-01-01 06:00:00 53. 2013 1 1 557 600
## 9 2013-01-01 06:00:00 140. 2013 1 1 557 600
## 10 2013-01-01 06:00:00 138. 2013 1 1 558 600
## # ... with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## # hour <dbl>, minute <dbl>
Mutate
mutate() always adds new columns at the end of your dataset so we’ll start by creating a narrower dataset so we can see the new variables. Remember that when you’re in RStudio, the easiest way to see all the columns is View().
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
## # A tibble: 336,776 x 9
## year month day dep_delay arr_delay distance air_time gain speed
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2. 11. 1400. 227. -9. 370.
## 2 2013 1 1 4. 20. 1416. 227. -16. 374.
## 3 2013 1 1 2. 33. 1089. 160. -31. 408.
## 4 2013 1 1 -1. -18. 1576. 183. 17. 517.
## 5 2013 1 1 -6. -25. 762. 116. 19. 394.
## 6 2013 1 1 -4. 12. 719. 150. -16. 288.
## 7 2013 1 1 -5. 19. 1065. 158. -24. 404.
## 8 2013 1 1 -3. -14. 229. 53. 11. 259.
## 9 2013 1 1 -3. -8. 944. 140. 5. 405.
## 10 2013 1 1 -2. 8. 733. 138. -10. 319.
## # ... with 336,766 more rows
Note that you can refer to columns that you’ve just created:
mutate(flights_sml,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
## # A tibble: 336,776 x 10
## year month day dep_delay arr_delay distance air_time gain hours
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2. 11. 1400. 227. -9. 3.78
## 2 2013 1 1 4. 20. 1416. 227. -16. 3.78
## 3 2013 1 1 2. 33. 1089. 160. -31. 2.67
## 4 2013 1 1 -1. -18. 1576. 183. 17. 3.05
## 5 2013 1 1 -6. -25. 762. 116. 19. 1.93
## 6 2013 1 1 -4. 12. 719. 150. -16. 2.50
## 7 2013 1 1 -5. 19. 1065. 158. -24. 2.63
## 8 2013 1 1 -3. -14. 229. 53. 11. 0.883
## 9 2013 1 1 -3. -8. 944. 140. 5. 2.33
## 10 2013 1 1 -2. 8. 733. 138. -10. 2.30
## # ... with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>
If you only want to keep the new variables, use transmute():
transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
## # A tibble: 336,776 x 3
## gain hours gain_per_hour
## <dbl> <dbl> <dbl>
## 1 -9. 3.78 -2.38
## 2 -16. 3.78 -4.23
## 3 -31. 2.67 -11.6
## 4 17. 3.05 5.57
## 5 19. 1.93 9.83
## 6 -16. 2.50 -6.40
## 7 -24. 2.63 -9.11
## 8 11. 0.883 12.5
## 9 5. 2.33 2.14
## 10 -10. 2.30 -4.35
## # ... with 336,766 more rows
Creation functions
transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100
)
## # A tibble: 336,776 x 3
## dep_time hour minute
## <int> <dbl> <dbl>
## 1 517 5. 17.
## 2 533 5. 33.
## 3 542 5. 42.
## 4 544 5. 44.
## 5 554 5. 54.
## 6 554 5. 54.
## 7 555 5. 55.
## 8 557 5. 57.
## 9 557 5. 57.
## 10 558 5. 58.
## # ... with 336,766 more rows
(x <- 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
#> [1] 1 2 3 4 5 6 7 8 9 10
lag(x)
## [1] NA 1 2 3 4 5 6 7 8 9
#> [1] NA 1 2 3 4 5 6 7 8 9
lead(x)
## [1] 2 3 4 5 6 7 8 9 10 NA
#> [1] 2 3 4 5 6 7 8 9 10 NA
Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means. If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.
Ranking: there are a number of ranking functions, but you should start with min_rank(). It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks.
y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
## [1] 1 2 2 NA 4 5
#> [1] 1 2 2 NA 4 5
min_rank(desc(y)) # desc() Transform a vector into a format that will be sorted in descending order.
## [1] 5 3 3 NA 2 1
#> [1] 5 3 3 NA 2 1
Summarise
summarise() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied âby groupâ. For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date:
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 x 4
## # Groups: year, month [?]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # ... with 355 more rows
Pipe
Imagine that we want to explore the relationship between the distance and average delay for each location. There are three steps to prepare this data:
This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don’t care about it. Naming things is hard, so this slows down our analysis.
There’s another way to tackle the same problem with the pipe, %>%:
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
ggplot(data = delays, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom. We’ll use piping frequently from now on because it considerably improves the readability of code
Counts
Whenever you do any aggregation, it’s always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))). That way you can check that you’re not drawing conclusions based on very small amounts of data. For example, let’s look at the planes (identified by their tail number) that have the highest average delays:
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay)
)
ggplot(data = delays, mapping = aes(x = delay)) +
geom_freqpoly(binwidth = 10)
ggplot(data = delays, mapping = aes(x = delay)) +
geom_histogram(binwidth = 10)
Wow, there are some planes that have an average delay of 5 hours (300 minutes)!
The story is actually a little more nuanced. We can get more insight if we draw a scatterplot of number of flights vs. average delay:
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_jitter(size=0.3)
The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you’ll see that the variation decreases as the sample size increases.
When looking at this sort of plot, it’s often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This is what the following code does, as well as showing you a handy pattern for integrating ggplot2 into dplyr flows. It’s a bit painful that you have to switch from %>% to +, but once you get the hang of it, it’s quite convenient.
delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
RStudio tip: a useful keyboard shortcut is Cmd/Ctrl + Shift + P. This resends the previously sent chunk from the editor to the console. This is very convenient when you're (e.g.) exploring the value of n in the example above. You send the whole block once with Cmd/Ctrl + Enter, then you modify the value of n and press Cmd/Ctrl + Shift + P to resend the complete block.
Other summay functions
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
## # A tibble: 365 x 5
## # Groups: year, month [?]
## year month day avg_delay1 avg_delay2
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 12.7 32.5
## 2 2013 1 2 12.7 32.0
## 3 2013 1 3 5.73 27.7
## 4 2013 1 4 -1.93 28.3
## 5 2013 1 5 -1.53 22.6
## 6 2013 1 6 4.24 24.4
## 7 2013 1 7 -4.95 27.8
## 8 2013 1 8 -3.23 20.8
## 9 2013 1 9 -0.264 25.6
## 10 2013 1 10 -5.90 27.3
## # ... with 355 more rows
# When do the first and last flights leave each day?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first = min(dep_time),
last = max(dep_time)
)
## # A tibble: 365 x 5
## # Groups: year, month [?]
## year month day first last
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 517. 2356.
## 2 2013 1 2 42. 2354.
## 3 2013 1 3 32. 2349.
## 4 2013 1 4 25. 2358.
## 5 2013 1 5 14. 2357.
## 6 2013 1 6 16. 2355.
## 7 2013 1 7 49. 2359.
## 8 2013 1 8 454. 2351.
## 9 2013 1 9 2. 2252.
## 10 2013 1 10 3. 2320.
## # ... with 355 more rows
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last_dep = last(dep_time)
)
## # A tibble: 365 x 5
## # Groups: year, month [?]
## year month day first_dep last_dep
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # ... with 355 more rows
not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
## # A tibble: 770 x 20
## # Groups: year, month, day [365]
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 2356 2359 -3. 425
## 3 2013 1 2 42 2359 43. 518
## 4 2013 1 2 2354 2359 -5. 413
## 5 2013 1 3 32 2359 33. 504
## 6 2013 1 3 2349 2359 -10. 434
## 7 2013 1 4 25 2359 26. 505
## 8 2013 1 4 2358 2359 -1. 429
## 9 2013 1 4 2358 2359 -1. 436
## 10 2013 1 5 14 2359 15. 503
## # ... with 760 more rows, and 13 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, r <int>
# Which destinations have the most carriers?
not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
## # A tibble: 104 x 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # ... with 94 more rows
Simple Count
not_cancelled %>%
count(dest)
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows
You can optionally provide a weight variable. For example, you could use this to âcountâ (sum) the total number of miles a plane flew:
not_cancelled %>%
count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418.
## 2 N0EGMQ 239143.
## 3 N10156 109664.
## 4 N102UW 25722.
## 5 N103US 24619.
## 6 N104UW 24616.
## 7 N10575 139903.
## 8 N105UW 23618.
## 9 N107US 21677.
## 10 N108UW 32070.
## # ... with 4,027 more rows
Ungrouping
If you need to remove grouping, and return to operations on ungrouped data, use ungroup().
# 1.
not_cancelled %>% count(dest)
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows
not_cancelled %>% group_by(dest) %>% summarise(n = n())
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows
# 2.
not_cancelled %>% count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418.
## 2 N0EGMQ 239143.
## 3 N10156 109664.
## 4 N102UW 25722.
## 5 N103US 24619.
## 6 N104UW 24616.
## 7 N10575 139903.
## 8 N105UW 23618.
## 9 N107US 21677.
## 10 N108UW 32070.
## # ... with 4,027 more rows
not_cancelled %>% group_by(tailnum) %>% summarise(n = sum(distance))
## # A tibble: 4,037 x 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418.
## 2 N0EGMQ 239143.
## 3 N10156 109664.
## 4 N102UW 25722.
## 5 N103US 24619.
## 6 N104UW 24616.
## 7 N10575 139903.
## 8 N105UW 23618.
## 9 N107US 21677.
## 10 N108UW 32070.
## # ... with 4,027 more rows
flights %>%
group_by(departed = !is.na(dep_delay), arrived = !is.na(arr_delay)) %>%
summarise(n=n())
## # A tibble: 3 x 3
## # Groups: departed [?]
## departed arrived n
## <lgl> <lgl> <int>
## 1 FALSE FALSE 8255
## 2 TRUE FALSE 1175
## 3 TRUE TRUE 327346
There are no flights which arrived but did not depart, so we can just use !is.na(dep_delay).
flights %>%
mutate(dep_date = lubridate::make_datetime(year, month, day)) %>%
group_by(dep_date) %>%
summarise(cancelled = sum(is.na(dep_delay)),
n = n(),
mean_dep_delay = mean(dep_delay,na.rm=TRUE),
mean_arr_delay = mean(arr_delay,na.rm=TRUE)) %>%
ggplot(aes(x= cancelled/n)) +
geom_point(aes(y=mean_dep_delay), colour='blue', alpha=0.5) +
geom_point(aes(y=mean_arr_delay), colour='red', alpha=0.5) +
ylab('mean delay (minutes)')
# or my ans
flights %>%
group_by(year, month, day) %>%
summarise(cancel_n = sum(is.na(dep_delay)),
total_n = n(),
cancel_prop = cancel_n / total_n,
mean_dep_delay = mean(dep_delay, na.rm = TRUE),
mean_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
ggplot(aes(x = cancel_n, y = mean_dep_delay)) +
geom_point(color = 'blue', alpha = .5) +
geom_point(aes(y = mean_arr_delay), color = 'red', alpha = .5) +
labs(y = "Mean Delay in Min", x = "Proportion of Cancellation")
We can see that on most days, there is not a strong relationship between cancellations and delay, but if one is unusually high, then the other probably is, too.
flights %>% filter(dep_delay > 0) %>% group_by(carrier, dest) %>% summarise(n()) %>%
ggplot(aes(x = dest, fill = carrier, y = `n()`)) +
geom_bar(stat = "identity") +
ylab("Number of Delay") + coord_flip() + theme(axis.text.y = element_text(size = 5))
ggsave("test.png")
## Saving 7 x 5 in image
I don’t think we can disentangle the bad airports from the bad carriers, because each carrier performances differently in each airport.
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 542 540 2. 923
## 4 2013 1 1 544 545 -1. 1004
## 5 2013 1 1 554 600 -6. 812
## 6 2013 1 1 554 558 -4. 740
## 7 2013 1 1 555 600 -5. 913
## 8 2013 1 1 557 600 -3. 709
## 9 2013 1 1 557 600 -3. 838
## 10 2013 1 1 558 600 -2. 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
flights %>%
mutate(dep_date = lubridate::make_datetime(year, month, day)) %>%
group_by(tailnum) %>%
arrange(dep_date) %>%
filter(!cumany(dep_delay > 60)) %>%
tally(sort = TRUE)
## # A tibble: 3,755 x 2
## tailnum n
## <chr> <int>
## 1 N954UW 206
## 2 N952UW 163
## 3 N957UW 142
## 4 N5FAAA 117
## 5 N38727 99
## 6 N3742C 98
## 7 N5EWAA 98
## 8 N705TW 97
## 9 N765US 97
## 10 N635JB 94
## # ... with 3,745 more rows
DA is an iterative cycle. You:
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you'll eventually write up and communicate to others.
Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.
our goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
To create ggplot histogram bins by hand:
diamonds %>%
count(cut_width(carat, .5))
## # A tibble: 11 x 2
## `cut_width(carat, 0.5)` n
## <fct> <int>
## 1 [-0.25,0.25] 785
## 2 (0.25,0.75] 29498
## 3 (0.75,1.25] 15977
## 4 (1.25,1.75] 5313
## 5 (1.75,2.25] 2002
## 6 (2.25,2.75] 322
## 7 (2.75,3.25] 32
## 8 (3.25,3.75] 5
## 9 (3.75,4.25] 4
## 10 (4.25,4.75] 1
## 11 (4.75,5.25] 1
Zoom to small values of the y-axis with coord_cartesian()
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
(coord_cartesian() also has an xlim() argument for when you need to zoom into the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly differently: they throw away the data outside the limits.)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.4.4
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
p1 <- ggplot(diamonds) + geom_histogram(aes(x = x), binwidth = 0.1) + coord_cartesian(xlim = c(0, 10))
# need to first use filter to remove the potential outliers for y and z
p2 <- ggplot(filter(diamonds, y < 20)) + geom_histogram(aes(x = y), binwidth = 0.1) + coord_cartesian(xlim = c(0, 10))
p3 <- ggplot(filter(diamonds, z < 20)) + geom_histogram(aes(x = z), binwidth = 0.1) + coord_cartesian(xlim = c(0, 10))
grid.arrange(p1, p2, p3, ncol = 1)
Based on the distribution of x, y, and z on the same scale, z has a mean lower than both x and y, in addition, x and y share the same distribution, hence x and y should be the length, width of a diamond and z should be its depth.
ggplot(diamonds) + geom_histogram(aes(x = price), binwidth = 100)
The price value distribution has many spikes, and there is a gap around 1500 where there are no diamond with that price.
diamonds %>%
filter(carat == .99 | carat == 1) %>%
count(carat)
## # A tibble: 2 x 2
## carat n
## <dbl> <int>
## 1 0.990 23
## 2 1.00 1558
# filter out these two groups and compare variable distribution/bar plot on potential metrices
carat_99vs1 <- diamonds %>%
filter(carat == .99 | carat == 1)
# distribution of clarity
p1 <- ggplot(carat_99vs1, aes(x = clarity, group = 1)) + geom_bar(aes(y = ..prop..), position = "dodge", show.legend = FALSE) + facet_wrap(~ factor(carat)) + theme(axis.text.x = element_text(size = 9, vjust = 0.5, angle = -45))
p2 <- ggplot(carat_99vs1, aes(x = cut, group = 1)) + geom_bar(aes(y = ..prop..), position = "dodge") + facet_wrap(~ factor(carat)) + theme(axis.text.x = element_text(size = 9, vjust = 0.5, angle = -45))
p3 <- ggplot(carat_99vs1, aes(x = depth, color = factor(carat))) + geom_freqpoly(size = 1.1, stat = 'density', show.legend = FALSE) + scale_y_continuous(labels = scales::percent_format())
p4 <- ggplot(carat_99vs1, aes(x = table, color = factor(carat))) + geom_freqpoly(size = 1.1, stat = 'density') + scale_y_continuous(labels = scales::percent_format())
grid.arrange(p1, p2, p3, p4, nrow = 2)
We could also compute the count for each carat value at 1 decimal point to see if some merchant rounds the number up from 0.99 to 1 for better sale price.
diamonds %>%
count(cut_width(carat, 0.1, boundary = 0, closed = "left")) %>%
ggplot(aes(x = `cut_width(carat, 0.1, boundary = 0, closed = "left")`, y = n)) + geom_bar(stat = "identity") + labs(x = "Carat Range", y = "Count") + theme(axis.text.x = element_text(angle = -45, size = 8, vjust = 0.5))
Notice that there is a major increase from range [0.9, 1) and range [1, 1.1), which indicate a strong tendency that merchant would round up the carat value.
The coord_cartesian() function zooms in on the area specified by the limits, after having calculated and drawn the geoms. Since the histogram bins have already been calculated, it is unaffected.
However, the xlim() and ylim() functions influence actions before the calculation of the stats related to the histogram. Thus, any values outside the x- and y-limits are dropped before calculating bin widths and counts. This can influence how the histogram looks.
Visualizing Covariates
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
Or we can utilize Boxplot
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
p1 <- flights %>%
mutate(cancelled = ifelse(is.na(dep_time), "Cancelled", "Non-cancelled")) %>%
ggplot(aes(x = sched_dep_time, color = cancelled)) + geom_freqpoly(stat = 'density', size = 0.8, show.legend = FALSE)
p2 <- flights %>%
mutate(cancelled = ifelse(is.na(dep_time), "Cancelled", "Non-cancelled")) %>%
ggplot(aes(x = cancelled, y = sched_dep_time, fill = cancelled)) + geom_boxplot() + theme(axis.text.x = element_blank())
grid.arrange(p1, p2, nrow = 1)
p1 <- ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
p2 <- ggplot(diamonds, aes(x = color, y = price)) +
geom_boxplot()
p3 <- ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = clarity, y = price))
grid.arrange(p1, p2, p3, ncol = 1)
For both clarity and color, there is a large amount of variation within each category, which overwhelms the between category trend. Carat is clearly the best predictor of its price.
Now that we have established that carat appears to be the best predictor of price, what is the relationship between it and cut?
ggplot(diamonds, aes(x = cut, y = carat)) +
geom_boxplot()
Noticeably, the largest carat diamonds have a cut of âFairâ (the lowest).This negative relationship can be due to the way in which diamonds are selected for sale. A larger diamond can be profitably sold with a lower quality cut, while a smaller diamond requires a better cut.
library("lvplot")
## Warning: package 'lvplot' was built under R version 3.4.4
ggplot(diamonds, aes(x = cut, y = price)) +
geom_lv(aes(fill=..LV..)) + scale_fill_lv()
ggplot(diamonds, aes(x = cut, y = price)) +
geom_lv(aes(fill=..LV..)) + scale_fill_brewer()
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Blues is 9
## Returning the palette you asked for with that many colors
p1 <- ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(color = cut), binwidth = 500)
p2 <- ggplot(data = diamonds, mapping = aes(x = price)) +
geom_histogram() +
facet_wrap(~ cut, ncol = 1, scales = "free_y")
p3 <- ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_violin() +
coord_flip()
grid.arrange(p1, p3)
p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
geom_freqpoly() is better for look-up: meaning that given a price, it is easy to tell which cut has the highest density. However, the overlapping lines makes it difficult to distinguish how the overall distributions relate to each other. The geom_violin() and faceted geom_histogram() have similar strengths and weaknesses. It is easy to visually distinguish differences in the overall shape of the distributions (skewness, central values, variance, etc). However, since we can’t easily compare the vertical values of the distribution, it is difficult to look up which category has the highest density for a given price.
All of these methods depend on tuning parameters to determine the level of smoothness of the distribution.
library("ggbeeswarm")
## Warning: package 'ggbeeswarm' was built under R version 3.4.4
p1 <- ggplot(data = mpg) +
geom_quasirandom(mapping = aes(x = reorder(class, hwy, FUN = median),
y = hwy)) + labs(x = "", y ="", title = "Default")
p2 <- ggplot(data = mpg) +
geom_quasirandom(mapping = aes(x = reorder(class, hwy, FUN = median),
y = hwy),
method = "tukey")+ labs(x = "", y = "", title = "tukey")
p3 <- ggplot(data = mpg) +
geom_quasirandom(mapping = aes(x = reorder(class, hwy, FUN = median),
y = hwy),
method = "tukeyDense")+ labs(x = "", y = "", title = "tukeyDense")
p4 <- ggplot(data = mpg) +
geom_quasirandom(mapping = aes(x = reorder(class, hwy, FUN = median),
y = hwy),
method = "frowney")+ labs(x = "", y = "", title = "frowney")
p5 <- ggplot(data = mpg) +
geom_quasirandom(mapping = aes(x = reorder(class, hwy, FUN = median),
y = hwy),
method = "smiley")+ labs(x = "", y = "", title = "smiley")
p6 <- ggplot(data = mpg) +
geom_beeswarm(mapping = aes(x = reorder(class, hwy, FUN = median),
y = hwy))+ labs(x = "", y = "", title = "beeswarm")
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 3)
Visualize the covariation between categorical variables.
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
Another way is to compute count with dplyr and visualize with geom_tile()
diamonds %>%
count(color, cut) %>%
ggplot(aes(x = color, y = cut)) +
geom_tile(aes(fill = n), show.legend = FALSE)
If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the d3heatmap or heatmaply packages, which create interactive plots.
# cut within color
ggplot(diamonds, aes(x = color, fill = cut)) + geom_bar(position = "fill")
# color within cut
ggplot(diamonds, aes(x = cut, fill = color)) + geom_bar(position = "fill")
flights %>%
filter(dep_delay > 0) %>%
group_by(dest, month) %>%
summarise(avg_delay = mean(dep_delay)) %>%
ggplot(aes(dest, factor(month))) + geom_tile(aes(fill = avg_delay)) +
labs(y = "Months", x = "Destination")
The plot is difficult to read because there are too many destinations to show. In order to imporve this large plot, try the d3heatmap or heatmaply packages, which create interactive plots.
library(d3heatmap)
## Warning: package 'd3heatmap' was built under R version 3.4.4
flights %>%
filter(dep_delay > 0) %>%
group_by(dest, month) %>%
summarise(avg_delay = mean(dep_delay)) %>%
spread(dest, avg_delay) %>%
d3heatmap(scale = "column", dendrogram = "none", color = "Blues")
Or we can filter out destinations that are not at least once per month and reorder the dest according to departure time to better visualize.
library("viridis")
## Warning: package 'viridis' was built under R version 3.4.4
## Loading required package: viridisLite
flights %>%
group_by(month, dest) %>%
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
group_by(dest) %>%
filter(n() == 12) %>%
ungroup() %>%
mutate(dest = reorder(dest, dep_delay)) %>%
ggplot(aes(x = factor(month), y = dest, fill = dep_delay)) +
geom_tile() +
scale_fill_viridis() +
labs(x = "Month", y = "Destination", fill = "Departure Delay")
# let's compare
p1 <- ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
p2 <- ggplot(data = diamonds) +
geom_count(mapping = aes(x = color, y = cut))
grid.arrange(p1, p2, nrow = 1)
It’s usually better to use the categorical variable with a larger number of categories or the longer labels on the y axis. Another justification, for switching the order is that the larger numbers are at the top when x = color and y = cut, and that lowers the cognitive burden of interpreting the plot.
Previously you used geom_histogram() and geom_freqpoly() to bin in one dimension. Now you’ll learn how to use geom_bin2d() and geom_hex() to bin in two dimensions.
geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. geom_bin2d() creates rectangular bins. geom_hex() creates hexagonal bins.
library(hexbin)
## Warning: package 'hexbin' was built under R version 3.4.4
p1 <- diamonds %>%
filter(carat < 3) %>%
ggplot() +
geom_bin2d(mapping = aes(x = carat, y = price), show.legend = FALSE)
p2 <- diamonds %>%
filter(carat < 3) %>%
ggplot() +
geom_hex(mapping = aes(x = carat, y = price), show.legend = FALSE)
grid.arrange(p1, p2, nrow = 1)
Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualising the combination of a categorical and a continuous variable that you learned about.
diamonds %>%
filter(carat < 3) %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it’s difficult to tell that each boxplot summarises a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with varwidth = TRUE.
diamonds %>%
filter(carat < 3) %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)), varwidth = TRUE)
Another approach is to display approximately the same number of points in each bin. That’s the job of cut_number():
diamonds %>%
filter(carat < 3) %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))
Using freqpoly, I need consider the smoothness (binwidth or number of points in a bin); we don’t want to either bin too much points and lose the trend of the convariate, or include few points so that the trend is affected heavily by local points. **Bin need to be large enough to remove noise but not so small in order to keep pattern.
small_data <- diamonds %>%
filter(carat < 3)
ggplot(small_data, aes(x = carat, color = cut_width(price, 2500, boundary = 0))) + geom_freqpoly(size = .8, binwidth = 0.1) + scale_color_discrete("Price Range")
Besides the plots above, we can also use a boxplot to do this
p1 <- ggplot(small_data, aes(x = price, y = carat)) + geom_boxplot(aes(group = cut_number(price, 10))) + coord_flip()
p2 <- ggplot(small_data, aes(x = price, y = carat)) + geom_boxplot(aes(group = cut_width(price, 2000, boundary = 0))) + coord_flip()
grid.arrange(p1, p2, nrow = 1)
ggplot(small_data, aes(x = price)) + geom_freqpoly(aes(color = cut_width(carat, 0.5, boundary = 0)), binwidth = 1000, size = .8, stat = "density") + scale_color_discrete("Carat Range")
## Warning: Ignoring unknown parameters: binwidth
The price distribution of large diamonds is left skewed (more dense on the right tail) while the one for small diamonds is right skewed. It accords with my expectation because large diamonds are more easily to be sold for a much higher price than average diamonds.
ggplot(small_data, aes(x = carat, y = price)) + geom_jitter(size = .5, alpha = .5) + facet_wrap(~cut)
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y), size = .7) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
Patterns in your data provide clues about relationships.If you spot a pattern, ask yourself:
Could this pattern be due to coincidence (i.e. random chance)?
How can you describe the relationship implied by the pattern?
How strong is the relationship implied by the pattern?
What other variables might affect the relationship?
Does the relationship change if you look at individual subgroups of the data?
A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting))
Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.
library(modelr)
## Warning: package 'modelr' was built under R version 3.4.4
mod <- lm(log(price) ~ log(carat), data = diamonds)
diamonds2 <- diamonds %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid))
The residuals give us a view of the price of the diamond, once the effect of carat has been removed.
Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.
ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))
You’ll learn how models, and the modelr package, work in the final part of the book, model. We’re saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.
This part of the book proceeds as follows:
In tibbles, you’ll learn about the variant of the data frame that we use in this book: the tibble. You’ll learn what makes them different from regular data frames, and how you can construct them âby handâ.
In data import, you’ll learn how to get your data from disk and into R. We’ll focus on plain-text rectangular formats, but will give you pointers to packages that help with other types of data.
In tidy data, you’ll learn about tidy data, a consistent way of storing your data that makes transformation, visualisation, and modelling easier. You’ll learn the underlying principles, and how to get your data into a tidy form.
Relational data will give you tools for working with multiple interrelated datasets.
Strings will introduce regular expressions, a powerful tool for manipulating strings.
Factors are how R stores categorical data. They are used when a variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.
Dates and times will give you the key tools for working with dates and date-times.
Almost all of the functions that you’ll use in this book produce tibbles, as tibbles are one of the unifying features of the tidyverse. Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble. You can do that with as_tibble()
as_tibble(iris)
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.10 3.50 1.40 0.200 setosa
## 2 4.90 3.00 1.40 0.200 setosa
## 3 4.70 3.20 1.30 0.200 setosa
## 4 4.60 3.10 1.50 0.200 setosa
## 5 5.00 3.60 1.40 0.200 setosa
## 6 5.40 3.90 1.70 0.400 setosa
## 7 4.60 3.40 1.40 0.300 setosa
## 8 5.00 3.40 1.50 0.200 setosa
## 9 4.40 2.90 1.40 0.200 setosa
## 10 4.90 3.10 1.50 0.100 setosa
## # ... with 140 more rows
You can create a new tibble from individual vectors with tibble().
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
## # A tibble: 5 x 3
## x y z
## <int> <dbl> <dbl>
## 1 1 1. 2.
## 2 2 1. 5.
## 3 3 1. 10.
## 4 4 1. 17.
## 5 5 1. 26.
If you’re already familiar with data.frame(), note that tibble() does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names.
It’s possible for a tibble to have column names that are not valid R variable names, aka non-syntactic names.
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
tb
## # A tibble: 1 x 3
## `:)` ` ` `2000`
## <chr> <chr> <chr>
## 1 smile space number
Another way to create a tibble is with tribble(), short for transposed tibble. tribble() is customised for data entry in code: column headings are defined by formulas (i.e. they start with ~), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy to read form.
tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)
## # A tibble: 2 x 3
## x y z
## <chr> <dbl> <dbl>
## 1 a 2. 3.60
## 2 b 1. 8.50
Print tibble
nycflights13::flights %>%
print(n = 10, width = Inf) # print all columns
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 542 540 2. 923
## 4 2013 1 1 544 545 -1. 1004
## 5 2013 1 1 554 600 -6. 812
## 6 2013 1 1 554 558 -4. 740
## 7 2013 1 1 555 600 -5. 913
## 8 2013 1 1 557 600 -3. 709
## 9 2013 1 1 557 600 -3. 838
## 10 2013 1 1 558 600 -2. 753
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl>
## 1 819 11. UA 1545 N14228 EWR IAH 227.
## 2 830 20. UA 1714 N24211 LGA IAH 227.
## 3 850 33. AA 1141 N619AA JFK MIA 160.
## 4 1022 -18. B6 725 N804JB JFK BQN 183.
## 5 837 -25. DL 461 N668DN LGA ATL 116.
## 6 728 12. UA 1696 N39463 EWR ORD 150.
## 7 854 19. B6 507 N516JB EWR FLL 158.
## 8 723 -14. EV 5708 N829AS LGA IAD 53.
## 9 846 -8. B6 79 N593JB JFK MCO 140.
## 10 745 8. AA 301 N3ALAA LGA ORD 138.
## distance hour minute time_hour
## <dbl> <dbl> <dbl> <dttm>
## 1 1400. 5. 15. 2013-01-01 05:00:00
## 2 1416. 5. 29. 2013-01-01 05:00:00
## 3 1089. 5. 40. 2013-01-01 05:00:00
## 4 1576. 5. 45. 2013-01-01 05:00:00
## 5 762. 6. 0. 2013-01-01 06:00:00
## 6 719. 5. 58. 2013-01-01 05:00:00
## 7 1065. 6. 0. 2013-01-01 06:00:00
## 8 229. 6. 0. 2013-01-01 06:00:00
## 9 944. 6. 0. 2013-01-01 06:00:00
## 10 733. 6. 0. 2013-01-01 06:00:00
## # ... with 3.368e+05 more rows
You can also control the default print behaviour by setting options:
options(tibble.print_max = n, tibble.print_min = m): if more than n rows, print only m rows. Use options(tibble.print_min = Inf) to always show all rows.
Use options(tibble.width = Inf) to always print all columns, regardless of the width of the screen.
A final option is to use RStudio’s built-in data viewer to get a scrollable view of the complete dataset.
nycflights13::flights %>%
View()
Data Extraction
df <- tibble(
x = runif(5),
y = rnorm(5)
)
# Extract by name
df$x
## [1] 0.4596974 0.7775757 0.4863925 0.7403212 0.6311137
df[["x"]]
## [1] 0.4596974 0.7775757 0.4863925 0.7403212 0.6311137
# Extract by position
df[[1]]
## [1] 0.4596974 0.7775757 0.4863925 0.7403212 0.6311137
To use these in a pipe, you’ll need to use the special placeholder .:
df %>% .$x
## [1] 0.4596974 0.7775757 0.4863925 0.7403212 0.6311137
df %>% .[["x"]]
## [1] 0.4596974 0.7775757 0.4863925 0.7403212 0.6311137
Some older functions don’t work with tibbles. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data.frame
The main reason that some older functions don’t work with tibble is the [ function.
With base R data frames, [ sometimes returns a data frame, and sometimes returns a vector. With tibbles, [ always returns another tibble.
df <- data.frame(abc = 1, xyz = "a")
df$x # partial matching
## [1] a
## Levels: a
df[, "xyz"]
## [1] a
## Levels: a
df[, c("abc", "xyz")]
## abc xyz
## 1 1 a
df <- as.tibble(df)
df$x
## Warning: Unknown or uninitialised column: 'x'.
## NULL
df[, "xyz"]
## # A tibble: 1 x 1
## xyz
## <fct>
## 1 a
df[, c("abc", "xyz")]
## # A tibble: 1 x 2
## abc xyz
## <dbl> <fct>
## 1 1. a
Extracting the variable called 1.
Plotting a scatterplot of 1 vs 2.
Creating a new column called 3 which is 2 divided by 1.
Renaming the columns to one, two and three.
annoying <- tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)
# 1.
annoying$`1`
## [1] 1 2 3 4 5 6 7 8 9 10
annoying[["1"]] # will return a vector
## [1] 1 2 3 4 5 6 7 8 9 10
annoying["1"] # will return a tibble
## # A tibble: 10 x 1
## `1`
## <int>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
# 2.
ggplot(annoying) + geom_point(aes(x = `1`, y = `2`))
# 3.
annoying[["3"]] <- annoying[[2]] / annoying[[1]]
# 4. renaming
annoying <- rename(annoying, one = `1`, two = `2`, three = `3`)
enframe() converts named atomic vectors or lists to two-column data frames. For unnamed vectors, the natural sequence is used as name column.
enframe(1:3)
## # A tibble: 3 x 2
## name value
## <int> <int>
## 1 1 1
## 2 2 2
## 3 3 3
enframe(c(a = 5, b = 7))
## # A tibble: 2 x 2
## name value
## <chr> <dbl>
## 1 a 5.
## 2 b 7.
3.What option controls how many additional column names are printed at the footer of a tibble?
The n_extra argument determines the number of extra columns to print information for.
load flat files in R with the readr package, which is part of the core tidyverse.
Most of readr’s functions are concerned with turning flat files into data frames:
read_csv() reads comma delimited files, read_csv2() reads semicolon separated files (common in countries where , is used as the decimal place), read_tsv() reads tab delimited files, and read_delim() reads in files with any delimiter.
read_fwf() reads fixed width files. You can specify fields either by their widths with fwf_widths() or their position with fwf_positions(). read_table() reads a common variation of fixed width files where columns are separated by white space.
read_log() reads Apache style log files. (But also check out webreadr which is built on top of read_log() and provides many more helpful tools.)
Parsing Vector
we need to take a little detour to talk about the parse_*() functions. These functions take a character vector and return a more specialised vector like a logical, integer, or date:
str(parse_logical(c("TRUE", "FALSE", "NA")))
## logi [1:3] TRUE FALSE NA
str(parse_integer(c("1", "2", "3")))
## int [1:3] 1 2 3
str(parse_date(c("2010-01-01", "1979-10-14")))
## Date[1:2], format: "2010-01-01" "1979-10-14"
These functions are useful in their own right, but are also an important building block for readr. Once you’ve learned how the individual parsers work in this section, we’ll circle back and see how they fit together to parse a complete file in the next section.
If parsing fails, you’ll get a warning:
x <- parse_integer(c("123", "345", "abc", "123.45"))
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 2 parsing failures.
## row # A tibble: 2 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 3 NA an integer abc row 2 4 NA no trailing characters .45
x
## [1] 123 345 NA NA
## attr(,"problems")
## # A tibble: 2 x 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 3 NA an integer abc
## 2 4 NA no trailing characters .45
And the failures will be missing in the output. If there are many parsing failures, you’ll need to use problems() to get the complete set. This returns a tibble, which you can then manipulate with dplyr.
problems(x)
## # A tibble: 2 x 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 3 NA an integer abc
## 2 4 NA no trailing characters .45
Using parsers is mostly a matter of understanding what’s available and how they deal with different types of input. There are eight particularly important parsers:
parse_logical() and parse_integer() parse logicals and integers respectively. There’s basically nothing that can go wrong with these parsers so I won’t describe them here further.
parse_double() is a strict numeric parser, and parse_number() is a flexible numeric parser. These are more complicated than you might expect because different parts of the world write numbers in different ways.
parse_character() seems so simple that it shouldn’t be necessary. But one complication makes it quite important: character encodings.
parse_factor() create factors, the data structure that R uses to represent categorical variables with fixed and known values.
parse_datetime(), parse_date(), and parse_time() allow you to parse various date & time specifications. These are the most complicated because there are so many different ways of writing dates.
Parse Numbers
To address the first problem, readr has the notion of a âlocaleâ, an object that specifies parsing options that differ from place to place. When parsing numbers, the most important option is the character you use for the decimal mark. You can override the default value of . by creating a new locale and setting the decimal_mark argument:
parse_double("1.23")
## [1] 1.23
parse_double("1,23", locale = locale(decimal_mark = ","))
## [1] 1.23
parse_number() addresses the second problem: it ignores non-numeric characters before and after the number. This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.
parse_number("$100")
## [1] 100
parse_number("20%")
## [1] 20
parse_number("It cost $123.45")
## [1] 123.45
The final problem is addressed by the combination of parse_number() and the locale as parse_number() will ignore the âgrouping markâ:
parse_number("$123,456,789")
## [1] 123456789
# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
## [1] 123456789
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
## [1] 123456789
Parse String
x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
parse_character(x1, locale = locale(encoding = "Latin1"))
## [1] "El Niño was particularly bad this year"
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
## [1] "<U+3053><U+3093><U+306B><U+3061><U+306F>"
How do you find the correct encoding? If you’re lucky, it’ll be included somewhere in the data documentation. Unfortunately, that’s rarely the case, so readr provides guess_encoding() to help you figure it out. It’s not foolproof, and it works better when you have lots of text (unlike here), but it’s a reasonable place to start. Expect to try a few different encodings before you find the right one.
In R, we can get at the underlying representation of a string using charToRaw():
guess_encoding(charToRaw(x1))
## # A tibble: 2 x 2
## encoding confidence
## <chr> <dbl>
## 1 ISO-8859-1 0.460
## 2 ISO-8859-9 0.230
guess_encoding(charToRaw(x2))
## # A tibble: 1 x 2
## encoding confidence
## <chr> <dbl>
## 1 KOI8-R 0.420
Give parse_factor() a vector of known levels to generate a warning whenever an unexpected value is present:
fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
## Warning: 1 parsing failure.
## row # A tibble: 1 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 3 NA value in level set bananana
## [1] apple banana <NA>
## attr(,"problems")
## # A tibble: 1 x 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 3 NA value in level set bananana
## Levels: apple banana
You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date-time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight). When called without any additional arguments:
parse_datetime("2010-10-01T2010")
## [1] "2010-10-01 20:10:00 UTC"
# If time is omitted, it will be set to midnight
parse_datetime("20101010")
## [1] "2010-10-10 UTC"
parse_date("2010-10-01")
## [1] "2010-10-01"
parse_date("2010/10/01")
## [1] "2010-10-01"
library(hms)
## Warning: package 'hms' was built under R version 3.4.4
parse_time("01:10 am")
## 01:10:00
parse_time("20:10:01")
## 20:10:01
If you’re using %b or %B with non-English month names, you’ll need to set the lang argument to locale(). See the list of built-in languages in date_names_langs(), or if your language is not already included, create your own with date_names().
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
## [1] "2015-01-01"
parse_date("14 oct. 1979", "%d %b %Y", locale = locale("fr"))
## [1] "1979-10-14"
If the decimal and grouping marks are set to the same character, locale throws an error; If the decimal_mark is set to the comma “,”, then the grouping mark is set to the period “.”; If the grouping mark is set to a period, then the decimal mark is set to a comma
parse_date("02/01/2006", locale = locale(date_format = "%m/%d/%Y"))
## [1] "2006-02-01"
parse_date("02/01/2006", "%m/%d/%Y")
## [1] "2006-02-01"
d1 <- "January 1, 2010"
parse_date(d1, "%B %d, %Y")
## [1] "2010-01-01"
d2 <- "2015-Mar-07"
parse_date(d2, "%Y-%b-%d")
## [1] "2015-03-07"
d3 <- "06-Jun-2017"
parse_date(d3, "%d-%b-%Y")
## [1] "2017-06-06"
d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, "%B %d (%Y)")
## [1] "2015-08-19" "2015-07-01"
d5 <- "12/30/14" # Dec 30, 2014
parse_date(d5, "%m/%d/%y")
## [1] "2014-12-30"
t1 <- "1705"
parse_time(t1, "%H%M")
## 17:05:00
t2 <- "11:15:10.12 PM"
parse_time(t2, "%I:%M:%OS %p")
## 23:15:10.12
readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. You can emulate this process with a character vector using guess_parser(), which returns readr’s best guess, and parse_guess() which uses that guess to parse the column:
guess_parser("2010-10-01")
## [1] "date"
guess_parser("15:01")
## [1] "time"
guess_parser(c("TRUE", "FALSE"))
## [1] "logical"
guess_parser(c("1", "5", "9"))
## [1] "integer"
guess_parser(c("12,352,561"))
## [1] "number"
str(parse_guess("2010-10-10"))
## Date[1:1], format: "2010-10-10"
The heuristic tries each of the following types, stopping when it finds a match:
If none of these rules apply, then the column will stay as a vector of strings.
Problems
These defaults don’t always work for larger files. There are two basic problems:
The first thousand rows might be a special case, and readr guesses a type that is not sufficiently general. For example, you might have a column of doubles that only contains integers in the first 1000 rows.
The column might contain a lot of missing values. If the first 1000 rows contain only NAs, readr will guess that it’s a character vector, whereas you probably want to parse it as something more specific.
challenge <- read_csv(readr_example("challenge.csv"))
## Parsed with column specification:
## cols(
## x = col_integer(),
## y = col_character()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 1000 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1001 x no trailing characters .23837975086644292 'F:/R-3.4.3/libra~ file 2 1002 x no trailing characters .41167997173033655 'F:/R-3.4.3/libra~ row 3 1003 x no trailing characters .7460716762579978 'F:/R-3.4.3/libra~ col 4 1004 x no trailing characters .723450553836301 'F:/R-3.4.3/libra~ expected 5 1005 x no trailing characters .614524137461558 'F:/R-3.4.3/libra~
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.
problems(challenge)
## # A tibble: 1,000 x 5
## row col expected actual file
## <int> <chr> <chr> <chr> <chr>
## 1 1001 x no trailing characters .23837975086644292 'F:/R-3.4.3/libr~
## 2 1002 x no trailing characters .41167997173033655 'F:/R-3.4.3/libr~
## 3 1003 x no trailing characters .7460716762579978 'F:/R-3.4.3/libr~
## 4 1004 x no trailing characters .723450553836301 'F:/R-3.4.3/libr~
## 5 1005 x no trailing characters .614524137461558 'F:/R-3.4.3/libr~
## 6 1006 x no trailing characters .473980569280684 'F:/R-3.4.3/libr~
## 7 1007 x no trailing characters .5784610391128808 'F:/R-3.4.3/libr~
## 8 1008 x no trailing characters .2415937229525298 'F:/R-3.4.3/libr~
## 9 1009 x no trailing characters .11437866208143532 'F:/R-3.4.3/libr~
## 10 1010 x no trailing characters .2983446326106787 'F:/R-3.4.3/libr~
## # ... with 990 more rows
A good strategy is to work column by column until there are no problems remaining. Here we can see that there are a lot of parsing problems with the x column - there are trailing characters after the integer value. That suggests we need to use a double parser instead.
Every parse_xyz() function has a corresponding col_xyz() function. You use parse_xyz() when the data is in a character vector in R already; you use col_xyz() when you want to tell readr how to load the data.
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_date()
)
)
tail(challenge)
## # A tibble: 6 x 2
## x y
## <dbl> <date>
## 1 0.805 2019-11-21
## 2 0.164 2018-03-29
## 3 0.472 2014-08-04
## 4 0.718 2015-08-16
## 5 0.270 2020-02-04
## 6 0.608 2019-01-06
challenge2 <- read_csv(readr_example("challenge.csv"),
col_types = cols(.default = col_character())
)
This is particularly useful in conjunction with type_convert(), which applies the parsing heuristics to the character columns in a data frame.
df <- tribble(
~x, ~y,
"1", "1.21",
"2", "2.32",
"3", "4.56"
)
df
## # A tibble: 3 x 2
## x y
## <chr> <chr>
## 1 1 1.21
## 2 2 2.32
## 3 3 4.56
type_convert(df)
## Parsed with column specification:
## cols(
## x = col_integer(),
## y = col_double()
## )
## # A tibble: 3 x 2
## x y
## <int> <dbl>
## 1 1 1.21
## 2 2 2.32
## 3 3 4.56
Save CSV
Note that the type information is lost when you save to csv
his makes CSVs a little unreliable for caching interim resultsâyou need to recreate the column specification every time you load in. There are two alternatives:
write_rds(challenge, "challenge.rds")
read_rds("challenge.rds")
## # A tibble: 2,000 x 2
## x y
## <dbl> <date>
## 1 404. NA
## 2 4172. NA
## 3 3004. NA
## 4 787. NA
## 5 37. NA
## 6 2332. NA
## 7 2489. NA
## 8 1449. NA
## 9 3665. NA
## 10 3863. NA
## # ... with 1,990 more rows
library(feather)
## Warning: package 'feather' was built under R version 3.4.4
write_feather(challenge, "challenge.feather")
read_feather("challenge.feather")
## # A tibble: 2,000 x 2
## x y
## <dbl> <date>
## 1 404. NA
## 2 4172. NA
## 3 3004. NA
## 4 787. NA
## 5 37. NA
## 6 2332. NA
## 7 2489. NA
## 8 1449. NA
## 9 3665. NA
## 10 3863. NA
## # ... with 1,990 more rows
Feather tends to be faster than RDS and is usable outside of R. RDS supports list-columns (which you’ll learn about in many models); feather currently does not.
Spreading and Gathering
To tidy a dataset like this, we need to gather those columns into a new pair of variables. To describe that operation we need three parameters:
The set of columns that represent values, not variables. In this example, those are the columns 1999 and 2000.
The name of the variable whose values form the column names. I call that the key, and here it is year.
The name of the variable whose values are spread over the cells. I call that value, and here it’s the number of cases.
To combine the tidied versions of table4a and table4b into a single tibble, we need to use dplyr::left_join()
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
## # A tibble: 6 x 3
## country year population
## <chr> <chr> <int>
## 1 Afghanistan 1999 19987071
## 2 Brazil 1999 172006362
## 3 China 1999 1272915272
## 4 Afghanistan 2000 20595360
## 5 Brazil 2000 174504898
## 6 China 2000 1280428583
tidy4a <- table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
left_join(tidy4a, tidy4b)
## Joining, by = c("country", "year")
## # A tibble: 6 x 4
## country year cases population
## <chr> <chr> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Brazil 1999 37737 172006362
## 3 China 1999 212258 1272915272
## 4 Afghanistan 2000 2666 20595360
## 5 Brazil 2000 80488 174504898
## 6 China 2000 213766 1280428583
To tidy this up, we first analyse the representation in similar way to gather(). This time, however, we only need two parameters:
The column that contains variable names, the key column. Here, it’s type.
The column that contains values from multiple variables, the value column. Here it’s count.
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table2 %>%
spread(key = type, value = count)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
stocks <- tibble(
year = c(2015, 2015, 2016, 2016),
half = c( 1, 2, 1, 2),
return = c(1.88, 0.59, 0.92, 0.17)
)
stocks %>%
spread(year, return) %>%
gather(`2015`:`2016`, key = year, value = return)
## # A tibble: 4 x 3
## half year return
## <dbl> <chr> <dbl>
## 1 1. 2015 1.88
## 2 2. 2015 0.590
## 3 1. 2016 0.920
## 4 2. 2016 0.170
The functions spread() and gather() are not perfectly symmetrical because column type information is not transferred between them. In the original table the column year was numeric, but after running spread() and gather() it is a character vector. This is because variable names are always converted to a character vector by gather().
The convert argument tries to convert character vectors to the appropriate type. In the background this uses the type.convert() function.
stocks %>%
spread(year, return) %>%
gather(`2015`:`2016`, key = year, value = return, convert = TRUE)
## # A tibble: 4 x 3
## half year return
## <dbl> <int> <dbl>
## 1 1. 2015 1.88
## 2 2. 2015 0.590
## 3 1. 2016 0.920
## 4 2. 2016 0.170
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
# table4a %>%
# gather(1999, 2000, key = "year", value = "cases")
Because when specifying column names, for names like 1999 we need to use 1999
. The tidyverse functions will interpret 1999 and 2000 without quotes as looking for the 1999th and 2000th column of the data frame.
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
people <- tribble(
~name, ~key, ~value,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
Spreading the data frame fails because there are two rows with âageâ for âPhillip Woodsâ. We would need to add another column with an indicator for the number observation it is.
people <- tribble(
~name, ~key, ~value, ~obs,
#-----------------|--------|------|------
"Phillip Woods", "age", 45, 1,
"Phillip Woods", "height", 186, 1,
"Phillip Woods", "age", 50, 2,
"Jessica Cordero", "age", 37, 1,
"Jessica Cordero", "height", 156, 1
)
people %>%
spread(key, value)
## # A tibble: 3 x 4
## name obs age height
## <chr> <dbl> <dbl> <dbl>
## 1 Jessica Cordero 1. 37. 156.
## 2 Phillip Woods 1. 45. 186.
## 3 Phillip Woods 2. 50. NA
preg <- tribble(
~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12
)
We need to gather it!
preg %>%
gather(male, female, key = Sex, value = Number)
## # A tibble: 4 x 3
## pregnant Sex Number
## <chr> <chr> <dbl>
## 1 yes male NA
## 2 no male 20.
## 3 yes female 10.
## 4 no female 12.
gather(preg, sex, count, male, female) %>%
mutate(pregnant = pregnant == "yes",
female = sex == "female") %>%
select(-sex)
## # A tibble: 4 x 3
## pregnant count female
## <lgl> <dbl> <lgl>
## 1 TRUE NA FALSE
## 2 FALSE 20. FALSE
## 3 TRUE 10. TRUE
## 4 FALSE 12. TRUE
Separating and Uniting
table3 has a different problem: we have one column (rate) that contains two variables (cases and population). To fix this problem, we’ll need the separate() function. You’ll also learn about the complement of separate(): unite(), which you use if a single variable is spread across multiple columns.
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
separate() takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure 12.4 and the code below.
** The default ‘sep’ value is a regular expression that matches any sequence of non-alphanumeric values.**
If you wish to use a specific character to separate a column, you can pass the character to the sep argument of separate()
table3 %>%
separate(rate, into = c("cases", "population"))
## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Look carefully at the column types: you’ll notice that cases and population are character columns. This is the default behaviour in separate(): it leaves the type of the column as is. Here, however, it’s not very useful as those really are numbers. We can ask separate() to try and convert to better types using convert = TRUE:
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
You can also pass a vector of integers to sep. separate() will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. When using integers to separate strings, the length of sep should be one less than the number of names in into.
(table5 <- table3 %>%
separate(year, into = c("century", "year"), sep = 2) %>%
separate(rate, into = c("class", "population"), convert = TRUE))
## # A tibble: 6 x 5
## country century year class population
## * <chr> <chr> <chr> <int> <int>
## 1 Afghanistan 19 99 745 19987071
## 2 Afghanistan 20 00 2666 20595360
## 3 Brazil 19 99 37737 172006362
## 4 Brazil 20 00 80488 174504898
## 5 China 19 99 212258 1272915272
## 6 China 20 00 213766 1280428583
unite() is the inverse of separate(): it combines multiple columns into a single column. You’ll need it much less frequently than separate(), but it’s still a useful tool to have in your back pocket.
We can use unite() to rejoin the century and year columns that we created. unite() takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in dplyr::select() style:
table5 %>%
unite(new, century, year)
## # A tibble: 6 x 4
## country new class population
## <chr> <chr> <int> <int>
## 1 Afghanistan 19_99 745 19987071
## 2 Afghanistan 20_00 2666 20595360
## 3 Brazil 19_99 37737 172006362
## 4 Brazil 20_00 80488 174504898
## 5 China 19_99 212258 1272915272
## 6 China 20_00 213766 1280428583
In this case we also need to use the sep argument. The default will place an underscore (_) between the values from different columns. Here we don’t want any separator so we use “”:
table5 %>%
unite(new, century, year, sep = "")
## # A tibble: 6 x 4
## country new class population
## <chr> <chr> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f
## 3 h i j
# use extra
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"), extra = "merge")
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f,g
## 3 h i j
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e <NA>
## 3 f g i
# use fill
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"), fill = "left") # fill with missing values to the left
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 <NA> d e
## 3 f g i
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e <NA>
## 3 f g i
You would set it to FALSE if you want to create a new variable, but keep the old one.
table3 %>%
separate(year, into = c("century", "year"), sep = 2)
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
table3 %>%
separate(year, into = c("century", "short_year"), sep = 2, remove = FALSE)
## # A tibble: 6 x 5
## country year century short_year rate
## * <chr> <int> <chr> <chr> <chr>
## 1 Afghanistan 1999 19 99 745/19987071
## 2 Afghanistan 2000 20 00 2666/20595360
## 3 Brazil 1999 19 99 37737/172006362
## 4 Brazil 2000 20 00 80488/174504898
## 5 China 1999 19 99 212258/1272915272
## 6 China 2000 20 00 213766/1280428583
separate() use regexp to match the separator while extract() uses regexp to match the match groups; Because unite() or concatenate from already separated terms are much easiler compared with separating them apart.
library(dplyr)
df <- data.frame(x = c(NA, "a-b", "a-d", "b-c", "d-e"))
df %>% extract(x, "A")
## A
## 1 <NA>
## 2 a
## 3 a
## 4 b
## 5 d
df %>% extract(x, c("A", "B"), "([[:alnum:]]+)-([[:alnum:]]+)")
## A B
## 1 <NA> <NA>
## 2 a b
## 3 a d
## 4 b c
## 5 d e
Missing values
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
stocks
## # A tibble: 7 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015. 1. 1.88
## 2 2015. 2. 0.590
## 3 2015. 3. 0.350
## 4 2015. 4. NA
## 5 2016. 2. 0.920
## 6 2016. 3. 0.170
## 7 2016. 4. 2.66
The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains NA.
The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
The way that a dataset is represented can make implicit values explicit. For example, we can make the implicit missing value explicit by putting years in the columns:
stocks %>%
spread(year, return)
## # A tibble: 4 x 3
## qtr `2015` `2016`
## <dbl> <dbl> <dbl>
## 1 1. 1.88 NA
## 2 2. 0.590 0.920
## 3 3. 0.350 0.170
## 4 4. NA 2.66
Because these explicit missing values may not be important in other representations of the data, you can set na.rm = TRUE in gather() to turn explicit missing values implicit:
stocks %>%
spread(year, return) %>%
gather(year, return, `2015`:`2016`, na.rm = TRUE)
## # A tibble: 6 x 3
## qtr year return
## * <dbl> <chr> <dbl>
## 1 1. 2015 1.88
## 2 2. 2015 0.590
## 3 3. 2015 0.350
## 4 2. 2016 0.920
## 5 3. 2016 0.170
## 6 4. 2016 2.66
Another important tool for making missing values explicit in tidy data is complete():
stocks %>%
complete(year, qtr)
## # A tibble: 8 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015. 1. 1.88
## 2 2015. 2. 0.590
## 3 2015. 3. 0.350
## 4 2015. 4. NA
## 5 2016. 1. NA
## 6 2016. 2. 0.920
## 7 2016. 3. 0.170
## 8 2016. 4. 2.66
complete() takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit NAs where necessary.
Filling missing
There is one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
You can fill in these missing values with fill(). It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).
treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
treatment %>%
fill(person)
## # A tibble: 4 x 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1. 7.
## 2 Derrick Whitmore 2. 10.
## 3 Derrick Whitmore 3. 9.
## 4 Katherine Burke 1. 4.
# .direction argument
# Direction in which to fill missing values. Currently either "down" (the default) or "up".
The tidyr::who dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report, available at http://www.who.int/tb/country/data/download/en/.
who
## # A tibble: 7,240 x 60
## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 Afghanistan AF AFG 1980 NA NA NA
## 2 Afghanistan AF AFG 1981 NA NA NA
## 3 Afghanistan AF AFG 1982 NA NA NA
## 4 Afghanistan AF AFG 1983 NA NA NA
## 5 Afghanistan AF AFG 1984 NA NA NA
## 6 Afghanistan AF AFG 1985 NA NA NA
## 7 Afghanistan AF AFG 1986 NA NA NA
## 8 Afghanistan AF AFG 1987 NA NA NA
## 9 Afghanistan AF AFG 1988 NA NA NA
## 10 Afghanistan AF AFG 1989 NA NA NA
## # ... with 7,230 more rows, and 53 more variables: new_sp_m3544 <int>,
## # new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
## # new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
## # new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
## # new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
## # new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
## # new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
## # new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
## # new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
## # new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,
## # new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
## # new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
## # new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,
## # new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>,
## # newrel_m1524 <int>, newrel_m2534 <int>, newrel_m3544 <int>,
## # newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>,
## # newrel_f014 <int>, newrel_f1524 <int>, newrel_f2534 <int>,
## # newrel_f3544 <int>, newrel_f4554 <int>, newrel_f5564 <int>,
## # newrel_f65 <int>
This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, who is messy, and we’ll need multiple steps to tidy it.
Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you’ll usually need to string together multiple verbs into a pipeline.
So we need to gather together all the columns from new_sp_m014 to newrel_f65. We don’t know what those values represent yet, so we’ll give them the generic name “key”. We know the cells represent the count of cases, so we’ll use the variable cases. There are a lot of missing values in the current representation, so for now we’ll use na.rm just so we can focus on the values that are present.
who1 <- who %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)
who1
## # A tibble: 76,046 x 6
## country iso2 iso3 year key cases
## * <chr> <chr> <chr> <int> <chr> <int>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1998 new_sp_m014 30
## 3 Afghanistan AF AFG 1999 new_sp_m014 8
## 4 Afghanistan AF AFG 2000 new_sp_m014 52
## 5 Afghanistan AF AFG 2001 new_sp_m014 129
## 6 Afghanistan AF AFG 2002 new_sp_m014 90
## 7 Afghanistan AF AFG 2003 new_sp_m014 127
## 8 Afghanistan AF AFG 2004 new_sp_m014 139
## 9 Afghanistan AF AFG 2005 new_sp_m014 151
## 10 Afghanistan AF AFG 2006 new_sp_m014 193
## # ... with 76,036 more rows
We can get some hint of the structure of the values in the new key column by counting them:
who1 %>%
count(key)
## # A tibble: 56 x 2
## key n
## <chr> <int>
## 1 new_ep_f014 1032
## 2 new_ep_f1524 1021
## 3 new_ep_f2534 1021
## 4 new_ep_f3544 1021
## 5 new_ep_f4554 1017
## 6 new_ep_f5564 1017
## 7 new_ep_f65 1014
## 8 new_ep_m014 1038
## 9 new_ep_m1524 1026
## 10 new_ep_m2534 1020
## # ... with 46 more rows
You might be able to parse this out by yourself with a little thought and some experimentation, but luckily we have the data dictionary handy. It tells us:
The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.
The next two letters describe the type of TB:
The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).
The remaining numbers gives the age group. The dataset groups cases into seven age groups:
who2 <- who1 %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2
## # A tibble: 76,046 x 6
## country iso2 iso3 year key cases
## <chr> <chr> <chr> <int> <chr> <int>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1998 new_sp_m014 30
## 3 Afghanistan AF AFG 1999 new_sp_m014 8
## 4 Afghanistan AF AFG 2000 new_sp_m014 52
## 5 Afghanistan AF AFG 2001 new_sp_m014 129
## 6 Afghanistan AF AFG 2002 new_sp_m014 90
## 7 Afghanistan AF AFG 2003 new_sp_m014 127
## 8 Afghanistan AF AFG 2004 new_sp_m014 139
## 9 Afghanistan AF AFG 2005 new_sp_m014 151
## 10 Afghanistan AF AFG 2006 new_sp_m014 193
## # ... with 76,036 more rows
We can separate the values in each code with two passes of separate(). The first pass will split the codes at each underscore.
who3 <- who2 %>%
separate(key, c("new", "type", "sexage"), sep = "_")
who3
## # A tibble: 76,046 x 8
## country iso2 iso3 year new type sexage cases
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan AF AFG 1997 new sp m014 0
## 2 Afghanistan AF AFG 1998 new sp m014 30
## 3 Afghanistan AF AFG 1999 new sp m014 8
## 4 Afghanistan AF AFG 2000 new sp m014 52
## 5 Afghanistan AF AFG 2001 new sp m014 129
## 6 Afghanistan AF AFG 2002 new sp m014 90
## 7 Afghanistan AF AFG 2003 new sp m014 127
## 8 Afghanistan AF AFG 2004 new sp m014 139
## 9 Afghanistan AF AFG 2005 new sp m014 151
## 10 Afghanistan AF AFG 2006 new sp m014 193
## # ... with 76,036 more rows
Then we might as well drop the new column because it’s constant in this dataset. While we’re dropping columns, let’s also drop iso2 and iso3 since they’re redundant.
who3 %>%
count(new)
## # A tibble: 1 x 2
## new n
## <chr> <int>
## 1 new 76046
who4 <- who3 %>%
select(-new, -iso2, -iso3)
Next we’ll separate sexage into sex and age by splitting after the first character:
who5 <- who4 %>%
separate(sexage, c("sex", "age"), sep = 1)
who5
## # A tibble: 76,046 x 6
## country year type sex age cases
## <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1998 sp m 014 30
## 3 Afghanistan 1999 sp m 014 8
## 4 Afghanistan 2000 sp m 014 52
## 5 Afghanistan 2001 sp m 014 129
## 6 Afghanistan 2002 sp m 014 90
## 7 Afghanistan 2003 sp m 014 127
## 8 Afghanistan 2004 sp m 014 139
## 9 Afghanistan 2005 sp m 014 151
## 10 Afghanistan 2006 sp m 014 193
## # ... with 76,036 more rows
The who dataset is now tidy!
This typically isn’t how you’d work interactively. Instead, you’d gradually build up a complex pipe:
who %>%
gather(key, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
separate(key, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)
select(who3, country, iso2, iso3) %>%
distinct() %>%
group_by(country) %>%
filter(n() > 1)
## # A tibble: 0 x 3
## # Groups: country [0]
## # ... with 3 variables: country <chr>, iso2 <chr>, iso3 <chr>
We will use the nycflights13 package to learn about relational data. nycflights13 contains four tibbles that are related to the flights table that you used in data transformation:
One way to show the relationships between the different tables is with a drawing:
library(tidyverse)
library(nycflights13)
The key to understanding diagrams like this is to remember each relation always concerns a pair of tables. You don’t need to understand the whole thing; you just need to understand the chain of relations between the tables that you are interested in.
For nycflights13:
flights connects to planes via a single variable, tailnum.
flights connects to airlines through the carrier variable.
flights connects to airports in two ways: via the origin and dest variables.
flights connects to weather via origin (the location), and year, month, day and hour (the time).
Mutating Joins
A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.
Like mutate(), the join functions add variables to the right, so if you have a lot of variables already, the new variables won’t get printed out.
Imagine you want to add the full airline name to the flights2 data. You can combine the airlines and flights2 data frames with left_join():
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 x 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5. EWR IAH N14228 UA
## 2 2013 1 1 5. LGA IAH N24211 UA
## 3 2013 1 1 5. JFK MIA N619AA AA
## 4 2013 1 1 5. JFK BQN N804JB B6
## 5 2013 1 1 6. LGA ATL N668DN DL
## 6 2013 1 1 5. EWR ORD N39463 UA
## 7 2013 1 1 6. EWR FLL N516JB B6
## 8 2013 1 1 6. LGA IAD N829AS EV
## 9 2013 1 1 6. JFK MCO N593JB B6
## 10 2013 1 1 6. LGA ORD N3ALAA AA
## # ... with 336,766 more rows
flights2 %>%
select(-origin, -dest) %>%
left_join(airlines, by = "carrier")
## # A tibble: 336,776 x 7
## year month day hour tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr>
## 1 2013 1 1 5. N14228 UA United Air Lines Inc.
## 2 2013 1 1 5. N24211 UA United Air Lines Inc.
## 3 2013 1 1 5. N619AA AA American Airlines Inc.
## 4 2013 1 1 5. N804JB B6 JetBlue Airways
## 5 2013 1 1 6. N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5. N39463 UA United Air Lines Inc.
## 7 2013 1 1 6. N516JB B6 JetBlue Airways
## 8 2013 1 1 6. N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6. N593JB B6 JetBlue Airways
## 10 2013 1 1 6. N3ALAA AA American Airlines Inc.
## # ... with 336,766 more rows
Inner Join: An inner join matches pairs of observations whenever their keys are equal:
x %>%
inner_join(y, by = "key")
Outer joins: An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:
Defining Key Columns - The default, by = NULL, uses all variables that appear in both tables - A character vector, by = “x”. This is like a natural join, but uses only some of the common variables. - A named character vector: by = c(“a” = “b”). This will match variable a in table x to variable b in table y. The variables from x will be used in the output.
For example, if we want to draw a map we need to combine the flights data with the airports data which contains the location (lat and lon) of each airport. Each flight has an origin and destination airport, so we need to specify which one we want to join to:
flights2 %>%
left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 x 15
## year month day hour origin dest tailnum carrier name lat lon
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2013 1 1 5. EWR IAH N14228 UA Georg~ 30.0 -95.3
## 2 2013 1 1 5. LGA IAH N24211 UA Georg~ 30.0 -95.3
## 3 2013 1 1 5. JFK MIA N619AA AA Miami~ 25.8 -80.3
## 4 2013 1 1 5. JFK BQN N804JB B6 <NA> NA NA
## 5 2013 1 1 6. LGA ATL N668DN DL Harts~ 33.6 -84.4
## 6 2013 1 1 5. EWR ORD N39463 UA Chica~ 42.0 -87.9
## 7 2013 1 1 6. EWR FLL N516JB B6 Fort ~ 26.1 -80.2
## 8 2013 1 1 6. LGA IAD N829AS EV Washi~ 38.9 -77.5
## 9 2013 1 1 6. JFK MCO N593JB B6 Orlan~ 28.4 -81.3
## 10 2013 1 1 6. LGA ORD N3ALAA AA Chica~ 42.0 -87.9
## # ... with 336,766 more rows, and 4 more variables: alt <int>, tz <dbl>,
## # dst <chr>, tzone <chr>
Filtering Joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
top_dest
## # A tibble: 10 x 2
## dest n
## <chr> <int>
## 1 ORD 17283
## 2 ATL 17215
## 3 LAX 16174
## 4 BOS 15508
## 5 MCO 14082
## 6 CLT 14064
## 7 SFO 13331
## 8 FLL 12055
## 9 MIA 11728
## 10 DCA 9705
# Now you want to find each flight that went to one of those destinations. You could construct a filter yourself:
flights %>%
filter(dest %in% top_dest$dest)
## # A tibble: 141,145 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 542 540 2. 923
## 2 2013 1 1 554 600 -6. 812
## 3 2013 1 1 554 558 -4. 740
## 4 2013 1 1 555 600 -5. 913
## 5 2013 1 1 557 600 -3. 838
## 6 2013 1 1 558 600 -2. 753
## 7 2013 1 1 558 600 -2. 924
## 8 2013 1 1 558 600 -2. 923
## 9 2013 1 1 559 559 0. 702
## 10 2013 1 1 600 600 0. 851
## # ... with 141,135 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
But it’s difficult to extend that approach to multiple variables. For example, imagine that you’d found the 10 days with highest average delays. How would you construct the filter statement that used year, month, and day to match it back to flights?
flights %>%
semi_join(top_dest)
## Joining, by = "dest"
## # A tibble: 141,145 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 542 540 2. 923
## 2 2013 1 1 554 600 -6. 812
## 3 2013 1 1 554 558 -4. 740
## 4 2013 1 1 555 600 -5. 913
## 5 2013 1 1 557 600 -3. 838
## 6 2013 1 1 558 600 -2. 753
## 7 2013 1 1 558 600 -2. 924
## 8 2013 1 1 558 600 -2. 923
## 9 2013 1 1 559 559 0. 702
## 10 2013 1 1 600 600 0. 851
## # ... with 141,135 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
# If NULL, the default, *_join() will do a natural join, using all variables with common names across the two tables.
Only the existence of a match is important; it doesn’t matter which observation is matched. This means that filtering joins never duplicate rows like mutating joins do
Anti-joins are useful for diagnosing join mismatches. For example, when connecting flights and planes, you might be interested to know that there are many flights that don’t have a match in planes:
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
## # A tibble: 722 x 2
## tailnum n
## <chr> <int>
## 1 <NA> 2512
## 2 N725MQ 575
## 3 N722MQ 513
## 4 N723MQ 507
## 5 N713MQ 483
## 6 N735MQ 396
## 7 N0EGMQ 371
## 8 N534MQ 364
## 9 N542MQ 363
## 10 N531MQ 349
## # ... with 712 more rows
Matching with RegExp
To learn regular expressions, we’ll use str_view() and str_view_all(). These functions take a character vector and a regular expression, and show you how they match.
If is used as an escape character in regular expressions, how do you match a literal ? Well you need to escape it, creating the regular expression \. To create that regular expression, you need to use a string, which also needs to escape . That means to match a literal you need to write “\\” - you need four backslashes to match one!
x <- c("apple", "banana", "pear")
str_view(x, "an")
# To create the regular expression, we need \\
dot <- "\\."
# But the expression itself only contains one:
writeLines(dot)
## \.
#> \.
# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")
x <- "a\\b"
writeLines(x)
## a\b
#> a\b
str_view(x, "\\\\")
Character Classes
There are a number of special patterns that match more than one character. You’ve already seen ., which matches any character apart from a newline. There are four other useful tools:
Repetition
Grouping and backreferences
Earlier, you learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with backreferences, like , etc. For example, the following regular expression finds all fruits that have a repeated pair of letters.
str_view(fruit, "(a)(..)\\2", match = TRUE)
Notice
Regular expression is powerful but not simple to solve problems!
A word of caution before we continue: because regular expressions are so powerful, it’s easy to try and solve every problem with a single regular expression.
As a cautionary tale, check out this regular expression that checks if a email address is valid:
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
\t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
\t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)
This is a somewhat pathological example (because email addresses are actually surprisingly complex), but is used in real code. See the stackoverflow discussion at http://stackoverflow.com/a/201378 for more details.
Replacing
Instead of replacing with a fixed string you can use backreferences to insert components of the match. In the following code, I flip the order of the second and third words.
sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
## [1] "The canoe birch slid on the smooth planks."
## [2] "Glue sheet the to the dark blue background."
## [3] "It's to easy tell the depth of a well."
## [4] "These a days chicken leg is a rare dish."
## [5] "Rice often is served in round bowls."
With str_replace_all() you can perform multiple replacements by supplying a named vector:
x <- c("1 house 3", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
## [1] "one house three" "two cars" "three people"
This chapter will focus on the lubridate package, which makes it easier to work with dates and times in R. lubridate is not part of core tidyverse because you only need it when you’re working with dates/times. We will also need nycflights13 for practice data.
library(tidyverse)
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.4.4
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:hms':
##
## hms
## The following object is masked from 'package:base':
##
## date
library(nycflights13)
today()
## [1] "2018-08-10"
now()
## [1] "2018-08-10 02:51:20 EDT"
there are three ways you’re likely to create a date/time - From a string. - From individual date-time components. - From an existing date/time object.
ymd("2017-01-31")
## [1] "2017-01-31"
mdy("January 31st, 2017")
## [1] "2017-01-31"
dmy("31-Jan-2017")
## [1] "2017-01-31"
ymd(20170131)
## [1] "2017-01-31"
ymd_hms("2017-01-31 20:11:59")
## [1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
## [1] "2017-01-31 08:01:00 UTC"
ymd(20170131, tz = "UTC")
## [1] "2017-01-31 UTC"
To create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times:
flights %>%
select(year, month, day, hour, minute) %>%
mutate(departure = make_datetime(year, month, day, hour, minute))
## # A tibble: 336,776 x 6
## year month day hour minute departure
## <int> <int> <int> <dbl> <dbl> <dttm>
## 1 2013 1 1 5. 15. 2013-01-01 05:15:00
## 2 2013 1 1 5. 29. 2013-01-01 05:29:00
## 3 2013 1 1 5. 40. 2013-01-01 05:40:00
## 4 2013 1 1 5. 45. 2013-01-01 05:45:00
## 5 2013 1 1 6. 0. 2013-01-01 06:00:00
## 6 2013 1 1 5. 58. 2013-01-01 05:58:00
## 7 2013 1 1 6. 0. 2013-01-01 06:00:00
## 8 2013 1 1 6. 0. 2013-01-01 06:00:00
## 9 2013 1 1 6. 0. 2013-01-01 06:00:00
## 10 2013 1 1 6. 0. 2013-01-01 06:00:00
## # ... with 336,766 more rows
Let’s do the same thing for each of the four time columns in flights. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once I’ve created the date-time variables, I focus in on the variables we’ll explore in the rest of the chapter.
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt
## # A tibble: 328,063 x 9
## origin dest dep_delay arr_delay dep_time
## <chr> <chr> <dbl> <dbl> <dttm>
## 1 EWR IAH 2. 11. 2013-01-01 05:17:00
## 2 LGA IAH 4. 20. 2013-01-01 05:33:00
## 3 JFK MIA 2. 33. 2013-01-01 05:42:00
## 4 JFK BQN -1. -18. 2013-01-01 05:44:00
## 5 LGA ATL -6. -25. 2013-01-01 05:54:00
## 6 EWR ORD -4. 12. 2013-01-01 05:54:00
## 7 EWR FLL -5. 19. 2013-01-01 05:55:00
## 8 LGA IAD -3. -14. 2013-01-01 05:57:00
## 9 JFK MCO -3. -8. 2013-01-01 05:57:00
## 10 LGA ORD -2. 8. 2013-01-01 05:58:00
## # ... with 328,053 more rows, and 4 more variables: sched_dep_time <dttm>,
## # arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>
With this data, I can visualise the distribution of departure times across the year:
flights_dt %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day
Or within a single day:
flights_dt %>%
filter(dep_time < ymd(20130102)) %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 600) # 600 s = 10 minutes
You may want to switch between a date-time and a date. That’s the job of as_datetime() and as_date():
as_datetime(today())
## [1] "2018-08-10 UTC"
as_date(now())
## [1] "2018-08-10"
# Sometimes you'll get date/times as numeric offsets from the "Unix Epoch", 1970-01-01.
as_datetime(60 * 60 * 10)
## [1] "1970-01-01 10:00:00 UTC"
as_date(365 * 10 + 2)
## [1] "1980-01-01"
You can pull out individual parts of the date with the accessor functions year(), month(), mday() (day of the month), yday() (day of the year), wday() (day of the week), hour(), minute(), and second().
datetime <- ymd_hms("2016-07-08 12:34:56")
year(datetime)
## [1] 2016
wday(datetime, label = TRUE, abbr = FALSE)
## [1] Friday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
We can use wday() to see that more flights depart during the week than on the weekend:
flights_dt %>%
mutate(wday = wday(dep_time, label = TRUE)) %>%
ggplot(aes(x = wday)) +
geom_bar()
Time Durations
A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the duration.
h_age <- today() - ymd(19791014)
h_age
## Time difference of 14180 days
as.duration(h_age)
## [1] "1225152000s (~38.82 years)"
# Durations come with a bunch of convenient constructors
dseconds(15)
## [1] "15s"
dminutes(10)
## [1] "600s (~10 minutes)"
dhours(c(12, 24))
## [1] "43200s (~12 hours)" "86400s (~1 days)"
dweeks(3)
## [1] "1814400s (~3 weeks)"
Durations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds at the standard rate (60 seconds in a minute, 60 minutes in an hour, 24 hours in day, 7 days in a week, 365 days in a year).
You can add and multiply durations:
2 * dyears(1)
## [1] "63072000s (~2 years)"
dyears(1) + dweeks(12) + dhours(15)
## [1] "38847600s (~1.23 years)"
However, because durations represent an exact number of seconds, sometimes you might get an unexpected result:
one_pm <- ymd_hms("2016-03-12 13:00:00", tz = "America/New_York")
one_pm
## [1] "2016-03-12 13:00:00 EST"
one_pm + ddays(1)
## [1] "2016-03-13 14:00:00 EDT"
Why is one day after 1pm on March 12, 2pm on March 13?! If you look carefully at the date you might also notice that the time zones have changed. Because of DST, March 12 only has 23 hours, so if we add a full days worth of seconds we end up with a different time.
Periods: Like durations, periods can be created with a number of friendly constructor functions.
Compared to durations, periods are more likely to do what you expect:
one_pm
## [1] "2016-03-12 13:00:00 EST"
one_pm + days(1)
## [1] "2016-03-13 13:00:00 EDT"
seconds(15)
## [1] "15S"
minutes(10)
## [1] "10M 0S"
weeks(3:6)
## [1] "21d 0H 0M 0S" "28d 0H 0M 0S" "35d 0H 0M 0S" "42d 0H 0M 0S"
years(1)
## [1] "1y 0m 0d 0H 0M 0S"
10 * (months(6) + days(1))
## [1] "60m 10d 0H 0M 0S"
days(50) + hours(25) + minutes(2)
## [1] "50d 25H 2M 0S"
# A leap year
ymd("2016-01-01") + dyears(1)
## [1] "2016-12-31"
ymd("2016-01-01") + years(1)
## [1] "2017-01-01"
# Daylight Savings Time
one_pm + ddays(1)
## [1] "2016-03-13 14:00:00 EDT"
one_pm + days(1)
## [1] "2016-03-13 13:00:00 EDT"
Let’s use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination before they departed from New York City. These are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding days(1) to the arrival time of each overnight flight.
flights_dt <- flights_dt %>%
mutate(
overnight = arr_time < dep_time,
arr_time = arr_time + days(overnight * 1),
sched_arr_time = sched_arr_time + days(overnight * 1)
)
Now all of our flights obey the laws of physics.
flights_dt %>%
filter(overnight, arr_time < dep_time)
## # A tibble: 0 x 10
## # ... with 10 variables: origin <chr>, dest <chr>, dep_delay <dbl>,
## # arr_delay <dbl>, dep_time <dttm>, sched_dep_time <dttm>,
## # arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>,
## # overnight <lgl>
Transformations are useful because you can use them to approximate non-linear functions. If you’ve taken a calculus class, you may have heard of Taylor’s theorem which says you can approximate any smooth function with an infinite sum of polynomials. That means you can use a polynomial function to get arbitrarily close to a smooth function by fitting an equation like y = a_1 + a_2 * x + a_3 * x^2 + a_4 * x ^ 3. Typing that sequence by hand is tedious, so R provides a helper function: poly():
df <- tribble(
~y, ~x,
1, 1,
2, 2,
3, 3
)
model_matrix(df, y ~ poly(x, 2))
## # A tibble: 3 x 3
## `(Intercept)` `poly(x, 2)1` `poly(x, 2)2`
## <dbl> <dbl> <dbl>
## 1 1. -7.07e- 1 0.408
## 2 1. -7.85e-17 -0.816
## 3 1. 7.07e- 1 0.408
However there’s one major problem with using poly(): outside the range of the data, polynomials rapidly shoot off to positive or negative infinity. One safer alternative is to use the natural spline, splines::ns().
library(splines)
model_matrix(df, y ~ ns(x, 2))
## # A tibble: 3 x 3
## `(Intercept)` `ns(x, 2)1` `ns(x, 2)2`
## <dbl> <dbl> <dbl>
## 1 1. 0. 0.
## 2 1. 0.566 -0.211
## 3 1. 0.344 0.771
Let’s see what that looks like when we try and approximate a non-linear function:
library(modelr)
sim5 <- tibble(
x = seq(0, 3.5 * pi, length = 50),
y = 4 * sin(x) + rnorm(length(x))
)
ggplot(sim5, aes(x, y)) +
geom_point()
# fit 5 models to this data
mod1 <- lm(y ~ ns(x, 1), data = sim5)
mod2 <- lm(y ~ ns(x, 2), data = sim5)
mod3 <- lm(y ~ ns(x, 3), data = sim5)
mod4 <- lm(y ~ ns(x, 4), data = sim5)
mod5 <- lm(y ~ ns(x, 5), data = sim5)
grid <- sim5 %>%
data_grid(x = seq_range(x, n = 50, expand = 0.1)) %>%
gather_predictions(mod1, mod2, mod3, mod4, mod5, .pred = "y")
ggplot(sim5, aes(x, y)) +
geom_point() +
geom_line(data = grid, colour = "red") +
facet_wrap(~ model)
Notice that the extrapolation outside the range of the data is clearly bad. This is the downside to approximating a function with a polynomial. But this is a very real problem with every model: the model can never tell you if the behaviour is true when you start extrapolating outside the range of the data that you have seen. You must rely on theory and science.
Generalised linear models, e.g. stats::glm(). Linear models assume that the response is continuous and the error has a normal distribution. Generalised linear models extend linear models to include non-continuous responses (e.g. binary data or counts). They work by defining a distance metric based on the statistical idea of likelihood.
Generalised additive models, e.g. mgcv::gam(), extend generalised linear models to incorporate arbitrary smooth functions. That means you can write a formula like y ~ s(x) which becomes an equation like y = f(x) and let gam() estimate what that function is (subject to some smoothness constraints to make the problem tractable).
Penalised linear models, e.g. glmnet::glmnet(), add a penalty term to the distance that penalises complex models (as defined by the distance between the parameter vector and the origin). This tends to make models that generalise better to new datasets from the same population.
Robust linear models, e.g. MASS:rlm(), tweak the distance to downweight points that are very far away. This makes them less sensitive to the presence of outliers, at the cost of being not quite as good when there are no outliers.
Trees, e.g. rpart::rpart(), attack the problem in a completely different way than linear models. They fit a piece-wise constant model, splitting the data into progressively smaller and smaller pieces. Trees aren’t terribly effective by themselves, but they are very powerful when used in aggregate by models like random forests (e.g. randomForest::randomForest()) or gradient boosting machines (e.g. xgboost::xgboost.)
Not Overlapped Labels
we can use the ggrepel package by Kamil Slowikowski. This useful package will automatically adjust labels so that they don’t overlap:
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 3.4.4
best_in_class <- mpg %>%
group_by(class) %>%
filter(row_number(desc(hwy)) == 1)
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_point(size = 3, shape = 1, data = best_in_class) +
ggrepel::geom_label_repel(aes(label = model), data = best_in_class)
Themes
Saving Plots
There are two main ways to get your plots out of R and into your final write-up: ggsave() and knitr. ggsave() will save the most recent plot to disk.
If you don’t specify the width and height they will be taken from the dimensions of the current plotting device. For reproducible code, you’ll want to specify them.
Generally, however, I think you should be assembling your final reports using R Markdown, so I want to focus on the important code chunk options that you should know about for graphics. You can learn more about ggsave() in the documentation.
There are five main options that control figure sizing: fig.width, fig.height, fig.asp, out.width and out.height. Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e., height, width, and aspect ratio: pick two of three).
I only ever use three of the five options:
I find it most aesthetically pleasing for plots to have a consistent width. To enforce this, I set **fig.width = 6 (6“) and fig.asp = 0.618** (the golden ratio) in the defaults. Then in individual chunks, I only adjust fig.asp.
I control the output size with out.width and set it to a percentage of the line width). I default to out.width = “70%” and fig.align = “center”. That give plots room to breathe, without taking up too much space.
To put multiple plots in a single row I set the out.width to 50% for two plots, 33% for 3 plots, or 25% to 4 plots, and set fig.align = “default”. Depending on what I’m trying to illustrate ( e.g. show data or show plot variations), I’ll also tweak fig.width, as discussed below.
If you find that you’re having to squint to read the text in your plot, you need to tweak fig.width. If fig.width is larger than the size the figure is rendered in the final doc, the text will be too small; if fig.width is smaller, the text will be too big. You’ll often need to do a little experimentation to figure out the right ratio between the fig.width and the eventual width in your document.
When mingling code and text, like I do in this book, I recommend setting fig.show = “hold” so that plots are shown after the code. This has the pleasant side effect of forcing you to break up large blocks of code with their explanations.
To add a caption to the plot, use fig.cap. In R Markdown this will change the figure from inline to “floating”.
It’s a good idea to name code chunks that produce figures, even if you don’t routinely label other chunks. The chunk label is used to generate the file name of the graphic on disk, so naming your chunks makes it much easier to pick out plots and reuse in other circumstances (i.e. if you want to quickly drop a single plot into an email or a tweet).