5 Manipulating data

Functions introduced in this chapter

read_csv, select, rename, filter, slice, arrange, mutate, all.equal, ifelse, transmute, summarise, group_by, |>, count

5.1 Introduction

This tutorial will import some data from the web and then explore it using the amazing dplyr package, a package which is quickly becoming the de facto standard among R users for manipulating data. It’s part of the tidyverse that we’ve already used in several chapters.

5.1.1 Install new packages

There are no new packages used in this chapter.

5.1.2 Download the Quarto file

Look at either the top (Posit Cloud) or the upper right corner of the RStudio screen to make sure you are in your intro_stats project.

Then click on the following link to download this chapter as a Quarto file (.qmd).

https://vectorposse.github.io/intro_stats/chapter_downloads/05-manipulating_data.qmd

Once the file is downloaded, move it to your project folder in RStudio and open it there.

5.1.3 Restart R and run all chunks

In RStudio, select “Restart R and Run All Chunks” from the “Run” menu.

5.1.4 Load packages

We load the tidyverse package as usual, but this time it is to give us access to the dplyr package, which is loaded alongside our other tidyverse packages like ggplot2. The tidyverse also has a package called readr that will allow us to import data from an external source (in this case, a web site).

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

5.2 Importing CSV data

For most of the chapters, we use data sets that are either included in base R or included in a package that can be loaded into R. But it is useful to see how to get a data set from outside the R ecosystem. This depends a lot on the format of the data file, but a common format is a “comma-separated values” file, or CSV file. If you have a data set that is not formatted as a CSV file, it is usually pretty easy to open it in something like Google Spreadsheets or Microsoft Excel and then re-save it as a CSV file.

The file we’ll import is a random sample from all the commercial domestic flights that departed from Houston, Texas, in 2011.

We use the read_csv command to import a CSV file. In this case, we’re grabbing the file from a web page where the file is hosted. If you have a file on your computer, you can also put the file into your project directory and import it from there. Put the URL (for a web page) or the filename (for a file in your project directory) in quotes inside the read_csvcommand. We also need to assign the output to a tibble, so we’ve called it hf for “Houston flights”.

hf <- read_csv("https://vectorposse.github.io/intro_stats/data/hf.csv")

Rows: 22758 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): UniqueCarrier, TailNum, Origin, Dest, CancellationCode
dbl (16): Year, Month, DayofMonth, DayOfWeek, DepTime, ArrTime, FlightNum, A...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Here is a preview of the data set. If you’re looking at this on the website, you’ll note that only the first 10,000 rows print here.

hf

The glimpse command is the most reliable place to see the correct number of rows.

glimpse(hf)

Rows: 22,758
Columns: 21
$ Year              <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
$ Month             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ DayofMonth        <dbl> 12, 17, 24, 9, 18, 22, 11, 14, 26, 14, 18, 20, 3, 12…
$ DayOfWeek         <dbl> 3, 1, 1, 7, 2, 6, 2, 5, 3, 5, 2, 4, 1, 3, 6, 4, 1, 3…
$ DepTime           <dbl> 1419, 1530, 1356, 714, 721, 717, 1953, 2119, 2009, 1…
$ ArrTime           <dbl> 1515, 1634, 1513, 829, 827, 829, 2051, 2229, 2103, 1…
$ UniqueCarrier     <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA"…
$ FlightNum         <dbl> 428, 428, 428, 460, 460, 460, 533, 533, 533, 1121, 1…
$ TailNum           <chr> "N577AA", "N518AA", "N531AA", "N586AA", "N558AA", "N…
$ ActualElapsedTime <dbl> 56, 64, 77, 75, 66, 72, 58, 70, 54, 65, 135, 144, 64…
$ AirTime           <dbl> 41, 48, 43, 51, 46, 47, 44, 45, 39, 47, 114, 111, 46…
$ ArrDelay          <dbl> 5, 84, 3, -6, -8, -6, -29, 69, -17, -11, 39, -1, -2,…
$ DepDelay          <dbl> 19, 90, -4, -6, 1, -3, -12, 74, 4, -1, 44, -5, -1, 1…
$ Origin            <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IA…
$ Dest              <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DF…
$ Distance          <dbl> 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 96…
$ TaxiIn            <dbl> 4, 8, 6, 11, 7, 18, 3, 5, 9, 8, 7, 20, 5, 8, 8, 7, 4…
$ TaxiOut           <dbl> 11, 8, 28, 13, 13, 7, 11, 20, 6, 10, 14, 13, 13, 10,…
$ Cancelled         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ CancellationCode  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Diverted          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

The one disadvantage of a file imported from the internet or your computer is that it does not come with a help file. (Only packages in R have help files.) Hopefully you have access to some kind of information about the data you’re importing. In this case, we get lucky because the full Houston flights data set happens to be available in a package called hflights.

Exercise 1

Click or copy this link to view a copy of the documentation for the full Houston flights data. Read the webpage so you know what all the variables mean. Report below how many flights are in the original hflights data. What fraction of the original data has been sampled in the CSV file we imported above?

Please write up your answer here.

5.3 Introduction to `dplyr`

The dplyr package (pronounced “dee-ply-er”) contains tools for manipulating the rows and columns of tibbles. The key to using dplyr is to familiarize yourself with the “key verbs”:

select (and rename)
filter (and slice)
arrange
mutate (and transmute)
summarise (with group_by)
count

We’ll consider these one by one. We won’t have time to cover every aspect of these functions. More information appears in the help files, as well as this very helpful “cheat sheet”: https://rstudio.github.io/cheatsheets/html/data-transformation.html

5.4 `select`

The select verb is very easy. It just selects some subset of variables (the columns of your data set).

The select command from the dplyr package illustrates one of the common issues R users face. Because the word “select” is pretty common, and selecting things is a common task, there are multiple packages that have a function called select. Depending on the order in which packages were loaded, R might get confused as to which version of select you want and try to apply the wrong one. One way to get the correct version is to specify the package in the syntax. Instead of typing select, we can type dplyr::select to ensure we get the version from the dplyr package. We’ll do this in all future uses of the select function. (The other functions in this chapter don’t cause us trouble because we don’t use any other packages whose functions conflict like this.)

Suppose all we wanted to see was the carrier, origin, and destination. The command to do this appears in the next code chunk. (Note that we’re only going to print the first 100 rows. It takes a lot of memory to keep printing tibbles that have 22,758 rows!)

hf_select <- dplyr::select(hf, UniqueCarrier, Origin, Dest)
head(hf_select, n = 100)

A brief but important aside here: there is nothing special about the variable name hf_select. We could have typed

beef_gravy <- dplyr::select(hf, UniqueCarrier, Origin, Dest)

and it would work just as well. Generally speaking, though, you want to give variables a name that reflects the intent of your analysis.

Also, it is important to assign the result to a variable. If we had just typed

dplyr::select(hf, UniqueCarrier, Origin, Dest)

the result would print to the screen, but it wouldn’t be stored anywhere for future use.

Also, it is important to assign the result to a new variable. If we had typed

hf <- dplyr::select(hf, UniqueCarrier, Origin, Dest)

this would have overwritten the original tibble hf with this new version with only three variables. We want to preserve hf because we want to do other things with the entire data set later.

The take-home message here is this: Major modifications to your data should generally be given a new variable name. There are caveats here, though. Every time you create a new tibble, you also fill up more memory with your creation. So if you don’t need the old version of a tibble anymore, you are safe to overwrite the old version with something new and continue working with that newer modification.

Okay, back to the select function. The first argument of select is the tibble. After that, just list all the names of the variables you want to select.

If you don’t like the names of the variables, you can change them as part of the select process. For example,

hf_select <- dplyr::select(hf,
                           carrier = UniqueCarrier,
                           origin = Origin,
                           dest = Dest)
head(hf_select, n = 100)

(Note here that we are overwriting hf_select which had been defined slightly differently before. However, these two versions of hf_select are basically the same object, so no need to keep the older version around just taking up space unnecessarily.)

There are a few notational shortcuts. For example, try to figure out what the following commands are doing:

hf_select2 <- dplyr::select(hf, DayOfWeek:UniqueCarrier)
head(hf_select2, n = 100)

hf_select3 <- dplyr::select(hf, starts_with("Taxi"))
head(hf_select3, n = 100)

Exercise 2

What is contained in the new tibbles hf_select2 and hf_select3? In other words, what does the colon (:) appear to do and what does starts_with appear to do in the select function?

Please write up your answer here.

The cheat sheet shows a lot more of these “helper functions” if you’re interested.

The other command that’s related to select is rename. The only difference is that select will throw away any columns you don’t select (which is what you want and expect, typically), whereas rename will keep all the columns, but rename those you designate.

Exercise 3

Putting a minus sign in front of a variable name in the select command will remove the variable. Create a tibble called hf_select4 that removes Year, DayofMonth, DayOfWeek, FlightNum, and Diverted. (Be careful with the unusual—and inconsistent!—capitalization in those variable names.) In the second part of the code chunk below, type head(hf_select4, n = 100) so that the first 100 rows of the tibble print to the screen (just like in all the above examples).

# Add code here to define hf_select4.
# Add code here to print the first 100 rows of hf_select4.

To free up some memory, we’re going to remove some of the tibbles we’ve created. This won’t affect how they will print out above. You will never have to do this on your own.

rm(hf_select, hf_select2, hf_select3)

5.5 `filter`

The filter verb works a lot like select, but for rows instead of columns.

For example, let’s say we only want to see Delta flights. We use filter. (We can print all the rows here because the resulting tibble is much smaller!)

hf_filter <- filter(hf, UniqueCarrier == "DL")
hf_filter

In the printout of the tibble above, if you can’t see the UniqueCarrier column, click the black arrow on the right to scroll through the columns until you can see it. You can click “Next” at the bottom to scroll through the rows.

Exercise 4

How many rows did we get in the hf_filter tibble? What do you notice about the UniqueCarrier of all those rows?

Please write up your answer here.

Just like select, the first argument of filter is the name of the tibble. Following that, you must specify some condition. Only rows meeting that condition will be included in the output.

One thing that is unusual here is the double equal sign (UniqueCarrier == "DL"). This won’t be a mystery to people with programming experience, but it tends to be a sticking point for the rest of us. A single equals sign represents assignment. If I type x = 3, what I mean is, “Take the letter x and assign it the value 3.” In R, we would also write x <- 3 to mean the same thing. The first line of the code chunk below assigns x to be 3. Therefore, the following line that just says x creates the output “3”.

x = 3
x

[1] 3

On the other hand, x == 3 means something completely different. This is a logical statement that is either true or false. Either x is 3, in which case we get TRUE or x is not 3, and we get FALSE.

x == 3

[1] TRUE

(It’s true because we just assigned x to be 3 in the previous code chunk!)

In the above filter command, we are saying, “Give us the rows where the value of UniqueCarrier is "DL", or, in other words, where the statement UniqueCarrier == "DL" is true.

As another example, suppose we wanted to find out all flights that leave before 6:00 a.m. Due to the interesting way that times are stored in this data, we can do something like the following:

hf_filter2 <- filter(hf, DepTime < 600)
hf_filter2

Exercise 5

Look at the help file for hflights again (linked in Exercise 1). Why do we have to use the number 600 in the command above? (Read the description of the DepTime variable.)

Please write up your answer here.

If we need two or more conditions, we use & for “and” and | for “or”. The following will give us only the Delta flights that departed before 6:00 a.m.

hf_filter3 <- filter(hf, UniqueCarrier == "DL" & DepTime < 600)
hf_filter3

Again, check the cheat sheet for more complicated condition-checking if needed.

Exercise 6(a)

The symbol != means “not equal to” in R. Use the filter command to create a tibble called hf_filter4 that finds all flights except those flying into Salt Lake City (“SLC”). As before, print the first 100 rows to the screen.

# Add code here to define hf_filter4.
# Add code here to print the first 100 rows of hf_filter4.

Exercise 6(b)

How many flights were there flying into SLC? In other words, how many rows were removed from the original hf tibble to produce hf_filter4?

Be careful! You only printed 100 rows to the screen, but that doesn’t represent all the filtered data. How can you figure out how many rows are actually left in hf_filter4?

Please write up your answer here.

The slice command is related, but fairly useless in practice. It will allow you to extract rows by position. So slice(hf, 1:10) will give you the first 10 rows. As a general rule, the information available in a tibble should never depend on the order in which the rows appear. Therefore, no function you run should make any assumptions about the ordering of your data. The only reason one might want to think about the order of data is for convenience in presenting that data visually for someone to inspect. And that brings us to…

5.6 `arrange`

This just re-orders the rows, sorting on the values of one or more specified columns. As we mentioned before, in most data analyses you work with summaries of the data that do not depend on the order of the rows, so this is not quite as interesting as some of the other verbs. In fact, since the re-ordering is usually for the visual benefit of the reader, there is often no need to store the output in a new variable. We’ll just print the output to the screen (well, the first 100 rows anyway).

head(arrange(hf, ActualElapsedTime), n = 100)

Scroll over to the ActualElapsedTime variable in the output above (using the black right arrow) to see that these are now sorted in ascending order.

Exercise 7

How long is the shortest actual elapsed time? Why is this flight so short? (Hint: look at the destination.) Which airline flies that route? You may have to use your best friend Google to look up airport and airline codes.

Please write up your answer here.

If you want descending order, do this:

head(arrange(hf, desc(ActualElapsedTime)), n = 100)

Exercise 8

How long is the longest actual elapsed time? Why is this flight so long? Which airline flies that route? Again, you may have to use your best friend Google to look up airport and airline codes.

Please write up your answer here.

Exercise 9(a)

You can sort by multiple columns. The first column listed will be the first in the sort order, and then within each level of that first variable, the next column will be sorted, etc. Print a tibble that sorts first by destination (Dest) and then by arrival time (ArrTime), both in the default ascending order. Print only the first 100 rows as above.

# Add code here to sort hf first by Dest and then by ArrTime.

Exercise 9(b)

Based on the output of the previous part, what is the first airport code alphabetically and to what city does it correspond? (Use Google if you need to link the airport code to a city name.) At what time did the earliest flight to that city arrive?

Please write up your answer here.

5.7 `mutate`

Frequently, we want to create new variables that combine information from one or more existing variables. We use mutate for this. For example, suppose we wanted to find the total time of the flight. We might do this by adding up the minutes from several variables: TaxiOut, AirTime, and TaxiIn, and assigning that sum to a new variable called total. Scroll all the way to the right in the output below (using the black right arrow) to see the new total variable.

hf_mutate <- mutate(hf, total = TaxiOut + AirTime + TaxiIn)
head(hf_mutate, n = 100)

As it turns out, that was wasted effort because that variable already exists in ActualElapsedTime. The all.equal command below checks that both specified columns contain the exact same values.

all.equal(hf_mutate$total, hf$ActualElapsedTime)

[1] TRUE

Perhaps we want a variable that just classifies a flight as arriving late or not. Scroll all the way to the right in the output below to see the new late variable.

hf_mutate2 <- mutate(hf, late = (ArrDelay > 0))
head(hf_mutate2, n = 100)

This one is a little tricky. Keep in mind that ArrDelay > 0 is a logical condition that is either true or false, so that truth value is what is recorded in the late variable. If the arrival delay is a positive number of minutes, the flight is considered “late”, and if the arrival delay is zero or negative, it’s not late.

If we would rather see more descriptive words than TRUE or FALSE, we have to do something even more tricky. Look at the late variable in the output below.

hf_mutate3 <- mutate(hf,
                     late = as_factor(ifelse(ArrDelay > 0, 
                                             "Late", "On time")))
head(hf_mutate3, n = 100)

The as_factor command tells R that late should be a categorical variable. Without it, the variable would be a “character” variable, meaning a list of character strings. It won’t matter for us here, but in any future analysis, you want categorical data to be treated as such by R.

The main focus here is on the ifelse construction. The ifelse function takes a condition as its first argument. If the condition is true, it returns the value in the second slot, and if it’s false (the “else” part of if/else), it returns the value in the third slot. In other words, if ArrDelay > 0, this means the flight is late, so the new late variable should say “Late”; whereas, if ArrDelay is not greater than zero (so either zero or possibly negative if the flight arrived early), then the new variable should say “On Time”.

Having said that, I would generally recommend that you leave these kinds of variables as logical types. It’s much easier to summarize such variables in R, namely because R treats TRUE as 1 and FALSE as 0, allowing us to do things like this:

mean(hf_mutate2$late, na.rm = TRUE)

[1] 0.4761522

This gives us the proportion of late flights.

Note that we needed na.rm as you’ve seen in previous chapter. For example, look at the 93rd row of the tibble:

slice(hf_mutate2, 93)

Notice that all the times are missing. There are a bunch of rows like this. Since there is not always an arrival delay listed, the ArrDelay variable doesn’t always have a value, and if ArrDelay is NA, the late variable will be too. So if we try to calculate the mean with just the mean command, this happens:

mean(hf_mutate2$late)

[1] NA

Exercise 10

Why does taking the mean of a bunch of zeros and ones give us the proportion of ones? (Think about the formula for the mean. What happens when we take the sum of all the zeros and ones, and what happens when we divide by the total?)

Please write up your answer here.

Exercise 11

Create a new tibble called hf_mutate4 that uses the mutate command to create a new variable called dist_k which measures the flight distance in kilometers instead of miles. (Hint: to get from miles to kilometers, multiply the distance by 1.60934.) Print the first 100 rows of the output to the screen.

# Add code here to define hf_mutate4.
# Add code here to print the first 100 rows of hf_mutate4.

A related verb is transmute. The only difference between mutate and transmute is that mutate creates the new column(s) and keeps all the old ones too, whereas transmute will throw away all the columns except the newly created ones. This is not something that you generally want to do, but there are exceptions. For example, if we were preparing a report and we needed only to talk about flights being late or not, it would do no harm (and would save some memory) to throw away everything except the late variable.

To free up some memory, we’re going to remove some of the tibbles we’ve created. This won’t affect how they will print out above. You will never have to do this on your own.

rm(hf_mutate, hf_mutate2, hf_mutate3)

5.8 `summarise` (with `group_by`)

First, before you mention that summarise is spelled wrong…well, the author of the dplyr package is named Hadley Wickham (same author as the ggplot2 package) and he is from New Zealand. So that’s the way he spells it. He was nice enough to include the summarize function as an alias if you need to use it ’cause this is ’Murica!

The summarise function, by itself, is kind of boring, and doesn’t do anything that couldn’t be done more easily with base R functions.

summarise(hf, mean(Distance))

mean(hf$Distance)

[1] 790.5861

Where summarise shines is in combination with group_by. For example, let’s suppose that we want to see average flight distances, but broken down by airline. We can do the following:

hf_summ_grouped <- group_by(hf, UniqueCarrier)
hf_summ <- summarise(hf_summ_grouped, mean(Distance))
hf_summ

Also note that we can give summary columns a new name if we wish. In hf_summ, we didn’t give the new column an explicit name, so it showed up in our tibble as a column called mean(Distance). If we want to change it, we can do this:

hf_summ_grouped <- group_by(hf, UniqueCarrier)
hf_summ <- summarise(hf_summ_grouped, mean_dist = mean(Distance))
hf_summ

Look at the earlier version of hf_summ and compare it to the one above. Make sure you see that the name of the second column changed.

To free up some memory, we’re going to remove some of the tibbles we’ve created. This won’t affect how they will print out above. You will never have to do this on your own.

rm(hf_summ_grouped)

5.9 Piping

This is a good spot to introduce a time-saving and helpful device called “piping”, denoted by the symbol |>. We’ve seen this weird combination of symbols in past chapters, but we haven’t really explained what they do.

Piping always looks more complicated than it really is. The technical definition is that

x |> f(y)

is equivalent to

f(x, y).

As a simple example, we could add two numbers like this:

sum(2, 3)

[1] 5

Or using the pipe, we could do it like this:

2 |> sum(3)

[1] 5

All this is really saying is that the pipe takes the thing on its left, and plugs it into the first slot of the function on its right. So why do we care?

Let’s revisit the combination group_by/summarise example above. There are two ways to do this without pipes, and both are a little ugly. One way is above, where you have to keep reassigning the output to new variables (in the case above, to hf_summ_grouped and then hf_summ). The other way is to nest the functions:

summarise(group_by(hf, UniqueCarrier), mean_dist = mean(Distance))

This requires a lot of brain power to parse. In part, this is because the function is inside-out: first you group hf by UniqueCarrier, and then the result of that is summarized. Here’s how the pipe fixes it:

hf |>
  group_by(UniqueCarrier) |>
  summarise(mean_dist = mean(Distance))

Look at the group_by line. The group_by function would normally take two arguments: the tibble, and then the grouping variable. But it appears to have only one argument. Now look at the previous line. The pipe says to insert whatever is on its left (hf) into the first slot of the function on its right (group_by). So the net effect is still to evaluate the function group_by(hf, UniqueCarrier).

Now look at the summarise line. Again, summarise is a function of two inputs, but all we see is the part that finds the mean. The pipe at the end of the previous line tells the summarise function to insert the stuff already computed (the grouped tibble returned by group_by(hf, UniqueCarrier)) into the first slot of the summarise function.

Piping takes a little getting used to, but once you’re good at it, you’ll never go back. It’s just makes more sense semantically. When we read the above set of commands, we see a set of instructions in chronological order:

Start with the tibble hf.
Next, group by the carrier.
Next, summarize each group using the mean distance.

Now we can assign the result of all that to the new variable hf_summ:

hf_summ <- hf |>
  group_by(UniqueCarrier) |>
  summarise(mean_dist = mean(Distance))
hf_summ

Some people even take this one step further. The result of all the above is assigned to a new variable hf_summ that currently appears as the first command (hf_summ <- ...) But you could write this as

hf |>
  group_by(UniqueCarrier) |>
  summarise(mean_dist = mean(Distance)) -> hf_summ

Now it says the following:

Start with the tibble hf.
Next, group by the carrier.
Next, summarize each group using the mean distance.
Finally, assign the result to a new variable called hf_summ.

(As we’ve seen in a previous chapter, the arrow operator for assignment works both directions!)

5.10 `count`

Let’s try some counting. Counting is a form of summarizing data by groups. But this is a common enough task that dplyr doesn’t make us use group_by and summarise. It provides a count command instead. What if we wanted to know how many flights correspond to each carrier?

hf_summ2 <- hf |>
  count(UniqueCarrier)
hf_summ2

The new count column of hf_summ2 is just called n. That’s okay, but if we insist on giving it a more user-friendly name, we can do so as follows:

hf_summ2 <- hf |>
  count(UniqueCarrier, name = "total_count")
hf_summ2

This is a little different because it requires us to use a name argument and put the new name in quotes.

Exercise 12(a)

Create a tibble called hf_summ3 that lists the total count of flights for each day of the week. Be sure to use the pipe as above. Print the output to the screen. (You don’t need to give the count column a new name.)

# Add code here to define hf_summ3.
# Add code here to print hf_summ3.

Exercise 12(b)

According to the output in the previous part, what day of the week had the fewest flights? (Assume 1 = Monday.)

Please write up your answer here.

5.11 Putting it all together

Often we need more than one of these verbs. In many data analyses, we need to do a sequence of operations to get at the answer we seek. This is most easily accomplished using a more complicated sequence of pipes.

Here’s an example of multi-step piping. Let’s say that we only care about Delta flights, and even then, we only want to know about the month of the flight and the departure delay. From there, we wish to group by month so we can find the maximum departure delay by month. Here is a solution, piping hot and ready to go. [groan]

hf_grand_finale <- hf |>
  filter(UniqueCarrier == "DL") |>
  dplyr::select(Month, DepDelay) |>
  group_by(Month) |>
  summarise(max_delay = max(DepDelay, na.rm = TRUE))
hf_grand_finale

Go through each line of code carefully and translate it into English:

We define a variable called hf_grand_finale that starts with the original hf data.
We filter this data so that only Delta flights will be analyzed.
We select the variables Month and DepDelay, throwing away all other variables that are not of interest to us. (Don’t forget to use the dplyr::select syntax to make sure we get the right function!)
We group_by month so that the results will be displayed by month.
We summarise each month by listing the maximum value of DepDelay that appears within each month.
We print the result to the screen.

Notice in the summarise line, we again took advantage of dplyr’s ability to rename any variable along the way, assigning our computation to the new variable max_delay. Also note the need for na.rm = TRUE so that the max command ignores any missing values.

A minor simplification results from the realization that summarise must throw away any variables it doesn’t need. (Think about why for a second: what would summarise do with, say, ArrTime if we’ve only asked it to calculate the maximum value of DepDelay for each month?) So you could write this instead, removing the select clause:

hf_grand_finale <- hf |>
  filter(UniqueCarrier == "DL") |>
  group_by(Month) |>
  summarise(max_delay = max(DepDelay, na.rm = TRUE))
hf_grand_finale

Check that you get the same result. With massive data sets, it’s possible that the selection and sequence of these verbs matter, but you don’t see an appreciable difference here, even with 22758 rows. (There are ways of benchmarking performance in R, but that is a more advanced topic.)

Exercise 13

Summarize in your own words what information is contained in the hf_grand_finale tibble. In other words, what are the numbers in the max_delay column telling us? Be specific!

Please write up your answer here.

The remaining exercises are probably the most challenging you’ve seen so far in the course. Take each slowly. Read the instructions carefully. Go back through the chapter and identify which “verb” needs to be used for each part of the task. Examine the sample code in those sections carefully to make sure you get the syntax right. Create a sequence of “pipes” to do each task, one-by-one. Copy and paste pieces of code from earlier and make minor changes to adapt the code to the given task.

Exercise 14

Create a tibble named hf_ex14 that counts the flights to LAX grouped by day of the week. (Hint: you need to filter to get flights to LAX. Then you’ll need to count using DayOfWeek as a grouping variable. Because you’re using count, you don’t need group_by or summarise.) Print the output to the screen.

# Add code here to count the flights to LAX
# grouped by day of the week.
# Print hf_ex14 to the screen.

Exercise 15

Create a tibble named hf_ex15 that finds the median distance flight for each airline. Sort the resulting tibble from highest distance to lowest. (Hint: You’ll need to group_by carrier and summarise using the median function. Finally, you’ll need to arrange the result according to the median distance variable that you just created.) Print the output to the screen.

# Add code here to find the median distance by airline.
# Print hf_ex15 to the screen.

5.12 Conclusion

Raw data often doesn’t come in the right form for us to run our analyses. The dplyr verbs are powerful tools for manipulating tibbles until they are in the right form.

5.12.1 Preparing and submitting your assignment

From the “Run” menu, select “Restart R and Run All Chunks”.
Deal with any code errors that crop up. Repeat steps 1–2 until there are no more code errors.
Spell check your document by clicking the icon with “ABC” and a check mark.
Hit the “Render” button one last time to generate the final draft of the HTML file. (If there are errors here, you may need to go back and fix broken inline code or other markdown issues.)
Proofread the HTML file carefully. If there are errors, go back and fix them, then repeat steps 1–5 again.

If you have completed this chapter as part of a statistics course, follow the directions you receive from your professor to submit your assignment.

5.1 Introduction

5.1.1 Install new packages

5.1.2 Download the Quarto file

5.1.3 Restart R and run all chunks

5.1.4 Load packages

5.2 Importing CSV data

Exercise 1

5.3 Introduction to dplyr

5.4 select

Exercise 2

Exercise 3

5.5 filter

Exercise 4

Exercise 5

Exercise 6(a)

Exercise 6(b)

5.6 arrange

Exercise 7

Exercise 8

Exercise 9(a)

Exercise 9(b)

5.7 mutate

Exercise 10

Exercise 11

5.8 summarise (with group_by)

5.9 Piping

5.10 count

Exercise 12(a)

Exercise 12(b)

5.11 Putting it all together

Exercise 13

Exercise 14

Exercise 15

5.12 Conclusion

5.12.1 Preparing and submitting your assignment

5.3 Introduction to `dplyr`

5.4 `select`

5.5 `filter`

5.6 `arrange`

5.7 `mutate`

5.8 `summarise` (with `group_by`)

5.10 `count`