Data Manipulation with R Package dplyr

class: center, middle, inverse, title-slide

.title[
# Data Manipulation with R Package dplyr
]
.author[
### <br>Ying-Ju Tessa Chen, PhD <br><br> Associate Professor <br> Department of Mathematics<br> University of Dayton <br><br> <a href="https://twitter.com/ju_tessa"><svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> <span class="citation">@ying-ju</span></a> <br> <a href="https://github.com/ying-ju/"><svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> ying-ju</a> <br> <a href="mailto:ychen4@udayton.edu"><svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"></path></svg> ychen4@udayton.edu</a><br>
]

---

## Learning Objectives

In this session, we will talk about data manipulation using R package .red[dplyr]. This lecture is based on [R for Data Science](https://r4ds.had.co.nz/).

The R package .red[dplyr] is included in the package .red[tidyverse], which contains a collection of R packages that help us do data manipulation and management. The key packages in tidyverse are:

- dplyr: data manipulation
  - ggplot2: data visualization 
  - purr: functional programming toolkit
  - readr: read data and write files
  - tibble: simple data frame
  - tidyr: data management
 
 
---
## Core Functions in dplyr

In this session, we will focus on the following key functions in .red[dplyr] using the dataset **flights** from the R package .red[nycflights13].

- filter(): pick observations by their values
  - arrange(): reorder the rows
  - select(): select variables by their names
  - mutate(): create new variables with functions of existing variables 
  - group_by(): group data by existing variables
  - summarize(): collapse many values done to a single summary (with group_by)

All functions above work similarly.

1. The first argument is a data frame. 
2. The subsequent arguments describe what to do with the data frame using the variable names.
3. The result is a new data frame (but we can save it back to the original data frame if needed).

---
## Load packages and read the Flights Data

First, we load the necessary packages, check conflict functions, and import the dataset **flights** from the R package .red[nycflights13].

``` r
library(tidyverse)
library(conflicted)
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
df <- nycflights13::flights
```

---
## Understand Our Data

Now we need to understand the data and each variable before we move on. This dataset provides on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013 and there are 19 variables ([Flights Data](https://rdrr.io/cran/nycflights13/man/flights.html)). 
.small[

- year, month, day: Date of departure.
- dep_time, arr_time: Actual departure and arrival times (format HHMM or HMM), local time zone.
- sched_dep_time, sched_arr_time: Scheduled departure and arrival times (format HHMM or HMM), local time zone.
- dep_delay, arr_delay: Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
- carrier: Two letter carrier abbreviation. See airlines to get name.
- flight: Flight number.
- tailnum: Plane tail number. See planes for additional metadata.
- origin, dest: Origin and destination. See airports for additional metadata.
- air_time: Amount of time spent in the air, in minutes.
- distance: Distance between airports, in miles.
- hour, minute: Time of scheduled departure broken into hour and minutes.
- time_hour: Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data.
]

---
## Get a Glimpse of the Data

We get a glimpse of the data.

``` r
glimpse(df)
```

```
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
```

---
## filter() - Filtering Observations
The .blue[filter()] function is used to subset observations based on logical conditions. Here are a few examples showcasing common comparisons and logical operations:

.pull-left[

``` r
# Select flights on December 25th
filter(df, month == 12, day == 25)
```

```
## # A tibble: 719 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    25      456            500        -4      649            651
##  2  2013    12    25      524            515         9      805            814
##  3  2013    12    25      542            540         2      832            850
##  4  2013    12    25      546            550        -4     1022           1027
##  5  2013    12    25      556            600        -4      730            745
##  6  2013    12    25      557            600        -3      743            752
##  7  2013    12    25      557            600        -3      818            831
##  8  2013    12    25      559            600        -1      855            856
##  9  2013    12    25      559            600        -1      849            855
## 10  2013    12    25      600            600         0      850            846
## # ℹ 709 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
```
]
.pull-right[

``` r
# Filter for flights in July or August
filter(df, month %in% c(7, 8))

# Filter for flights not delayed by more than 1 hour
filter(df, arr_delay <= 60, dep_delay <= 60)
```
]

---
### filter() - Filtering Observations (Continued)

.orange[Logical Operations] - R provides the following syntax: & is "and", | is "or", ! is "not". The following code finds all flights that departed in July or August.

``` r
flights0708 <- filter(df, month == 7 | month == 8)
# only shows the first 5 rows and 10 columns in the data
head(filter(df, month %in% c(7, 8)), c(5, 10))
```

```
## # A tibble: 5 × 10
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     7     1        1           2029       212      236           2359
## 2  2013     7     1        2           2359         3      344            344
## 3  2013     7     1       29           2245       104      151              1
## 4  2013     7     1       43           2130       193      322             14
## 5  2013     7     1       44           2150       174      300            100
## # ℹ 2 more variables: arr_delay <dbl>, carrier <chr>
```

**Note:**

1. If we use .blue[filter(df, month == 7 | 8)], it finds all months are equal 7 | 8, an expression that evaluates to **TRUE**. In a numeric context, TRUE becomes one, so this finds all fights in the data. 
2. .blue[filter()] only includes rows where the condition is **TRUE** and it excludes both FALSE and NA values.

---
## arrange() - Ordering Data
The .blue[arrange()] function sorts a dataset by specified variables. Use .blue[desc()] for descending order and note that missing values are sorted last.

.pull-left[

``` r
# Sort flights by date
arrange(df, year, month, day)
```

```
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
```
]
.pull-right[

``` r
# Sort by arrival delay in descending order
arrange(df, desc(arr_delay))
```

```
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     7    22     2257            759       898      121           1026
##  9  2013    12     5      756           1700       896     1058           2020
## 10  2013     5     3     1133           2055       878     1250           2215
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
```
]

---
### arrange() - Ordering Data (Continued)

**Note:** 
1. We can save the data frame back to the original data frame after sorting the data.
2. Use .blue[desc()] for sorting data via descending order. The following code chunk arranges the Flights Data by arrival delay in descending order.
3. Missing values are always sorted at the end.

``` r
df <- arrange(df, desc(arr_delay))
```

We can use the .blue[tail()] function to read the last part of the data. Below, we display the last 4 rows of 8 columns in the data.

``` r
tail(df[, 1:8], n = 4)
```

```
## # A tibble: 4 × 8
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     9    30       NA           2200        NA       NA           2312
## 2  2013     9    30       NA           1210        NA       NA           1330
## 3  2013     9    30       NA           1159        NA       NA           1344
## 4  2013     9    30       NA            840        NA       NA           1020
```

---
## select()

.blue[select()] is used when we would like to obtain several variables in the data. For example, we can use the following code chunk to select the Flights Data with only a few variables.

``` r
# select specific columns
select(df, year, month, day)

# select all columns between year and day
select(df, year:day)

# select all columns except those from year and day
select(df, -(year:day)) 
```

**Note:**

1. We can use a minus sign - to drop variables. 
2. There are several helper functions we can use within .blue[select()]. See .blue[?select] for the information. 
3. .blue[select()] can be used with the .blue[everything()] function when we have a handful of variables we would like to move to the start of the data frame.

``` r
# move carrier, origin, dest, and distance to the start of the data
select(df, carrier, origin, dest, distance, everything())
```

---
## mutate() - Creating New Variables

The .blue[mutate()] function adds new columns based on calculations using existing variables. Use .blue[transmute()] if only the new variables should be kept.

.pull-left[

``` r
# Add variables based on existing columns
df <- mutate(df,
    gain = arr_delay - dep_delay,
    speed = distance / air_time * 60)
print(df[1:6,18:21])
```

```
## # A tibble: 6 × 4
##   minute time_hour            gain speed
##    <dbl> <dttm>              <dbl> <dbl>
## 1      0 2013-01-09 09:00:00   -29  467.
## 2     35 2013-06-15 19:00:00   -10  392.
## 3     35 2013-01-10 16:00:00   -17  389.
## 4     45 2013-09-20 18:00:00    -7  438.
## 5      0 2013-07-22 16:00:00   -16  368.
## 6      0 2013-04-10 19:00:00   -29  434.
```
]

.pull-right[

``` r
# Use transmute to keep only the new variables
transmute(df, gain, speed)
```

```
## # A tibble: 336,776 × 2
##     gain speed
##    <dbl> <dbl>
##  1   -29  467.
##  2   -10  392.
##  3   -17  389.
##  4    -7  438.
##  5   -16  368.
##  6   -29  434.
##  7     4  366.
##  8    -3  419.
##  9   -18  437.
## 10    -3  385.
## # ℹ 336,766 more rows
```
]

**Note:** There are many functions for creating new variables that we can use with .blue[mutate()]. The key property is that the function must be vectorized, which means it must take a vector of values as input and returns a vector with the same number of values as output.

---
### summarize() and group_by() - Summarizing Data

In general, .blue[summarize()] function is used together with .blue[group_by()] as we group rows for some purposes. .blue[group_by()] is used to group rows by one or more variables, giving priority to the variable entered first.

``` r
group_by(df, year, month, day)
```

```
## # A tibble: 336,776 × 21
## # Groups:   year, month, day [365]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     7    22     2257            759       898      121           1026
##  9  2013    12     5      756           1700       896     1058           2020
## 10  2013     5     3     1133           2055       878     1250           2215
## # ℹ 336,766 more rows
## # ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl>
```

The result shows the original data but indicates groups: year, month, day, in our example.

---
### summarize() and group_by() - Summarizing Data (Continued)

For example, we can study the average departure / arrival delays for each day.

``` r
by_day <- group_by(df, year, month, day) 
summarize(by_day, 
          ave_dep_delay = mean(dep_delay, na.rm = T),
          ave_arr_delay = mean(arr_delay, na.rm = T)
          )
```

```
## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day ave_dep_delay ave_arr_delay
##    <int> <int> <int>         <dbl>         <dbl>
##  1  2013     1     1         11.5         12.7  
##  2  2013     1     2         13.9         12.7  
##  3  2013     1     3         11.0          5.73 
##  4  2013     1     4          8.95        -1.93 
##  5  2013     1     5          5.73        -1.53 
##  6  2013     1     6          7.15         4.24 
##  7  2013     1     7          5.42        -4.95 
##  8  2013     1     8          2.55        -3.23 
##  9  2013     1     9          2.28        -0.264
## 10  2013     1    10          2.84        -5.90 
## # ℹ 355 more rows
```

---
## Using Pipes (%>% or |>) to Combine Operations

In other to handle the data processing well in data science, it is essential to know the use of pipes. Pipes are great tool for presenting a sequence of multiple operations and therefore, pipes increase readability of the code. The pipe, **|>**, is from the package .red[magrittr] and it is loaded automatically when tidyverse is loaded.

The logic when using pipe: .orange[object |> function1 |> function 2....]

.pull-left[
If we want to group the Flights Data by the destination and then find the number of flights, the average distance, the average arrival delay at each destination, and filter to remove Honolulu airport (HNL), we may use the following code chunk to achieve this.

``` r
by_dest <- group_by(df, dest)
delay <- summarize(by_dest,
                   count = n(),
                   ave_dist = mean(distance, na.rm=T),
                   ave_arr_delay = mean(arr_delay, na.rm=T))
delay <- filter(delay, count > 20, dest != "HNL")
```

]
.pull-right[
The following code chunk does the same task with the pipe, |> and it makes the code easier to read.

``` r
delay <- df |> 
  group_by(dest) |>
  summarize(
    count = n(),
    ave_dist = mean(distance, na.rm=T),
    ave_arr_delay = mean(arr_delay, na.rm=T)
    ) |>
  filter(count > 20, dest != "HNL")
```
]

---
## Helper Functions

Here we introduce some helper functions.

### Counting & Unique Rows

We can use .blue[count()] to count the observations in each group and .blue[distinct()] to keep distinct/unique rows. 
.small[
.pull-left[

``` r
# count: Quick way to summarize counts
count(df, dest)
```

```
## # A tibble: 105 × 2
##    dest      n
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     265
##  3 ALB     439
##  4 ANC       8
##  5 ATL   17215
##  6 AUS    2439
##  7 AVL     275
##  8 BDL     443
##  9 BGR     375
## 10 BHM     297
## # ℹ 95 more rows
```
]

.pull-right[

``` r
# distinct: Remove duplicate rows
distinct(df, carrier)
```

```
## # A tibble: 16 × 1
##    carrier
##    <chr>  
##  1 HA     
##  2 MQ     
##  3 AA     
##  4 DL     
##  5 F9     
##  6 9E     
##  7 VX     
##  8 EV     
##  9 FL     
## 10 B6     
## 11 US     
## 12 UA     
## 13 WN     
## 14 YV     
## 15 AS     
## 16 OO
```
]
]

---
### Counting & Unique Rows (Continued)

You have seen .blue[n()], which takes no arguments, and returns the size of the current group. To count the number of non-missing values, we can use .blue[sum(is.na(x))]. To count the number of distinct values, use .blue[n_distinct()].

.pull-left[

``` r
df |> 
  summarize(total_na = sum(is.na(arr_delay)))
```

```
## # A tibble: 1 × 1
##   total_na
##      <int>
## 1     9430
```
]

.pull-right[

``` r
not_cancelled <- df |> 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled |>
  group_by(dest) |>
  summarize(carriers = n_distinct(carrier)) |>
  arrange(desc(carriers))
```

```
## # A tibble: 104 × 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # ℹ 94 more rows
```
]

---
### Counting & Unique Rows (Continued)

.pull-left[
We can use .blue[count()] directly if all we want is a count.

``` r
not_cancelled_count <- not_cancelled |> 
  count(dest)
head(not_cancelled_count)
```

```
## # A tibble: 6 × 2
##   dest      n
##   <chr> <int>
## 1 ABQ     254
## 2 ACK     264
## 3 ALB     418
## 4 ANC       8
## 5 ATL   16837
## 6 AUS    2411
```
]
.pull-right[
We can optionally provide a weight variable. For example we could use this to "count" the total number of miles a plane flew.

``` r
not_cancelled_weight <- not_cancelled |>
  count(tailnum, wt = distance)
head(not_cancelled_weight)
```

```
## # A tibble: 6 × 2
##   tailnum      n
##   <chr>    <dbl>
## 1 D942DN    3418
## 2 N0EGMQ  239143
## 3 N10156  109664
## 4 N102UW   25722
## 5 N103US   24619
## 6 N104UW   24616
```
]

---
### Summary for Numeric Data

**Measures of rank:** .blue[min()], .blue[quantile()], .blue[max()].

``` r
not_cancelled |>
  group_by(year, month, day) |>
  summarize(
    first = min(dep_time), # the first flight departed each day
    last = max(dep_time) # the last flight departed each day
  )
```

```
## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day first  last
##    <int> <int> <int> <int> <int>
##  1  2013     1     1   517  2356
##  2  2013     1     2    42  2354
##  3  2013     1     3    32  2349
##  4  2013     1     4    25  2358
##  5  2013     1     5    14  2357
##  6  2013     1     6    16  2355
##  7  2013     1     7    49  2359
##  8  2013     1     8   454  2351
##  9  2013     1     9     2  2252
## 10  2013     1    10     3  2320
## # ℹ 355 more rows
```

---
### Summary for Numeric Data (Continued)

**Measure of Counts of logical values:** When used with numeric functions, TRUE is converted to 1 and FALSE to 0. Thus, .blue[sum()] gives the number of TRUEs and .blue[mean()] gives the proportion in the variable. For example, we can check how many flights left before 5AM using the following code chunk.

``` r
not_cancelled |>
  group_by(year, month, day) |>
  summarize(n_early = sum(dep_time < 500))
```

```
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day n_early
##    <int> <int> <int>   <int>
##  1  2013     1     1       0
##  2  2013     1     2       3
##  3  2013     1     3       4
##  4  2013     1     4       3
##  5  2013     1     5       3
##  6  2013     1     6       2
##  7  2013     1     7       2
##  8  2013     1     8       1
##  9  2013     1     9       3
## 10  2013     1    10       3
## # ℹ 355 more rows
```

---
### Summary for Numeric Data (Continued)

**Measure of proportion of logistic values**

So, what proportion of flights are delayed by more than one hour?

``` r
not_cancelled |>
  group_by(year, month, day) |>
  summarize(hour_perc = mean(arr_delay > 60))
```

```
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day hour_perc
##    <int> <int> <int>     <dbl>
##  1  2013     1     1    0.0722
##  2  2013     1     2    0.0851
##  3  2013     1     3    0.0567
##  4  2013     1     4    0.0396
##  5  2013     1     5    0.0349
##  6  2013     1     6    0.0470
##  7  2013     1     7    0.0333
##  8  2013     1     8    0.0213
##  9  2013     1     9    0.0202
## 10  2013     1    10    0.0183
## # ℹ 355 more rows
```

---
### Row Selection

**Measures of position:** .blue[slice()], .blue[first()], .blue[nth(x, 2)], .blue[last()].

.pull-left[

The .blue[slice] function can be used to subset rows using their positions.

``` r
slice(df, 1:5)
```

```
## # A tibble: 5 × 21
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     9      641            900      1301     1242           1530
## 2  2013     6    15     1432           1935      1137     1607           2120
## 3  2013     1    10     1121           1635      1126     1239           1810
## 4  2013     9    20     1139           1845      1014     1457           2210
## 5  2013     7    22      845           1600      1005     1044           1815
## # ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl>
```
]
.pull-right[

The following code chunk finds the first and last departure for each day.

``` r
not_cancelled |>
  group_by(year, month, day) |>
  summarize(
    first_dep = first(dep_time),
    last_dep = last(dep_time)
  )
```

```
## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day first_dep last_dep
##    <int> <int> <int>     <int>    <int>
##  1  2013     1     1       848     1904
##  2  2013     1     2      1607      654
##  3  2013     1     3      2056      941
##  4  2013     1     4      2123     1026
##  5  2013     1     5      1344     1329
##  6  2013     1     6       943      812
##  7  2013     1     7      2021     1623
##  8  2013     1     8      1307     1847
##  9  2013     1     9       641     1842
## 10  2013     1    10      1121     1844
## # ℹ 355 more rows
```
]

---
## Grouping by Multiple Variables

Here we show some examples to demonstrate how to group the data by multiple variables.

``` r
per_day <- df |> 
   group_by(year, month, day) |>
  summarize(flights = n())
per_month <- summarize(per_day, flights = sum(flights))
per_year <- summarize(per_month, flights = sum(flights))

head(per_month, 3)
```

```
## # A tibble: 3 × 3
## # Groups:   year [1]
##    year month flights
##   <int> <int>   <int>
## 1  2013     1   27004
## 2  2013     2   24951
## 3  2013     3   28834
```

``` r
per_year
```

```
## # A tibble: 1 × 2
##    year flights
##   <int>   <int>
## 1  2013  336776
```

---
## Ungrouping

If we need to remove grouping, and return to operations on ungrouped data, use .blue[ungroup()].

``` r
daily <- df |> group_by(year, month, day)
daily |> 
  ungroup() |> # no longer grouped by date
  summarize(flights=n()) # all flights
```

```
## # A tibble: 1 × 1
##   flights
##     <int>
## 1  336776
```

---
## Conditional / Replacement Functions

- .blue[if_else()]: conditional values

- **Usage**: .blue[if_else(condition, true, false, ...)]
  
- .blue[drop_na()]: drops rows contain a missing value.

- **Usage**: .blue[drop_na(data, ...)]

- .blue[rename()]: renames columns

- **Usage**: .blue[rename(.data, ...)]

- .blue[recode()]: replaces numeric values based on their position, and replaces character values by their name. Usage: .orange[.default], if supplied, gives all cases not matched a new value. .orange[.missing] replaces the missing values by the specified new value.

- **Usage**: .blue[recode(.x, ..., .default = NULL, .missing = NULL)]

- .blue[case_when()]: a general vectorised if

- **Usage**: .blue[case_when(...)]

---
## if_else() function

The .blue[ifelse()] function allows us to perform element-wise conditional statements on vectors or data frames, and the .blue[if_else()] function is a vectorized if ... else statement.

.pull-left[
The following code chunk shows how we use the traditional if ... else statement to obtain a vector that records if the flight was delayed on its arrival.

``` r
# Create a new column for the results
df$is_delayed <- NA

for (i in seq_along(df$arr_delay)) {
  if (df$arr_delay[i] > 0) {
    df$is_delayed[i] <- TRUE
  } else {
    df$is_delayed[i] <- FALSE
  }
}
```
]

.pull-right[
The following code chunks show the usage of the .blue[ifelse()] and .blue[if_else()] functions.

``` r
df$is_delayed <- ifelse(df$arr_delay>0, TRUE, FALSE)
```

``` r
# if_else: Simple conditional assignment
df <- mutate(df, 
             is_delayed = if_else(arr_delay > 0, 
                                  TRUE, FALSE))
```
]

**Note:** This function always considers TRUE, FALSE, and missing when deciding what the output type should be, and it lets us handle missing values in the condition with missing, unlike its base R version, ifelse().

---
## drop_na() function

We can use the .blue[drop_na()] function to drop rows containing missing values.

``` r
nrow(df)
```

```
## [1] 336776
```

``` r
df <- df |> 
  drop_na()
nrow(df)
```

```
## [1] 327346
```

---
## rename() function

We can use the .blue[rename()] function to rename column names. This function makes the task much easier since we can rename a variable directly without knowing which column that variable is.

``` r
df <- df |> rename(
  DepTime = dep_time,
  SchedDepTime = sched_dep_time, 
  DepDelay = dep_delay,
  ArrTime = arr_time, 
  SchedArrTime = sched_arr_time,
  ArrDelay = arr_delay
)

colnames(df)
```

```
##  [1] "year"         "month"        "day"          "DepTime"      "SchedDepTime"
##  [6] "DepDelay"     "ArrTime"      "SchedArrTime" "ArrDelay"     "carrier"     
## [11] "flight"       "tailnum"      "origin"       "dest"         "air_time"    
## [16] "distance"     "hour"         "minute"       "time_hour"    "gain"        
## [21] "speed"        "is_delayed"
```

---
## recode() function

We might want to recode specific values in an R data frame sometimes. The .blue[recode()(] function from the .red[dplyr package] makes this simple to accomplish.

``` r
df$month <- recode(df$month, 
                   `1` = "Jan",
                   `2` = "Feb",
                   `3` = "Mar",
                   `4` = "Apr")
unique(df$month)
```

```
## [1] "Jan" NA    "Apr" "Mar" "Feb"
```

We should pay attention to the data type of a variable. In the example above, the original data type is integer and it becomes character after we recode some values. This is due to unreplaced values treated as NA as `.x` is not compatible.

``` r
x  <- c(12, 31, 0, 34, 8, 5, 7, 16, 8)
x <- x |> recode(`0` = 1314)
x
```

```
## [1]   12   31 1314   34    8    5    7   16    8
```

---
## case_when() function

We can use .blue[case_when()(] function to create a new variable with the .blue[mutate()(] function together.

``` r
df <- df |> mutate(arr_status = case_when(
  ArrDelay > 60 ~ "LongDelay",
  ArrDelay > 0 & ArrDelay <= 60 ~ "Delay",
  ArrDelay == 0  ~ "OnTime",
  ArrDelay < 0 ~ "NotDelay",
  is.na(ArrDelay) ~ "Unknown"
))

count(df, arr_status)
```

```
## # A tibble: 4 × 2
##   arr_status      n
##   <chr>       <int>
## 1 Delay      105215
## 2 LongDelay   27789
## 3 NotDelay   188933
## 4 OnTime       5409
```

---
# Wrap Up

By now, you should know how to utilize basic functions in the R package .red[dplyr] for data manipulation and analysis.

.pull-left[
**Core Functions**
- `filter()`: subset data
- `arrange()`: order rows
- `select()`: choose columns
- `mutate()`: create new columns
- `group_by()`: group data
- `summarize()`: aggregate data
]

.pull-right[
**Helper Functions**
- Counting & Unique Rows: `count()`, `distinct()`, `n()`, `n_distinct()`
- Summary for Numeric Data: `min()`, `quantile()`, `max()`, `sum()`, `mean()`
- Row Selection: `slice()`, `first()`, `nth()`, `last()`

**Conditional/Replacement Functions**
- `if_else()`: conditional values
- `drop_na()`: remove missing data
- `rename()`: rename columns
- `recode()`: replace values
- `case_when()`: complex conditions
]

---
## In-Class Activity

You can download the worksheet [here](https://bitbucket.org/tessachen/tessachen.bitbucket.io/src/main/worksheets/dplyr_activity.docx).

<br>

## Thank you!

.pull-left-2[
Please do not hesitate to contact me (Tessa Chen) at <a href="mailto:ychen4@udayton.edu"><i class="fa fa-paper-planee fa-fw"></i?&nbsp: ychen4@udayton.edu</a> for questions or further discussions.
]
.pull-right-2[
<img src="data:image/png;base64,#../figs/Tessa_grey_G.gif" width="50%" style="display: block; margin: auto;" />
]