URISE Workshop: Data Wrangling

In this tutorial, we will introduce some of the basic tools used in data wrangling.

Make sure to click “Run Code” in the interactive code chunks.

NOTE: YOU CAN RUN JUST A PORTION OF ANY CODE CHUNK BY HIGHLIGHTING IT AND USING THE FOLLOWING KEYBOARD SHORTCUTS: PCS - (ctrl + Enter); MACs - (cmd + Enter).

Before we get started, some R basics.

Up until now, we have been focusing on gaining enough R coding skills to be able to demonstrate the insights that can be gained from data chiefly through data visualization. Now, we are transitioning into topics that will require a stronger understanding of certain R functions.

Let’s start by reviewing how we create objects in our R environment:

Even though we are moving to a different code chunk, thing1 and thing2 still exist in our coding environment. That means we can continue to use them.

It is a good idea to store values as objects if you think you are going to need them in the future.

R is made up of packages and functions, which make it easier for us all to do commonly needed actions. This is a double edged sword. Pre-made functions make it quicker to get some things done, but you don’t necessarily know that the code that is behind the function is doing what you think it’s doing.

Here are a few examples of functions you might need:

Calculating basic summary statistics:

Randomly sampling values from a normal distribution:

Navigating matrices and dataframes:

Data Wrangling using dplyr

Note: This tutorial is largely created by Kevin Markham–I added some exercises for students to work through.

dplyr is a package from the Tidyverse that makes data wrangling much more intuitive and fun!

Why even use dplyr?

Great for data exploration and transformation
Intuitive to write and easy to read, especially when using the “chaining” syntax (covered below)
Fast on data frames

dplyr functionality

Five basic verbs: filter, select, arrange, mutate, summarise (plus group_by)
Can work with data stored in databases and data tables
Joins: inner join, left join, semi-join, anti-join (not covered below)
Window functions for calculating ranking, offsets, and more

Our example data set

dplyr will mask a few base functions (namely: filter())
hflights is flights departing from two Houston airports in 2011

as_tibble creates a special tidyverse dataframe called a tibble
tibbles have a number of features, but one of them is that they print more nicely in this type of tutorial environment

Use glimpse() to get a summary output of all the columns:

dplyr::filter: Keep rows matching criteria

Base R approach to filtering forces you to repeat the data frame’s name
dplyr approach is simpler to write and read
Command structure (for all dplyr verbs):
- first argument is a data frame
- return value is a data frame
- nothing is modified in place
Note: dplyr generally does not preserve row names

We can use the “|” for the OR condition:

We can also use the %in% operator to select for specific values in a column:

Now you try.. How would we code it if we wanted to view all flights United Airlines or American Airlines flights that departed after 2PM?

dplyr::select: Pick columns by name

Base R approach is awkward to type and to read
dplyr approach uses similar syntax to filter
Like a SELECT in SQL

We can use the colon (:) to select multiple contiguous columns, and use ‘contains’ to match columns by name.

Note: we can also use ‘starts_with’, ‘ends_with’, and ‘matches’ (for regular expressions) to match columns by name.

Now you try: Select all of the columns that contain information about the date and time of the flight, as well as those that contain information about the flight’s cancellation.

“Chaining” or “Pipelining”

Usual way to perform multiple operations in one line is by nesting
Can write commands in a natural order by using the %>% infix operator (which can be thought of as “and then”)

Chaining increases readability significantly when there are many commands
Operator is automatically imported from the magrittr package
Can be used to replace nesting in R commands outside of dplyr

Now you try: Using the chaining method select the columns indicating the flight carrier, flight number, and departure time, and filter so I see only United Airlines flights that left after 2PM.

dplyr::arrange: Reorder rows

Use desc() to display things in descending order:

Now you try: Using piping select all American Airlines flights that left in the afternoon on January 1, and arrange them based on departure times.

dplyr::mutate: Add new variables

Create new variables that are functions of existing variables

Note that the above code just printed our manipulations, but didn’t store them. To store them, we need to use the assignment arrow, as seen in the code chunk below:

Now you try: Calculate the total delay of each flight based on the Departure delay and Arrival delay, and store the resulting value.

dplyr::summarise: Reduce variables to values

Primarily useful with data that has been grouped by one or more variables
group_by creates the groups that will be operated on
summarise uses the provided aggregation function to summarise each group

Now you try: Calculate the mean delay in arrival times for each month of the year.

Helper function n() counts the number of rows in a group
Helper function n_distinct(vector) counts the number of unique items in that vector

Grouping can sometimes be useful without summarising

Resources

Official dplyr reference manual and vignettes on CRAN: vignettes are well-written and cover many aspects of dplyr
July 2014 webinar about dplyr (and ggvis) by Hadley Wickham and related slides/code: mostly conceptual, with a bit of code
dplyr tutorial by Hadley Wickham at the useR! 2014 conference: excellent, in-depth tutorial with lots of example code (Dropbox link includes slides, code files, and data files)
dplyr GitHub repo and list of releases