URISE Workshop: Data Wrangling
In this tutorial, we will introduce some of the basic tools used in data wrangling.
Make sure to click “Run Code” in the interactive code chunks.
NOTE: YOU CAN RUN JUST A PORTION OF ANY CODE CHUNK BY HIGHLIGHTING IT AND USING THE FOLLOWING KEYBOARD SHORTCUTS: PCS - (ctrl + Enter); MACs - (cmd + Enter).
Before we get started, some R basics.
Up until now, we have been focusing on gaining enough R coding skills to be able to demonstrate the insights that can be gained from data chiefly through data visualization. Now, we are transitioning into topics that will require a stronger understanding of certain R functions.
Let’s start by reviewing how we create objects in our R environment:
Even though we are moving to a different code chunk, thing1 and thing2 still exist in our coding environment. That means we can continue to use them.
It is a good idea to store values as objects if you think you are going to need them in the future.
R is made up of packages and functions, which make it easier for us all to do commonly needed actions. This is a double edged sword. Pre-made functions make it quicker to get some things done, but you don’t necessarily know that the code that is behind the function is doing what you think it’s doing.
Here are a few examples of functions you might need:
Calculating basic summary statistics:
Randomly sampling values from a normal distribution:
Navigating matrices and dataframes:
Data Wrangling using dplyr
Note: This tutorial is largely created by Kevin Markham–I added some exercises for students to work through.
dplyr is a package from the Tidyverse that makes data wrangling much more intuitive and fun!
Why even use dplyr?
- Great for data exploration and transformation
- Intuitive to write and easy to read, especially when using the “chaining” syntax (covered below)
- Fast on data frames
dplyr functionality
Five basic verbs:
filter
,select
,arrange
,mutate
,summarise
(plusgroup_by
)Can work with data stored in databases and data tables
Joins: inner join, left join, semi-join, anti-join (not covered below)
Window functions for calculating ranking, offsets, and more
Our example data set
- dplyr will mask a few base functions (namely: filter())
- hflights is flights departing from two Houston airports in 2011
as_tibble
creates a special tidyverse dataframe called a tibble- tibbles have a number of features, but one of them is that they print more nicely in this type of tutorial environment
Use glimpse() to get a summary output of all the columns:
dplyr::filter: Keep rows matching criteria
- Base R approach to filtering forces you to repeat the data frame’s name
- dplyr approach is simpler to write and read
- Command structure (for all dplyr verbs):
- first argument is a data frame
- return value is a data frame
- nothing is modified in place
- Note: dplyr generally does not preserve row names
We can use the “|” for the OR condition:
We can also use the %in% operator to select for specific values in a column:
Now you try.. How would we code it if we wanted to view all flights United Airlines or American Airlines flights that departed after 2PM?
dplyr::select: Pick columns by name
- Base R approach is awkward to type and to read
- dplyr approach uses similar syntax to
filter
- Like a SELECT in SQL
We can use the colon (:) to select multiple contiguous columns, and use ‘contains’ to match columns by name.
Note: we can also use ‘starts_with’, ‘ends_with’, and ‘matches’ (for regular expressions) to match columns by name.
Now you try: Select all of the columns that contain information about the date and time of the flight, as well as those that contain information about the flight’s cancellation.
“Chaining” or “Pipelining”
- Usual way to perform multiple operations in one line is by nesting
- Can write commands in a natural order by using the
%>%
infix operator (which can be thought of as “and then”)
- Chaining increases readability significantly when there are many commands
- Operator is automatically imported from the magrittr package
- Can be used to replace nesting in R commands outside of dplyr
Now you try: Using the chaining method select the columns indicating the flight carrier, flight number, and departure time, and filter so I see only United Airlines flights that left after 2PM.
dplyr::arrange: Reorder rows
Use desc() to display things in descending order:
Now you try: Using piping select all American Airlines flights that left in the afternoon on January 1, and arrange them based on departure times.
dplyr::mutate: Add new variables
- Create new variables that are functions of existing variables
Note that the above code just printed our manipulations, but didn’t store them. To store them, we need to use the assignment arrow, as seen in the code chunk below:
Now you try: Calculate the total delay of each flight based on the Departure delay and Arrival delay, and store the resulting value.
dplyr::summarise: Reduce variables to values
- Primarily useful with data that has been grouped by one or more variables
group_by
creates the groups that will be operated onsummarise
uses the provided aggregation function to summarise each group
Now you try: Calculate the mean delay in arrival times for each month of the year.
- Helper function
n()
counts the number of rows in a group - Helper function
n_distinct(vector)
counts the number of unique items in that vector
- Grouping can sometimes be useful without summarising
Resources
Official dplyr reference manual and vignettes on CRAN: vignettes are well-written and cover many aspects of dplyr
July 2014 webinar about dplyr (and ggvis) by Hadley Wickham and related slides/code: mostly conceptual, with a bit of code
dplyr tutorial by Hadley Wickham at the useR! 2014 conference: excellent, in-depth tutorial with lots of example code (Dropbox link includes slides, code files, and data files)