Tidyverse & dplyr
Tidyverse is a collectin of data science tools in R. This library needs to be installed and loaded, as it’s not part of the standard library.
The same is true with dplyr (“deep ply R”).
To use these libraries: library(tidyverse) library(dplyr)
For this document I’ll be using the gapminder dataset, loaded using library(gapminder) if not installed, install it via install.packages(“gapminder”) and then load it with the library() funciton.
Verbs (dplyr verbs)
Verbs are the steps to transform data using the dplyr package.
Verbs (or steps in an operation) are added after pipes. Pipes are represented with this notation: %>%
In a language like groovy it would be -> The idea of a pipe, is to take what’s on the left, and pass it into the function (verb) that follows.
With the filter verb, we are using a filter() function that works simlar to the subset() function described in the R Basics notes.
library(gapminder) gapminder %>% filter(year == 2007)
In the above output, we are subsetting the gapminder dataframe, pulling out only the data for the variable year, where year is 2007.
We could have used an operator like > or <, such as filter(year > 2005).
The original dataframe isn’t modified. Filter is creating a new dataframe from the original.
gapminder %>% filter(year > 2003,country == "United States")
We can further subset quite easily, by passing in multiple conditions. In the above example, I use year and country, where the year is greater than 2003 and the country is United States.
This double expression is also called an argument.
Arrange() operates like the order() function.
gapminder %>% arrange(gdpPercap)
This dplyr verb will sort data in ascending or descending order, based on parameters.
gapminder %>% arrange(desc(gdpPercap))
Above, I’ve arranged this again by gdpPercap, but this time it’s arranged in descending order.
So far I’ve noted the arange and filter verbs. These verbs can be stacked together by piping one to another:
gapminder %>% filter(year==2007) %>% arrange(desc(gdpPercap))
The mutate() function changes or adds variables in a dataset.
When using the mutate() function, it has what is being replaced on the left side of the = sign, with the value or calcuation on the right of the = sign.
gapminder %>% mutate(pop = pop/100000)
To add a new variable with mutate():
gapminder %>% mutate(gdp = gdpPercap * pop)