Most of the data wrangling I’ve done in the past year? I’d say 99% of it was with dplyr, tidyr, janitor, and lubridate. It can take forever to learn every single functions in those packages, so don’t! I’ve put together a top 7 list of the functions I use the most. It’s a quick 10 minute walkthrough, with everything from basic to advanced examples to help you hit the ground running.
1
SELECT
R/select.R
is a function from the dplyr
package that helps you keep or drop columns in your dataset.
If you want to keep the two columns in the iris
dataset, you could:
iris%>%select(Sepal.Length, Sepal.Width)
However, if your dataset has more than 20 columns and you don’t want to type the name for each one, you might want to combine the use of:
starts_with()
: Starts with an exact prefix.ends_with()
: Ends with an exact suffix.contains()
: Contains a literal string.matches()
: Matches a regular expression.num_range()
: Matches a numerical range like x01, x02, x03.
Advanced:
iris%>%select(contains("Sepal"))
or
iris%>%select(matches("Sepal.[a-z]*"))
will give you the same columns like above.
2
COUNT
- count()
source: R/count-tally.R
count()
is literally the most used function in my daily task. It lets you quickly count the unique values when you specify one or more columns in the dataset for example:
Basic:
> iris %>% count(Species)
Species n
1 setosa 50
2 versicolor 50
3 virginica 50
Advanced:
In the case when you want to get a quick summary of a subset of the dataset, you can combine the use of adorn_totals() in the janitor
package.
> iris %>%
+ filter(Sepal.Length > 5) %>%
+ count(Species) %>% adorn_totals()
Species n
setosa 22
versicolor 47
virginica 49
Total 118
3
SUMMARISE
Source: R/summarise.R
summarise()
supports column-wise calculations and returns one row for each combination of grouping variables. If there are no grouping variables, the output will have a single row summarising all observations in the input.
Basic:
> iris %>%
+ group_by(Species) %>%
+ summarise(mean=mean(Sepal.Length))
# A tibble: 3 × 2
Species mean
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
Advanced:
In some cases, when you want to get a statistic or calculation across multiple columns, it’s a good idea to combine its use with:
across()
– allows you to pick columns inside in “data-masking” functions likesummarise(), filter() or mutate()
and apply the same transformation to multiple columns.
# rounding to zero digit for multiple columns that contains "Sepal"
iris %>%
group_by(Species) %>%
summarise(across(contains("Sepal"), ~ round(.x, 0)))
4
DISTINCT
Source: R/distinct.R
distinct()
is the function that helps remove duplicates and keeps only unique/distinct rows in a data frame.
Our example dataset:
> df
user_id age sex group
1 1 45 female store
2 1 45 online
3 2 36 female store
4 2 36 store
5 5 51 male store
6 6 21 online
7 7 27 male store
8 8 60 online
9 9 58 online
10 10 65 store
11 11 64 female online
12 12 21 store
13 13 62 female online
14 14 24 female online
15 15 31 female online
In the case of multiple entries for one user, it would be appropriate to keep only one record per user. To do that we could:
# to remove duplicates
> df %>%
distinct(user_id, .keep_all = T)
user_id age sex
1 1 45
2 2 36
3 3 33 male
4 4 34 female
5 5 37 female
6 6 53 female
7 7 46 female
8 8 53 female
9 9 62 female
10 10 43 male
11 11 18 female
12 12 33 male
13 13 30 female
14 14 58 male
15 15 43 male
Notice that if .keep_all = TRUE
, the dataset keeps all variables and if the combination is not distinct, this only keeps the first row of values.
In our case, it would be appropriate to only keep the entries for each user where sex
is not showing as blank. We could combine the use of arrange()
to achieve this.
Advanced:
> df %>%
arrange(user_id, desc(sex)) %>%
distinct(user_id, .keep_all = T)
user_id age sex
1 1 45 female
2 2 36 female
3 3 53 male
4 4 59 female
5 5 69 female
6 6 62 male
7 7 67 male
8 8 70 female
9 9 55 male
10 10 47 female
11 11 49 male
12 12 65 male
13 13 35 male
14 14 63 male
15 15 24 female
5
LEFT JOIN
Source: R/join.R
R/join-by.R
The dplyr
package offers several functions for joining two tables:
An inner_join(x, y)
only keeps observations from x
that have a matching key in y
.
A left_join(x, y)
keeps all observations in x
.
A right_join(x, y)
keeps all observations in y
.
A full_join(x, y)
keeps all observations in x
and y
.
Most of the time, I would just use left_join()
. See an example below:
> user
user_id age sex join_date
1 1 45 female 2018-12-31
2 2 36 female 2019-01-02
3 3 34 male 2019-09-28
4 4 61 female 2019-10-07
5 5 56 male 2019-07-08
6 6 27 male 2019-03-05
7 7 24 male 2019-03-17
8 8 48 male 2019-08-08
9 9 67 male 2019-11-01
10 10 35 male 2019-11-13
11 11 68 male 2019-02-27
12 12 21 female 2019-02-25
13 13 62 female 2019-03-29
14 14 18 male 2019-01-11
15 15 38 male 2019-11-16
> order
user_id group date
1 13 store 2019-02-12
2 10 online 2019-12-01
3 8 other 2019-11-14
4 11 online 2019-11-11
5 10 store 2019-03-15
6 NA online 2019-08-22
7 NA store 2019-03-04
8 11 other 2019-03-18
9 9 other 2019-04-14
10 6 other 2019-01-10
By default, because we have not specified the relationship
between tables, all records in order
are returned. See some duplicates below against user_id.
Basic:
> user %>% left_join(order) # joining 'user' to 'order'
Joining with `by = join_by(user_id)`
user_id age sex join_date group date
1 1 45 female 2018-12-31 <NA> <NA>
2 2 36 female 2019-01-02 <NA> <NA>
3 3 34 male 2019-09-28 <NA> <NA>
4 4 61 female 2019-10-07 <NA> <NA>
5 5 56 male 2019-07-08 <NA> <NA>
6 6 27 male 2019-03-05 other 2019-01-10
7 7 24 male 2019-03-17 <NA> <NA>
8 8 48 male 2019-08-08 other 2019-11-14
9 9 67 male 2019-11-01 other 2019-04-14
10 10 35 male 2019-11-13 online 2019-12-01
11 10 35 male 2019-11-13 store 2019-03-15
12 11 68 male 2019-02-27 online 2019-11-11
13 11 68 male 2019-02-27 other 2019-03-18
14 12 21 female 2019-02-25 <NA> <NA>
15 13 62 female 2019-03-29 store 2019-02-12
16 14 18 male 2019-01-11 <NA> <NA>
17 15 38 male 2019-11-16 <NA> <NA>
Advanced:
If we are only interested in the most recent order record, we can remove duplicates by including multiple = "first"
. This tells left_join
to return only the first match found in order
. We ensure the date is ranked from the latest to the oldest using arrange(desc(date))
.
> user %>% left_join(order %>% arrange(desc(date)), multiple = "first")
Joining with `by = join_by(user_id)`
user_id age sex join_date group date
1 1 45 female 2018-12-31 <NA> <NA>
2 2 36 female 2019-01-02 <NA> <NA>
3 3 34 male 2019-09-28 <NA> <NA>
4 4 61 female 2019-10-07 <NA> <NA>
5 5 56 male 2019-07-08 <NA> <NA>
6 6 27 male 2019-03-05 other 2019-01-10
7 7 24 male 2019-03-17 <NA> <NA>
8 8 48 male 2019-08-08 other 2019-11-14
9 9 67 male 2019-11-01 other 2019-04-14
10 10 35 male 2019-11-13 online 2019-12-01
11 11 68 male 2019-02-27 online 2019-11-11
12 12 21 female 2019-02-25 <NA> <NA>
13 13 62 female 2019-03-29 store 2019-02-12
14 14 18 male 2019-01-11 <NA> <NA>
15 15 38 male 2019-11-16 <NA> <NA>
You may have seen that some users have a purchase date earlier than their joining date on the shopping site. This doesn’t make sense. We could remove these records by setting a condition in join_by()
.
> user %>% left_join(order, join_by(user_id, join_date <= date))
user_id age sex join_date group date
1 1 45 female 2018-12-31 <NA> <NA>
2 2 36 female 2019-01-02 <NA> <NA>
3 3 34 male 2019-09-28 <NA> <NA>
4 4 61 female 2019-10-07 <NA> <NA>
5 5 56 male 2019-07-08 <NA> <NA>
6 6 27 male 2019-03-05 <NA> <NA>
7 7 24 male 2019-03-17 <NA> <NA>
8 8 48 male 2019-08-08 other 2019-11-14
9 9 67 male 2019-11-01 <NA> <NA>
10 10 35 male 2019-11-13 online 2019-12-01
11 11 68 male 2019-02-27 online 2019-11-11
12 12 21 female 2019-02-25 <NA> <NA>
13 13 62 female 2019-03-29 <NA> <NA>
14 14 18 male 2019-01-11 <NA> <NA>
15 15 38 male 2019-11-16 <NA> <NA>
6
COMPLETE
Source: R/complete.R
The function complete()
in tidyr
package is handy when recreating rows or records that are missing from a data frame. It transforms implicit missing values into explicit ones, filling in the gaps in the data frame.
See the example for the dataset below. We have a list of the number of orders by date. For reporting purposes, we would want to show a complete set of dates.
> order %>% count(date, group)
date group n
1 2019-01-01 other 1
2 2019-01-02 online 1
3 2019-01-03 online 1
4 2019-01-03 other 1
5 2019-01-04 store 1
6 2019-01-05 online 3
7 2019-01-07 online 1
8 2019-01-07 store 1
9 2019-01-08 online 1
10 2019-01-08 other 1
11 2019-01-09 online 1
12 2019-01-09 other 1
13 2019-01-10 store 2
14 2019-01-11 online 1
15 2019-01-11 store 1
16 2019-01-12 online 2
17 2019-01-12 other 1
18 2019-01-13 online 2
19 2019-01-13 other 1
20 2019-01-13 store 2
21 2019-01-14 online 1
22 2019-01-14 store 1
23 2019-01-15 online 2
Basic
> order %>% count(date, group) %>%
+ complete(date = seq.Date(min(order$date), max(order$date), by = 'day'), group)
# A tibble: 45 × 3
date group n
<date> <chr> <int>
1 2019-01-01 online NA
2 2019-01-01 other 1
3 2019-01-01 store NA
4 2019-01-02 online 1
5 2019-01-02 other NA
6 2019-01-02 store NA
7 2019-01-03 online 1
8 2019-01-03 other 1
9 2019-01-03 store NA
10 2019-01-04 online NA
# ℹ 35 more rows# ℹ Use `print(n = ...)` to see more rows
For the NAs in the table, you can use replace_na()
from the tidyr
package to replace NA
with specific values. This can be done directly after the pipe. Alternatively, you can use fill
in complete()
, as shown in the example below.
Advanced:
> order %>% count(date, group) %>%
+ complete(date = seq.Date(min(order$date), max(order$date), by = 'day'), group, fill = list(n = 0))
# A tibble: 45 × 3
date group n
<date> <chr> <int>
1 2019-01-01 online 0
2 2019-01-01 other 1
3 2019-01-01 store 0
4 2019-01-02 online 1
5 2019-01-02 other 0
6 2019-01-02 store 0
7 2019-01-03 online 1
8 2019-01-03 other 1
9 2019-01-03 store 0
10 2019-01-04 online 0
# ℹ 35 more rows# ℹ Use `print(n = ...)` to see more rows
7
PIVOT WIDER
is another very commonly used function in
R/pivot-wide.Rtidyr()
that transforms the dataset to a “wider” version, increasing the number of columns and decreasing the number of rows. The inverse transformation is pivot_longer()
.
> order %>% mutate(month = month(date, label = T)) %>% count(month, group)
month group n
1 Jan online 49
2 Jan store 41
3 Feb online 38
4 Feb store 52
5 Mar online 45
6 Mar store 45
7 Apr online 24
8 Apr store 36
9 May online 27
10 May store 23
11 Jun online 67
12 Jun store 73
13 Jul online 54
14 Jul store 66
15 Aug online 35
16 Aug store 55
17 Sep online 52
18 Sep store 38
19 Oct online 82
20 Oct store 58
21 Nov online 19
22 Nov store 21
If we were asked to lay out the monthly number of orders by columns, we could:
> order %>% mutate(month = month(date, label = T)) %>% count(month, group) %>%
+ pivot_wider(id_cols = "group", names_from = "month",
+ values_from = "n")
# A tibble: 2 × 12
group Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 online 49 38 45 24 27 67 54 35 52 82 19
2 store 41 52 45 36 23 73 66 55 38 58 21
A great feature of pivot_wider()
is it also has arguments for renaming columns and redefining values. For example, if you want to see how the number of order changes with a 15% increase, you can include a function in value_fn
in pivot_wider()
to show this increase.
Advanced:
> order %>% mutate(month = month(date, label = T)) %>% count(month, group) %>%
+ pivot_wider(id_cols = "group", names_from = "month",
+ values_from = "n", values_fn = ~ round(.x * 1.15, 0), names_glue = "{month}_(+15%)")
# A tibble: 2 × 12
group `Jan_(+15%)` `Feb_(+15%)` `Mar_(+15%)` `Apr_(+15%)` `May_(+15%)`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 online 56 44 52 28 31
2 store 47 60 52 41 26
# ℹ 6 more variables: `Jun_(+15%)` <dbl>, `Jul_(+15%)` <dbl>, `Aug_(+15%)` <dbl>,
# `Sep_(+15%)` <dbl>, `Oct_(+15%)` <dbl>, `Nov_(+15%)` <dbl>
Note that we use names_glue
in the example above to rename all the resulting columns.
names_glue
Instead ofnames_sep
andnames_prefix
, you can supply a glue specification that uses thenames_from
columns (and special .value) to create custom column names.