Most of the data wrangling I’ve done in the past year? I’d say 99% of it was with dplyr, tidyr, janitor, and lubridate. It can take forever to learn every single functions in those packages, so don’t! I’ve put together a top 7 list of the functions I use the most. It’s a quick 10 minute walkthrough, with everything from basic to advanced examples to help you hit the ground running.

1

SELECT

R/select.R is a function from the dplyr package that helps you keep or drop columns in your dataset.

If you want to keep the two columns in the iris dataset, you could:

iris%>%select(Sepal.Length, Sepal.Width)

However, if your dataset has more than 20 columns and you don’t want to type the name for each one, you might want to combine the use of:

  • starts_with(): Starts with an exact prefix.
  • ends_with(): Ends with an exact suffix.
  • contains(): Contains a literal string.
  • matches(): Matches a regular expression.
  • num_range(): Matches a numerical range like x01, x02, x03.

Advanced:

iris%>%select(contains("Sepal")) 

or  

iris%>%select(matches("Sepal.[a-z]*")) 

will give you the same columns like above.

2

COUNT

  1. count()

source: R/count-tally.R

count() is literally the most used function in my daily task. It lets you quickly count the unique values when you specify one or more columns in the dataset for example:

Basic:

> iris %>% count(Species)
     
  Species  n     
1 setosa 50 
2 versicolor 50 
3 virginica 50

Advanced:

In the case when you want to get a quick summary of a subset of the dataset, you can combine the use of adorn_totals() in the janitor package.

> iris %>% 
+   filter(Sepal.Length > 5) %>% 
+   count(Species) %>% adorn_totals()    

Species   n     
setosa  22 
versicolor  47  
virginica  49      
Total 118

3

SUMMARISE

Source: R/summarise.R

summarise() supports column-wise calculations and returns one row for each combination of grouping variables. If there are no grouping variables, the output will have a single row summarising all observations in the input.

Basic:

> iris %>%
+   group_by(Species) %>% 
+   summarise(mean=mean(Sepal.Length))

# A tibble: 3 × 2  
  Species     mean  
   <fct>      <dbl>
1 setosa      5.01
2 versicolor  5.94
3 virginica   6.59

Advanced:

In some cases, when you want to get a statistic or calculation across multiple columns, it’s a good idea to combine its use with:

  • across() – allows you to pick columns inside in “data-masking” functions like summarise(), filter() or mutate()and apply the same transformation to multiple columns.
# rounding to zero digit for multiple columns that contains "Sepal" 
iris %>%
  group_by(Species) %>%
     summarise(across(contains("Sepal"), ~ round(.x, 0)))

4

DISTINCT

Source: R/distinct.R

distinct() is the function that helps remove duplicates and keeps only unique/distinct rows in a data frame.

Our example dataset:

> df   
    user_id age    sex  group
1        1  45 female  store
2        1  45        online
3        2  36 female  store
4        2  36         store
5        5  51   male  store
6        6  21        online
7        7  27   male  store
8        8  60        online
9        9  58        online
10      10  65         store
11      11  64 female online
12      12  21         store
13      13  62 female online
14      14  24 female online
15      15  31 female online

In the case of multiple entries for one user, it would be appropriate to keep only one record per user. To do that we could:

# to remove duplicates 
> df  %>%
	 distinct(user_id, .keep_all = T)
 
		   user_id age    sex
   1        1  45       
   2        2  36       
   3        3  33   male
   4        4  34 female
   5        5  37 female
   6        6  53 female
   7        7  46 female
   8        8  53 female
   9        9  62 female
   10      10  43   male
   11      11  18 female
   12      12  33   male
   13      13  30 female
   14      14  58   male
   15      15  43   male

Notice that if .keep_all = TRUE, the dataset keeps all variables and if the combination is not distinct, this only keeps the first row of values.

In our case, it would be appropriate to only keep the entries for each user where sex is not showing as blank. We could combine the use of arrange() to achieve this.

Advanced:

> df  %>%
 arrange(user_id, desc(sex)) %>%
  distinct(user_id, .keep_all = T)
  
     user_id age    sex
	 1        1  45 female
	 2        2  36 female
	 3        3  53   male
	 4        4  59 female
	 5        5  69 female
	 6        6  62   male
	 7        7  67   male
	 8        8  70 female
	 9        9  55   male
	 10      10  47 female
	 11      11  49   male
	 12      12  65   male
	 13      13  35   male
	 14      14  63   male
	 15      15  24 female

5

LEFT JOIN

Source: R/join.R R/join-by.R

The dplyr package offers several functions for joining two tables:

An inner_join(x, y) only keeps observations from x that have a matching key in y.

left_join(x, y) keeps all observations in x.

right_join(x, y) keeps all observations in y.

full_join(x, y) keeps all observations in x and y.

Most of the time, I would just use left_join(). See an example below:

   > user
      user_id age    sex  join_date
      1        1  45 female 2018-12-31
      2        2  36 female 2019-01-02
      3        3  34   male 2019-09-28
      4        4  61 female 2019-10-07
      5        5  56   male 2019-07-08
      6        6  27   male 2019-03-05
      7        7  24   male 2019-03-17
      8        8  48   male 2019-08-08
      9        9  67   male 2019-11-01
      10      10  35   male 2019-11-13
      11      11  68   male 2019-02-27
      12      12  21 female 2019-02-25
      13      13  62 female 2019-03-29
      14      14  18   male 2019-01-11
      15      15  38   male 2019-11-16
      
   > order
      user_id  group       date
   1       13  store 2019-02-12
   2       10 online 2019-12-01
   3        8  other 2019-11-14
   4       11 online 2019-11-11
   5       10  store 2019-03-15
   6       NA online 2019-08-22
   7       NA  store 2019-03-04
   8       11  other 2019-03-18
   9        9  other 2019-04-14
   10       6  other 2019-01-10

By default, because we have not specified the relationship between tables, all records in order are returned. See some duplicates below against user_id.

Basic:

> user %>% left_join(order) # joining 'user' to 'order'

Joining with `by = join_by(user_id)`
   user_id age    sex  join_date  group       date
1        1  45 female 2018-12-31   <NA>       <NA>
2        2  36 female 2019-01-02   <NA>       <NA>
3        3  34   male 2019-09-28   <NA>       <NA>
4        4  61 female 2019-10-07   <NA>       <NA>
5        5  56   male 2019-07-08   <NA>       <NA>
6        6  27   male 2019-03-05  other 2019-01-10
7        7  24   male 2019-03-17   <NA>       <NA>
8        8  48   male 2019-08-08  other 2019-11-14
9        9  67   male 2019-11-01  other 2019-04-14
10      10  35   male 2019-11-13 online 2019-12-01
11      10  35   male 2019-11-13  store 2019-03-15
12      11  68   male 2019-02-27 online 2019-11-11
13      11  68   male 2019-02-27  other 2019-03-18
14      12  21 female 2019-02-25   <NA>       <NA>
15      13  62 female 2019-03-29  store 2019-02-12
16      14  18   male 2019-01-11   <NA>       <NA>
17      15  38   male 2019-11-16   <NA>       <NA>

Advanced:

If we are only interested in the most recent order record, we can remove duplicates by including multiple = "first". This tells left_join to return only the first match found in order. We ensure the date is ranked from the latest to the oldest using arrange(desc(date)).

> user %>% left_join(order %>% arrange(desc(date)), multiple = "first")

Joining with `by = join_by(user_id)`   
    user_id age    sex  join_date  group       date
1        1  45 female 2018-12-31   <NA>       <NA>
2        2  36 female 2019-01-02   <NA>       <NA>
3        3  34   male 2019-09-28   <NA>       <NA>
4        4  61 female 2019-10-07   <NA>       <NA>
5        5  56   male 2019-07-08   <NA>       <NA>
6        6  27   male 2019-03-05  other 2019-01-10
7        7  24   male 2019-03-17   <NA>       <NA>
8        8  48   male 2019-08-08  other 2019-11-14
9        9  67   male 2019-11-01  other 2019-04-14
10      10  35   male 2019-11-13 online 2019-12-01
11      11  68   male 2019-02-27 online 2019-11-11
12      12  21 female 2019-02-25   <NA>       <NA>
13      13  62 female 2019-03-29  store 2019-02-12
14      14  18   male 2019-01-11   <NA>       <NA>
15      15  38   male 2019-11-16   <NA>       <NA>

You may have seen that some users have a purchase date earlier than their joining date on the shopping site. This doesn’t make sense. We could remove these records by setting a condition in join_by().

> user %>% left_join(order, join_by(user_id, join_date <= date))
     user_id age    sex  join_date  group       date
 1        1  45 female 2018-12-31   <NA>       <NA>
 2        2  36 female 2019-01-02   <NA>       <NA>
 3        3  34   male 2019-09-28   <NA>       <NA>
 4        4  61 female 2019-10-07   <NA>       <NA>
 5        5  56   male 2019-07-08   <NA>       <NA>
 6        6  27   male 2019-03-05   <NA>       <NA>
 7        7  24   male 2019-03-17   <NA>       <NA>
 8        8  48   male 2019-08-08  other 2019-11-14
 9        9  67   male 2019-11-01   <NA>       <NA>
 10      10  35   male 2019-11-13 online 2019-12-01
 11      11  68   male 2019-02-27 online 2019-11-11
 12      12  21 female 2019-02-25   <NA>       <NA>
 13      13  62 female 2019-03-29   <NA>       <NA>
 14      14  18   male 2019-01-11   <NA>       <NA>
 15      15  38   male 2019-11-16   <NA>       <NA>

6

COMPLETE

Source: R/complete.R

The function complete() in tidyr package is handy when recreating rows or records that are missing from a data frame. It transforms implicit missing values into explicit ones, filling in the gaps in the data frame.

See the example for the dataset below. We have a list of the number of orders by date. For reporting purposes, we would want to show a complete set of dates.

> order %>% count(date, group)

	date  group n
1  2019-01-01  other 1
2  2019-01-02 online 1
3  2019-01-03 online 1
4  2019-01-03  other 1
5  2019-01-04  store 1
6  2019-01-05 online 3
7  2019-01-07 online 1
8  2019-01-07  store 1
9  2019-01-08 online 1
10 2019-01-08  other 1
11 2019-01-09 online 1
12 2019-01-09  other 1
13 2019-01-10  store 2
14 2019-01-11 online 1
15 2019-01-11  store 1
16 2019-01-12 online 2
17 2019-01-12  other 1
18 2019-01-13 online 2
19 2019-01-13  other 1
20 2019-01-13  store 2
21 2019-01-14 online 1
22 2019-01-14  store 1
23 2019-01-15 online 2

Basic

> order %>% count(date, group) %>%
 +   complete(date = seq.Date(min(order$date), max(order$date), by = 'day'), group)

 # A tibble: 45 × 3
    date       group      n 
    <date>     <chr>  <int> 
1 2019-01-01 online    NA 
2 2019-01-01 other      1 
3 2019-01-01 store     NA 
4 2019-01-02 online     1 
5 2019-01-02 other     NA 
6 2019-01-02 store     NA 
7 2019-01-03 online     1 
8 2019-01-03 other      1 
9 2019-01-03 store     NA
10 2019-01-04 online    NA
# ℹ 35 more rows# ℹ Use `print(n = ...)` to see more rows

For the NAs in the table, you can use replace_na() from the tidyr package to replace NA with specific values. This can be done directly after the pipe. Alternatively, you can use fill in complete(), as shown in the example below.

Advanced:

> order %>% count(date, group) %>% 
+   complete(date = seq.Date(min(order$date), max(order$date), by = 'day'), group, fill = list(n = 0))

# A tibble: 45 × 3   
       date       group      n   
      <date>     <chr>  <int> 
1 2019-01-01 online     0 
2 2019-01-01 other      1 
3 2019-01-01 store      0 
4 2019-01-02 online     1 
5 2019-01-02 other      0 
6 2019-01-02 store      0 
7 2019-01-03 online     1 
8 2019-01-03 other      1 
9 2019-01-03 store      0
10 2019-01-04 online     0
# ℹ 35 more rows# ℹ Use `print(n = ...)` to see more rows

7

PIVOT WIDER


R/pivot-wide.R
 is another very commonly used function in tidyr() that transforms the dataset to a “wider” version, increasing the number of columns and decreasing the number of rows. The inverse transformation is pivot_longer().

> order %>% mutate(month = month(date, label = T)) %>% count(month, group)

   month  group  n
1    Jan online 49
2    Jan  store 41
3    Feb online 38
4    Feb  store 52
5    Mar online 45
6    Mar  store 45
7    Apr online 24
8    Apr  store 36
9    May online 27
10   May  store 23
11   Jun online 67
12   Jun  store 73
13   Jul online 54
14   Jul  store 66
15   Aug online 35
16   Aug  store 55
17   Sep online 52
18   Sep  store 38
19   Oct online 82
20   Oct  store 58
21   Nov online 19
22   Nov  store 21

If we were asked to lay out the monthly number of orders by columns, we could:

> order %>% mutate(month = month(date, label = T)) %>% count(month, group) %>% 
+   pivot_wider(id_cols = "group", names_from = "month",
+               values_from = "n")

# A tibble: 2 × 12
  group    Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov
  <chr>  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1 online    49    38    45    24    27    67    54    35    52    82    19
 2 store     41    52    45    36    23    73    66    55    38    58    21

A great feature of pivot_wider() is it also has arguments for renaming columns and redefining values. For example, if you want to see how the number of order changes with a 15% increase, you can include a function in value_fn in pivot_wider() to show this increase.

Advanced:

> order %>% mutate(month = month(date, label = T)) %>% count(month, group) %>%
 +   pivot_wider(id_cols = "group", names_from = "month",
 +       values_from = "n", values_fn = ~ round(.x * 1.15, 0), names_glue = "{month}_(+15%)")

 # A tibble: 2 × 12
   group  `Jan_(+15%)` `Feb_(+15%)` `Mar_(+15%)` `Apr_(+15%)` `May_(+15%)`
   <chr>         <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
 1 online         56           44            52           28           31
 2 store          47           60            52           41           26
 # ℹ 6 more variables: `Jun_(+15%)` <dbl>, `Jul_(+15%)` <dbl>, `Aug_(+15%)` <dbl>,
 #   `Sep_(+15%)` <dbl>, `Oct_(+15%)` <dbl>, `Nov_(+15%)` <dbl>

Note that we use names_glue in the example above to rename all the resulting columns.

names_glue

Instead of names_sep and names_prefix, you can supply a glue specification that uses the names_from columns (and special .value) to create custom column names.