Modern data manipulation:
dplyr and data.table

Raphael Rehms

dplyr

Basics

dplyr is an R package that implements an advanced version of standard data frames

From the official website:

“dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges”

  • Part of the `tidyverse` (package collection)

    library(dplyr)

pipe %>%

The %>% is an operator to concatenate function calls. Originally from the `magrittr` package.

A short example will show this merits:

  • Take a list of numbers.

  • Square each number.

  • Sum the squared values.

# Without using pipe operator

sum(sapply(list(1, 2, 3, 4),function(x) x^2))
[1] 30
# Use pipe
list(1, 2, 3, 4) %>%
  sapply(function(x) x^2) %>%
  sum()
[1] 30

There is a keyboard shortcut in Rstudio to insert a pipe operator. On my RStudio, it is Ctrl + Shift + m`

The pipe was so popular, that there is an official implementation in base R (`|>`). However, I prefer %>% for reasons…

tibble

tibble

Here, we use the starwars data set from the dplyr package.

head(starwars)
# A tibble: 6 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
  • Data class is already `tibble`

  • Slightly more informative than standard data frame

  • We can define a tibble by using the according function, e.g. tibble(iris)

  • A tibble behaves like a standard data frame (e.g. it is still list-like)

Basic operations

tibble is designed to work smoothly with the pipe

  • Select rows and columns

    • We can use filter to select rows with defined conditions

    • We can use select to select columns

starwars %>% 
  filter(height > 170, mass < 130) %>%
  select(name, homeworld)
# A tibble: 37 × 2
   name              homeworld 
   <chr>             <chr>     
 1 Luke Skywalker    Tatooine  
 2 Owen Lars         Tatooine  
 3 Biggs Darklighter Tatooine  
 4 Obi-Wan Kenobi    Stewjon   
 5 Anakin Skywalker  Tatooine  
 6 Chewbacca         Kashyyyk  
 7 Han Solo          Corellia  
 8 Greedo            Rodia     
 9 Jek Tono Porkins  Bestine IV
10 Boba Fett         Kamino    
# ℹ 27 more rows
  • Note that we can call variables from the data set without $

Basic operations

Define new variables

  • We can use mutate to define new variables
starwars %>% 
  mutate(bmi = mass / (height/100)^2) 
# A tibble: 87 × 15
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>, bmi <dbl>
  • Note that we returned the data here!

  • Hence, no inplace operation. Use <- or %<>% to for an assigment

Group and summary operations

  • Use group_by to make a operations on subgroups

  • Use summarise to summarize variables

  • Sort results using arrange and possibly desc

Question: What is the mean height and weight for each species? Sort the result by weight in descending order.

starwars %>% 
  group_by(species) %>% 
  summarise(
    mean_height = mean(height, na.rm = TRUE),
    mean_weight = mean(mass, na.rm = TRUE),
  ) %>% 
  arrange(desc(mean_weight))
# A tibble: 38 × 3
   species      mean_height mean_weight
   <chr>              <dbl>       <dbl>
 1 Hutt                 175        1358
 2 Kaleesh              216         159
 3 Wookiee              231         124
 4 Trandoshan           190         113
 5 Besalisk             198         102
 6 Neimodian            191          90
 7 Kaminoan             221          88
 8 Nautolan             196          87
 9 Mon Calamari         180          83
10 Cerean               198          82
# ℹ 28 more rows

A lot more…

We can only scratch on the surface. See the webpage for more examples:

https://dplyr.tidyverse.org/index.html

data.table

data. table

  • data.table is used for large data sets and is designed to be fast and memory efficient.

  • Like a tibble, it is a ‘list-like’ object.

  • We stick to the starwars data set:

    library(data.table)
    
    sw_dt <- starwars
    setDT(sw_dt)
    head(sw_dt)
                 name height  mass  hair_color  skin_color eye_color birth_year
               <char>  <int> <num>      <char>      <char>    <char>      <num>
    1: Luke Skywalker    172    77       blond        fair      blue       19.0
    2:          C-3PO    167    75        <NA>        gold    yellow      112.0
    3:          R2-D2     96    32        <NA> white, blue       red       33.0
    4:    Darth Vader    202   136        none       white    yellow       41.9
    5:    Leia Organa    150    49       brown       light     brown       19.0
    6:      Owen Lars    178   120 brown, grey       light      blue       52.0
          sex    gender homeworld species
       <char>    <char>    <char>  <char>
    1:   male masculine  Tatooine   Human
    2:   none masculine  Tatooine   Droid
    3:   none masculine     Naboo   Droid
    4:   male masculine  Tatooine   Human
    5: female  feminine  Alderaan   Human
    6:   male masculine  Tatooine   Human
                                                                                                                       films
                                                                                                                      <list>
    1:                           A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith,The Force Awakens
    2:     A New Hope,The Empire Strikes Back,Return of the Jedi,The Phantom Menace,Attack of the Clones,Revenge of the Sith
    3: A New Hope,The Empire Strikes Back,Return of the Jedi,The Phantom Menace,Attack of the Clones,Revenge of the Sith,...
    4:                                             A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith
    5:                           A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith,The Force Awakens
    6:                                                                   A New Hope,Attack of the Clones,Revenge of the Sith
                                vehicles               starships
                                  <list>                  <list>
    1: Snowspeeder,Imperial Speeder Bike X-wing,Imperial shuttle
    2:                                                          
    3:                                                          
    4:                                           TIE Advanced x1
    5:             Imperial Speeder Bike                        
    6:                                                          
  • data.table implements the function fread for fast and easy file reading.

general syntax

data.table uses as basic syntac

DT[i, j, by]

with an analogy to SQL:

DT[where | order by,  select | update ,  group by]
  • i to filter rows or order

  • j to select columns OR create new ones

  • by to do it for subgroups

Warning

Wrap j in a list() or its alias in DT .() to ensure a data table object! \(\rightarrow\) see exercise.

Example from before

dplyr

starwars %>% 
  filter(height > 170, mass < 130) %>%
  select(name, homeworld)

data.table

sw_dt[height > 170 & mass < 130, .(name, homeworld)]
                 name      homeworld
               <char>         <char>
 1:    Luke Skywalker       Tatooine
 2:         Owen Lars       Tatooine
 3: Biggs Darklighter       Tatooine
 4:    Obi-Wan Kenobi        Stewjon
 5:  Anakin Skywalker       Tatooine
 6:         Chewbacca       Kashyyyk
 7:          Han Solo       Corellia
 8:            Greedo          Rodia
 9:  Jek Tono Porkins     Bestine IV
10:         Boba Fett         Kamino
11:             Bossk      Trandosha
12:  Lando Calrissian        Socorro
13:             Lobot         Bespin
14:            Ackbar       Mon Cala
15:      Qui-Gon Jinn           <NA>
16:       Nute Gunray Cato Neimoidia
17:     Padmé Amidala          Naboo
18:     Jar Jar Binks          Naboo
19:      Roos Tarpals          Naboo
20:        Darth Maul       Dathomir
21:       Ayla Secura         Ryloth
22:        Mace Windu     Haruun Kal
23:      Ki-Adi-Mundi          Cerea
24:         Kit Fisto    Glee Anselm
25:        Adi Gallia      Coruscant
26:          Plo Koon          Dorin
27:      Gregar Typho          Naboo
28: Poggle the Lesser       Geonosis
29:             Dooku        Serenno
30:        Jango Fett   Concord Dawn
31:   Dexter Jettster           Ojom
32:           Lama Su         Kamino
33:        Wat Tambor          Skako
34:          Shaak Ti          Shili
35:   Raymus Antilles       Alderaan
36:         Sly Moore         Umbara
37:        Tion Medon         Utapau
                 name      homeworld
# sw_dt[height > 170 & mass < 130, list(name, homeworld)]  # the same

Special argument .N

  • data.table uses .N to count rows (in a group)
starwars[species == "Human", .N]  # 35 humans in the data set
[1] 35
  • we can combine it with by to count the number in each group
head(starwars[,.N, by = species])
   species     N
    <char> <int>
1:   Human    35
2:   Droid     6
3: Wookiee     2
4:  Rodian     1
5:    Hutt     1
6:    <NA>     4

More complex example from before

starwars %>% 
  group_by(species) %>% 
  summarise(
    mean_height = mean(height, na.rm = TRUE),
    mean_weight = mean(mass, na.rm = TRUE),
  ) %>% 
  arrange(desc(mean_weight))
# A tibble: 38 × 3
   species      mean_height mean_weight
   <chr>              <dbl>       <dbl>
 1 Hutt                 175        1358
 2 Kaleesh              216         159
 3 Wookiee              231         124
 4 Trandoshan           190         113
 5 Besalisk             198         102
 6 Neimodian            191          90
 7 Kaminoan             221          88
 8 Nautolan             196          87
 9 Mon Calamari         180          83
10 Cerean               198          82
# ℹ 28 more rows
  • Giving the list in j names to return a data.table with according names.

  • We use the fact that the returned object is a data table

sw_dt[,.(
    mean_height = mean(height, na.rm = TRUE),
    mean_weight = mean(mass, na.rm = TRUE)
), by = species][order(-mean_weight),]
           species mean_height mean_weight
            <char>       <num>       <num>
 1:           Hutt    175.0000     1358.00
 2:        Kaleesh    216.0000      159.00
 3:        Wookiee    231.0000      124.00
 4:     Trandoshan    190.0000      113.00
 5:       Besalisk    198.0000      102.00
 6:      Neimodian    191.0000       90.00
 7:       Kaminoan    221.0000       88.00
 8:       Nautolan    196.0000       87.00
 9:   Mon Calamari    180.0000       83.00
10:         Cerean    198.0000       82.00
11:          Human    178.0000       81.31
12:           <NA>    175.0000       81.00
13:         Zabrak    173.0000       80.00
14:        Kel Dor    188.0000       80.00
15:      Geonosian    183.0000       80.00
16:         Pau'an    206.0000       80.00
17:         Rodian    173.0000       74.00
18:         Gungan    208.6667       74.00
19:          Droid    131.2000       69.75
20:      Sullustan    160.0000       68.00
21:          Toong    163.0000       65.00
22:        Togruta    178.0000       57.00
23:        Twi'lek    179.0000       55.00
24:       Clawdite    168.0000       55.00
25:       Mirialan    168.0000       53.10
26:     Tholothian    184.0000       50.00
27:        Skakoan    193.0000       48.00
28:     Vulptereen     94.0000       45.00
29:            Dug    112.0000       40.00
30:           Ewok     88.0000       20.00
31: Yoda's species     66.0000       17.00
32:         Aleena     79.0000       15.00
33:      Toydarian    137.0000         NaN
34:          Xexto    122.0000         NaN
35:       Iktotchi    188.0000         NaN
36:       Quermian    264.0000         NaN
37:       Chagrian    196.0000         NaN
38:           Muun    191.0000         NaN
           species mean_height mean_weight

Example with multiple arguments in group

We want to find out what the max and min value of height for each subgroup of species and gender is. We also want to know, how many data points were used to calculate the result:

sw_dt[,.(minimum = min(height, na.rm = T), 
         maximum = (max(height, na.rm = T)),
         .N),   # no name used for .N!
      by = .(species, gender)]
           species    gender minimum maximum     N
            <char>    <char>   <int>   <int> <int>
 1:          Human masculine     170     202    26
 2:          Droid masculine      96     200     5
 3:          Human  feminine     150     185     9
 4:        Wookiee masculine     228     234     2
 5:         Rodian masculine     173     173     1
 6:           Hutt masculine     175     175     1
 7:           <NA>      <NA>     157     185     4
 8: Yoda's species masculine      66      66     1
 9:     Trandoshan masculine     190     190     1
10:   Mon Calamari masculine     180     180     1
11:           Ewok masculine      88      88     1
12:      Sullustan masculine     160     160     1
13:      Neimodian masculine     191     191     1
14:         Gungan masculine     196     224     3
15:      Toydarian masculine     137     137     1
16:            Dug masculine     112     112     1
17:         Zabrak masculine     171     175     2
18:        Twi'lek masculine     180     180     1
19:        Twi'lek  feminine     178     178     1
20:         Aleena masculine      79      79     1
21:     Vulptereen masculine      94      94     1
22:          Xexto masculine     122     122     1
23:          Toong masculine     163     163     1
24:         Cerean masculine     198     198     1
25:       Nautolan masculine     196     196     1
26:     Tholothian  feminine     184     184     1
27:       Iktotchi masculine     188     188     1
28:       Quermian masculine     264     264     1
29:        Kel Dor masculine     188     188     1
30:       Chagrian masculine     196     196     1
31:      Geonosian masculine     183     183     1
32:       Mirialan  feminine     166     170     2
33:       Clawdite  feminine     168     168     1
34:       Besalisk masculine     198     198     1
35:       Kaminoan masculine     229     229     1
36:       Kaminoan  feminine     213     213     1
37:          Droid  feminine      96      96     1
38:        Skakoan masculine     193     193     1
39:           Muun masculine     191     191     1
40:        Togruta  feminine     178     178     1
41:        Kaleesh masculine     216     216     1
42:         Pau'an masculine     206     206     1
           species    gender minimum maximum     N

New variables

  • When we define new variables in base data sets, this invokes a deep copy (very inefficient)

  • data.table relies on reference using a custom operator :=

We create a new variable bmi

sw_dt[, bmi := mass / (height/100)^2]

This is a lot more efficient as we fo not make a copy of the full data set.



Delete variables

  • We can delete variables from the data table by assigning a NULL to it:
sw_dt[, vehicles := NULL]

New variables cont’d

:= can also be used…

  • to assign multiple variables

  • in combination with selection of rows

  • in combination with by and .N

Here we make an example where we calculate the mean bmi (by hand) and its contribution (in %) to the common weight of all individuals from one species.

  • We don’t create a new data table, but add the information to the original one:
sw_dt[, c("mean_bmi", "mass_contrib") := .(sum(mass/(height/100)^2, na.rm=T)/.N, 
                                             mass/sum(mass, na.rm = T)*100), 
      by = species]

# look at the data set
sw_dt[, .(name, species, bmi, mean_bmi, mass_contrib)]
                     name        species       bmi   mean_bmi mass_contrib
                   <char>         <char>     <num>      <num>        <num>
 1:        Luke Skywalker          Human  26.02758  14.243690     4.734965
 2:                 C-3PO          Droid  26.89232  21.770756    26.881720
 3:                 R2-D2          Droid  34.72222  21.770756    11.469534
 4:           Darth Vader          Human  33.33007  14.243690     8.363055
 5:           Leia Organa          Human  21.77778  14.243690     3.013160
 6:             Owen Lars          Human  37.87401  14.243690     7.379166
 7:    Beru Whitesun Lars          Human  27.54821  14.243690     4.611979
 8:                 R5-D4          Droid  34.00999  21.770756    11.469534
 9:     Biggs Darklighter          Human  25.08286  14.243690     5.165416
10:        Obi-Wan Kenobi          Human  23.24598  14.243690     4.734965
11:      Anakin Skywalker          Human  23.76641  14.243690     5.165416
12:        Wilhuff Tarkin          Human        NA  14.243690           NA
13:             Chewbacca        Wookiee  21.54509  23.191276    45.161290
14:              Han Solo          Human  24.69136  14.243690     4.919444
15:                Greedo         Rodian  24.72518  24.725183   100.000000
16: Jabba Desilijic Tiure           Hutt 443.42857 443.428571   100.000000
17:        Wedge Antilles          Human  26.64360  14.243690     4.734965
18:      Jek Tono Porkins           <NA>  33.95062  18.483967    45.267490
19:                  Yoda Yoda's species  39.02663  39.026630   100.000000
20:             Palpatine          Human  25.95156  14.243690     4.611979
21:             Boba Fett          Human  23.35095  14.243690     4.808757
22:                 IG-88          Droid  35.00000  21.770756    50.179211
23:                 Bossk     Trandoshan  31.30194  31.301939   100.000000
24:      Lando Calrissian          Human  25.21625  14.243690     4.857951
25:                 Lobot          Human  25.79592  14.243690     4.857951
26:                Ackbar   Mon Calamari  25.61728  25.617284   100.000000
27:            Mon Mothma          Human        NA  14.243690           NA
28:          Arvel Crynyd          Human        NA  14.243690           NA
29: Wicket Systri Warrick           Ewok  25.82645  25.826446   100.000000
30:             Nien Nunb      Sullustan  26.56250  26.562500   100.000000
31:          Qui-Gon Jinn          Human  23.89326  14.243690     5.472882
32:           Nute Gunray      Neimodian  24.67038  24.670376   100.000000
33:         Finis Valorum          Human        NA  14.243690           NA
34:         Padmé Amidala          Human  13.14828  14.243690     2.767187
35:         Jar Jar Binks         Gungan  17.18034  11.174272    44.594595
36:          Roos Tarpals         Gungan  16.34247  11.174272    55.405405
37:            Rugor Nass         Gungan        NA  11.174272           NA
38:              Ric Olié          Human        NA  14.243690           NA
39:                 Watto      Toydarian        NA   0.000000           NA
40:               Sebulba            Dug  31.88776  31.887755   100.000000
41:         Quarsh Panaka          Human        NA  14.243690           NA
42:        Shmi Skywalker          Human        NA  14.243690           NA
43:            Darth Maul         Zabrak  26.12245  13.061224   100.000000
44:           Bib Fortuna        Twi'lek        NA   8.679460           NA
45:           Ayla Secura        Twi'lek  17.35892   8.679460   100.000000
46:          Ratts Tyerel         Aleena  24.03461  24.034610   100.000000
47:              Dud Bolt     Vulptereen  50.92802  50.928022   100.000000
48:               Gasgano          Xexto        NA   0.000000           NA
49:        Ben Quadinaros          Toong  24.46460  24.464602   100.000000
50:            Mace Windu          Human  23.76641  14.243690     5.165416
51:          Ki-Adi-Mundi         Cerean  20.91623  20.916233   100.000000
52:             Kit Fisto       Nautolan  22.64681  22.646814   100.000000
53:             Eeth Koth         Zabrak        NA  13.061224           NA
54:            Adi Gallia     Tholothian  14.76843  14.768431   100.000000
55:           Saesee Tiin       Iktotchi        NA   0.000000           NA
56:           Yarael Poof       Quermian        NA   0.000000           NA
57:              Plo Koon        Kel Dor  22.63468  22.634676   100.000000
58:            Mas Amedda       Chagrian        NA   0.000000           NA
59:          Gregar Typho           <NA>  24.83565  18.483967    34.979424
60:                 Cordé           <NA>        NA  18.483967           NA
61:           Cliegg Lars          Human        NA  14.243690           NA
62:     Poggle the Lesser      Geonosian  23.88844  23.888441   100.000000
63:       Luminara Unduli       Mirialan  19.44637  18.795618    52.919021
64:         Barriss Offee       Mirialan  18.14487  18.795618    47.080979
65:                 Dormé          Human        NA  14.243690           NA
66:                 Dooku          Human  21.47709  14.243690     4.919444
67:   Bail Prestor Organa          Human        NA  14.243690           NA
68:            Jango Fett          Human  23.58984  14.243690     4.857951
69:            Zam Wesell       Clawdite  19.48696  19.486961   100.000000
70:       Dexter Jettster       Besalisk  26.01775  26.017753   100.000000
71:               Lama Su       Kaminoan  16.78076   8.390382   100.000000
72:               Taun We       Kaminoan        NA   8.390382           NA
73:            Jocasta Nu          Human        NA  14.243690           NA
74:                R4-P17          Droid        NA  21.770756           NA
75:            Wat Tambor        Skakoan  12.88625  12.886252   100.000000
76:              San Hill           Muun        NA   0.000000           NA
77:              Shaak Ti        Togruta  17.99015  17.990153   100.000000
78:              Grievous        Kaleesh  34.07922  34.079218   100.000000
79:               Tarfful        Wookiee  24.83746  23.191276    54.838710
80:       Raymus Antilles          Human  22.35174  14.243690     4.857951
81:             Sly Moore           <NA>  15.14960  18.483967    19.753086
82:            Tion Medon         Pau'an  18.85192  18.851918   100.000000
83:                  Finn          Human        NA  14.243690           NA
84:                   Rey          Human        NA  14.243690           NA
85:           Poe Dameron          Human        NA  14.243690           NA
86:                   BB8          Droid        NA  21.770756           NA
87:        Captain Phasma          Human        NA  14.243690           NA
                     name        species       bmi   mean_bmi mass_contrib

Note that the mean bmi here is so small because be calculated it by hand using /.N! So we ignore NAs here…It is rather an example to show a potential use for .N.