This post is a translation from «Unir datos – un repaso de las diferencias entre merge, inner join, left join, right join, full join, cbind y rbind cuando se usa objetos tipo data.table en R«, in response to a request in TW, so pardon my english.
This week someone ask me how to make joins using data.table objects, this person was hesitating whether to use: merge, rbind o cbind; so I’ve made this blog post leveraging the script that we used to explain the differences.
When we want to join two datasets usually do one of this:
- Add Rows: Increase the rows of a dataset under the other.
- Add Columns: Increase the columns of a data set to another.
- Join (vlookup): In this case we have some columns or variables as «key» or «id», with this columns or variables the data from the first set is added to the second when the «key» or «id» in both datasets is the same. Vlookup is a MSExcels users’s term but actually is a particular case of Join which is the right term for computacional people. There are different types of Join, in summary:
- Inner Join: Returns only the data wich has «matched keys» in both datasets.
- Left Join: Return all data from the left dataset and the data with matched key from the right dataset (vlookup is a left join).
- Right Join: Return all data from the right dataset and the data with matched key from the left dataset.
- Full Join: Returns all data from both datasets, obviously combining the data from the matched keys
The data.table package is an excellent choice to perform tasks more efficiently in R, but to learn how to use it a bit of reading and patience is required, you can read their vignettes as a good introduction.
To perform joins you can use the Dt [X] syntax from data.table package or use the merge command as if they were data.frame objects. Dt [X] is more efficient than merge for merge counterpart is more intuitive (at least for the average user of R). To understand a little the Dt [X] syntax you have to know that when you write Dt [X] the software will search the keys in Dt object based on the X’s key, ie, the basis for the merge is the object X
The following script shows how to make Inner Join, Left Join (vlookup), Right Join, Full Join, add columns and rows using 3 data.table object type (note that the output starts with a # # #, as you see when used knitr)
# Load package library(data.table) # We will create three data.frames with two 'id columns': 'id1' and 'id2' # but with different columns between them, also the 'ids' for each # data.frames will have matching and different elements df1 <- data.frame(id1 = c(11, 12, 13, 14), id2 = c(21, 22, 23, 24), c11 = c(1, 2, 3, 4), c12 = c("a", "b", "c", "d")) df2 <- data.frame(id1 = c(99, 12, 13, 14), id2 = c(98, 22, 23, 24), c21 = c(5, 6, 7, 8), c22 = c("e", "f", "g", "h")) df3 <- data.frame(id1 = c(999, 9999, 13, 14), id2 = c(998, 9998, 23, 24), c31 = c(9, 10, 11, 12), c32 = c("i", "j", "k", "l")) # Show data.frames df1 ## id1 id2 c11 c12 ## 1 11 21 1 a ## 2 12 22 2 b ## 3 13 23 3 c ## 4 14 24 4 d df2 ## id1 id2 c21 c22 ## 1 99 98 5 e ## 2 12 22 6 f ## 3 13 23 7 g ## 4 14 24 8 h df3 ## id1 id2 c31 c32 ## 1 999 998 9 i ## 2 9999 9998 10 j ## 3 13 23 11 k ## 4 14 24 12 l # Transform data.frame to data.table and # set 'key' as id1 e id2 columns Dt1 <- data.table(df1, key = c("id1", "id2")) Dt2 <- data.table(df2, key = c("id1", "id2")) Dt3 <- data.table(df3, key = c("id1", "id2")) # *** INNER JOIN *** # In the Inner Join we want to get the data # wich has "matched keys" in both datasets # Inner join between Dt1 y Dt2 with data.table syntax Dt1[Dt2, nomatch = 0] ## id1 id2 c11 c12 c21 c22 ## 1: 12 22 2 b 6 f ## 2: 13 23 3 c 7 g ## 3: 14 24 4 d 8 h # Inner join between Dt1 y Dt2, using merge command merge(x = Dt1, y = Dt2) ## id1 id2 c11 c12 c21 c22 ## 1: 12 22 2 b 6 f ## 2: 13 23 3 c 7 g ## 3: 14 24 4 d 8 h # Inner join between Dt1, Dt2 and Dt3 with data.table syntax Dt1[Dt2[Dt3, nomatch = 0], nomatch = 0] ## id1 id2 c11 c12 c21 c22 c31 c32 ## 1: 13 23 3 c 7 g 11 k ## 2: 14 24 4 d 8 h 12 l # Inner join between Dt1, Dt2 and Dt3 using Reduce and merge commands Reduce(merge, list(Dt1, Dt2, Dt3)) ## id1 id2 c11 c12 c21 c22 c31 c32 ## 1: 13 23 3 c 7 g 11 k ## 2: 14 24 4 d 8 h 12 l # *** RIGHT JOIN *** # In the Right join we want to get all data from the # right dataset and only the rows with matched key from # the left dataset. # Right join between Dt1 y Dt2 with data.table syntax Dt1[Dt2, nomatch = NA] ## id1 id2 c11 c12 c21 c22 ## 1: 12 22 2 b 6 f ## 2: 13 23 3 c 7 g ## 3: 14 24 4 d 8 h ## 4: 99 98 NA NA 5 e # Right join between Dt1 y Dt2, using merge command merge(x = Dt1, y = Dt2, all.y = TRUE) ## id1 id2 c11 c12 c21 c22 ## 1: 12 22 2 b 6 f ## 2: 13 23 3 c 7 g ## 3: 14 24 4 d 8 h ## 4: 99 98 NA NA 5 e # Right join between Dt1, Dt2 and Dt3 with data.table syntax Dt1[Dt2[Dt3, nomatch = NA], nomatch = NA] ## id1 id2 c11 c12 c21 c22 c31 c32 ## 1: 13 23 3 c 7 g 11 k ## 2: 14 24 4 d 8 h 12 l ## 3: 999 998 NA NA NA NA 9 i ## 4: 9999 9998 NA NA NA NA 10 j # To make a Right Join between several data.tables we are going to # create the mergeRight function which is the "merge command" # setting parameter all.y = TRUE; doing this we will force to show # all data from the right dataset and only data with matched key # from the left dataset mergeRight <- function(x, y) merge(x, y, all.y = TRUE) # Right join between Dt1, Dt2 and Dt3 using Reduce and mergeRight function # Noted how "NAs" are shown where there is no match Reduce(mergeRight, list(Dt1, Dt2, Dt3)) ## id1 id2 c11 c12 c21 c22 c31 c32 ## 1: 13 23 3 c 7 g 11 k ## 2: 14 24 4 d 8 h 12 l ## 3: 999 998 NA NA NA NA 9 i ## 4: 9999 9998 NA NA NA NA 10 j # *** LEFT JOIN *** # In the Left join we want to get all data from the right dataset # and only the rows with matched key from the right dataset # Left join between Dt1 y Dt2 with data.table syntax # To make a Left join in data.table we simply change the order # of the data.tables and make a right join. # You can see that the only problem with this method is the # order of the columns because the Dt2's columns are shown first Dt2[Dt1, nomatch = NA] ## id1 id2 c21 c22 c11 c12 ## 1: 11 21 NA NA 1 a ## 2: 12 22 6 f 2 b ## 3: 13 23 7 g 3 c ## 4: 14 24 8 h 4 d # Left join between Dt1 y Dt2, using merge command merge(x = Dt1, y = Dt2, all.x = TRUE) ## id1 id2 c11 c12 c21 c22 ## 1: 11 21 1 a NA NA ## 2: 12 22 2 b 6 f ## 3: 13 23 3 c 7 g ## 4: 14 24 4 d 8 h # Left join between Dt1, Dt2 and Dt3 with data.table syntax # idem: see the order of columns Dt3[Dt2[Dt1, nomatch = NA], nomatch = NA] ## id1 id2 c31 c32 c21 c22 c11 c12 ## 1: 11 21 NA NA NA NA 1 a ## 2: 12 22 NA NA 6 f 2 b ## 3: 13 23 11 k 7 g 3 c ## 4: 14 24 12 l 8 h 4 d # To make a Left Join between several data.tables we are going # to create the mergeLeft function which is the merge command # setting parameter all.x = TRUE; doing this we will force to # show all data from the left dataset and only data with # matched key from the right dataset mergeLeft <- function(x, y) merge(x, y, all.x = TRUE) # Left join between Dt1, Dt2 and Dt3 using Reduce and mergeLeft function Reduce(mergeLeft, list(Dt1, Dt2, Dt3)) ## id1 id2 c11 c12 c21 c22 c31 c32 ## 1: 11 21 1 a NA NA NA NA ## 2: 12 22 2 b 6 f NA NA ## 3: 13 23 3 c 7 g 11 k ## 4: 14 24 4 d 8 h 12 l # *** FULL JOIN *** # In the Full join we want to get all data from both datasets # regardless of whether or not there are matched keys # Full join between Dt1 y Dt2 with data.table syntax # To make a "Full Join" with data.table syntax we'll going to # create an object that contains all the ids we have and use it # as the index in our join, like this: # 1.- Create a variable from the union of 'ids/keys' ids <- unique(rbind(Dt1[, list(id1, id2)], Dt2[, list(id1, id2)])) ids ## id1 id2 ## 1: 11 21 ## 2: 12 22 ## 3: 13 23 ## 4: 14 24 ## 5: 99 98 # 2.- Full join between Dt1 y Dt2 with data.table syntax # (using ids object) Dt1[Dt2[ids]] ## id1 id2 c11 c12 c21 c22 ## 1: 11 21 1 a NA NA ## 2: 12 22 2 b 6 f ## 3: 13 23 3 c 7 g ## 4: 14 24 4 d 8 h ## 5: 99 98 NA NA 5 e # Full join between Dt1 y Dt2, using merge command merge(x = Dt1, y = Dt2, all = TRUE) ## id1 id2 c11 c12 c21 c22 ## 1: 11 21 1 a NA NA ## 2: 12 22 2 b 6 f ## 3: 13 23 3 c 7 g ## 4: 14 24 4 d 8 h ## 5: 99 98 NA NA 5 e # Full join between Dt1, Dt2 and Dt3 with data.table syntax # 1.- Create a variable from the union of 'ids/keys' ids <- unique(rbind(Dt1[, list(id1, id2)], Dt2[, list(id1, id2)], Dt3[, list(id1, id2)])) ids ## id1 id2 ## 1: 11 21 ## 2: 12 22 ## 3: 13 23 ## 4: 14 24 ## 5: 99 98 ## 6: 999 998 ## 7: 9999 9998 # 2.- Full join between Dt1 y Dt2 with data.table syntax # (using ids object) Dt1[Dt2[Dt3[ids]]] ## id1 id2 c11 c12 c21 c22 c31 c32 ## 1: 11 21 1 a NA NA NA NA ## 2: 12 22 2 b 6 f NA NA ## 3: 13 23 3 c 7 g 11 k ## 4: 14 24 4 d 8 h 12 l ## 5: 99 98 NA NA 5 e NA NA ## 6: 999 998 NA NA NA NA 9 i ## 7: 9999 9998 NA NA NA NA 10 j # To make a Full Join between several data.tables we are going # to create the mergeFull function which is the merge command # setting parameter all = TRUE; doing this we will force to # show all data from both dataset mergeFull <- function(x, y) merge(x, y, all = TRUE) # Full join between Dt1, Dt2 and Dt3 using Reduce and mergeFull function Reduce(mergeFull, list(Dt1, Dt2, Dt3)) ## id1 id2 c11 c12 c21 c22 c31 c32 ## 1: 11 21 1 a NA NA NA NA ## 2: 12 22 2 b 6 f NA NA ## 3: 13 23 3 c 7 g 11 k ## 4: 14 24 4 d 8 h 12 l ## 5: 99 98 NA NA 5 e NA NA ## 6: 999 998 NA NA NA NA 9 i ## 7: 9999 9998 NA NA NA NA 10 j # *** PARA AUMENTAR FILAS - RBIND *** # The rbind command is used to increase the rows of # a dataset under the other, the following will fail # since the variables are not the same name rbind(Dt1, Dt2) ## Error: Some colnames of argument 2 (c21,c22) are not present in colnames ## of item 1. If an argument has colnames they can be in a different order, ## but they must all be present. Alternatively, you can drop names (by using ## an unnamed list) and the columns will then be joined by position. Or, set ## use.names=FALSE. # *** PARA AUMENTAR COLUMNAS - CBIND *** # The command cbind columns is used to increase the columns # of a data set to another. Here you can see how the cbind # (without options) does not respect the indices simply # puts a whole dataset beside the other (see the order of "ids" ) cbind(Dt1, Dt2) ## id1 id2 c11 c12 id1 id2 c21 c22 ## 1: 11 21 1 a 12 22 6 f ## 2: 12 22 2 b 13 23 7 g ## 3: 13 23 3 c 14 24 8 h ## 4: 14 24 4 d 99 98 5 e cbind(Dt1, Dt2, Dt3) ## id1 id2 c11 c12 id1 id2 c21 c22 id1 id2 c31 c32 ## 1: 11 21 1 a 12 22 6 f 13 23 11 k ## 2: 12 22 2 b 13 23 7 g 14 24 12 l ## 3: 13 23 3 c 14 24 8 h 999 998 9 i ## 4: 14 24 4 d 99 98 5 e 9999 9998 10 j
Created by Pretty R at inside-R.org
The perceptive reader will have noticed how to do a Left Join without having the problem with the order of the columns… I leave you the question.