Merging data frames: An example with artificial data¶
df1 <- data.frame(
x = c(1,3,2,4,6,5),
y = c(1,1,2,2,2,4)
)
df1
x y
1 1 1
2 3 1
3 2 2
4 4 2
5 6 2
6 5 4
df2 <- data.frame(
a = c(51,42,22),
b = c(1,2,3)
)
df2
a b
1 51 1
2 42 2
3 22 3
In this first attempt at merging, the data frames do not share any variables, hence there is no way of determining which of the rows of the two data frames “belong together”. In such a case each row of the first data frame is matched with each of the second data frame. Hence the number of rows of the result equals the products of the numbers of rows of the two data frames.
df12 <- merge(df1,df2)
df12
x y a b
1 1 1 51 1
2 3 1 51 1
3 2 2 51 1
4 4 2 51 1
5 6 2 51 1
6 5 4 51 1
7 1 1 42 2
8 3 1 42 2
9 2 2 42 2
10 4 2 42 2
11 6 2 42 2
12 5 4 42 2
13 1 1 22 3
14 3 1 22 3
15 2 2 22 3
16 4 2 22 3
17 6 2 22 3
18 5 4 22 3
nrow(df1)
[1] 6
nrow(df2)
[1] 3
nrow(df12)
[1] 18
By explicitly specifying the variables used for matching, the result is different: It contains only rows for which matches can be found in both data frames
merge(df1,df2,by.x="y",by.y="b")
y x a
1 1 1 51
2 1 3 51
3 2 2 42
4 2 4 42
5 2 6 42
With the optional argument all.x=TRUE
the result has a row for each
row from the first data frame, whether or not a match is find for it:
Missing information (from non-existing rows of the second data frame)
is filled up with NA
.
merge(df1,df2,by.x="y",by.y="b",
all.x=TRUE)
y x a
1 1 1 51
2 1 3 51
3 2 2 42
4 2 4 42
5 2 6 42
6 4 5 NA
With all.y=TRUE
the result contains all rows from the second data
frame:
merge(df1,df2,by.x="y",by.y="b",
all.y=TRUE)
y x a
1 1 1 51
2 1 3 51
3 2 2 42
4 2 4 42
5 2 6 42
6 3 NA 22
The argument setting all=TRUE
is equivalent with
all.x=TRUE
and all.y=TRUE
merge(df1,df2,by.x="y",by.y="b",
all=TRUE)
y x a
1 1 1 51
2 1 3 51
3 2 2 42
4 2 4 42
5 2 6 42
6 3 NA 22
7 4 5 NA
- R file: merging-artificial.R
- Rmarkdown file: merging-artificial.Rmd
- Jupyter notebook file: merging-artificial.ipynb
- Interactive version of the Jupyter notebook (shuts down after 60s):
- Interactive version of the Jupyter notebook (sign in required):