21. Combining dataframes

The ability to combine datasets from different original files is often a difficult and poorly documented process. It usually comes down to opening the files and “copying and pasting” the data from one file to another.  From a reproducible research perspective this is very poor practice as there is no real documentation.  In addition, it is tedious and fraught with the risk of mistakes.  These issues are magnified if it is a process that is repeated often, for example merging individual files of each subject in a trial.  Using R to merge or combine datasets obviates many of these problems (i) a script is the source documentation, (ii) a script cannot suffer from errors of copy/paste nature, (iii) a script can be easily ran repeatedly and/or (iv) modified to do the same or similar job for other datasets (iv) if a mistake is noted (see point (i)), or more data comes in, then the script can be fixed and re-ran with very little effort.  These are especially useful properties when it comes to merging data.

 

Binding columns (cbind)

The simplest process is combining a new dataset which contains new variable(s) we wish to added to corresponding data original dataset.  In other words joining two dataframes by column – obviously they need the same number of rows.  For example, the eye colour of the subjects in the data we have been working with in this course.  Download the IDdata.csv file which contains subject ID’s and eye colours.  The original recruit’s data didn’t have subject ID’s in it, so we assume they are match correctly.  Import it into R and then have a look at it:

head(IDdata)

str(IDdata)

unique(IDdata$eye)

plot(ID ~ eye, data=IDdata)

These functions and simple plot show that there are 205 entries and there are 3 values for eye. These are 1=Blue, 2=Brown, 3=Other.  If you haven’t done so re-read in your latest saved version of the recuirts dataframe (saved as data.csv).  Now we can bind the two dataframes using the cbind(<dataframe1, dataframe2, dataframe3, etc…) function:

data <- cbind(IDdata, data)

head(data)

Now assign some factors to give eye colour some meaning:

unique(data$eye)  # checking that there are still only 3 values. Yes I am paranoid

data$eyef[data$eye == "1"] <- "Blue"

data$eyef[data$eye == "2"] <- "Brown"

data$eyef[data$eye == "3"] <- "Other"

data$eyef <- as.factor(data$eyef)

Factor_summary <- with(data, table(eye, eyef))

Factor_summary

head(data)

plot(ID ~ eyef, data=IDdata)

 

Now save the new data, lets give it a new name when we save it (just to make sure you can!):

write.csv(data, file="data_eye_ID.csv", na = "NA", row.names=F)  # saving the new updated data

 

Binding rows (rbind)

The next case is simply joining sets of data which are identical in the variables they contain.  For example, sets of data from several subjects.  In this example we will pretend that we obtained the original Recruits.csv data in two halves – say from two trial sites.  Download the Recruits1_100.csv and Recruits101_205.csv files, then read them in and check them out a little bit.

data1_100 <- read.csv(file = 'Recruits1_100.csv', header = TRUE, sep = ',', na.strings=c("",".","NA"))

data101_205 <- read.csv(file = 'Recruits101_205.csv', header = TRUE, sep = ',', na.strings=c("",".","NA"))

head(data1_100)

head(data101_205)   #make sure the headings match

now bind them together using the rbind(<dataframe1, dataframe2, dataframe3, etc…) function and do a little bit of data integrity checking:

data1_205 <- rbind(data1_100, data101_205)

dim(data1_100)

dim(data101_205)

dim(data1_205)

dim(data)  # check lengths match /add up as expected to the 205 of the original data

 

You might notice that the combined dataframe (data1_205) has more columns than the original data (data) – this is because earlier we created several factors. If you compare the dataframes using the head() commands, you’ll see the extra factor columns.