Data Processing and Transformation

So far, we have seen different ways to read and store data. Now, let's focus on the kind of data processing and transformation required to perform data analysis and draw insights or build models. Data in its raw form is hardly of any use, so it becomes essential to process it to make it suitable for any useful purpose. This section focuses on many methods in R that have widespread usage during data analysis.

cbind

As the name suggests, it combines two or more vector, matrix, DataFrame, or table by column. cbind is useful when we have more than one vector, matrix, or DataFrame that need to be combined into one for analysis or visualization. The output of cbind varies based on the input data. The following exercise provides a few examples of cbind, which combines two vectors.

Exercise 9: Exploring the cbind Function

In this exercise, we will implement the cbind function to combine two DataFrame objects.

Perform the following steps to complete the exercise:

  1. Generate 16 random numbers drawn from a binomial distribution with parameter size equal to 100 and probability of success equal to 0.4:

    r_numbers <- rbinom(n = 16, size = 100, prob = 0.4)

  2. Next, print the r_numbers values using the following command:

    r_numbers

    The output is as follows:

    ## [1] 38 46 40 42 45 39 37 35 44 39 46 41 31 32 34 43

  3. Select any 16 alphabets from English LETTERS without repetition:

    r_characters <- sample(LETTERS, 18, FALSE)

  4. Now, print the r_characters values using the following command:

    r_characters

    The output is as follows:

    ## [1] "C" "K" "Z" "I" "E" "A" "X" "O" "H" "Y" "T" "B" "N" "F" "U" "V" "S"

    ## [18] "P"

  5. Combine r_numbers and r_characters using cbind:

    cbind(r_numbers, r_characters)

    The output is as follows:

    ## Warning in cbind(r_numbers, r_characters): number of rows of result is not a multiple of vector length (arg 1)

    ## r_numbers r_characters

    ## [1,] "38" "C"

    ## [2,] "46" "K"

    ## [3,] "40" "Z"

    ## [4,] "42" "I"

    ## [5,] "45" "E"

    ## [6,] "39" "A"

    ## [7,] "37" "X"

    ## [8,] "35" "O"

    ## [9,] "44" "H"

    "

  6. Print the class (type of data structure) we obtain after using cbind:

    class(cbind(r_numbers, r_characters))

    The output is as follows:

    ## [1] "matrix"

    Observe a warning message in the output of cbind in the 5th step of this exercise:

    number of rows of result is not a multiple of vector length (arg 1)

    r_numbers r_characters

The error means that the lengths of r_numbers and r_characters are not same (16 and 18, respectively). Note that the cbind() method, unlike as.data.frame(), doesn't throw an error. Instead, it automatically performs what is known as Recycling, and the vector of shorter length gets recycled. In the output, the r_numbers 38 and 48 are recycled from the top to fill the 17th and 18th index.

Consider that we write the following command instead:

cbind(as.data.frame(r_numbers), as.data.frame(r_characters))

It will now throw an error as we had shown earlier in the DataFrame section:

Error in data.frame(..., check.names = FALSE) :

arguments imply differing number of rows: 16, 18

One needs to be careful by always checking for the dimensions and the class of data. Otherwise, it may lead to unwanted results. When we give two vectors, it creates a matrix by default on doing a cbind.

Note

Since we are not setting any seed value, the output of sample and rbinom will differ in each execution of the code.

rbind

rbind is like cbind, but it combines by row instead of column. For rbind to work, the number of columns should be equal in both the DataFrames. It is useful in cases when we want to append an additional set of observations with an existing dataset where all the columns of the original dataset are the same and are in the same order. Let's explore rbind in the following exercise.

Exercise 10: Exploring the rbind Function

In this exercise, we will combine two DataFrames using the rbind function.

Perform the following steps to complete the exercise:

  1. Generate 16 random numbers drawn from a binomial distribution with parameter size equal to 100 and probability of success equal to 0.4:

    r_numbers <- rbinom(n = 18, size = 100, prob = 0.4)

  2. Next, print the r_numbers values:

    r_numbers

    The output is as follows:

    ## [1] 38 46 40 42 45 39 37 35 44 39 46 41 31 32 34 43

  3. Select any 16 alphabets from English LETTERS without repetition:

    r_characters <- sample(LETTERS, 18, FALSE)

  4. Now, print the r_characters using the following command:

    r_characters

    The output is as follows:

    ## [1] "C" "K" "Z" "I" "E" "A" "X" "O" "H" "Y" "T" "B" "N" "F" "U" "V" "S"

    ## [18] "P"

  5. Finally, use the rbind method to print the combined value of r_numbers and r_characters:

    rbind(r_numbers, r_characters)

    The output is as follows:

    ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]

    ## r_numbers "37" "44" "38" "38" "41" "35" "38" "40" "38" "45" "37"

    ## r_characters "Q" "Y" "O" "L" "A" "G" "V" "S" "B" "U" "D"

    ## [,12] [,13] [,14] [,15] [,16] [,17] [,18]

    ## r_numbers "40" "41" "42" "36" "44" "37" "44"

    ## r_characters "R" "T" "P" "F" "X" "C" "I"

From the last step, observe that the rbind function concatenates (binds) the r_numbers and r_characters as two rows of data, unlike cbind, where it was stacked in two columns. Except for the output, all the other rules of cbind apply to rbind as well.

The merge Function

The merge() function in R is particularly useful when there is more than one DataFrame to join using a common column (what we call a primary key in the database world). Merge has two different implementations for the DataFrame and data table, which behave mostly in the same way.

Exercise 11: Exploring the merge Function

In this exercise, we will generate two DataFrames, that is, df_one and df_two, such that the r_numbers column uniquely identifies each row in each of the DataFrame.

Perform the following steps to complete the exercise:

First DataFrame

  1. Use the set.seed() method to ensure that the same random numbers are generated every time the code is run:

    set.seed(100)

  2. Next, generate any 16 random numbers between 1 to 30 without repetition:

    r_numbers <- sample(1:30,10, replace = FALSE)

  3. Generate any 16 characters from the English alphabet with repetition:

    r_characters <- sample(LETTERS, 10, TRUE)

  4. Combine r_numbers and r_characters into one DataFrame named df_one:

    df_one <- cbind(as.data.frame(r_numbers), as.data.frame(r_characters))

    df_one

    The output is as follows:

    ## r_numbers r_characters

    ## 1 10 Q

    ## 2 8 W

    ## 3 16 H

    ## 4 2 K

    ## 5 13 T

    ## 6 26 R

    ## 7 20 F

    ## 8 9 J

    ## 9 25 J

    ## 10 4 R

Second DataFrame

  1. Use the set.seed() method for preserving the same random numbers over multiple runs:

    set.seed(200)

  2. Next, generate any 16 random numbers between 1 to 30 without repetition:

    r_numbers <- sample(1:30,10, replace = FALSE)

  3. Now, generate any 16 characters from the English alphabet with repetition:

    r_characters <- sample(LETTERS, 10, TRUE)

  4. Combine r_numbers and r_characters into one DataFrame named df_two:

    df_two <- cbind(as.data.frame(r_numbers), as.data.frame(r_characters))

    df_two

    The output is as follows:

    ## r_numbers r_characters

    ## 1 17 L

    ## 2 30 Q

    ## 3 29 D

    ## 4 19 Q

    ## 5 18 J

    ## 6 21 H

    ## 7 26 O

    ## 8 3 D

    ## 9 12 X

    ## 10 5 Q

Once we create the df_one and df_two DataFrames using the cbind() function, we are ready to perform some merge (will use the word JOIN, which means the same as merge()).

Now, let's see how different type of joins give different results.

In the world of databases, JOINs are used to combine two or more than two tables using a common primary key. In databases, we use Structured Query Language (SQL) to perform the JOINs. In R, the merge() function helps us with the same functionality as SQL offers in databases. Also, instead of tables, we have DataFrames here, which is again a table with rows and columns of data.

Inner Join

In Exercise 11, Exploring the merge Function, we created two DataFrames: df_one and df_ two. We will now join the two DataFrames using Inner Join. Observe that only the value 26 (row number 7) in the r_numbers column is common between the two DataFrames, where the corresponding character in the r_characters column is R in df_one and character O in df_two. In the output, X corresponds to the df_one DataFrame and Y correspond to the df_two DataFrame.

To merge the df_one and df_two DataFrames using the r_numbers column, use the following command:

merge(df_one, df_two, by = "r_numbers")

## r_numbers r_characters.x r_characters.y

## 1 26 R O

Left Join

Left Join gives all the values of df_one in the r_numbers column and adds <NA> as a value wherever the corresponding value in df_two is not found. For example, for r_number = 2, there is no value in df_two, whereas for r_number = 26, values in df_one and df_two, for the r_characters column is R and O, respectively.

To merge the df_one and df_two DataFrames using the r_numbers column, use the following command:

merge(df_one, df_two, by = "r_numbers", all.x = TRUE)

## r_numbers r_characters.x r_characters.y

## 1 2 K <NA>

## 2 4 R <NA>

## 3 8 W <NA>

## 4 9 J <NA>

## 5 10 Q <NA>

## 6 13 T <NA>

## 7 16 H <NA>

## 8 20 F <NA>

## 9 25 J <NA>

## 10 26 R O

Right Join

Right Joins works just like Left Join, except for that the values in the r_character columns of df_one are <NA> wherever a match is not found. Again, r_numbers = 26 is the only match.

To merge the df_one and df_two DataFrames using the r_numbers column, use the following command:

merge(df_one, df_two, by = "r_numbers", all.y = TRUE)

## r_numbers r_characters.x r_characters.y

## 1 3 <NA> D

## 2 5 <NA> Q

## 3 12 <NA> X

## 4 17 <NA> L

## 5 18 <NA> J

## 6 19 <NA> Q

## 7 21 <NA> H

## 8 26 R O

## 9 29 <NA> D

## 10 30 <NA> Q

Full Join

Unlike Left and Right Join, Full Join gives all the unique values of the r_numbers column from both the DataFrames and adds <NA> in the r_characters column from the respective DataFrame. Observe that only the r_number = 26 row has values from both the DataFrame.

To merge the df_one and df_two DataFrames using the r_numbers column, use the following command:

merge(df_one, df_two, by = "r_numbers", all = TRUE)

## r_numbers r_characters.x r_characters.y

## 1 2 K <NA>

## 2 3 <NA> D

## 3 4 R <NA>

## 4 5 <NA> Q

## 5 8 W <NA>

## 6 9 J <NA>

## 7 10 Q <NA>

## 8 12 <NA> X

## 9 13 T <NA>

## 10 16 H <NA>

## 11 17 <NA> L

## 12 18 <NA> J

## 13 19 <NA> Q

The reshape Function

Data is known to be in a wide format if each subject has only a single row, with each measurement present as a different variable or column. Similarly, it is a long format if each measurement has a single observation (thus, multiple rows per subject). The reshape function is used often to convert between wide and long formats for a variety of operations to make the data useful for computation or analysis. In many visualizations, we use reshape() to convert wide format to long and vice versa.

We will use the Iris dataset. This dataset contains variables named Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width, whose measurements are given in centimeters, for 50 flowers from each of 3 species of Iris, namely setosa, versicolor, and virginica.

Exercise 12: Exploring the reshape Function

In this exercise, we will explore the reshape function.

Perform the following steps to complete the exercise:

  1. First, print the top five rows of the iris dataset using the following command:

    head(iris)

    The output of the previous command is as follows:

    ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species

    ## 1 5.1 3.5 1.4 0.2 setosa

    ## 2 4.9 3.0 1.4 0.2 setosa

    ## 3 4.7 3.2 1.3 0.2 setosa

    ## 4 4.6 3.1 1.5 0.2 setosa

    ## 5 5.0 3.6 1.4 0.2 setosa

    ## 6 5.4 3.9 1.7 0.4 setosa

  2. Now, create a variable called Type based on the following condition. When Sepal.Width > 2 and Sepal Width <= 3, we will assign TYPE 1 or TYPE 2. The type column is for demo purpose only and has no particular logic:

    iris$Type <- ifelse((iris$Sepal.Width>2 & iris$Sepal.Width <=3),"TYPE 1","TYPE 2")

  3. Store the Type, Sepal.Width, and Species columns in the df_iris DataFrame:

    df_iris <- iris[,c("Type","Sepal.Width","Species")]

  4. Next, reshape df_iris into wide DataFrame using the following reshape command:

    reshape(df_iris,idvar = "Species", timevar = "Type", direction = "wide")

    The output is as follows:

    ## Species Sepal.Width.TYPE 2 Sepal.Width.TYPE 1

    ## 1 setosa 3.5 3.0

    ## 51 versicolor 3.2 2.3

    ## 101 virginica 3.3 2.7

    You will get a warning while running the reshape command, saying as follows:

    multiple rows match for Type=TYPE 2: first taken multiple rows match for Type=TYPE 1: first taken

This warning means there were multiple values for Type 1 and Type 2 for the three species, so the reshape has picked the first occurrence of each of the species. In this case, the 1, 51, and 101 row numbers. We will now see how we could handle this transformation better in the aggregate function.

The aggregate Function

Aggregation is a useful method for computing statistics such as count, averages, standard deviations, and quartiles, and it also allows for writing a custom function. In the following code, the formula (formula is a name of the data structure in R, not a mathematical equation) for each Iris species computes the mean of the numeric measures sepal and petal width and length. The first of the aggregate function argument is a formula that takes species and all the other measurements to compute the mean from all the observations.

aggregate(formula =. ~ Species, data = iris, FUN = mean)

The output of the previous command is as follows:

## Species Sepal.Length Sepal.Width Petal.Length Petal.Width

## 1 setosa 5.006 3.428 1.462 0.246

## 2 versicolor 5.936 2.770 4.260 1.326

## 3 virginica 6.588 2.974 5.552 2.026