Chapter 2 Data preprocessing

2.1 Importing data

2.1.1 CSV import

CSV (Comma Separated Values) is a common data format. The read.csv() function can import CSV files.

Note: Replace “data.csv” with the path to your own CSV file.

data_csv <- read.csv("data.csv")

2.1.2 Excel import

To import data from Excel files, you can use the readxl package.

Note: Replace “data.xlsx” with the path to your own Excel file.

library(readxl)
data_xlsx <- read_excel("data.xlsx")

2.1.3 Multiple dataset import

To efficiently import and concatenate multiple datasets in R, follow these steps:

  1. List CSV files in directory: Use the list.files function to create a list of CSV files in the specified directory. Ensure that the data files are the only items in the directory.

Note: Here files within previously set directory is used.


file.list <- list.files(
  path = ".",            # "." represents the current working directory
  pattern = ".csv",      # Select files with the ".csv" extension
  recursive = TRUE       # Include files in subdirectories
)
  1. Filter Empty Files: Identify and filter out empty files from the list. This step helps to ensure that only meaningful data is considered.
file.list <- file.list[file.info(file.list)$size > 10]
  1. Check for Duplicate Filenames: Verify if there are any duplicate filenames, which might indicate having the data folder replicated. If duplicates exist, consider cleaning the folder.
file.list1 <- gsub(".*\\/","",file.list)
if (length(file.list) != length(unique(file.list1))) {
 stop("Duplicate filenames found. Remove duplicates and rerun the script.")
}

Note: The process of checking for duplicates and cleaning the folder is crucial for accurate data consolidation.

  1. Import and Concatenate: Utilize the ldply function from the plyr package to read and concatenate all CSV files into a single dataset. Make sure to install the plyr package if not already installed.
if (!require(plyr)) install.packages("plyr")
library(plyr)
merged_data <- ldply(.data = file.list, .fun = read.csv2, .inform = TRUE, header = TRUE, sep = ",")

The resulting merged_data contains the combined dataset from all CSV files.

2.2 Exporting data

2.2.1 CSV export

The write.csv() function can be used to export data frames to CSV files.

Note: Replace “exported_data.csv” with your preferred file name.

write.csv(data_csv, "exported_data.csv")

2.2.2 Excel export

For Excel file exports, you can use the writexl package.

Note: Replace “exported_data.xlsx” with your preferred file name.

library(writexl)
write_xlsx(data_xlsx, "exported_data.xlsx")

2.3 Data cleaning

2.3.1 Drop variables

You can drop variables using the base R or dplyr package.

Note: Replace “var1” and “var2” with the variables you want to drop.


library(dplyr)
data2 <- select(data_csv, -c(var1, var2))

2.3.2 Drop rows based on conditions

You can drop rows that do not meet certain conditions.

Note: Replace “var1” with the variable based on which you want to filter rows, and replace “10” with your filtering threshold.

library(dplyr)
data2 <- filter(data_csv, var1 >= 10)

2.3.3 Handling missing values

2.3.3.1 Removing NA values

You can remove rows containing NA values.

library(dplyr)
data2 <- drop_na(data_csv)

2.3.3.2 Replacing NA values

You can replace NA values with a specific value.

Note: Replace “var1” and “var2” with the variables for which you want to replace NA values.

library(dplyr)
data2 <- replace_na(data_csv, replace = list(var1 = 0, var2 = 1))

2.4 Data transformation

2.4.1 Transform variable characteristics

You can convert variables to different types.

Note: Replace “var1” and “var2” with the variables you want to convert.

# To numeric
data_csv$var1 <- as.numeric(data_csv$var1)

# To character
data_csv$var2 <- as.character(data_csv$var2)

2.4.2 Variable renaming

You can rename variables for easier interpretation.

2.4.2.1 Renaming variables manually

Renaming variables is a common data preparation task. It can make the data more readable and compatible with the specific naming conventions you intend to follow.

Note: Replace the key-value pairs in the rename() function according to your dataset. The key is the new name and the value is the existing name.

# Renaming columns in a data frame
data_frame <- data_frame %>%
  rename(
    new_var1 = old_var1,
    new_var2 = old_var2
  )

2.4.2.2 Renaming variables based on data

When your dataset is large or you have a codebook, it’s efficient to programmatically rename variables.

Note: Replace codebook and data_frame with your own codebook and data frame, respectively. The old and new names should correspond in the codebook.

# Using a codebook to rename columns
codebook_data <- read.csv("your_codebook.csv")
rename_list <- list(
  oldnames = codebook_data$old_names,
  newnames = codebook_data$new_names
)
matched_positions <- match(rename_list$oldnames, names(data_frame))
valid_positions <- which(!is.na(matched_positions))
names(data_frame)[matched_positions[valid_positions]] <- rename_list$newnames[valid_positions]

2.4.2.3 Renaming cell content

Sometimes, the cell content needs to be renamed to be more interpretable.

Note: Replace gender with the specific column where you need to recode the values. Specify the old and new values in the case_when() function.

# Renaming cell content in a data frame
data_frame <- data_frame %>%
  mutate(
    gender = case_when(
      gender == 1 ~ "Male",
      gender == 2 ~ "Female",
      TRUE ~ as.character(gender)
    )
  )

2.4.2.4 Variable reversing (recoding)

Variable reversing is often required when different items in a survey are scaled in opposite directions.

Note: The abs(8 - data_frame[, reverse_columns]) is specific to a 7-point Likert scale. Replace 8 with max_scale_value + 1 where max_scale_value is the maximum value on your scale.

# Reverse the values in the identified columns
reverse_columns <- grepl("_rev$", colnames(data_frame))
data_frame[, reverse_columns] <- abs(8 - data_frame[, reverse_columns])

2.4.2.5 Reversing items based on data

In some situations, you may need to reverse variables based on certain conditions or a codebook.

Note: Replace codebook_data and data_frame with your own data. The codebook should indicate which variables to reverse. In the example below, the character string "yes" is used to indicate reversed items.

# Using a codebook to reverse variables
cols_to_reverse <- codebook_data$variable_name[trimws(tolower(codebook_data$to_reverse)) == "yes"]
for (col in cols_to_reverse) {
  if (col %in% names(data_frame)) {
    positive_indices <- which(data_frame[[col]] >= 0 & !is.na(data_frame[[col]]))
    data_frame[positive_indices, col] <- 8 - data_frame[positive_indices, col]
  }
}

2.5 Data imputation

2.5.1 Perform mean substitution

Handling missing data is crucial. Simple techniques include replacing missing values with mean or median.

Note: The mutate(across(where(is.numeric), ...)) function replaces NA values in numeric columns with their mean. Replace data_frame with your dataset.

# Performing mean substitution to handle missing values
data_frame_noNA <- data_frame %>%
  mutate(across(where(is.numeric), ~ifelse(is.na(.), mean(., na.rm = TRUE), .)))

2.5.2 Multiple imputation with MICE in R

Multiple imputation is a statistical technique for handling missing data by creating several different plausible imputed datasets and then averaging the results. The mice package in R is designed to perform multiple imputation using chained equations.

Note: The mice() function in the mice package creates multiple imputed datasets (m) based on the pattern of missingness in the data. The complete() function then extracts the first completed dataset for analysis, but all imputed sets should typically be used for a proper multiple imputation analysis.

# Ensure the mice package is installed and then load it
if (!require(mice)) install.packages("mice")
library(mice)

# Perform multiple imputation on your dataset
imputed_data <- mice(data_frame, m = 5, maxit = 50, method = 'pmm', seed = 500)

# Extract the first imputed dataset
completed_data <- complete(imputed_data, 5) # here i use the fifth imputation

# Note: To perform a proper multiple imputation analysis,
# use all imputed datasets, typically by pooling the results.

The mice() function’s parameters include m for the number of datasets to impute, maxit for the maximum number of iterations for the imputations, method for the imputation method (with ‘pmm’ indicating predictive mean matching), and seed for the random number generation to ensure reproducibility.

2.6 Data manipulation

Data manipulation is an essential skill in data analysis. This document provides clear examples of sorting, grouping, and summarizing data using R.

2.6.1 Joins

2.6.1.1 Inner join

An inner join returns only the rows that have matching values in both tables.

Note: Replace “key” with the common variable existing in both “data1” and “data2”.

library(dplyr)
merged_data <- inner_join(data1, data2, by = "key")
Table 1
ID Value
1 A
2 B
3 C
+
Table 2
ID Value
2 B
3 C
4 D
=
Result of Inner Join
ID Value.x Value.y
2 B B
3 C C

2.6.1.2 Left and right join

A left join returns all records from the left table and the matched records from the right table. Here, “key” should be replaced with the column name that is common to both “data1” and “data2” and serves as the linking key between the two tables. If you wish to return all records from the right table and the matched records from the left table, a rightjoin is appropriate.

library(dplyr)
merged_data <- left_join(data1, data2, by = "key")
Table 1
ID Value
1 A
2 B
3 C
+
Table 2
ID Value
2 B
3 C
4 D
=
Result of Left Join
ID Value.x Value.y
1 A NA
2 B B
3 C C

2.6.1.3 Full join

A full join returns all records when there is a match in either the left or right table records.

library(dplyr)
merged_data <- full_join(data1, data2, by = "ID")
Table 1
ID Value
1 A
2 B
3 C
+
Table 2
ID Value
2 B
3 C
4 D
=
Result of Inner Join
ID Value.x Value.y
2 B B
3 C C

2.6.2 Filtering

Filtering is a way to select specific rows based on conditions.

Note: Replace “var1” with the variable based on which you want to filter rows. Replace “10” with your threshold value.

# Base R
data_filtered <- data_csv[data_csv$var1 >= 10, ]

# dplyr
data_filtered <- filter(data_csv, var1 >= 10)

2.6.3 Sorting

Sorting allows you to organize your data frame based on specific variables. This is especially useful when you have a multi-dimensional dataset and want to prioritize certain variables for analysis. I use the mtcars dataset as an example.

Note: Replace “mtcars” with the name of your data frame. Replace “cyl”, “gear” etc. with the variables you wish to sort by. If you have a specific vector of sorted variables, replace “sorted_vars_vector” with that vector name.

2.6.3.1 Sorting single variable

To sort a dataframe by a single variable, use the arrange function from the dplyr package.

library(dplyr)
# Sort the mtcars dataset by cyl in ascending order
data_sorted_single <- mtcars %>%
  arrange(cyl)
Table 2.1: A table of mtcars sorted by cyl in ascending order.
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

2.6.3.2 Sorting multiple variables

You can also sort by multiple variables. The data will be sorted by the first variable first, and then within each group by the second variable, and so on.

# Sort the mtcars dataset by mpg and then by qsec
data_sorted_multi <- mtcars %>%
  arrange(cyl, mpg)
# Display the first few rows of the sorted data
head(data_sorted_multi)
Table 2.2: A table of mtcars sorted by cyl and mpg in ascending order.
mpg cyl disp hp drat wt qsec vs am gear carb
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2

2.6.3.3 Sorting based on custom order

Sometimes, you may need to sort the data in a custom order, not necessarily ascending or descending.

# Define a custom order for the 'gear' variable
gear_order <- c(6, 4, 8)

# Sort the mtcars dataset by 'gear' in the custom order
data_sorted_custom <- mtcars %>%
  mutate(gear = factor(gear, levels = gear_order)) %>%
  arrange(gear)
# Display the first few rows of the sorted data
print(data_sorted_custom)
Table 2.3: A table of mtcars sorted by cyl and mpg in ascending order.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4

2.6.4 Grouping

Grouping allows you to categorize your data based on one or more variables, and then perform operations on each group separately. This is particularly useful when you want to perform calculations on subsets of data (e.g. subscales within a scale)

The group_by function is used to split the data into groups based on one or more variables.

library(dplyr)
grouped_data <- mtcars %>%
  group_by(cyl)

After grouping, you can summarize each group’s data, which is shown in the subsequent subchapter.

2.6.5 Summarizing

After grouping, you can calculate any index (I use the mean) for each group formed based on “var1”. The summarise function is used to apply summary functions to each of the previously created groups

# Calculate the average horsepower (hp) for each group of cylinders
summary_data <- grouped_data %>%
  summarise(avg_hp = mean(hp, na.rm = TRUE))
# Display the summarized data
summary_data
Table 2.4: A table of average horsepower (hp) for each group of cylinders.
cyl avg_hp
4 82.63636
6 122.28571
8 209.21429