Chapter 2 Data preprocessing
2.1 Importing data
2.1.1 CSV import
CSV (Comma Separated Values) is a common data format. The read.csv()
function can import CSV files.
Note: Replace “data.csv” with the path to your own CSV file.
2.1.2 Excel import
To import data from Excel files, you can use the readxl
package.
Note: Replace “data.xlsx” with the path to your own Excel file.
2.1.3 Multiple dataset import
To efficiently import and concatenate multiple datasets in R, follow these steps:
- List CSV files in directory:
Use the
list.files
function to create a list of CSV files in the specified directory. Ensure that the data files are the only items in the directory.
Note: Here files within previously set directory is used.
file.list <- list.files(
path = ".", # "." represents the current working directory
pattern = ".csv", # Select files with the ".csv" extension
recursive = TRUE # Include files in subdirectories
)
- Filter Empty Files: Identify and filter out empty files from the list. This step helps to ensure that only meaningful data is considered.
- Check for Duplicate Filenames: Verify if there are any duplicate filenames, which might indicate having the data folder replicated. If duplicates exist, consider cleaning the folder.
file.list1 <- gsub(".*\\/","",file.list)
if (length(file.list) != length(unique(file.list1))) {
stop("Duplicate filenames found. Remove duplicates and rerun the script.")
}
Note: The process of checking for duplicates and cleaning the folder is crucial for accurate data consolidation.
- Import and Concatenate:
Utilize the
ldply
function from theplyr
package to read and concatenate all CSV files into a single dataset. Make sure to install theplyr
package if not already installed.
if (!require(plyr)) install.packages("plyr")
library(plyr)
merged_data <- ldply(.data = file.list, .fun = read.csv2, .inform = TRUE, header = TRUE, sep = ",")
The resulting merged_data
contains the combined dataset from all CSV files.
2.2 Exporting data
2.3 Data cleaning
2.3.1 Drop variables
You can drop variables using the base R or dplyr
package.
Note: Replace “var1” and “var2” with the variables you want to drop.
2.3.2 Drop rows based on conditions
You can drop rows that do not meet certain conditions.
Note: Replace “var1” with the variable based on which you want to filter rows, and replace “10” with your filtering threshold.
2.4 Data transformation
2.4.1 Transform variable characteristics
You can convert variables to different types.
Note: Replace “var1” and “var2” with the variables you want to convert.
2.4.2 Variable renaming
You can rename variables for easier interpretation.
2.4.2.1 Renaming variables manually
Renaming variables is a common data preparation task. It can make the data more readable and compatible with the specific naming conventions you intend to follow.
Note: Replace the key-value pairs in the rename()
function according to your dataset. The key is the new name and the value is the existing name.
2.4.2.2 Renaming variables based on data
When your dataset is large or you have a codebook, it’s efficient to programmatically rename variables.
Note: Replace codebook
and data_frame
with your own codebook and data frame, respectively. The old and new names should correspond in the codebook.
# Using a codebook to rename columns
codebook_data <- read.csv("your_codebook.csv")
rename_list <- list(
oldnames = codebook_data$old_names,
newnames = codebook_data$new_names
)
matched_positions <- match(rename_list$oldnames, names(data_frame))
valid_positions <- which(!is.na(matched_positions))
names(data_frame)[matched_positions[valid_positions]] <- rename_list$newnames[valid_positions]
2.4.2.3 Renaming cell content
Sometimes, the cell content needs to be renamed to be more interpretable.
Note: Replace gender
with the specific column where you need to recode the values. Specify the old and new values in the case_when()
function.
2.4.2.4 Variable reversing (recoding)
Variable reversing is often required when different items in a survey are scaled in opposite directions.
Note: The abs(8 - data_frame[, reverse_columns])
is specific to a 7-point Likert scale. Replace 8
with max_scale_value + 1
where max_scale_value
is the maximum value on your scale.
2.4.2.5 Reversing items based on data
In some situations, you may need to reverse variables based on certain conditions or a codebook.
Note: Replace codebook_data
and data_frame
with your own data. The codebook should indicate which variables to reverse. In the example below, the character string
"yes"
is used to indicate reversed items.
# Using a codebook to reverse variables
cols_to_reverse <- codebook_data$variable_name[trimws(tolower(codebook_data$to_reverse)) == "yes"]
for (col in cols_to_reverse) {
if (col %in% names(data_frame)) {
positive_indices <- which(data_frame[[col]] >= 0 & !is.na(data_frame[[col]]))
data_frame[positive_indices, col] <- 8 - data_frame[positive_indices, col]
}
}
2.5 Data imputation
2.5.1 Perform mean substitution
Handling missing data is crucial. Simple techniques include replacing missing values with mean or median.
Note: The mutate(across(where(is.numeric), ...))
function replaces NA values in numeric columns with their mean. Replace data_frame
with your dataset.
2.5.2 Multiple imputation with MICE in R
Multiple imputation is a statistical technique for handling missing data by creating several different plausible imputed datasets and then averaging the results. The mice
package in R is designed to perform multiple imputation using chained equations.
Note: The mice()
function in the mice
package creates multiple imputed datasets (m
) based on the pattern of missingness in the data. The complete()
function then extracts the first completed dataset for analysis, but all imputed sets should typically be used for a proper multiple imputation analysis.
# Ensure the mice package is installed and then load it
if (!require(mice)) install.packages("mice")
library(mice)
# Perform multiple imputation on your dataset
imputed_data <- mice(data_frame, m = 5, maxit = 50, method = 'pmm', seed = 500)
# Extract the first imputed dataset
completed_data <- complete(imputed_data, 5) # here i use the fifth imputation
# Note: To perform a proper multiple imputation analysis,
# use all imputed datasets, typically by pooling the results.
The mice()
function’s parameters include m
for the number of datasets to impute, maxit
for the maximum number of iterations for the imputations, method
for the imputation method (with ‘pmm’ indicating predictive mean matching), and seed
for the random number generation to ensure reproducibility.
2.6 Data manipulation
Data manipulation is an essential skill in data analysis. This document provides clear examples of sorting, grouping, and summarizing data using R.
2.6.1 Joins
2.6.1.1 Inner join
An inner join returns only the rows that have matching values in both tables.
Note: Replace “key” with the common variable existing in both “data1” and “data2”.
Table 1 | |
ID | Value |
---|---|
1 | A |
2 | B |
3 | C |
Table 2 | |
ID | Value |
---|---|
2 | B |
3 | C |
4 | D |
Result of Inner Join | ||
ID | Value.x | Value.y |
---|---|---|
2 | B | B |
3 | C | C |
2.6.1.2 Left and right join
A left join returns all records from the left table and the matched records from the right table. Here, “key” should be replaced with the column name that is common to both “data1” and “data2” and serves as the linking key between the two tables. If you wish to return all records from the right table and the matched records from the left table, a rightjoin
is appropriate.
Table 1 | |
ID | Value |
---|---|
1 | A |
2 | B |
3 | C |
Table 2 | |
ID | Value |
---|---|
2 | B |
3 | C |
4 | D |
Result of Left Join | ||
ID | Value.x | Value.y |
---|---|---|
1 | A | NA |
2 | B | B |
3 | C | C |
2.6.2 Filtering
Filtering is a way to select specific rows based on conditions.
Note: Replace “var1” with the variable based on which you want to filter rows. Replace “10” with your threshold value.
2.6.3 Sorting
Sorting allows you to organize your data frame based on specific variables. This is especially useful when you have a multi-dimensional dataset and want to prioritize certain variables for analysis. I use the mtcars dataset as an example.
Note: Replace “mtcars” with the name of your data frame. Replace “cyl”, “gear” etc. with the variables you wish to sort by. If you have a specific vector of sorted variables, replace “sorted_vars_vector” with that vector name.
2.6.3.1 Sorting single variable
To sort a dataframe by a single variable, use the arrange
function from the dplyr
package.
library(dplyr)
# Sort the mtcars dataset by cyl in ascending order
data_sorted_single <- mtcars %>%
arrange(cyl)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
2.6.3.2 Sorting multiple variables
You can also sort by multiple variables. The data will be sorted by the first variable first, and then within each group by the second variable, and so on.
# Sort the mtcars dataset by mpg and then by qsec
data_sorted_multi <- mtcars %>%
arrange(cyl, mpg)
# Display the first few rows of the sorted data
head(data_sorted_multi)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
2.6.3.3 Sorting based on custom order
Sometimes, you may need to sort the data in a custom order, not necessarily ascending or descending.
# Define a custom order for the 'gear' variable
gear_order <- c(6, 4, 8)
# Sort the mtcars dataset by 'gear' in the custom order
data_sorted_custom <- mtcars %>%
mutate(gear = factor(gear, levels = gear_order)) %>%
arrange(gear)
# Display the first few rows of the sorted data
print(data_sorted_custom)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
2.6.4 Grouping
Grouping allows you to categorize your data based on one or more variables, and then perform operations on each group separately. This is particularly useful when you want to perform calculations on subsets of data (e.g. subscales within a scale)
The group_by
function is used to split the data into groups based on one or more variables.
After grouping, you can summarize each group’s data, which is shown in the subsequent subchapter.
2.6.5 Summarizing
After grouping, you can calculate any index (I use the mean) for each group formed based on “var1”. The summarise
function is used to apply summary functions to each of the previously created groups
# Calculate the average horsepower (hp) for each group of cylinders
summary_data <- grouped_data %>%
summarise(avg_hp = mean(hp, na.rm = TRUE))
# Display the summarized data
summary_data
cyl | avg_hp |
---|---|
4 | 82.63636 |
6 | 122.28571 |
8 | 209.21429 |