Chapter 3 Statistical analysis
3.1 Descriptive statistics
Descriptive statistics summarize and organize the characteristics of a dataset. These summaries can enable comparisons across individuals or various units.
3.1.1 Univariate analysis: Using describe function
The describe
function in R provides a comprehensive summary of various attributes of each variable like mean, standard deviation, range, etc., making it a powerful function for quickly summarizing data.
Note: Replace your_data
with your actual data frame.
While the describe
function offers a robust summary of each individual variable, in many scenarios, the interest lies in a set of related items rather than individual ones. To address this, we can calculate the mean for each set of related variables and then apply the describe function to these newly created mean variables using describe(your_data[, "new_variable"])
.
3.1.2 Measures of central tendency for item sets
3.1.2.1 Calculating item set means
# Calculating mean for specific columns
your_data$item_set_mean <- rowMeans(your_data[, c("item1", "item2")], na.rm = TRUE)
Note: Replace your_data
, item1
, and item2
with your specific details.
3.1.2.1.1 Difference between mean and rowMeans
While mean
computes the average for a single variable, rowMeans
allows us to compute row-wise means across multiple variables. This is especially useful when you are interested in summarizing a set of related variables.
- Use
mean
when you need to find the average of a single variable. - Use
rowMeans
when you are interested in the average of a set of related variables for each row.
3.1.2.1.2 Using startsWith
for item selection
If you have multiple items that start with a common string, you can use the startsWith
function.
3.1.2.1.3 Using endsWith
as an alternative
Alternatively, you can use endsWith
if the items of interest end with a common string.
# Mean for columns ending with a specific string
your_data$set_mean_end <- rowMeans(your_data[, endsWith(colnames(your_data), "suffix")], na.rm = TRUE)
Note: Replace your_data and prefix with your actual data frame and common starting string.
3.1.2.2 Calculating subscale means using loop functions
Why specify each subscale?
In psychometrics, a scale often consists of multiple subscales. Each subscale might start with a unique string pattern. By specifying these, you can dynamically compute subscale means.
# Define subscale patterns
subscale_patterns <- list(
`Subscale1` = "prefix1_",
`Subscale2` = "prefix2_"
)
# Loop through each subscale and calculate mean
for(subscale in names(subscale_patterns)) {
items <- colnames(your_data)[startsWith(colnames(your_data), subscale_patterns[[subscale]])]
your_data[[subscale]] <- rowMeans(your_data[, items], na.rm = TRUE)
}
Note: Replace your_data, Subscale1, Subscale2, prefix1_, and prefix2_ with your actual data frame and subscale names and prefixes.
3.2 Regression analysis
3.2.1 Simple linear regression
Concept: Simple linear regression models the linear relationship between a single independent variable (predictor) and a dependent variable (response).
Implementation:
- Replace
your_response
,your_predictor
, andyour_data
with your specific variables and dataset.
Diagnostic checks:
1. Residual analysis: Examine residuals to check for non-linearity, heteroscedasticity, and outliers.
r plot(simple_model$residuals)
2. Normality of residuals: Shapiro-Wilk test or QQ plot.
r shapiro.test(simple_model$residuals) qqnorm(simple_model$residuals); qqline(simple_model$residuals)
3.2.2 Multiple linear regression
Concept: Multiple linear regression extends simple linear regression by incorporating multiple predictors.
Implementation:
multiple_model <- lm(your_response ~ predictor1 + predictor2 + ..., data = your_data)
summary(multiple_model)
Multicollinearity check:
- Use Variance Inflation Factor (VIF) to check for multicollinearity.
r library(car) vif(multiple_model)
Model selection:
- Stepwise selection (forward, backward, or both) to identify significant predictors.
r stepAIC(multiple_model, direction = "both")
3.2.3 Logistic regression
Concept: Logistic regression is used for binary outcome prediction, modeling the probability that an observation falls into one of two categories.
Implementation:
logistic_model <- glm(your_response ~ predictor1 + predictor2, family = binomial, data = your_data)
summary(logistic_model)
Model evaluation:
- Confusion matrix and ROC curve for assessing model performance.
r library(caret) confusionMatrix(predict(logistic_model, type = "response"), your_data$your_response)
3.2.4 Polynomial regression
Concept: Polynomial regression models non-linear relationships by incorporating polynomial terms.
Implementation:
polynomial_model <- lm(your_response ~ poly(your_predictor, degree = 2), data = your_data)
summary(polynomial_model)
Choosing polynomial degree:
- Use cross-validation to find the optimal degree.
r library(boot) cv.glm(your_data, polynomial_model, K = 10)$delta
3.2.5 Model diagnostics for regression
- Residual vs. Fitted Plot: Checks for homoscedasticity.
- Normal Q-Q Plot: Verifies normality of residuals.
- Scale-Location Plot: Assesses homoscedasticity.
- Residuals vs. Leverage Plot: Identifies influential cases.
Creating diagnostic plots:
3.2.6 Advanced topics
- Interaction Effects: Model the effect of interacting predictors.
- Non-linear Regression: Use generalized additive models (GAM) for complex relationships.
- Ridge and Lasso Regression: For high-dimensional data.
- Survival Analysis: Regression for time-to-event data.
- Mixed-Effects Models: Handle clustered or grouped data.
3.3 Inferential statistics
3.3.1 Hypothesis testing
Hypothesis testing is a statistical method that allows us to make decisions about a population parameter based on a sample. In R, several functions facilitate hypothesis testing.
3.3.2 Analysis of variance (ANOVA)
ANOVA is used to compare the means of three or more groups.
3.3.2.1 One-way ANOVA
Tests for significant differences among group means based on a single independent variable.
Implementation:
3.3.2.2 Two-way ANOVA
Explores the interaction between two independent variables on a dependent variable.
Implementation:
Assumptions:
- Normality: Each group’s residuals should be normally distributed.
- Homogeneity of variances: Use Levene’s test (leveneTest
from car
package).
- Independence: Observations should be independent.
3.3.3 Non-parametric tests
For data that do not meet the assumptions of parametric tests, non-parametric tests are used.
3.3.4 Advanced inferential statistics
- Repeated Measures ANOVA: For dependent samples.
- Multivariate Analysis of Variance (MANOVA): Analyzes multiple dependent variables.
- Kruskal-Wallis Test: Non-parametric alternative to one-way ANOVA.
- Friedman Test: Non-parametric alternative for repeated measures ANOVA.
3.3.5 Model checking and assumptions
For each inferential statistic method, it’s critical to check for model assumptions:
- Normality: Shapiro-Wilk test or visual assessment using QQ plots.
- Homogeneity of Variances: Levene’s test or visual assessment using boxplots.
- Independence: Ensuring data points are independent, usually verified during study design.
3.4 Psychometric analysis
3.4.1 Classical test theory
3.4.1.1 Setting the working directory and reading data
- Setting directory and reading CSV:
r setwd("your_directory_path") # Replace "BFI_data.csv" with the actual filename data <- read.csv("BFI_data.csv", sep = ";") head(data) tail(data)
3.4.1.2 Data preparation
Selecting variables:
r # Select relevant columns, typically ID, demographics, and test items data <- data[,c("ID", "age", "sex", "BFI_item1", ..., "BFI_item44")] head(data)
Handling missing values:
r # Replace specific codes for missing values with NA data[data == -77] <- NA
Checking missing values:
r # Summarize missing values for each variable missing_values <- apply(data, 2, function(x) sum(is.na(x))) print(missing_values)
Removing incomplete responses:
r # Exclude participants with excessive missing responses data <- subset(data, rowSums(is.na(data[,c("BFI_item1", ..., "BFI_item44")])) < threshold)
3.4.1.3 Item analysis
Descriptive statistics for items:
r library(psych) descriptive_stats <- describe(data[,c("BFI_item1", ..., "BFI_item44")]) print(descriptive_stats)
Item reliability:
r # Cronbach's alpha for the scale scale_alpha <- alpha(data[,c("BFI_item1", ..., "BFI_item44")]) print(scale_alpha$total)
3.4.1.4 Exploratory factor analysis (EFA)
Conducting EFA: ```r # Correlation matrix calculation R <- cor(data[,c(“BFI_item1”, …, “BFI_item44”)], use = “pairwise.complete.obs”)
# Factor extraction library(GPArotation) efa_result <- fa(R, nfactors = 5, rotate = “varimax”) print(efa_result$loadings) ```