Chapter 3 Statistical analysis

3.1 Descriptive statistics

Descriptive statistics summarize and organize the characteristics of a dataset. These summaries can enable comparisons across individuals or various units.

3.1.1 Univariate analysis: Using describe function

The describe function in R provides a comprehensive summary of various attributes of each variable like mean, standard deviation, range, etc., making it a powerful function for quickly summarizing data.

# Using the psych package for the describe function
library(psych)
describe(your_data)

Note: Replace your_data with your actual data frame.

While the describe function offers a robust summary of each individual variable, in many scenarios, the interest lies in a set of related items rather than individual ones. To address this, we can calculate the mean for each set of related variables and then apply the describe function to these newly created mean variables using describe(your_data[, "new_variable"]).

3.1.2 Measures of central tendency for item sets

3.1.2.1 Calculating item set means

# Calculating mean for specific columns
your_data$item_set_mean <- rowMeans(your_data[, c("item1", "item2")], na.rm = TRUE)

Note: Replace your_data, item1, and item2 with your specific details.

3.1.2.1.1 Difference between mean and rowMeans

While mean computes the average for a single variable, rowMeans allows us to compute row-wise means across multiple variables. This is especially useful when you are interested in summarizing a set of related variables.

  • Use mean when you need to find the average of a single variable.
  • Use rowMeans when you are interested in the average of a set of related variables for each row.
3.1.2.1.2 Using startsWith for item selection

If you have multiple items that start with a common string, you can use the startsWith function.

# Mean for columns starting with a specific string
your_data$set_mean_start <- rowMeans(your_data[, startsWith(colnames(your_data), "prefix")], na.rm = TRUE)
3.1.2.1.3 Using endsWith as an alternative

Alternatively, you can use endsWith if the items of interest end with a common string.

# Mean for columns ending with a specific string
your_data$set_mean_end <- rowMeans(your_data[, endsWith(colnames(your_data), "suffix")], na.rm = TRUE)

Note: Replace your_data and prefix with your actual data frame and common starting string.

3.1.2.2 Calculating subscale means using loop functions

Why specify each subscale?

In psychometrics, a scale often consists of multiple subscales. Each subscale might start with a unique string pattern. By specifying these, you can dynamically compute subscale means.

# Define subscale patterns
subscale_patterns <- list(
  `Subscale1` = "prefix1_",
  `Subscale2` = "prefix2_"
)

# Loop through each subscale and calculate mean
for(subscale in names(subscale_patterns)) {
  items <- colnames(your_data)[startsWith(colnames(your_data), subscale_patterns[[subscale]])]
  your_data[[subscale]] <- rowMeans(your_data[, items], na.rm = TRUE)
}

Note: Replace your_data, Subscale1, Subscale2, prefix1_, and prefix2_ with your actual data frame and subscale names and prefixes.

3.2 Regression analysis

3.2.1 Simple linear regression

Concept: Simple linear regression models the linear relationship between a single independent variable (predictor) and a dependent variable (response).

Implementation:

simple_model <- lm(your_response ~ your_predictor, data = your_data)
summary(simple_model)
  • Replace your_response, your_predictor, and your_data with your specific variables and dataset.

Diagnostic checks: 1. Residual analysis: Examine residuals to check for non-linearity, heteroscedasticity, and outliers. r plot(simple_model$residuals) 2. Normality of residuals: Shapiro-Wilk test or QQ plot. r shapiro.test(simple_model$residuals) qqnorm(simple_model$residuals); qqline(simple_model$residuals)

3.2.2 Multiple linear regression

Concept: Multiple linear regression extends simple linear regression by incorporating multiple predictors.

Implementation:

multiple_model <- lm(your_response ~ predictor1 + predictor2 + ..., data = your_data)
summary(multiple_model)

Multicollinearity check: - Use Variance Inflation Factor (VIF) to check for multicollinearity. r library(car) vif(multiple_model)

Model selection: - Stepwise selection (forward, backward, or both) to identify significant predictors. r stepAIC(multiple_model, direction = "both")

3.2.3 Logistic regression

Concept: Logistic regression is used for binary outcome prediction, modeling the probability that an observation falls into one of two categories.

Implementation:

logistic_model <- glm(your_response ~ predictor1 + predictor2, family = binomial, data = your_data)
summary(logistic_model)

Model evaluation: - Confusion matrix and ROC curve for assessing model performance. r library(caret) confusionMatrix(predict(logistic_model, type = "response"), your_data$your_response)

3.2.4 Polynomial regression

Concept: Polynomial regression models non-linear relationships by incorporating polynomial terms.

Implementation:

polynomial_model <- lm(your_response ~ poly(your_predictor, degree = 2), data = your_data)
summary(polynomial_model)

Choosing polynomial degree: - Use cross-validation to find the optimal degree. r library(boot) cv.glm(your_data, polynomial_model, K = 10)$delta

3.2.5 Model diagnostics for regression

  1. Residual vs. Fitted Plot: Checks for homoscedasticity.
  2. Normal Q-Q Plot: Verifies normality of residuals.
  3. Scale-Location Plot: Assesses homoscedasticity.
  4. Residuals vs. Leverage Plot: Identifies influential cases.

Creating diagnostic plots:

par(mfrow = c(2, 2))
plot(your_model)

3.2.6 Advanced topics

  1. Interaction Effects: Model the effect of interacting predictors.
  2. Non-linear Regression: Use generalized additive models (GAM) for complex relationships.
  3. Ridge and Lasso Regression: For high-dimensional data.
  4. Survival Analysis: Regression for time-to-event data.
  5. Mixed-Effects Models: Handle clustered or grouped data.

3.3 Inferential statistics

3.3.1 Hypothesis testing

Hypothesis testing is a statistical method that allows us to make decisions about a population parameter based on a sample. In R, several functions facilitate hypothesis testing.

3.3.1.1 One-sample t-test

Used to determine whether the mean of a single group differs from a specified value.

Implementation:

t.test(your_data$your_column, mu = your_value)
  • Replace your_column with your variable, and your_value with the hypothesized mean.

3.3.1.2 Two-sample t-test

Compares the means of two independent groups to determine if they differ significantly.

Implementation:

t.test(your_data$column1, your_data$column2)
  • Replace column1 and column2 with the respective variables.

3.3.2 Analysis of variance (ANOVA)

ANOVA is used to compare the means of three or more groups.

3.3.2.1 One-way ANOVA

Tests for significant differences among group means based on a single independent variable.

Implementation:

one_way_anova <- aov(your_response ~ your_factor, data = your_data)
summary(one_way_anova)

3.3.2.2 Two-way ANOVA

Explores the interaction between two independent variables on a dependent variable.

Implementation:

two_way_anova <- aov(your_response ~ factor1 * factor2, data = your_data)
summary(two_way_anova)

Assumptions: - Normality: Each group’s residuals should be normally distributed. - Homogeneity of variances: Use Levene’s test (leveneTest from car package). - Independence: Observations should be independent.

3.3.3 Non-parametric tests

For data that do not meet the assumptions of parametric tests, non-parametric tests are used.

3.3.3.1 Chi-square test

Used to test the independence between two categorical variables.

Implementation:

chisq.test(your_data$table)
  • Create a contingency table using table(your_data$var1, your_data$var2).

3.3.3.2 Mann-Whitney U test

Compares the medians of two independent groups.

Implementation:

wilcox.test(your_data$group1, your_data$group2)
  • Replace group1 and group2 with the respective variables.

3.3.4 Advanced inferential statistics

  1. Repeated Measures ANOVA: For dependent samples.
  2. Multivariate Analysis of Variance (MANOVA): Analyzes multiple dependent variables.
  3. Kruskal-Wallis Test: Non-parametric alternative to one-way ANOVA.
  4. Friedman Test: Non-parametric alternative for repeated measures ANOVA.

3.3.5 Model checking and assumptions

For each inferential statistic method, it’s critical to check for model assumptions:

  • Normality: Shapiro-Wilk test or visual assessment using QQ plots.
  • Homogeneity of Variances: Levene’s test or visual assessment using boxplots.
  • Independence: Ensuring data points are independent, usually verified during study design.

3.4 Psychometric analysis

3.4.1 Classical test theory

3.4.1.1 Setting the working directory and reading data

  • Setting directory and reading CSV: r setwd("your_directory_path") # Replace "BFI_data.csv" with the actual filename data <- read.csv("BFI_data.csv", sep = ";") head(data) tail(data)

3.4.1.2 Data preparation

  • Selecting variables: r # Select relevant columns, typically ID, demographics, and test items data <- data[,c("ID", "age", "sex", "BFI_item1", ..., "BFI_item44")] head(data)

  • Handling missing values: r # Replace specific codes for missing values with NA data[data == -77] <- NA

  • Checking missing values: r # Summarize missing values for each variable missing_values <- apply(data, 2, function(x) sum(is.na(x))) print(missing_values)

  • Removing incomplete responses: r # Exclude participants with excessive missing responses data <- subset(data, rowSums(is.na(data[,c("BFI_item1", ..., "BFI_item44")])) < threshold)

3.4.1.3 Item analysis

  • Descriptive statistics for items: r library(psych) descriptive_stats <- describe(data[,c("BFI_item1", ..., "BFI_item44")]) print(descriptive_stats)

  • Item reliability: r # Cronbach's alpha for the scale scale_alpha <- alpha(data[,c("BFI_item1", ..., "BFI_item44")]) print(scale_alpha$total)

3.4.1.4 Exploratory factor analysis (EFA)

  • Conducting EFA: ```r # Correlation matrix calculation R <- cor(data[,c(“BFI_item1”, …, “BFI_item44”)], use = “pairwise.complete.obs”)

    # Factor extraction library(GPArotation) efa_result <- fa(R, nfactors = 5, rotate = “varimax”) print(efa_result$loadings) ```

3.4.1.5 Reliability analysis

  • Testing for measurement models: r library(lavaan) # Example for a single factor model model <- 'latent_factor =~ BFI_item1 + BFI_item2 + BFI_item3' fit <- cfa(model, data = data) summary(fit)

3.4.1.6 Validity analysis

  • Correlations for construct validity: r # Assuming another construct (e.g., self-esteem) for validity analysis validity_correlation <- cor(data$BFI_total, data$self_esteem) print(validity_correlation)

3.4.2 Item response theory