Week 11: Bayesian Models & Spatial Epidemiology

📅 Week 11: Bayesian Models & Spatial Epidemiology

This week, we will:
Apply Bayesian statistical methods using the rstanarm package
Learn spatial interpolation techniques with kriging (gstat package)
Conduct hotspot analysis for malaria using the spdep package


1️⃣ Bayesian Analysis for Malaria Modeling

Why Use Bayesian Methods?

📌 Bayesian analysis allows you to incorporate prior knowledge, quantify uncertainty in parameters, and make probabilistic statements about malaria risk and intervention effects.

Step 1: Introduction to Bayesian Thinking


# Install and load necessary packages
install.packages(c("rstanarm", "bayesplot", "ggplot2", "dplyr", "tidybayes"))
library(rstanarm)
library(bayesplot)
library(ggplot2)
library(dplyr)
library(tidybayes)

# Load malaria dataset
malaria_data <- read.csv("malaria_data.csv")

# Examine data structure
str(malaria_data)
summary(malaria_data)

# Understand Bayesian vs. Frequentist approaches
# In frequentist statistics, we typically get:
#  - Point estimates (e.g., coefficient = 2.5)
#  - Confidence intervals (e.g., 95% CI: 1.8 - 3.2)
#  - p-values (e.g., p = 0.001)

# In Bayesian statistics, we get:
#  - Posterior distributions for parameters
#  - Credible intervals (direct probability statements about parameters)
#  - Posterior probabilities

# Simple example: Binomial model for malaria prevalence
# Suppose we surveyed 100 people and found 23 with malaria
# Frequentist approach would give a point estimate of 23%
n_surveyed <- 100
n_positive <- 23
freq_estimate <- n_positive / n_surveyed
freq_ci <- binom.test(n_positive, n_surveyed)$conf.int

cat("Frequentist estimate:", freq_estimate, "\n")
cat("95% Confidence interval:", freq_ci[1], "-", freq_ci[2], "\n")

# Bayesian approach with different priors
# 1. Non-informative prior (Beta(1,1))
# 2. Informative prior based on previous studies (e.g., Beta(10,40) for ~20% prevalence)

# Posterior with non-informative prior
alpha_non_inf <- 1 + n_positive
beta_non_inf <- 1 + (n_surveyed - n_positive)
posterior_mean_non_inf <- alpha_non_inf / (alpha_non_inf + beta_non_inf)

# Posterior with informative prior
alpha_inf <- 10 + n_positive
beta_inf <- 40 + (n_surveyed - n_positive)
posterior_mean_inf <- alpha_inf / (alpha_inf + beta_inf)

cat("Bayesian estimate (non-informative prior):", posterior_mean_non_inf, "\n")
cat("Bayesian estimate (informative prior):", posterior_mean_inf, "\n")

# Plot the posterior distributions
x <- seq(0, 0.5, length.out = 1000)
posterior_non_inf <- dbeta(x, alpha_non_inf, beta_non_inf)
posterior_inf <- dbeta(x, alpha_inf, beta_inf)
prior_inf <- dbeta(x, 10, 40)

plot_data <- data.frame(
  x = rep(x, 3),
  density = c(posterior_non_inf, posterior_inf, prior_inf),
  Distribution = rep(c("Posterior (Non-inf Prior)", "Posterior (Inf Prior)", "Informative Prior"), 
                    each = length(x))
)

ggplot(plot_data, aes(x = x, y = density, color = Distribution)) +
  geom_line(size = 1) +
  geom_vline(xintercept = freq_estimate, linetype = "dashed", color = "black") +
  annotate("text", x = freq_estimate + 0.03, y = max(posterior_non_inf) * 0.8, 
           label = "Frequentist Estimate", angle = 90) +
  labs(
    title = "Posterior Distributions for Malaria Prevalence",
    subtitle = "Comparing different prior distributions",
    x = "Prevalence",
    y = "Density"
  ) +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")

# Calculate 95% credible intervals
credible_interval_non_inf <- qbeta(c(0.025, 0.975), alpha_non_inf, beta_non_inf)
credible_interval_inf <- qbeta(c(0.025, 0.975), alpha_inf, beta_inf)

cat("95% Credible interval (non-informative prior):", 
    credible_interval_non_inf[1], "-", credible_interval_non_inf[2], "\n")
cat("95% Credible interval (informative prior):", 
    credible_interval_inf[1], "-", credible_interval_inf[2], "\n")
            

🔍 Explanation:

✔ Bayesian statistics combines prior knowledge with observed data
✔ The result is a posterior distribution that represents our updated belief
✔ Informative priors can improve precision when based on reliable prior knowledge
✔ Credible intervals have a direct probabilistic interpretation
✔ With large sample sizes, Bayesian and frequentist methods often converge


Step 2: Bayesian Regression Models with rstanarm


# Prepare data for modeling
# For this example, we'll model malaria incidence as a function of environmental factors

# Check for missing values and handle them
missing_values <- colSums(is.na(malaria_data))
print(missing_values)

# Handle missing values if necessary
malaria_data <- na.omit(malaria_data)

# Convert categorical variables to factors
if("region" %in% names(malaria_data)) {
  malaria_data$region <- as.factor(malaria_data$region)
}

# Create a binary outcome if needed (e.g., high vs. low incidence)
if("incidence" %in% names(malaria_data)) {
  median_incidence <- median(malaria_data$incidence)
  malaria_data$high_incidence <- as.factor(ifelse(malaria_data$incidence > median_incidence, 1, 0))
}

# Standardize numeric predictors for better mixing in MCMC
malaria_data <- malaria_data %>%
  mutate(across(c("rainfall", "temperature", "humidity", "elevation"), 
               ~scale(.)[,1]))

# Simple Bayesian linear regression
# Model incidence as a function of environmental factors
bayes_model <- stan_glm(
  incidence ~ rainfall + temperature + humidity + elevation,
  data = malaria_data,
  family = gaussian(),
  prior = normal(0, 2.5),      # Prior for coefficients
  prior_intercept = normal(0, 5),  # Prior for intercept
  prior_aux = exponential(1),  # Prior for error term
  chains = 4,                  # Number of Markov chains
  iter = 2000,                 # Number of iterations
  seed = 123                   # For reproducibility
)

# Examine the model summary
print(bayes_model)
summary(bayes_model)

# Look at posterior distributions of parameters
posterior_samples <- as.matrix(bayes_model)
mcmc_areas(posterior_samples, pars = c("rainfall", "temperature", "humidity", "elevation"),
          prob = 0.95)  # 95% credible intervals

# Plot posterior distributions with prior overlays
mcmc_dens_overlay(posterior_samples, pars = c("rainfall", "temperature", "humidity", "elevation"))

# Investigate the effect of rainfall on incidence
rainfall_effect <- posterior_samples[, "rainfall"]
prob_positive_effect <- mean(rainfall_effect > 0)
cat("Probability that rainfall has a positive effect on incidence:", 
    prob_positive_effect, "\n")

# More detailed visualization of rainfall effect
ggplot(data.frame(rainfall_effect = rainfall_effect), aes(x = rainfall_effect)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  labs(
    title = "Posterior Distribution of Rainfall Effect on Malaria Incidence",
    subtitle = paste0("Probability of positive effect: ", round(prob_positive_effect * 100, 1), "%"),
    x = "Effect Size",
    y = "Count"
  ) +
  theme_minimal()

# Compare variables' relative importance
variable_importance <- abs(colMeans(posterior_samples[, c("rainfall", "temperature", 
                                                        "humidity", "elevation")]))
importance_df <- data.frame(
  Variable = names(variable_importance),
  Importance = variable_importance
)

ggplot(importance_df, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Variable Importance in Bayesian Model",
    subtitle = "Based on absolute effect size",
    x = "",
    y = "Mean Absolute Effect Size"
  ) +
  theme_minimal()

# Posterior predictive checks
pp_check(bayes_model, nreps = 50)

# Fit similar model with frequentist approach for comparison
freq_model <- glm(
  incidence ~ rainfall + temperature + humidity + elevation,
  data = malaria_data,
  family = gaussian()
)

# Compare coefficients
coef_comparison <- data.frame(
  Variable = names(coef(freq_model)),
  Frequentist = coef(freq_model),
  Bayesian = colMeans(posterior_samples)[names(coef(freq_model))]
)
print(coef_comparison)
            

🔍 Explanation:

rstanarm implements Bayesian regression using Stan
✔ The model estimates posterior distributions, not just point estimates
✔ Credible intervals show the range of plausible values for parameters
✔ Posterior probability statements (e.g., "95% probability rainfall increases incidence") are straightforward
✔ Posterior predictive checks assess model fit by comparing observed data to predictions


Step 3: Bayesian Hierarchical Models for Malaria


# Hierarchical (multilevel) models account for grouped data
# For malaria data, we might have observations nested within regions

# Check if we have hierarchical data structure
if("region" %in% names(malaria_data) && length(unique(malaria_data$region)) > 1) {
  # Examine the number of observations per region
  table(malaria_data$region)
  
  # Plot incidence by region
  ggplot(malaria_data, aes(x = region, y = incidence)) +
    geom_boxplot() +
    theme_minimal() +
    labs(title = "Malaria Incidence by Region",
        x = "Region",
        y = "Incidence") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
  # Fit a Bayesian hierarchical model
  # This allows region-specific effects while borrowing strength across regions
  hierarchical_model <- stan_glmer(
    incidence ~ rainfall + temperature + (1 + rainfall | region),
    data = malaria_data,
    family = gaussian(),
    prior = normal(0, 2.5),
    prior_intercept = normal(0, 5),
    prior_aux = exponential(1),
    prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
    chains = 4,
    iter = 2000,
    seed = 123
  )
  
  # Examine model summary
  print(hierarchical_model)
  
  # Extract random effects (region-specific deviations)
  ranef_summary <- summary(ranef(hierarchical_model))
  print(ranef_summary)
  
  # Visualize varying slopes and intercepts
  re_effects <- ranef(hierarchical_model)
  region_effects <- re_effects$region
  
  # Create a dataframe for plotting
  region_df <- data.frame(
    region = rownames(region_effects),
    intercept = region_effects[, 1],
    rainfall_slope = region_effects[, 2]
  )
  
  # Plot region-specific intercepts
  ggplot(region_df, aes(x = reorder(region, intercept), y = intercept)) +
    geom_point(size = 3) +
    geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
    coord_flip() +
    labs(
      title = "Region-Specific Intercepts",
      subtitle = "Deviations from the overall mean",
      x = "Region",
      y = "Intercept Adjustment"
    ) +
    theme_minimal()
  
  # Plot region-specific rainfall effects
  ggplot(region_df, aes(x = reorder(region, rainfall_slope), y = rainfall_slope)) +
    geom_point(size = 3) +
    geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
    coord_flip() +
    labs(
      title = "Region-Specific Rainfall Effects",
      subtitle = "How rainfall effect varies across regions",
      x = "Region",
      y = "Rainfall Slope Adjustment"
    ) +
    theme_minimal()
  
  # Plot the relationship between intercepts and slopes
  ggplot(region_df, aes(x = intercept, y = rainfall_slope, label = region)) +
    geom_point(size = 3) +
    geom_text(hjust = -0.3, vjust = 0.3) +
    geom_vline(xintercept = 0, linetype = "dashed", color = "gray") +
    geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +
    labs(
      title = "Relationship Between Region Intercepts and Rainfall Effects",
      x = "Intercept Adjustment",
      y = "Rainfall Effect Adjustment"
    ) +
    theme_minimal()
  
  # Extract the overall fixed effects
  fixed_effects <- fixef(hierarchical_model)
  print(fixed_effects)
  
  # Make predictions for each region
  # Create a sequence of rainfall values
  new_rainfall <- seq(min(malaria_data$rainfall), max(malaria_data$rainfall), length.out = 100)
  
  # Base data for predictions
  base_data <- data.frame(
    rainfall = new_rainfall,
    temperature = mean(malaria_data$temperature),
    humidity = mean(malaria_data$humidity),
    elevation = mean(malaria_data$elevation)
  )
  
  # Make predictions for each region
  prediction_list <- list()
  
  for (region_name in levels(malaria_data$region)) {
    new_data <- base_data
    new_data$region <- region_name
    
    # Get posterior predictions
    post_pred <- posterior_predict(hierarchical_model, newdata = new_data)
    
    # Calculate mean and credible intervals
    pred_mean <- colMeans(post_pred)
    pred_lower <- apply(post_pred, 2, quantile, 0.025)
    pred_upper <- apply(post_pred, 2, quantile, 0.975)
    
    # Store results
    region_pred <- data.frame(
      rainfall = new_rainfall,
      region = region_name,
      mean = pred_mean,
      lower = pred_lower,
      upper = pred_upper
    )
    
    prediction_list[[region_name]] <- region_pred
  }
  
  # Combine all predictions
  all_predictions <- do.call(rbind, prediction_list)
  
  # Plot the region-specific rainfall effects
  ggplot(all_predictions, aes(x = rainfall, y = mean, color = region, fill = region)) +
    geom_line() +
    geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2, color = NA) +
    labs(
      title = "Region-Specific Effects of Rainfall on Malaria Incidence",
      subtitle = "Lines show mean prediction; shaded areas show 95% credible intervals",
      x = "Rainfall (standardized)",
      y = "Predicted Incidence"
    ) +
    theme_minimal()
}
            

🔍 Explanation:

✔ Hierarchical (multilevel) models account for grouped data structure
✔ For malaria studies, regions or health facilities often form natural groupings
stan_glmer() allows varying intercepts and slopes by group
✔ This approach "borrows strength" across regions while allowing regional differences
✔ Region-specific predictions enable targeted intervention planning


Step 4: Bayesian Prediction and Decision Making


# Using Bayesian models for prediction and decision-making
# Define new locations for prediction
new_locations <- data.frame(
  rainfall = c(0.5, -1.0, 1.5),  # Standardized values
  temperature = c(1.0, 0.2, -0.8),
  humidity = c(0.2, -0.5, 1.2),
  elevation = c(-0.8, 1.2, 0.3),
  region = c("Region1", "Region2", "Region3")
)

# Ensure region is a factor with the same levels as in the training data
new_locations$region <- factor(new_locations$region, levels = levels(malaria_data$region))

# Generate posterior predictions
posterior_preds <- posterior_predict(hierarchical_model, newdata = new_locations)

# Calculate mean and 95% credible intervals
pred_summary <- data.frame(
  Location = 1:nrow(new_locations),
  Mean = colMeans(posterior_preds),
  Lower_95 = apply(posterior_preds, 2, quantile, 0.025),
  Upper_95 = apply(posterior_preds, 2, quantile, 0.975)
)

print(pred_summary)

# Visualize the predictions with uncertainty
ggplot(pred_summary, aes(x = factor(Location), y = Mean)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = Lower_95, ymax = Upper_95), width = 0.2) +
  labs(
    title = "Predicted Malaria Incidence with Uncertainty",
    subtitle = "Points show mean prediction; error bars show 95% credible intervals",
    x = "Location",
    y = "Predicted Incidence"
  ) +
  theme_minimal()

# Probability of exceeding a threshold (e.g., epidemic threshold)
epidemic_threshold <- 50  # Define an appropriate threshold
prob_epidemic <- colMeans(posterior_preds > epidemic_threshold)

# Add to our summary
pred_summary$Prob_Epidemic <- prob_epidemic
print(pred_summary)

# Visualize epidemic probability
ggplot(pred_summary, aes(x = factor(Location), y = Prob_Epidemic, fill = Prob_Epidemic)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(Prob_Epidemic * 100, 1), "%")), vjust = -0.5) +
  scale_fill_gradient(low = "green", high = "red") +
  labs(
    title = "Probability of Exceeding Epidemic Threshold",
    x = "Location",
    y = "Probability"
  ) +
  ylim(0, 1) +
  theme_minimal()

# Expected loss analysis for intervention decisions
# Assume:
# - Loss without intervention when there's an outbreak: 100 units
# - Loss with intervention when there's no outbreak: 30 units
# - Loss with intervention when there is an outbreak: 20 units
# - Loss without intervention when there's no outbreak: 0 units

loss_no_int_outbreak <- 100
loss_int_no_outbreak <- 30
loss_int_outbreak <- 20
loss_no_int_no_outbreak <- 0

# Calculate expected loss for both decisions
exp_loss_intervention <- loss_int_outbreak * prob_epidemic + loss_int_no_outbreak * (1 - prob_epidemic)
exp_loss_no_intervention <- loss_no_int_outbreak * prob_epidemic + loss_no_int_no_outbreak * (1 - prob_epidemic)

# Determine optimal decision
optimal_decision <- ifelse(exp_loss_intervention < exp_loss_no_intervention, 
                         "Intervene", "Don't Intervene")

# Create decision table
decision_df <- data.frame(
  Location = 1:nrow(new_locations),
  Prob_Epidemic = prob_epidemic,
  Expected_Loss_Intervention = exp_loss_intervention,
  Expected_Loss_No_Intervention = exp_loss_no_intervention,
  Optimal_Decision = optimal_decision
)

print(decision_df)

# Visualize the decision analysis
decision_df_long <- decision_df %>%
  select(Location, Expected_Loss_Intervention, Expected_Loss_No_Intervention) %>%
  pivot_longer(cols = c(Expected_Loss_Intervention, Expected_Loss_No_Intervention),
              names_to = "Decision", values_to = "Expected_Loss")

ggplot(decision_df_long, aes(x = factor(Location), y = Expected_Loss, fill = Decision)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = round(Expected_Loss, 1)), position = position_dodge(width = 0.9), vjust = -0.5) +
  labs(
    title = "Expected Loss by Decision",
    subtitle = "Lower expected loss indicates preferred decision",
    x = "Location",
    y = "Expected Loss"
  ) +
  theme_minimal()

# Create a function to determine optimal decision threshold
find_threshold <- function(loss_no_int_outbreak, loss_int_no_outbreak, 
                         loss_int_outbreak, loss_no_int_no_outbreak) {
  # Threshold probability where expected losses are equal
  p_threshold <- (loss_int_no_outbreak - loss_no_int_no_outbreak) / 
                ((loss_no_int_outbreak - loss_int_outbreak) + 
                 (loss_int_no_outbreak - loss_no_int_no_outbreak))
  
  return(p_threshold)
}

decision_threshold <- find_threshold(loss_no_int_outbreak, loss_int_no_outbreak, 
                                   loss_int_outbreak, loss_no_int_no_outbreak)

cat("Decision threshold: Intervene when P(epidemic) >", decision_threshold, "\n")

# Visualize the loss function and decision threshold
p_seq <- seq(0, 1, by = 0.01)
exp_loss_int <- loss_int_outbreak * p_seq + loss_int_no_outbreak * (1 - p_seq)
exp_loss_no_int <- loss_no_int_outbreak * p_seq + loss_no_int_no_outbreak * (1 - p_seq)

loss_df <- data.frame(
  Probability = p_seq,
  Intervention = exp_loss_int,
  No_Intervention = exp_loss_no_int
)

loss_df_long <- loss_df %>%
  pivot_longer(cols = c(Intervention, No_Intervention),
              names_to = "Decision", values_to = "Expected_Loss")

ggplot(loss_df_long, aes(x = Probability, y = Expected_Loss, color = Decision)) +
  geom_line(size = 1) +
  geom_vline(xintercept = decision_threshold, linetype = "dashed") +
  annotate("text", x = decision_threshold + 0.05, y = 50, 
           label = paste("Threshold =", round(decision_threshold, 2))) +
  labs(
    title = "Expected Loss by Decision for Different Epidemic Probabilities",
    subtitle = "Vertical line shows the decision threshold probability",
    x = "Probability of Epidemic",
    y = "Expected Loss"
  ) +
  theme_minimal()
            

🔍 Explanation:

✔ Bayesian models naturally quantify prediction uncertainty
✔ We can directly calculate probabilities of exceeding epidemic thresholds
✔ Decision analysis combines probabilities with costs/losses
✔ Expected loss minimization provides a framework for intervention decisions
✔ The decision threshold is influenced by the relative costs of different outcomes


2️⃣ Spatial Interpolation with Kriging

Step 1: Understanding Spatial Dependency in Malaria Data


# Install and load necessary packages
install.packages(c("sp", "gstat", "automap", "raster", "ggplot2", "sf", "dplyr"))
library(sp)
library(gstat)
library(automap)
library(raster)
library(ggplot2)
library(sf)
library(dplyr)

# Load malaria point data with spatial coordinates
malaria_points <- read.csv("malaria_points.csv")

# Examine the data structure
str(malaria_points)
summary(malaria_points)

# We need at minimum: longitude, latitude, and a value to interpolate (e.g., prevalence)
# Ensure columns are properly named
names(malaria_points)

# Create a spatial points data frame
coordinates(malaria_points) <- c("longitude", "latitude")

# Set the coordinate reference system (CRS)
# Here we use WGS84 (EPSG:4326), but use the appropriate CRS for your region
proj4string(malaria_points) <- CRS("+proj=longlat +datum=WGS84")

# Visualize the point data
plot(malaria_points, pch = 20, cex = sqrt(malaria_points$prevalence) * 3,
     main = "Malaria Prevalence at Sampling Locations")

# Convert to sf object for better plotting
malaria_sf <- st_as_sf(malaria_points)

ggplot(malaria_sf) +
  geom_sf(aes(size = prevalence, color = prevalence)) +
  scale_color_viridis_c() +
  labs(
    title = "Malaria Prevalence at Sampling Locations",
    size = "Prevalence",
    color = "Prevalence"
  ) +
  theme_minimal()

# Explore spatial autocorrelation
# Calculate distances between all pairs of points
dist_matrix <- spDists(malaria_points, longlat = TRUE)

# Function to calculate semivariogram manually
calculate_semivariogram <- function(data, distances, n_bins = 15) {
  # Get the variable to analyze
  var_values <- data$prevalence
  n <- length(var_values)
  
  # Calculate value differences for all pairs
  diff_matrix <- matrix(NA, n, n)
  for(i in 1:n) {
    for(j in 1:n) {
      diff_matrix[i, j] <- (var_values[i] - var_values[j])^2
    }
  }
  
  # Determine distance bins
  max_dist <- max(distances)
  bins <- seq(0, max_dist, length.out = n_bins + 1)
  bin_centers <- (bins[-1] + bins[-length(bins)]) / 2
  
  # Calculate semivariance for each bin
  semivariance <- numeric(n_bins)
  for(i in 1:n_bins) {
    # Find pairs in this distance bin
    bin_pairs <- which(distances >= bins[i] & distances < bins[i+1], arr.ind = TRUE)
    
    if(nrow(bin_pairs) > 0) {
      # Calculate semivariance as half the mean squared difference
      semivariance[i] <- sum(diff_matrix[bin_pairs]) / (2 * nrow(bin_pairs))
    } else {
      semivariance[i] <- NA
    }
  }

}
  
  return(data.frame(distance = bin_centers, semivariance = semivariance))
}

# Calculate empirical semivariogram
semivariogram <- calculate_semivariogram(malaria_points@data, dist_matrix)

# Plot empirical semivariogram
ggplot(semivariogram, aes(x = distance, y = semivariance)) +
  geom_point(size = 3) +
  geom_line() +
  labs(
    title = "Empirical Semivariogram for Malaria Prevalence",
    subtitle = "Shows spatial dependence at different distances",
    x = "Distance (degrees)",
    y = "Semivariance"
  ) +
  theme_minimal()

# Use gstat's variogram function for a more robust calculation
v <- variogram(prevalence ~ 1, data = malaria_points)
plot(v, main = "Empirical Variogram for Malaria Prevalence")

# Fit variogram models
v_fit <- fit.variogram(v, model = vgm(c("Sph", "Exp", "Gau")))
print(v_fit)

# Plot the fitted variogram model
plot(v, v_fit, main = "Fitted Variogram Model")

# Explain variogram parameters
cat("Nugget:", v_fit$psill[1], "- Represents measurement error or small-scale variation\n")
cat("Partial Sill:", v_fit$psill[2], "- Variance explained by spatial dependence\n")
cat("Range:", v_fit$range[2], "- Distance beyond which spatial correlation becomes negligible\n")
cat("Model:", as.character(v_fit$model[2]), "- Mathematical model describing spatial dependence\n")

# Calculate the proportion of spatial dependence
spatial_dependence <- v_fit$psill[2] / sum(v_fit$psill)
cat("Proportion of variance due to spatial dependence:", round(spatial_dependence * 100, 1), "%\n")
            

🔍 Explanation:

✔ Spatial interpolation relies on the principle of spatial autocorrelation
✔ The semivariogram quantifies how similarity changes with distance
✔ Key parameters include the nugget (measurement error), sill (total variance), and range (correlation distance)
✔ Different variogram models (spherical, exponential, Gaussian) describe different spatial patterns
✔ A well-fitted variogram is essential for accurate kriging interpolation


Step 2: Ordinary Kriging for Malaria Risk Mapping


# Create a grid for interpolation
# Define the extent of your study area (slightly larger than the point data)
bbox <- bbox(malaria_points)
x_range <- bbox[1, ] + c(-0.1, 0.1)  # Add some buffer
y_range <- bbox[2, ] + c(-0.1, 0.1)

# Create a grid with desired resolution
grid_res <- 0.01  # Adjust based on your needs and area size
grid <- expand.grid(
  longitude = seq(x_range[1], x_range[2], by = grid_res),
  latitude = seq(y_range[1], y_range[2], by = grid_res)
)
coordinates(grid) <- c("longitude", "latitude")
gridded(grid) <- TRUE
proj4string(grid) <- proj4string(malaria_points)

# Perform ordinary kriging
krige_model <- gstat(formula = prevalence ~ 1, 
                   data = malaria_points,
                   model = v_fit)

# Make predictions on the grid
krige_pred <- predict(krige_model, grid)

# Extract prediction and variance
pred_raster <- raster(krige_pred["var1.pred"])
var_raster <- raster(krige_pred["var1.var"])
sd_raster <- sqrt(var_raster)  # Standard deviation raster

# Plot the prediction raster
plot(pred_raster, main = "Kriging Prediction: Malaria Prevalence")

# Plot the prediction standard deviation
plot(sd_raster, main = "Kriging Standard Deviation")

# Convert to data frame for ggplot
pred_df <- as.data.frame(krige_pred)
names(pred_df) <- c("longitude", "latitude", "prediction", "variance")
pred_df$sd <- sqrt(pred_df$variance)

# Create nicer maps with ggplot
ggplot(pred_df, aes(x = longitude, y = latitude, fill = prediction)) +
  geom_raster() +
  scale_fill_viridis_c(option = "plasma", name = "Prevalence") +
  labs(
    title = "Kriged Malaria Prevalence Surface",
    subtitle = "Ordinary kriging interpolation",
    x = "Longitude",
    y = "Latitude"
  ) +
  theme_minimal() +
  theme(panel.grid = element_blank()) +
  coord_fixed()

# Plot the prediction uncertainty
ggplot(pred_df, aes(x = longitude, y = latitude, fill = sd)) +
  geom_raster() +
  scale_fill_viridis_c(option = "inferno", name = "SD") +
  labs(
    title = "Kriging Interpolation Uncertainty",
    subtitle = "Standard deviation of predictions",
    x = "Longitude",
    y = "Latitude"
  ) +
  theme_minimal() +
  theme(panel.grid = element_blank()) +
  coord_fixed()

# Add original data points to the prediction map
ggplot() +
  geom_raster(data = pred_df, aes(x = longitude, y = latitude, fill = prediction)) +
  geom_point(data = as.data.frame(malaria_points), 
            aes(x = longitude, y = latitude, size = prevalence),
            color = "black", alpha = 0.7) +
  scale_fill_viridis_c(option = "plasma", name = "Prevalence") +
  labs(
    title = "Kriged Malaria Prevalence with Sampling Locations",
    x = "Longitude",
    y = "Latitude",
    size = "Observed Prevalence"
  ) +
  theme_minimal() +
  theme(panel.grid = element_blank()) +
  coord_fixed()

# Probability of exceeding a threshold
threshold <- 0.20  # e.g., 20% prevalence
# Calculate probability under normal distribution assumption
pred_df$prob_exceed <- pnorm(threshold, mean = pred_df$prediction, sd = pred_df$sd, lower.tail = FALSE)

# Map the exceedance probabilities
ggplot(pred_df, aes(x = longitude, y = latitude, fill = prob_exceed)) +
  geom_raster() +
  scale_fill_gradientn(colors = c("green", "yellow", "red"), 
                      limits = c(0, 1), name = "Probability") +
  labs(
    title = "Probability of Malaria Prevalence Exceeding 20%",
    subtitle = "Based on kriging prediction and uncertainty",
    x = "Longitude",
    y = "Latitude"
  ) +
  theme_minimal() +
  theme(panel.grid = element_blank()) +
  coord_fixed()

# Save the results for further use
writeRaster(pred_raster, "malaria_prevalence_kriged.tif", overwrite = TRUE)
writeRaster(sd_raster, "malaria_uncertainty_kriged.tif", overwrite = TRUE)
            

🔍 Explanation:

✔ Ordinary kriging creates a continuous surface from discrete point data
✔ The method honors the original data points while interpolating between them
✔ Kriging provides both predictions and associated uncertainties
✔ Higher uncertainty typically occurs in areas far from sampling points
✔ Exceedance probability maps help identify high-risk areas for targeted interventions


Step 3: Universal Kriging with Environmental Covariates


# Universal kriging (kriging with external drift) incorporates covariates
# For malaria, we might use environmental factors like elevation, rainfall, temperature

# First, make sure our point data includes these covariates
print(names(malaria_points))

if(all(c("elevation", "rainfall", "temperature") %in% names(malaria_points@data))) {
  # Explore relationships between prevalence and covariates
  malaria_df <- as.data.frame(malaria_points)
  
  # Scatterplots
  par(mfrow = c(1, 3))
  plot(malaria_df$prevalence ~ malaria_df$elevation, 
       main = "Prevalence vs Elevation", pch = 16)
  abline(lm(prevalence ~ elevation, data = malaria_df), col = "red")
  
  plot(malaria_df$prevalence ~ malaria_df$rainfall, 
       main = "Prevalence vs Rainfall", pch = 16)
  abline(lm(prevalence ~ rainfall, data = malaria_df), col = "red")
  
  plot(malaria_df$prevalence ~ malaria_df$temperature, 
       main = "Prevalence vs Temperature", pch = 16)
  abline(lm(prevalence ~ temperature, data = malaria_df), col = "red")
  par(mfrow = c(1, 1))
  
  # Create raster layers for covariates (you would typically have these already)
  # For this example, we'll create them from the point data using IDW interpolation
  
  # Interpolate elevation
  idw_elev <- idw(elevation ~ 1, malaria_points, grid)
  elev_raster <- raster(idw_elev["var1.pred"])
  
  # Interpolate rainfall
  idw_rain <- idw(rainfall ~ 1, malaria_points, grid)
  rain_raster <- raster(idw_rain["var1.pred"])
  
  # Interpolate temperature
  idw_temp <- idw(temperature ~ 1, malaria_points, grid)
  temp_raster <- raster(idw_temp["var1.pred"])
  
  # Create a raster stack of covariates
  cov_stack <- stack(elev_raster, rain_raster, temp_raster)
  names(cov_stack) <- c("elevation", "rainfall", "temperature")
  
  # Extract covariate values at grid points
  grid_df <- as.data.frame(grid)
  grid$elevation <- over(grid, elev_raster)$layer
  grid$rainfall <- over(grid, rain_raster)$layer
  grid$temperature <- over(grid, temp_raster)$layer
  
  # Perform variogram analysis with trend model
  v_trend <- variogram(prevalence ~ elevation + rainfall + temperature, 
                      data = malaria_points)
  v_trend_fit <- fit.variogram(v_trend, model = vgm(c("Sph", "Exp", "Gau")))
  plot(v_trend, v_trend_fit, main = "Variogram with Trend Model")
  
  # Universal kriging (kriging with external drift)
  uk_model <- gstat(formula = prevalence ~ elevation + rainfall + temperature, 
                   data = malaria_points,
                   model = v_trend_fit)
  
  # Make predictions
  uk_pred <- predict(uk_model, grid)
  
  # Extract prediction and variance
  uk_pred_raster <- raster(uk_pred["var1.pred"])
  uk_var_raster <- raster(uk_pred["var1.var"])
  uk_sd_raster <- sqrt(uk_var_raster)
  
  # Plot the universal kriging prediction
  plot(uk_pred_raster, main = "Universal Kriging: Malaria Prevalence")
  
  # Convert to data frame for ggplot
  uk_pred_df <- as.data.frame(uk_pred)
  names(uk_pred_df) <- c("longitude", "latitude", "prediction", "variance")
  uk_pred_df$sd <- sqrt(uk_pred_df$variance)
  
  # Create map of universal kriging predictions
  ggplot(uk_pred_df, aes(x = longitude, y = latitude, fill = prediction)) +
    geom_raster() +
    scale_fill_viridis_c(option = "plasma", name = "Prevalence") +
    labs(
      title = "Universal Kriging of Malaria Prevalence",
      subtitle = "Incorporating elevation, rainfall, and temperature",
      x = "Longitude",
      y = "Latitude"
    ) +
    theme_minimal() +
    theme(panel.grid = element_blank()) +
    coord_fixed()
  
  # Compare ordinary and universal kriging
  # Extract predictions for both methods at the original points
  ok_at_points <- over(malaria_points, pred_raster)
  uk_at_points <- over(malaria_points, uk_pred_raster)
  
  # Combine with observed values
  validation_df <- data.frame(
    observed = malaria_points$prevalence,
    ok_predicted = ok_at_points$layer,
    uk_predicted = uk_at_points$layer
  )
  
  # Calculate error metrics for both methods
  ok_rmse <- sqrt(mean((validation_df$observed - validation_df$ok_predicted)^2))
  uk_rmse <- sqrt(mean((validation_df$observed - validation_df$uk_predicted)^2))
  
  ok_mae <- mean(abs(validation_df$observed - validation_df$ok_predicted))
  uk_mae <- mean(abs(validation_df$observed - validation_df$uk_predicted))
  
  cat("Ordinary Kriging RMSE:", ok_rmse, "\n")
  cat("Universal Kriging RMSE:", uk_rmse, "\n")
  cat("Ordinary Kriging MAE:", ok_mae, "\n")
  cat("Universal Kriging MAE:", uk_mae, "\n")
  
  # Visualize the comparison
  par(mfrow = c(1, 2))
  plot(validation_df$observed, validation_df$ok_predicted,
       xlab = "Observed", ylab = "Predicted",
       main = paste("Ordinary Kriging (RMSE =", round(ok_rmse, 3), ")"),
       pch = 16, xlim = c(0, max(validation_df$observed)),
       ylim = c(0, max(validation_df$observed)))
  abline(0, 1, col = "red")
  
  plot(validation_df$observed, validation_df$uk_predicted,
       xlab = "Observed", ylab = "Predicted",
       main = paste("Universal Kriging (RMSE =", round(uk_rmse, 3), ")"),
       pch = 16, xlim = c(0, max(validation_df$observed)),
       ylim = c(0, max(validation_df$observed)))
  abline(0, 1, col = "red")
  par(mfrow = c(1, 1))
  
  # Save the universal kriging results
  writeRaster(uk_pred_raster, "malaria_prevalence_uk.tif", overwrite = TRUE)
  writeRaster(uk_sd_raster, "malaria_uncertainty_uk.tif", overwrite = TRUE)
} else {
  cat("Covariates (elevation, rainfall, temperature) not found in the dataset.\n")
  cat("Universal kriging requires these covariates.\n")
}
            

🔍 Explanation:

✔ Universal kriging (or kriging with external drift) incorporates covariates
✔ For malaria, environmental factors often influence spatial distribution
✔ The method models the trend related to covariates plus residual spatial correlation
✔ Universal kriging typically improves prediction accuracy compared to ordinary kriging
✔ The approach is particularly useful when clear relationships exist between malaria and environmental factors


Step 4: Cross-Validation and Comparison of Interpolation Methods


# Cross-validation to assess interpolation accuracy
# We'll use leave-one-out cross-validation (LOOCV)

# Ordinary kriging cross-validation
ok_cv <- krige.cv(prevalence ~ 1, 
                 malaria_points,
                 model = v_fit)

# Summary of ordinary kriging CV results
print(summary(ok_cv))

# Calculate error metrics
ok_cv$residual <- ok_cv$observed - ok_cv$var1.pred
ok_cv$squared_residual <- ok_cv$residual^2
ok_cv$standardized_residual <- ok_cv$residual / sqrt(ok_cv$var1.var)

ok_rmse_cv <- sqrt(mean(ok_cv$squared_residual))
ok_mae_cv <- mean(abs(ok_cv$residual))
ok_mean_error <- mean(ok_cv$residual)
ok_mean_std_error <- mean(ok_cv$standardized_residual)
ok_std_error_sd <- sd(ok_cv$standardized_residual)

cat("Ordinary Kriging CV Results:\n")
cat("RMSE:", ok_rmse_cv, "\n")
cat("MAE:", ok_mae_cv, "\n")
cat("Mean Error:", ok_mean_error, "- Should be close to 0 for unbiased predictions\n")
cat("Mean Standardized Error:", ok_mean_std_error, "- Should be close to 0\n")
cat("SD of Standardized Errors:", ok_std_error_sd, "- Should be close to 1 if uncertainty is well-estimated\n")

# Universal kriging cross-validation
if(exists("v_trend_fit")) {
  uk_cv <- krige.cv(prevalence ~ elevation + rainfall + temperature, 
                   malaria_points,
                   model = v_trend_fit)
  
  # Calculate error metrics for universal kriging
  uk_cv$residual <- uk_cv$observed - uk_cv$var1.pred
  uk_cv$squared_residual <- uk_cv$residual^2
  uk_cv$standardized_residual <- uk_cv$residual / sqrt(uk_cv$var1.var)
  
  uk_rmse_cv <- sqrt(mean(uk_cv$squared_residual))
  uk_mae_cv <- mean(abs(uk_cv$residual))
  uk_mean_error <- mean(uk_cv$residual)
  uk_mean_std_error <- mean(uk_cv$standardized_residual)
  uk_std_error_sd <- sd(uk_cv$standardized_residual)
  
  cat("\nUniversal Kriging CV Results:\n")
  cat("RMSE:", uk_rmse_cv, "\n")
  cat("MAE:", uk_mae_cv, "\n")
  cat("Mean Error:", uk_mean_error, "- Should be close to 0 for unbiased predictions\n")
  cat("Mean Standardized Error:", uk_mean_std_error, "- Should be close to 0\n")
  cat("SD of Standardized Errors:", uk_std_error_sd, "- Should be close to 1 if uncertainty is well-estimated\n")
  
  # Compare the methods
  cat("\nImprovement with Universal Kriging:\n")
  cat("RMSE reduction:", round((ok_rmse_cv - uk_rmse_cv) / ok_rmse_cv * 100, 1), "%\n")
  cat("MAE reduction:", round((ok_mae_cv - uk_mae_cv) / ok_mae_cv * 100, 1), "%\n")
}

# Compare kriging with simpler methods (IDW)
idw_cv <- krige.cv(prevalence ~ 1, 
                  malaria_points,
                  model = NULL,  # NULL model means IDW
                  nmax = 12,     # Number of neighbors to use
                  nfold = nrow(malaria_points),  # LOOCV
                  set = list(idp = 2))  # Power parameter for IDW

idw_cv$residual <- idw_cv$observed - idw_cv$var1.pred
idw_rmse_cv <- sqrt(mean(idw_cv$residual^2))
idw_mae_cv <- mean(abs(idw_cv$residual))

cat("\nIDW Cross-Validation Results:\n")
cat("RMSE:", idw_rmse_cv, "\n")
cat("MAE:", idw_mae_cv, "\n")

# Visualize the cross-validation results
# Combine CV results from all methods
cv_results <- data.frame(
  Method = c(rep("Ordinary Kriging", nrow(ok_cv)),
            if(exists("uk_cv")) rep("Universal Kriging", nrow(uk_cv)) else NULL,
            rep("IDW", nrow(idw_cv))),
  Observed = c(ok_cv$observed,
              if(exists("uk_cv")) uk_cv$observed else NULL,
              idw_cv$observed),
  Predicted = c(ok_cv$var1.pred,
               if(exists("uk_cv")) uk_cv$var1.pred else NULL,
               idw_cv$var1.pred),
  Residual = c(ok_cv$residual,
              if(exists("uk_cv")) uk_cv$residual else NULL,
              idw_cv$residual)
)

# Create a comparison plot
ggplot(cv_results, aes(x = Observed, y = Predicted, color = Method)) +
  geom_point(alpha = 0.7) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  labs(
    title = "Cross-Validation: Observed vs. Predicted",
    subtitle = "Comparison of interpolation methods",
    x = "Observed Prevalence",
    y = "Predicted Prevalence"
  ) +
  theme_minimal() +
  facet_wrap(~ Method)

# Residual analysis
ggplot(cv_results, aes(x = Predicted, y = Residual, color = Method)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Residuals by Predicted Value",
    subtitle = "Check for systematic prediction errors",
    x = "Predicted Value",
    y = "Residual (Observed - Predicted)"
  ) +
  theme_minimal() +
  facet_wrap(~ Method)

# RMSE comparison
method_comparison <- data.frame(
  Method = c("Ordinary Kriging", if(exists("uk_cv")) "Universal Kriging" else NULL, "IDW"),
  RMSE = c(ok_rmse_cv, if(exists("uk_cv")) uk_rmse_cv else NULL, idw_rmse_cv),
  MAE = c(ok_mae_cv, if(exists("uk_cv")) uk_mae_cv else NULL, idw_mae_cv)
)

ggplot(method_comparison, aes(x = Method, y = RMSE, fill = Method)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(RMSE, 3)), vjust = -0.5) +
  labs(
    title = "RMSE Comparison Across Interpolation Methods",
    x = "",
    y = "Root Mean Square Error"
  ) +
  theme_minimal()
            

🔍 Explanation:

✔ Cross-validation assesses interpolation accuracy by predicting at known points
✔ Leave-one-out cross-validation (LOOCV) predicts each point using all other points
✔ Key metrics include RMSE, MAE, and standardized errors
✔ Universal kriging typically outperforms ordinary kriging when relevant covariates are available
✔ Standardized errors help assess whether uncertainty estimates are reliable


3️⃣ Hotspot Analysis for Malaria Surveillance

Step 1: Understanding Spatial Clustering of Malaria


# Install and load necessary packages
install.packages(c("spdep", "sf", "ggplot2", "dplyr", "mapview"))
library(spdep)
library(sf)
library(ggplot2)
library(dplyr)
library(mapview)

# Load malaria incidence data
# This could be point data or polygon data (e.g., by district)
malaria_data <- read.csv("malaria_districts.csv")

# Convert to sf object for spatial analysis
# For polygon data
if("geometry" %in% names(malaria_data) || "WKT" %in% names(malaria_data)) {
  # If geometry is provided as Well-Known Text (WKT)
  if("WKT" %in% names(malaria_data)) {
    malaria_sf <- st_as_sf(malaria_data, wkt = "WKT")
  } else {
    # Directly use the geometry column
    malaria_sf <- st_as_sf(malaria_data)
  }
} else if(all(c("longitude", "latitude") %in% names(malaria_data))) {
  # For point data with coordinates
  malaria_sf <- st_as_sf(malaria_data, coords = c("longitude", "latitude"), 
                       crs = 4326)
}

# Set coordinate reference system if not already set
if(is.na(st_crs(malaria_sf))) {
  st_crs(malaria_sf) <- 4326  # WGS84
}

# Examine the data
print(head(malaria_sf))

# Visualize malaria incidence across districts
ggplot(malaria_sf) +
  geom_sf(aes(fill = incidence)) +
  scale_fill_viridis_c(option = "plasma") +
  labs(
    title = "Malaria Incidence by District",
    fill = "Incidence"
  ) +
  theme_minimal()

# Interactive map
mapview(malaria_sf, zcol = "incidence")

# Create a spatial weights matrix to define neighbors
# For polygon data (e.g., districts)
if(length(unique(st_geometry_type(malaria_sf))) == 1 && 
   unique(st_geometry_type(malaria_sf)) %in% c("POLYGON", "MULTIPOLYGON")) {
  
  # Queen contiguity (neighbors share at least one point)
  nb_queen <- poly2nb(malaria_sf, queen = TRUE)
  
  # Rook contiguity (neighbors share edges)
  nb_rook <- poly2nb(malaria_sf, queen = FALSE)
  
  # Compare the number of neighbors
  par(mfrow = c(1, 2))
  plot(st_geometry(malaria_sf), main = "Queen Contiguity")
  plot.nb(nb_queen, st_coordinates(st_centroid(malaria_sf)), add = TRUE)
  
  plot(st_geometry(malaria_sf), main = "Rook Contiguity")
  plot.nb(nb_rook, st_coordinates(st_centroid(malaria_sf)), add = TRUE)
  par(mfrow = c(1, 1))
  
  # We'll proceed with queen contiguity
  nb <- nb_queen
  
} else {
  # For point data, use distance-based neighbors
  # k-nearest neighbors
  coords <- st_coordinates(malaria_sf)
  nb_knn <- knn2nb(knearneigh(coords, k = 5))
  
  # Distance-based neighbors (connect points within a certain distance)
  d <- 0.1  # Distance threshold (in map units)
  nb_dist <- dnearneigh(coords, 0, d)
  
  # Visualize the neighbor connections
  par(mfrow = c(1, 2))
  plot(st_geometry(malaria_sf), main = "K-Nearest Neighbors (k=5)")
  plot.nb(nb_knn, coords, add = TRUE)
  
  plot(st_geometry(malaria_sf), main = paste0("Distance-based (d=", d, ")"))
  plot.nb(nb_dist, coords, add = TRUE)
  par(mfrow = c(1, 1))
  
  # We'll proceed with k-nearest neighbors
  nb <- nb_knn
}

# Create spatial weights matrix
# Row-standardized weights
w_row <- nb2listw(nb, style = "W")

# Binary weights
w_bin <- nb2listw(nb, style = "B")

# Global spatial autocorrelation (Moran's I)
moran_test <- moran.test(malaria_sf$incidence, w_row)
print(moran_test)

# Moran's I test results
cat("Moran's I statistic:", moran_test$estimate[1], "\n")
cat("Expected value (no autocorrelation):", moran_test$estimate[2], "\n")
cat("Variance:", moran_test$estimate[3], "\n")
cat("p-value:", moran_test$p.value, "\n")

# Interpret Moran's I
if(moran_test$p.value < 0.05) {
  if(moran_test$estimate[1] > moran_test$estimate[2]) {
    cat("There is significant positive spatial autocorrelation (clustering).\n")
  } else {
    cat("There is significant negative spatial autocorrelation (dispersion).\n")
  }
} else {
  cat("There is no significant spatial autocorrelation (random pattern).\n")
}

# Moran scatter plot
mp <- moran.plot(malaria_sf$incidence, w_row,
                labels = as.character(1:nrow(malaria_sf)),
                xlab = "Malaria Incidence", 
                ylab = "Spatially Lagged Incidence")

# Monte Carlo simulation to assess significance
moran_mc <- moran.mc(malaria_sf$incidence, w_row, nsim = 999)
plot(moran_mc, main = "Monte Carlo Simulation of Moran's I")
            

🔍 Explanation:

✔ Spatial clustering analysis assesses whether malaria cases are randomly distributed
✔ Neighbor relationships can be defined using contiguity or distance
✔ Moran's I measures global spatial autocorrelation
✔ Positive Moran's I indicates clustering, negative indicates dispersion
✔ The Moran scatter plot visualizes local relationships between values and their neighbors


Step 2: Local Indicators of Spatial Association (LISA)


# Local Indicators of Spatial Association (LISA)
# Calculate local Moran's I
local_moran <- localmoran(malaria_sf$incidence, w_row)

# Examine the results
head(local_moran)

# Add local Moran's I results to the data
malaria_sf$local_moran_i <- local_moran[, "Ii"]  # Local Moran's I statistic
malaria_sf$local_moran_p <- local_moran[, "Pr(z != E(Ii))"]  # p-value

# Map of local Moran's I values
ggplot(malaria_sf) +
  geom_sf(aes(fill = local_moran_i)) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  labs(
    title = "Local Moran's I",
    subtitle = "Red = positive spatial autocorrelation, Blue = negative",
    fill = "Local Moran's I"
  ) +
  theme_minimal()

# Map of statistical significance
malaria_sf$significant <- ifelse(malaria_sf$local_moran_p < 0.05, "Significant", "Not Significant")

ggplot(malaria_sf) +
  geom_sf(aes(fill = significant)) +
  scale_fill_manual(values = c("Significant" = "red", "Not Significant" = "gray")) +
  labs(
    title = "Statistical Significance of Local Moran's I",
    subtitle = "Areas with significant spatial autocorrelation (p < 0.05)",
    fill = "Significance"
  ) +
  theme_minimal()

# Classify clusters into High-High, Low-Low, High-Low, Low-High
# First, calculate spatially lagged incidence
malaria_sf$inc_lag <- lag.listw(w_row, malaria_sf$incidence)

# Mean values for classification
mean_inc <- mean(malaria_sf$incidence)
mean_inc_lag <- mean(malaria_sf$inc_lag)

# Classify into LISA categories
malaria_sf$lisa_cluster <- NA  # Initialize
malaria_sf$lisa_cluster[malaria_sf$incidence >= mean_inc & 
                       malaria_sf$inc_lag >= mean_inc_lag & 
                       malaria_sf$local_moran_p < 0.05] <- "High-High"
malaria_sf$lisa_cluster[malaria_sf$incidence < mean_inc & 
                       malaria_sf$inc_lag < mean_inc_lag & 
                       malaria_sf$local_moran_p < 0.05] <- "Low-Low"
malaria_sf$lisa_cluster[malaria_sf$incidence >= mean_inc & 
                       malaria_sf$inc_lag < mean_inc_lag & 
                       malaria_sf$local_moran_p < 0.05] <- "High-Low"
malaria_sf$lisa_cluster[malaria_sf$incidence < mean_inc & 
                       malaria_sf$inc_lag >= mean_inc_lag & 
                       malaria_sf$local_moran_p < 0.05] <- "Low-High"
malaria_sf$lisa_cluster[malaria_sf$local_moran_p >= 0.05] <- "Not Significant"

# Create a color-coded map of LISA clusters
lisa_colors <- c("High-High" = "red", "Low-Low" = "blue", 
                "High-Low" = "pink", "Low-High" = "lightblue", 
                "Not Significant" = "white")

ggplot(malaria_sf) +
  geom_sf(aes(fill = lisa_cluster)) +
  scale_fill_manual(values = lisa_colors) +
  labs(
    title = "LISA Cluster Map for Malaria Incidence",
    subtitle = "Identifying hotspots (High-High) and coldspots (Low-Low)",
    fill = "Cluster Type"
  ) +
  theme_minimal()

# Interpret the results
cluster_counts <- table(malaria_sf$lisa_cluster)
print(cluster_counts)

cat("Interpretation of LISA clusters:\n")
cat("High-High: Hotspots (high incidence areas surrounded by high incidence)\n")
cat("Low-Low: Coldspots (low incidence areas surrounded by low incidence)\n")
cat("High-Low: Spatial outliers (high incidence surrounded by low incidence)\n")
cat("Low-High: Spatial outliers (low incidence surrounded by high incidence)\n")
cat("Not Significant: No significant spatial pattern\n")

# Interactive map of LISA clusters
mapview(malaria_sf, zcol = "lisa_cluster")
            

🔍 Explanation:

✔ Local Indicators of Spatial Association (LISA) identify significant local clusters
✔ Local Moran's I detects hotspots, coldspots, and spatial outliers
✔ High-High clusters (hotspots) represent areas with high incidence surrounded by high values
✔ Low-Low clusters (coldspots) are areas with low incidence surrounded by low values
✔ Spatial outliers (High-Low or Low-High) indicate areas that differ from their neighbors


Step 3: Getis-Ord Gi* Hotspot Analysis


# Getis-Ord Gi* statistic for hotspot analysis
# Unlike local Moran's I, Gi* focuses specifically on identifying concentrations of high or low values

# Calculate Getis-Ord Gi* statistic
gi_star <- localG(malaria_sf$incidence, w_row)

# Add Gi* results to the data
malaria_sf$gi_star <- as.numeric(gi_star)

# Map of Gi* z-scores
ggplot(malaria_sf) +
  geom_sf(aes(fill = gi_star)) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", 
                      midpoint = 0, name = "Gi* z-score") +
  labs(
    title = "Getis-Ord Gi* Hotspot Analysis",
    subtitle = "Red = hotspot, Blue = coldspot (z-scores)"
  ) +
  theme_minimal()

# Classify areas based on Gi* significance
# Convert Gi* to p-values
gi_pvalues <- 2 * (1 - pnorm(abs(gi_star)))

# Add p-values to the data
malaria_sf$gi_pvalue <- gi_pvalues

# Classify hotspots and coldspots based on significance levels
malaria_sf$hotspot_cat <- NA
# Hotspots
malaria_sf$hotspot_cat[gi_star > 0 & gi_pvalues < 0.01] <- "High (99% confidence)"
malaria_sf$hotspot_cat[gi_star > 0 & gi_pvalues >= 0.01 & gi_pvalues < 0.05] <- "High (95% confidence)"
malaria_sf$hotspot_cat[gi_star > 0 & gi_pvalues >= 0.05 & gi_pvalues < 0.1] <- "High (90% confidence)"
# Coldspots
malaria_sf$hotspot_cat[gi_star < 0 & gi_pvalues < 0.01] <- "Low (99% confidence)"
malaria_sf$hotspot_cat[gi_star < 0 & gi_pvalues >= 0.01 & gi_pvalues < 0.05] <- "Low (95% confidence)"
malaria_sf$hotspot_cat[gi_star < 0 & gi_pvalues >= 0.05 & gi_pvalues < 0.1] <- "Low (90% confidence)"
# Not significant
malaria_sf$hotspot_cat[gi_pvalues >= 0.1 | is.na(gi_pvalues)] <- "Not Significant"

# Create a factor with ordered levels
hotspot_levels <- c("High (99% confidence)", "High (95% confidence)", "High (90% confidence)",
                   "Not Significant",
                   "Low (90% confidence)", "Low (95% confidence)", "Low (99% confidence)")
malaria_sf$hotspot_cat <- factor(malaria_sf$hotspot_cat, levels = hotspot_levels)

# Define color scheme for hotspots
hotspot_colors <- c("High (99% confidence)" = "#d7191c",
                   "High (95% confidence)" = "#fdae61",
                   "High (90% confidence)" = "#ffffbf",
                   "Not Significant" = "#ffffff",
                   "Low (90% confidence)" = "#abd9e9",
                   "Low (95% confidence)" = "#2c7bb6",
                   "Low (99% confidence)" = "#053061")

# Create a map of Gi* hotspots and coldspots
ggplot(malaria_sf) +
  geom_sf(aes(fill = hotspot_cat)) +
  scale_fill_manual(values = hotspot_colors, name = "Hotspot Category") +
  labs(
    title = "Getis-Ord Gi* Hotspot Analysis for Malaria Incidence",
    subtitle = "Red = hotspots, Blue = coldspots"
  ) +
  theme_minimal()

# Overlay district borders on hotspot map to show context
ggplot() +
  geom_sf(data = malaria_sf, aes(fill = hotspot_cat)) +
  geom_sf(data = malaria_sf, fill = NA, color = "black", size = 0.2) +
  scale_fill_manual(values = hotspot_colors, name = "Hotspot Category") +
  labs(
    title = "Malaria Hotspots by District",
    subtitle = "Getis-Ord Gi* statistic with confidence levels"
  ) +
  theme_minimal()

# Compare Local Moran's I and Getis-Ord Gi*
# Prepare a comparison dataframe
comparison_df <- data.frame(
  District = 1:nrow(malaria_sf),
  Incidence = malaria_sf$incidence,
  LocalMoran = malaria_sf$local_moran_i,
  LocalMoran_pvalue = malaria_sf$local_moran_p,
  LISA_Cluster = malaria_sf$lisa_cluster,
  GiStar = malaria_sf$gi_star,
  GiStar_pvalue = malaria_sf$gi_pvalue,
  Hotspot_Category = malaria_sf$hotspot_cat
)

# Print the first few rows
head(comparison_df)

# Check agreement between Local Moran's I and Gi*
# For simplicity, we'll classify districts as hotspot, coldspot, or neither
comparison_df$moran_hotspot <- ifelse(comparison_df$LISA_Cluster == "High-High", "Hotspot",
                                    ifelse(comparison_df$LISA_Cluster == "Low-Low", "Coldspot", "Neither"))

comparison_df$gistar_hotspot <- ifelse(comparison_df$GiStar > 0 & comparison_df$GiStar_pvalue < 0.05, "Hotspot",
                                     ifelse(comparison_df$GiStar < 0 & comparison_df$GiStar_pvalue < 0.05, "Coldspot", "Neither"))

# Create a confusion matrix
confusion_matrix <- table(LocalMoran = comparison_df$moran_hotspot, 
                        GiStar = comparison_df$gistar_hotspot)
print(confusion_matrix)

# Calculate percentage agreement
agreement <- sum(diag(confusion_matrix)) / sum(confusion_matrix) * 100
cat("Agreement between Local Moran's I and Getis-Ord Gi*:", round(agreement, 1), "%\n")

# Visualize both methods side by side
side_by_side <- malaria_sf %>%
  select(geometry, incidence, lisa_cluster, hotspot_cat) %>%
  gather(key = "method", value = "classification", lisa_cluster, hotspot_cat)

# Create a new column for colors
side_by_side$color <- NA
# LISA colors
side_by_side$color[side_by_side$method == "lisa_cluster" & side_by_side$classification == "High-High"] <- "red"
side_by_side$color[side_by_side$method == "lisa_cluster" & side_by_side$classification == "Low-Low"] <- "blue"
side_by_side$color[side_by_side$method == "lisa_cluster" & side_by_side$classification == "High-Low"] <- "pink"
side_by_side$color[side_by_side$method == "lisa_cluster" & side_by_side$classification == "Low-High"] <- "lightblue"
side_by_side$color[side_by_side$method == "lisa_cluster" & side_by_side$classification == "Not Significant"] <- "white"

# Gi* colors
side_by_side$color[side_by_side$method == "hotspot_cat" & side_by_side$classification == "High (99% confidence)"] <- "#d7191c"
side_by_side$color[side_by_side$method == "hotspot_cat" & side_by_side$classification == "High (95% confidence)"] <- "#fdae61"
side_by_side$color[side_by_side$method == "hotspot_cat" & side_by_side$classification == "High (90% confidence)"] <- "#ffffbf"
side_by_side$color[side_by_side$method == "hotspot_cat" & side_by_side$classification == "Not Significant"] <- "#ffffff"
side_by_side$color[side_by_side$method == "hotspot_cat" & side_by_side$classification == "Low (90% confidence)"] <- "#abd9e9"
side_by_side$color[side_by_side$method == "hotspot_cat" & side_by_side$classification == "Low (95% confidence)"] <- "#2c7bb6"
side_by_side$color[side_by_side$method == "hotspot_cat" & side_by_side$classification == "Low (99% confidence)"] <- "#053061"

# Plot the side-by-side comparison
ggplot(side_by_side) +
  geom_sf(aes(fill = classification)) +
  facet_wrap(~ method, labeller = labeller(method = c(lisa_cluster = "Local Moran's I",
                                                    hotspot_cat = "Getis-Ord Gi*"))) +
  scale_fill_manual(values = unique(side_by_side$color[!is.na(side_by_side$color)])) +
  labs(
    title = "Comparison of Local Moran's I and Getis-Ord Gi*",
    subtitle = "Identifying malaria hotspots and coldspots",
    fill = "Classification"
  ) +
  theme_minimal()
            

🔍 Explanation:

✔ The Getis-Ord Gi* statistic specifically identifies hotspots and coldspots
✔ Positive Gi* values indicate clustering of high values (hotspots)
✔ Negative Gi* values indicate clustering of low values (coldspots)
✔ Statistical significance is determined using z-scores and p-values
✔ Hotspots with different confidence levels (90%, 95%, 99%) can be identified


Step 4: Temporal Hotspot Analysis and Applications


# Temporal hotspot analysis
# This requires data for multiple time periods
# Let's assume we have data for several years

# Check if we have temporal data available
has_time_data <- FALSE

if("year" %in% names(malaria_sf) && length(unique(malaria_sf$year)) > 1) {
  has_time_data <- TRUE
  # Analyze how hotspots change over time
  years <- sort(unique(malaria_sf$year))
  
  # Create a list to store yearly hotspot results
  yearly_hotspots <- list()
  
  for(yr in years) {
    # Subset data for the current year
    year_data <- malaria_sf %>% filter(year == yr)
    
    # Create neighbors based on the full dataset to keep consistent relationships
    # Create weights matrix for the year
    year_nb <- nb
    year_listw <- nb2listw(year_nb, style = "W")
    
    # Calculate Gi* for the year
    year_gi <- localG(year_data$incidence, year_listw)
    
    # Add results to the data
    year_data$gi_star <- as.numeric(year_gi)
    year_data$gi_pvalue <- 2 * (1 - pnorm(abs(year_gi)))
    
    # Classify hotspots
    year_data$hotspot_cat <- NA
    # Hotspots
    year_data$hotspot_cat[year_data$gi_star > 0 & year_data$gi_pvalue < 0.05] <- "Hotspot"
    # Coldspots
    year_data$hotspot_cat[year_data$gi_star < 0 & year_data$gi_pvalue < 0.05] <- "Coldspot"
    # Not significant
    year_data$hotspot_cat[year_data$gi_pvalue >= 0.05 | is.na(year_data$gi_pvalue)] <- "Not Significant"
    
    # Store results
    yearly_hotspots[[as.character(yr)]] <- year_data
  }
  
  # Create panels showing hotspot evolution over time
  hotspot_panels <- do.call(rbind, yearly_hotspots)
  
  # Plot temporal hotspots
  ggplot(hotspot_panels) +
    geom_sf(aes(fill = hotspot_cat)) +
    scale_fill_manual(values = c("Hotspot" = "red", "Coldspot" = "blue", 
                               "Not Significant" = "white")) +
    facet_wrap(~ year) +
    labs(
      title = "Temporal Evolution of Malaria Hotspots",
      subtitle = "Red = hotspots, Blue = coldspots",
      fill = "Classification"
    ) +
    theme_minimal()
  
  # Track persistent hotspots
  # Identify districts that are hotspots in multiple years
  hotspot_years <- data.frame(
    District = 1:nrow(malaria_sf),
    district_name = if("name" %in% names(malaria_sf)) malaria_sf$name else paste("District", 1:nrow(malaria_sf)),
    total_years = length(years),
    hotspot_years = 0,
    coldspot_years = 0
  )
  
  for(yr in years) {
    year_data <- yearly_hotspots[[as.character(yr)]]
    hotspot_years$hotspot_years <- hotspot_years$hotspot_years + 
      (year_data$hotspot_cat == "Hotspot")
    hotspot_years$coldspot_years <- hotspot_years$coldspot_years + 
      (year_data$hotspot_cat == "Coldspot")
  }
  
  # Calculate persistence indicators
  hotspot_years$hotspot_persistence <- hotspot_years$hotspot_years / hotspot_years$total_years
  hotspot_years$coldspot_persistence <- hotspot_years$coldspot_years / hotspot_years$total_years
  
  # Add persistence indicators to the original data
  malaria_sf$hotspot_persistence <- hotspot_years$hotspot_persistence
  malaria_sf$coldspot_persistence <- hotspot_years$coldspot_persistence
  
  # Classify persistence
  malaria_sf$persistence_cat <- "Not Persistent"
  malaria_sf$persistence_cat[malaria_sf$hotspot_persistence >= 0.75] <- "Persistent Hotspot"
  malaria_sf$persistence_cat[malaria_sf$hotspot_persistence >= 0.5 & 
                            malaria_sf$hotspot_persistence < 0.75] <- "Semi-Persistent Hotspot"
  malaria_sf$persistence_cat[malaria_sf$coldspot_persistence >= 0.75] <- "Persistent Coldspot"
  malaria_sf$persistence_cat[malaria_sf$coldspot_persistence >= 0.5 & 
                            malaria_sf$coldspot_persistence < 0.75] <- "Semi-Persistent Coldspot"
  
  # Map of persistence
  ggplot(malaria_sf) +
    geom_sf(aes(fill = persistence_cat)) +
    scale_fill_manual(values = c("Persistent Hotspot" = "red", 
                               "Semi-Persistent Hotspot" = "orange",
                               "Persistent Coldspot" = "blue", 
                               "Semi-Persistent Coldspot" = "lightblue",
                               "Not Persistent" = "white")) +
    labs(
      title = "Persistence of Malaria Hotspots Over Time",
      subtitle = "Districts that consistently appear as hotspots across years",
      fill = "Persistence Category"
    ) +
    theme_minimal()
}

# If we don't have time series data, proceed with spatial hotspot application
# Prioritizing intervention areas based on hotspot analysis

# Create an intervention priority index
malaria_sf$priority_score <- NA

# If we have temporal data, use persistence as a factor
if(has_time_data) {
  # Weight based on hotspot persistence and current incidence
  malaria_sf$priority_score <- scale(malaria_sf$hotspot_persistence)[,1] * 0.7 + 
                             scale(malaria_sf$incidence)[,1] * 0.3
} else {
  # Weight based on Gi* z-score and incidence
  malaria_sf$priority_score <- scale(malaria_sf$gi_star)[,1] * 0.6 + 
                             scale(malaria_sf$incidence)[,1] * 0.4
}

# Classify into priority levels
malaria_sf$priority_level <- cut(malaria_sf$priority_score, 
                               breaks = c(-Inf, -0.5, 0.5, 1.5, Inf),
                               labels = c("Low", "Medium", "High", "Very High"))

# Map the intervention priorities
ggplot(malaria_sf) +
  geom_sf(aes(fill = priority_level)) +
  scale_fill_manual(values = c("Low" = "#1a9641", "Medium" = "#a6d96a", 
                             "High" = "#fdae61", "Very High" = "#d7191c")) +
  labs(
    title = "Malaria Intervention Priority Areas",
    subtitle = "Based on hotspot analysis and incidence levels",
    fill = "Priority Level"
  ) +
  theme_minimal()

# Calculate population at risk in each priority level
if("population" %in% names(malaria_sf)) {
  priority_pop <- malaria_sf %>%
    st_drop_geometry() %>%
    group_by(priority_level) %>%
    summarize(
      districts = n(),
      total_population = sum(population, na.rm = TRUE),
      total_cases = sum(incidence * population / 1000, na.rm = TRUE),  # Assuming incidence per 1000
      percent_population = round(sum(population, na.rm = TRUE) / 
                                sum(malaria_sf$population, na.rm = TRUE) * 100, 1)
    )
  
  print(priority_pop)
  
  # Calculate resource needs based on intervention type and population
  # Example: bed nets for high and very high priority areas
  bed_nets_needed <- sum(malaria_sf$population[malaria_sf$priority_level %in% c("High", "Very High")]) * 1.8 / 1000  # 1.8 nets per person
  
  cat("Estimated bed nets needed (thousands):", round(bed_nets_needed, 1), "\n")
  
  # Chart showing population by priority level
  ggplot(priority_pop, aes(x = priority_level, y = total_population/1000000, fill = priority_level)) +
    geom_bar(stat = "identity") +
    geom_text(aes(label = paste0(percent_population, "%")), vjust = -0.5) +
    scale_fill_manual(values = c("Low" = "#1a9641", "Medium" = "#a6d96a", 
                               "High" = "#fdae61", "Very High" = "#d7191c")) +
    labs(
      title = "Population by Intervention Priority Level",
      subtitle = "Numbers show percentage of total population",
      x = "Priority Level",
      y = "Population (millions)"
    ) +
    theme_minimal()
}
            

🔍 Explanation:

✔ Temporal hotspot analysis tracks how malaria clusters change over time
✔ Persistent hotspots are areas that consistently show high incidence
✔ Priority indices combine multiple factors (hotspot status, persistence, incidence)
✔ Resource allocation can be optimized by targeting high-priority areas
✔ Population-based metrics help quantify intervention needs


✅ Summary of Week 11

By the end of this week, you should be able to:

Bayesian Analysis

Spatial Interpolation

Hotspot Analysis

Next Steps

Now that you've learned Bayesian and spatial analysis techniques for malaria data, you can:


Additional Resources