Project Introduction

In response to challenges in urban drone navigation, this project develops a predictive risk model to ensure drone safety. Standard map data is often outdated, failing to reflect the true height of obstacles. Our objective is to analyze map data from various sources, model its inaccuracies, and ultimately build a safety-aware pathfinding algorithm that can navigate a grid by predicting the real height of obstacles, not just what the map reports.

Part 1: Project Setup & Data Loading

We begin by loading all necessary libraries and our core dataset.

Load All Required Libraries

# --- Load all required libraries ---

# install.packages("tidyverse")
# install.packages("summarytools")
# install.packages("DataExplorer")
# install.packages("MissMech")
# install.packages("corrplot")
# install.packages("lmtest")
# install.packages("car")
# install.packages("arm")
# install.packages("brglm")
# install.packages("logistf")
# install.packages("brms")
# install.packages("igraph")
# install.packages("posterior") # brms dependency for as_draws_df

# --- Load all required libraries ---
library(tidyverse)
library(summarytools)
library(DataExplorer)
library(ggplot2)
library(dplyr)
library(readr)
library(MissMech)
library(corrplot)
library(lmtest)
library(car)
library(arm)
library(brglm)
library(logistf)
library(brms)
library(igraph)
library(tidyr)
library(posterior)

This block consolidates all packages, making the script clean and easy to run. We’ve included everything from data manipulation (tidyverse) to advanced modeling (brms).

Load the Map Dataset

# --- 2. LOAD AND INSPECT DATA ---

# File location
map_data <- read.csv("C:/Users/karki/Downloads/map_info.csv")

# View first few rows
head(map_data)
##   x y sq map_height real_height reported.obstacles map_last_update map_source
## 1 1 1  1          9          11                  2              33     Google
## 2 1 2  2         18          55                 38              53     Google
## 3 1 3  3         15          42                 42              49     Google
## 4 1 4  4         25          51                 47              59     Google
## 5 1 5  5         25          69                 NA              12     Google
## 6 1 6  6         23          71                 29              55     Google
##   obstacle
## 1        0
## 2        1
## 3        1
## 4        1
## 5        1
## 6        1

The data is loaded successfully. We can see our key columns: coordinates (x, y), map metadata (map_height, map_source), and our two target variables: real_height (continuous) and obstacle (binary).

Initial Data Structure Inspection

str(map_data)
## 'data.frame':    100 obs. of  9 variables:
##  $ x                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ y                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sq                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ map_height        : int  9 18 15 25 25 23 7 37 34 28 ...
##  $ real_height       : int  11 55 42 51 69 71 30 53 41 75 ...
##  $ reported.obstacles: int  2 38 42 47 NA 29 4 49 37 25 ...
##  $ map_last_update   : int  33 53 49 59 12 55 47 76 18 72 ...
##  $ map_source        : chr  "Google" "Google" "Google" "Google" ...
##  $ obstacle          : int  0 1 1 1 1 1 1 1 1 1 ...
# Summary statistics
summary(map_data)
##        x              y              sq           map_height     real_height   
##  Min.   : 1.0   Min.   : 1.0   Min.   :  1.00   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.: 3.0   1st Qu.: 3.0   1st Qu.: 25.75   1st Qu.:13.50   1st Qu.:23.50  
##  Median : 5.5   Median : 5.5   Median : 50.50   Median :25.00   Median :36.50  
##  Mean   : 5.5   Mean   : 5.5   Mean   : 50.50   Mean   :25.18   Mean   :36.92  
##  3rd Qu.: 8.0   3rd Qu.: 8.0   3rd Qu.: 75.25   3rd Qu.:36.00   3rd Qu.:49.25  
##  Max.   :10.0   Max.   :10.0   Max.   :100.00   Max.   :50.00   Max.   :83.00  
##                                                                                
##  reported.obstacles map_last_update  map_source           obstacle   
##  Min.   : 0.00      Min.   : 0.00   Length:100         Min.   :0.00  
##  1st Qu.: 3.00      1st Qu.:24.00   Class :character   1st Qu.:1.00  
##  Median :16.00      Median :52.00   Mode  :character   Median :1.00  
##  Mean   :21.46      Mean   :50.66                      Mean   :0.79  
##  3rd Qu.:39.00      3rd Qu.:72.00                      3rd Qu.:1.00  
##  Max.   :50.00      Max.   :99.00                      Max.   :1.00  
##  NA's   :7

The summary() output immediately reveals our first data quality issue: the reported.obstacles column has 7 missing values (NA's) that we will need to address.

Part 2: Exploratory Data Analysis (EDA) - Uncovering the Risk

Our first formal task is to apply exploratory data analysis to understand the data’s characteristics, find the problems, and validate our project’s core assumptions.

Numerical Variable Analysis

# Reshape summary output to long format
map_data %>%
  dplyr::select(where(is.numeric)) %>%  # <-- FIX IS HERE
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  group_by(variable) %>%
  summarise(
    mean = mean(value, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    min = min(value, na.rm = TRUE),
    max = max(value, na.rm = TRUE)
  ) %>%
  arrange(variable)
## # A tibble: 8 × 6
##   variable            mean median     sd   min   max
##   <chr>              <dbl>  <dbl>  <dbl> <int> <int>
## 1 map_height         25.2    25   14.4       1    50
## 2 map_last_update    50.7    52   29.5       0    99
## 3 obstacle            0.79    1    0.409     0     1
## 4 real_height        36.9    36.5 17.0       1    83
## 5 reported.obstacles 21.5    16   17.1       0    50
## 6 sq                 50.5    50.5 29.0       1   100
## 7 x                   5.5     5.5  2.89      1    10
## 8 y                   5.5     5.5  2.89      1    10

This summary confirms our project’s core premise: the real_height (max 83 ft) is often significantly greater than the map_height (max 50 ft). This discrepancy is the danger we need to model.

Correlation Analysis

# Compute correlation matrix
cor_matrix <- map_data %>%
  dplyr::select(where(is.numeric)) %>% 
  cor(use = "pairwise.complete.obs")

# Visualize correlation matrix
corrplot::corrplot(cor_matrix,      
         method = "color",
         type = "full",
         tl.col = "black",
         tl.cex = 0.8,
         diag = FALSE,
         addCoef.col = "black",
         number.cex = 0.7)

The correlation matrix is very insightful. It confirms map_height is a strong positive predictor for real_height (0.82), making our modeling viable. obstacle also strongly correlates with real_height (0.69), which is expected.

Part 3: Data Cleaning & Feature Engineering

Before we can model, we must handle the missing data identified in our EDA and engineer new features that will help quantify risk.

Step 3.1: Check Missing Value Counts

# Shows NA counts for every column
colSums(is.na(map_data))
##                  x                  y                 sq         map_height 
##                  0                  0                  0                  0 
##        real_height reported.obstacles    map_last_update         map_source 
##                  0                  7                  0                  0 
##           obstacle 
##                  0

This check confirms that all 7 missing values are isolated in the reported.obstacles column.

Step 3.2: Clean Data (Pre-Imputation)

# Clean map_source and remove duplicates
map_data_clean <- map_data %>%
  mutate(map_source = as.factor(tolower(map_source))) %>%
  filter(!duplicated(.))

# Confirm structure
glimpse(map_data_clean)
## Rows: 100
## Columns: 9
## $ x                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, …
## $ y                  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7,…
## $ sq                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ map_height         <int> 9, 18, 15, 25, 25, 23, 7, 37, 34, 28, 50, 37, 48, 1…
## $ real_height        <int> 11, 55, 42, 51, 69, 71, 30, 53, 41, 75, 72, 37, 83,…
## $ reported.obstacles <int> 2, 38, 42, 47, NA, 29, 4, 49, 37, 25, NA, 34, 14, 3…
## $ map_last_update    <int> 33, 53, 49, 59, 12, 55, 47, 76, 18, 72, 88, 11, 87,…
## $ map_source         <fct> google, google, google, google, google, google, goo…
## $ obstacle           <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, …

We create our map_data_clean object by converting map_source to a factor and removing any duplicate rows. This clean dataset is now ready for the MCAR test.

Step 3.3: Test Missing Data Pattern (MCAR)

# We'll include: reported.obstacles, map_height, real_height, map_last_update
# We use map_data_clean, which we just created.
data_to_check <- as.data.frame(map_data_clean[
  , c("reported.obstacles", "map_height", "real_height", "map_last_update")
])

# Run MCAR test
TestMCARNormality(data_to_check)
## Call:
## TestMCARNormality(data = data_to_check)
## 
## Number of Patterns:  2 
## 
## Total number of cases used in the analysis:  100 
## 
##  Pattern(s) used:
##           reported.obstacles   map_height   real_height   map_last_update
## group.1                    1            1             1                 1
## group.2                   NA            1             1                 1
##           Number of cases
## group.1                93
## group.2                 7
## 
## 
##     Test of normality and Homoscedasticity:
##   -------------------------------------------
## 
## Hawkins Test:
## 
##     P-value for the Hawkins test of normality and homoscedasticity:  0.0006565239 
## 
##     Either the test of multivariate normality or homoscedasticity (or both) is rejected.
##     Provided that normality can be assumed, the hypothesis of MCAR is 
##     rejected at 0.05 significance level. 
## 
## Non-Parametric Test:
## 
##     P-value for the non-parametric test of homoscedasticity:  0.3252182 
## 
##     Reject Normality at 0.05 significance level.
##     There is not sufficient evidence to reject MCAR at 0.05 significance level.

The high p-value for the non-parametric test (0.325) means we fail to reject the null hypothesis. We can conclude the data is Missing Completely At Random (MCAR). This is great news, as it statistically validates our plan to use a simple and robust method like mean imputation.

Step 3.4: Impute Missing Data

# Mean imputation for reported.obstacles on our clean data
map_data_clean$reported.obstacles_imp <- ifelse(
  is.na(map_data_clean$reported.obstacles),
  mean(map_data_clean$reported.obstacles, na.rm = TRUE),
  map_data_clean$reported.obstacles
)

# Check new column after imputation
summary(map_data_clean$reported.obstacles_imp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    3.00   21.23   21.46   38.00   50.00

With our imputation method validated, we now create a new, complete column reported.obstacles_imp. Our dataset is now fully clean, complete, and ready for feature engineering.

Step 3.5: Feature Engineering

# Derive height_error: the difference between real and map height
map_data_clean$height_error <- map_data_clean$real_height - map_data_clean$map_height

# Derive obstacle_report_ratio: number of obstacle reports per day since last update
# We add +1 to denominator to avoid division by zero
map_data_clean$obstacle_report_ratio <- map_data_clean$reported.obstacles_imp / (map_data_clean$map_last_update + 1)

# Check summaries of new variables
summary(map_data_clean[, c("height_error", "obstacle_report_ratio")])
##   height_error   obstacle_report_ratio
##  Min.   : 0.00   Min.   :0.0000       
##  1st Qu.: 5.00   1st Qu.:0.1228       
##  Median :10.00   Median :0.4447       
##  Mean   :11.74   Mean   :0.7421       
##  3rd Qu.:17.00   3rd Qu.:0.8350       
##  Max.   :48.00   Max.   :7.0000

We engineer two powerful new features. height_error directly quantifies the danger (max error is 48 ft!), and obstacle_report_ratio measures drone activity relative to map freshness. These will be crucial for our risk maps.

Part 4: Visualizing the Drone’s “Minefield”

Now that our data is clean and enriched, we create a series of visualizations to clearly communicate the scale of the risk.

Visualization 1: Map Update Freshness (Heatmap)

# Define freshness levels
map_data_clean <- map_data_clean %>%
  mutate(update_status = case_when(
    map_last_update <= 5 ~ "Fresh (0–5 days)",
    map_last_update <= 20 ~ "Moderate (6–20 days)",
    TRUE ~ "Stale (21+ days)"
  ))

# Create a heatmap of update status by square
ggplot(map_data_clean, aes(x = x, y = y, fill = update_status)) +
  geom_tile(color = "white") +
  scale_fill_manual(
    values = c(
      "Fresh (0–5 days)" = "#4CAF50",     # green
      "Moderate (6–20 days)" = "#FFC107", # yellow
      "Stale (21+ days)" = "#F44336"      # red
    )
  ) +
  coord_fixed() +
  labs(
    title = "Map Update Freshness by Square",
    x = "X Coordinate",
    y = "Y Coordinate",
    fill = "Update Status"
  ) +
  theme_minimal()

This map is a sea of red, and that’s a huge problem. Nearly all the data is “Stale” (over 21 days old), so relying on it is like “flying blind.” This risk is exactly why we need a predictive model.

Visualization 2: Map Freshness (Bar Plot)

map_data_clean %>%
  count(update_status) %>%
  ggplot(aes(x = update_status, y = n, fill = update_status)) +
  geom_bar(stat = "identity", width = 0.6) +
  geom_text(aes(label = n), vjust = -0.5, size = 5) +
  scale_fill_manual(values = c(
    "Fresh (0–5 days)" = "#2ca02c",
    "Moderate (6–20 days)" = "#ff7f0e",
    "Stale (21+ days)" = "#d62728"
  )) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(
    title = "Number of Squares by Map Update Freshness",
    x = "Update Status",
    y = "Number of Squares",
    fill = "Legend"
  ) +
  theme_minimal(base_size = 13)

This chart quantifies the risk: 76 out of 100 squares have dangerously stale data, while only 7 are freshly updated and reliable.

Visualization 3: Drone Safety Risk Map (Logic)

This brings us to the question: “Where are the highest risks for drone collisions?” We will define a set of safety rules based on our engineered variables. - Safe to fly: Low height error and a recently updated map. - Caution zone: Medium height error but a freshly updated map. - Danger zone: High height error OR a stale map with medium error. - Map needs update: Low height error but an outdated map (risk is uncertain).

# Reclassify safety using our defined logic
map_data_clean <- map_data_clean %>%
  mutate(
    height_risk = case_when(
      height_error <= 10 ~ "Low",
      height_error <= 25 ~ "Medium",
      TRUE ~ "High"
    ),
    map_fresh = ifelse(map_last_update <= 5, TRUE, FALSE),
    safety_status = case_when(
      height_risk == "Low" & map_fresh ~ "Safe to Fly",
      height_risk == "Low" & !map_fresh ~ "Map Needs Update",
      height_risk == "Medium" & map_fresh ~ "Caution Zone",
      height_risk == "Medium" & !map_fresh ~ "Danger Zone",
      height_risk == "High" ~ "Danger Zone"
    )
  )

This dplyr code block is the “brain” behind our final risk map. It translates our raw data into these four clear, actionable safety statuses by combining height_error with map_fresh.

Visualization 4: Drone Safety Risk Map (Plot)

# Better color scale
color_scheme <- c(
  "Safe to Fly" = "#2ca02c",        # Green
  "Map Needs Update" = "#bdbdbd",   # Gray
  "Caution Zone" = "#ffcc00",       # Yellow
  "Danger Zone" = "#d62728"         # Red
)

# Plot again with fixed logic
ggplot(map_data_clean, aes(x = x, y = y, fill = safety_status)) +
  geom_tile(color = "white", size = 0.3) +
  scale_fill_manual(values = color_scheme) +
  labs(
    title = "Refined Drone Safety Risk Map (Height Error + Map Freshness)",
    fill = "Drone Risk Status",
    x = "X Coordinate",
    y = "Y Coordinate"
  ) +
  theme_minimal(base_size = 13)

This map confirms our fears. The grid is a minefield of confirmed dangers and high-risk unknowns, dominated by “Danger Zones” (red) and “Map Needs Update” (gray).Safe to Fly” (green) squares—the only ones meeting our strict criteria - are exceptionally rare.*

Visualization 5: Map Source Reliability

ggplot(map_data_clean, aes(x = map_source, y = height_error, fill = map_source)) +
  geom_boxplot() +
  labs(
    title = "Comparison of Height Error by Map Source",
    x = "Map Source",
    y = "Height Error (in feet)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This visual tells a clear story: the google map is an extreme gamble. Its massive blue box shows it’s wildly inconsistent, with errors ranging from near-zero to a catastrophic 50 ft. In contrast, the army map is the clear professional choice, offering the most predictable and consistently low-error performance.

Part 5: Model 1 - Predicting Exact Height (Linear Regression)

Our second task is to create a linear regression model to predict the real_height for each square. This model will help us understand which variables drive the map’s inaccuracy.

Initial Linear Model

# Fit the linear regression model (excluding obstacle)
model_real_height <- lm(real_height ~ map_height + reported.obstacles_imp + map_last_update + map_source,
                        data = map_data_clean)
summary(model_real_height)
## 
## Call:
## lm(formula = real_height ~ map_height + reported.obstacles_imp + 
##     map_last_update + map_source, data = map_data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.4855  -5.0021  -0.5669   4.9448  23.3375 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            11.08124    3.07344   3.605 0.000501 ***
## map_height              0.92660    0.06224  14.888  < 2e-16 ***
## reported.obstacles_imp  0.10913    0.05424   2.012 0.047073 *  
## map_last_update         0.03853    0.02845   1.354 0.178960    
## map_sourcearmy         -4.36321    2.53354  -1.722 0.088327 .  
## map_sourcegoogle        9.98551    3.22030   3.101 0.002547 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.258 on 94 degrees of freedom
## Multiple R-squared:  0.7748, Adjusted R-squared:  0.7628 
## F-statistic: 64.68 on 5 and 94 DF,  p-value: < 2.2e-16

This first model explains 76% of the height variance (Adjusted R-squared = 0.7628), confirming map_heightis the single most dominant predictor. However,the wide residual range hints that potential outliers are skewing the results.

— Outlier detection and removal

# Calculate Cook's distance
cooks_d <- cooks.distance(model_real_height)
# Visualize Cook's Distance
map_data_clean$cooks_d <- cooks_d
plot(cooks_d, type = "h", main = "Cook's Distance", ylab = "Distance")
cutoff <- 10 / nrow(map_data_clean) # Set a cutoff threshold
abline(h = cutoff, col = "red", lty = 2)

# Filter out influential points
map_data_clean_no_outliers <- map_data_clean %>%
  filter(cooks_d <= cutoff)

The Cook’s Distance plot identifies several influential outliers (points above the red line). We filter these out to create a more robust and accurate model.

Refined Linear Model

# Refit model without outliers
model_real_height_clean <- lm(real_height ~ map_height + reported.obstacles_imp +map_last_update + map_source,
                  data = map_data_clean_no_outliers)
summary(model_real_height_clean)
## 
## Call:
## lm(formula = real_height ~ map_height + reported.obstacles_imp + 
##     map_last_update + map_source, data = map_data_clean_no_outliers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.2937  -4.4884  -0.5406   4.5887  13.1419 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            12.57927    2.38292   5.279 9.26e-07 ***
## map_height              0.92477    0.04774  19.369  < 2e-16 ***
## reported.obstacles_imp  0.08355    0.04211   1.984  0.05035 .  
## map_last_update         0.02231    0.02230   1.000  0.31995    
## map_sourcearmy         -4.47804    1.93172  -2.318  0.02276 *  
## map_sourcegoogle        9.77953    2.88016   3.395  0.00103 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.295 on 88 degrees of freedom
## Multiple R-squared:  0.8534, Adjusted R-squared:  0.8451 
## F-statistic: 102.5 on 5 and 88 DF,  p-value: < 2.2e-16

Success. After removing outliers, our Adjusted R-squared improved, and more predictors (like map_sourcearmy) became statistically significant. This model is a much better fit, and we can now confidently validate its assumptions.

Linear Model Assumption Checks

# 1. Linearity
plot(model_real_height_clean$fitted.values, model_real_height_clean$residuals,
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residuals vs Fitted Values")
abline(h = 0, col = "red")

# 2. Normality of residuals ((H0: Residuals are normal))
shapiro.test(model_real_height_clean$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_real_height_clean$residuals
## W = 0.97994, p-value = 0.1582
# 3.Homoscedasticity (H0: Residuals have constant variance)
bptest(model_real_height_clean)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_real_height_clean
## BP = 10.147, df = 5, p-value = 0.07118
# 4. Independence (H0: No autocorrelation)
# Durbin-Watson Test (H0: No autocorrelation)
dwtest(model_real_height_clean)
## 
##  Durbin-Watson test
## 
## data:  model_real_height_clean
## DW = 1.981, p-value = 0.3759
## alternative hypothesis: true autocorrelation is greater than 0
# 5. Multicollinearity
vif(model_real_height_clean)
##                            GVIF Df GVIF^(1/(2*Df))
## map_height             1.151345  1        1.073007
## reported.obstacles_imp 1.154930  1        1.074677
## map_last_update        1.023586  1        1.011724
## map_source             1.081254  2        1.019722

All assumptions are satisfied. The residual plot shows random scatter, the Shapiro-Wilk and Breusch-Pagan tests are not significant (p > 0.05), and all VIF scores are near 1. Our linear model is statistically valid and reliable.

Part 6: Model 2 - Predicting Obstacle Presence (Logistic Regression)

Our third task is to build a logistic regression model. Instead of predicting the exact height, this model will predict the probability of an obstacle being present (the obstacle column).

Fit Logistic Regression Model

# Fit logistic regression model
obstacle_glm <- glm(obstacle ~ map_height + reported.obstacles_imp + map_last_update + map_source,
                    family = binomial(link = "logit"),
                    data = map_data_clean)
summary(obstacle_glm)
## 
## Call:
## glm(formula = obstacle ~ map_height + reported.obstacles_imp + 
##     map_last_update + map_source, family = binomial(link = "logit"), 
##     data = map_data_clean)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)  
## (Intercept)            -8.23952    4.47363  -1.842   0.0655 .
## map_height              0.19421    0.07878   2.465   0.0137 *
## reported.obstacles_imp  1.13647    0.72485   1.568   0.1169  
## map_last_update         0.02480    0.03420   0.725   0.4683  
## map_sourcearmy         -0.30143    2.75555  -0.109   0.9129  
## map_sourcegoogle        0.66797    2.56983   0.260   0.7949  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 102.791  on 99  degrees of freedom
## Residual deviance:  14.789  on 94  degrees of freedom
## AIC: 26.789
## 
## Number of Fisher Scoring iterations: 12

This model is also a good fit. Unsurprisingly, map_height is again the most significant predictor. For every 1-foot increase in map height, the log-odds of there being a real obstacle increase by 0.19.

Get Model Predictions (Probabilities)

# 1. Use the predict() function
# 'type = "response"' is the key part that says "give me probabilities 0-1"
probabilities <- predict(obstacle_glm, type = "response")

# 2. Let's look at the first 10 probabilities
head(probabilities)
##          1          2          3          4          5          6 
## 0.06109876 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000

The summary() output shows the model’s formula, but this predict() function is what we use to get the actual answers. The type = "response" command converts the mathematical “log-odds” into a simple probability (from 0 to 1), telling us the percentage chance of an obstacle for each square.It , shows the model is 6.1% sure row 1 is an obstacle, but 100% sure rows 2-6 are.

Logistic Model Goodness-of-Fit Check

# Generate a binned residual plot to assess model fit
binnedplot(predict(obstacle_glm), resid(obstacle_glm),
           main = "Binned Residual Plot",
           xlab = "Predicted Probabilities",
           ylab = "Residuals")

The binned residual plot confirms our model fits the data well. All the black dots (average residuals) lie comfortably within the 95% confidence bands, indicating no major specification errors.

Part 7: Model 3 - Predicting Risk Probability (Bayesian Regression)

Fit Bayesian Regression Model

# Fit the Bayesian regression model
Bayesian_model <- brm(
  real_height ~ map_height + reported.obstacles_imp + map_last_update +         map_source,
  data = map_data_clean,
  iter = 2000,
  warmup = 200,
  chains = 3,
  thin = 2
)
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 4.1e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.41 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  201 / 2000 [ 10%]  (Sampling)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Sampling)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Sampling)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Sampling)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.034 seconds (Warm-up)
## Chain 1:                0.053 seconds (Sampling)
## Chain 1:                0.087 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 1.4e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.14 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  201 / 2000 [ 10%]  (Sampling)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Sampling)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Sampling)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Sampling)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.052 seconds (Warm-up)
## Chain 2:                0.053 seconds (Sampling)
## Chain 2:                0.105 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 9e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  201 / 2000 [ 10%]  (Sampling)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Sampling)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Sampling)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Sampling)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.033 seconds (Warm-up)
## Chain 3:                0.056 seconds (Sampling)
## Chain 3:                0.089 seconds (Total)
## Chain 3:
# Print summary of the model
print(summary(Bayesian_model))
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: real_height ~ map_height + reported.obstacles_imp + map_last_update + map_source 
##    Data: map_data_clean (Number of observations: 100) 
##   Draws: 3 chains, each with iter = 2000; warmup = 200; thin = 2;
##          total post-warmup draws = 2700
## 
## Regression Coefficients:
##                        Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept                 11.12      3.08     5.15    17.19 1.00     2493
## map_height                 0.93      0.06     0.80     1.05 1.00     2191
## reported.obstacles_imp     0.11      0.06    -0.00     0.22 1.00     2596
## map_last_update            0.04      0.03    -0.02     0.09 1.00     2627
## map_sourcearmy            -4.31      2.54    -9.30     0.69 1.00     2327
## map_sourcegoogle           9.90      3.19     3.37    16.07 1.00     2405
##                        Tail_ESS
## Intercept                  2438
## map_height                 1830
## reported.obstacles_imp     2598
## map_last_update            2303
## map_sourcearmy             2344
## map_sourcegoogle           2236
## 
## Further Distributional Parameters:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     8.36      0.61     7.30     9.66 1.00     2368     2110
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

The model converged successfully, indicated by all Rhat values being 1.00. The estimates are similar to our linear model, but now we have 95% credible intervals for each predictor, giving us a much richer understanding of their potential impact.

Part 8: Application - Building a Safe Pathfinding Algorithm

In our final section, we use our Bayesian model to build a real-world application: a path-planning algorithm that avoids danger by considering height uncertainty.

Step 1: Calculate Safety Probabilities

We will use our model to calculate the probability that the true height is 20 feet or less - a safe altitude for a drone.

# 1. Get all predictions at once (Vectorized is faster than a loop)
lin_preds <- posterior_linpred(Bayesian_model, newdata = map_data_clean)

# 2. Calculate the probability for each column (observation)
# This finds the % of draws that were <= 20 for each square.
safe_probs <- apply(lin_preds, 2, function(col) mean(col <= 20))

# Add probabilities and safety classification to the data frame
map_data_clean$safe_prob <- safe_probs
map_data_clean$bayesian_safety <- ifelse(map_data_clean$safe_prob > 0.5, "Safe", "Unsafe")

# View the safety breakdown
table(map_data_clean$bayesian_safety)
## 
##   Safe Unsafe 
##     15     85

Our Bayesian model has classified the grid based on a 50% probability threshold of being under 20ft. It classifies 85 squares as “Unsafe” and 15 as “Safe,” giving us a clear risk-based map to build our pathfinder.

Step 2: Identify Start and End Nodes

# Get top 2 blocks with highest predicted safety probability
top_safe_coords <- map_data_clean %>%
  arrange(desc(safe_prob)) %>%
  slice(1:2) %>%
  dplyr::select(x, y)

print("Top 2 Safest Coordinates (Start and End):")
## [1] "Top 2 Safest Coordinates (Start and End):"
print(top_safe_coords)
##   x y
## 1 2 8
## 2 7 3

For this demonstration, our model has identified the two “safest” squares in the entire grid to act as our origin (2_8)and destination (7_3).

Step 3: Build the Weighted Graph

# --- Build Edge List ---
# Helper function to get safety prob for a coordinate
get_prob <- function(x, y, data) {
  prob <- data$safe_prob[data$x == x & data$y == y]
  if (length(prob) == 0) { return(NA) }
  return(prob)
}

# Create the edge list
edge_list <- list()
for (i in 1:nrow(map_data_clean)) {
  row <- map_data_clean[i, ]
  x1 <- row$x
  y1 <- row$y
  node1 <- paste(x1, y1, sep = "_")
  
  # Define potential neighbors (up, down, left, right)
  neighbors_coords <- data.frame(
    x2 = c(x1, x1, x1 - 1, x1 + 1),
    y2 = c(y1 - 1, y1 + 1, y1, y1)
  )
  
  for (j in 1:nrow(neighbors_coords)) {
    x2 <- neighbors_coords$x2[j]
    y2 <- neighbors_coords$y2[j]
    
    # Check if neighbor is within the 10x10 grid
    if (x2 >= 1 && x2 <= 10 && y2 >= 1 && y2 <= 10) {
      node2 <- paste(x2, y2, sep = "_")
      neighbor_prob <- get_prob(x2, y2, map_data_clean)
      
      # Calculate weight (cost): 1 + (1 - safety_prob_neighbor)
      cost <- 1 + (1 - neighbor_prob)
      
      if (node1 < node2) {
        edge_list[[length(edge_list) + 1]] <- data.frame(from = node1, to = node2, weight = cost)
      }
    }
  }
}
edges <- do.call(rbind, edge_list)
head(edges)
##   from  to weight
## 1  1_1 1_2      2
## 2  1_1 2_1      2
## 3  1_2 1_3      2
## 4  1_2 2_2      2
## 5  1_3 1_4      2
## 6  1_3 2_3      2

This is the core of our application. We build a graph where the “cost” to move to a new square is inversely proportional to its safety. Moving to a very safe square (prob = 0.9) costs little (1.1), while moving to an unsafe square (prob = 0.1) is very expensive (1.9). This penalizes unsafe routes.

Step 4: Find the Safest Path

# 1) Build the graph from the edge list
grid_graph <- graph_from_data_frame(d = edges, directed = FALSE)

# 2) Get our start and end nodes from Step 2
start_node <- paste(top_safe_coords$x[1], top_safe_coords$y[1], sep = "_")
end_node   <- paste(top_safe_coords$x[2], top_safe_coords$y[2], sep = "_")

# 3) Compute the safe path
# We use weights = NA to tell igraph to use the "weight" attribute we just built
shortest_path <- shortest_paths(
  grid_graph,
  from    = start_node,
  to      = end_node,
  weights = NA, 
  output  = "vpath"
)$vpath[[1]]

# 4) Print the path
cat("Safest path from", start_node, "to", end_node, ":\n")
## Safest path from 2_8 to 7_3 :
print(names(shortest_path))
##  [1] "2_8" "2_7" "2_6" "2_5" "2_4" "2_3" "3_3" "4_3" "5_3" "6_3" "7_3"

Success! Our algorithm can now find the path from 2_8 to 7_3with the lowest cumulative safety cost. This path, derived directly from our Bayesian model, provides a truly “safety-aware” route that intelligently avoids the grid’s most dangerous areas, successfully completing our project’s objective.