In response to challenges in urban drone navigation, this project develops a predictive risk model to ensure drone safety. Standard map data is often outdated, failing to reflect the true height of obstacles. Our objective is to analyze map data from various sources, model its inaccuracies, and ultimately build a safety-aware pathfinding algorithm that can navigate a grid by predicting the real height of obstacles, not just what the map reports.
We begin by loading all necessary libraries and our core dataset.
# --- Load all required libraries ---
# install.packages("tidyverse")
# install.packages("summarytools")
# install.packages("DataExplorer")
# install.packages("MissMech")
# install.packages("corrplot")
# install.packages("lmtest")
# install.packages("car")
# install.packages("arm")
# install.packages("brglm")
# install.packages("logistf")
# install.packages("brms")
# install.packages("igraph")
# install.packages("posterior") # brms dependency for as_draws_df
# --- Load all required libraries ---
library(tidyverse)
library(summarytools)
library(DataExplorer)
library(ggplot2)
library(dplyr)
library(readr)
library(MissMech)
library(corrplot)
library(lmtest)
library(car)
library(arm)
library(brglm)
library(logistf)
library(brms)
library(igraph)
library(tidyr)
library(posterior)
This block consolidates all packages, making the script clean and
easy to run. We’ve included everything from data manipulation
(tidyverse) to advanced modeling
(brms).
# --- 2. LOAD AND INSPECT DATA ---
# File location
map_data <- read.csv("C:/Users/karki/Downloads/map_info.csv")
# View first few rows
head(map_data)
## x y sq map_height real_height reported.obstacles map_last_update map_source
## 1 1 1 1 9 11 2 33 Google
## 2 1 2 2 18 55 38 53 Google
## 3 1 3 3 15 42 42 49 Google
## 4 1 4 4 25 51 47 59 Google
## 5 1 5 5 25 69 NA 12 Google
## 6 1 6 6 23 71 29 55 Google
## obstacle
## 1 0
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
The data is loaded successfully. We can see our key columns:
coordinates (x, y), map metadata
(map_height, map_source), and our two target
variables: real_height (continuous) and
obstacle (binary).
str(map_data)
## 'data.frame': 100 obs. of 9 variables:
## $ x : int 1 1 1 1 1 1 1 1 1 1 ...
## $ y : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sq : int 1 2 3 4 5 6 7 8 9 10 ...
## $ map_height : int 9 18 15 25 25 23 7 37 34 28 ...
## $ real_height : int 11 55 42 51 69 71 30 53 41 75 ...
## $ reported.obstacles: int 2 38 42 47 NA 29 4 49 37 25 ...
## $ map_last_update : int 33 53 49 59 12 55 47 76 18 72 ...
## $ map_source : chr "Google" "Google" "Google" "Google" ...
## $ obstacle : int 0 1 1 1 1 1 1 1 1 1 ...
# Summary statistics
summary(map_data)
## x y sq map_height real_height
## Min. : 1.0 Min. : 1.0 Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 3.0 1st Qu.: 3.0 1st Qu.: 25.75 1st Qu.:13.50 1st Qu.:23.50
## Median : 5.5 Median : 5.5 Median : 50.50 Median :25.00 Median :36.50
## Mean : 5.5 Mean : 5.5 Mean : 50.50 Mean :25.18 Mean :36.92
## 3rd Qu.: 8.0 3rd Qu.: 8.0 3rd Qu.: 75.25 3rd Qu.:36.00 3rd Qu.:49.25
## Max. :10.0 Max. :10.0 Max. :100.00 Max. :50.00 Max. :83.00
##
## reported.obstacles map_last_update map_source obstacle
## Min. : 0.00 Min. : 0.00 Length:100 Min. :0.00
## 1st Qu.: 3.00 1st Qu.:24.00 Class :character 1st Qu.:1.00
## Median :16.00 Median :52.00 Mode :character Median :1.00
## Mean :21.46 Mean :50.66 Mean :0.79
## 3rd Qu.:39.00 3rd Qu.:72.00 3rd Qu.:1.00
## Max. :50.00 Max. :99.00 Max. :1.00
## NA's :7
The summary() output immediately reveals our first
data quality issue: the reported.obstacles column has
7 missing values (NA's) that we will need
to address.
Our first formal task is to apply exploratory data analysis to understand the data’s characteristics, find the problems, and validate our project’s core assumptions.
# Reshape summary output to long format
map_data %>%
dplyr::select(where(is.numeric)) %>% # <-- FIX IS HERE
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
group_by(variable) %>%
summarise(
mean = mean(value, na.rm = TRUE),
median = median(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
min = min(value, na.rm = TRUE),
max = max(value, na.rm = TRUE)
) %>%
arrange(variable)
## # A tibble: 8 × 6
## variable mean median sd min max
## <chr> <dbl> <dbl> <dbl> <int> <int>
## 1 map_height 25.2 25 14.4 1 50
## 2 map_last_update 50.7 52 29.5 0 99
## 3 obstacle 0.79 1 0.409 0 1
## 4 real_height 36.9 36.5 17.0 1 83
## 5 reported.obstacles 21.5 16 17.1 0 50
## 6 sq 50.5 50.5 29.0 1 100
## 7 x 5.5 5.5 2.89 1 10
## 8 y 5.5 5.5 2.89 1 10
This summary confirms our project’s core premise: the
real_height (max 83 ft) is often significantly greater than
the map_height (max 50 ft). This discrepancy is the danger
we need to model.
# Compute correlation matrix
cor_matrix <- map_data %>%
dplyr::select(where(is.numeric)) %>%
cor(use = "pairwise.complete.obs")
# Visualize correlation matrix
corrplot::corrplot(cor_matrix,
method = "color",
type = "full",
tl.col = "black",
tl.cex = 0.8,
diag = FALSE,
addCoef.col = "black",
number.cex = 0.7)
The correlation matrix is very insightful. It confirms
map_height is a strong positive predictor for
real_height (0.82), making our modeling viable.
obstacle also strongly correlates with
real_height (0.69), which is expected.
Before we can model, we must handle the missing data identified in our EDA and engineer new features that will help quantify risk.
# Shows NA counts for every column
colSums(is.na(map_data))
## x y sq map_height
## 0 0 0 0
## real_height reported.obstacles map_last_update map_source
## 0 7 0 0
## obstacle
## 0
This check confirms that all 7 missing values are isolated in the
reported.obstacles column.
# Clean map_source and remove duplicates
map_data_clean <- map_data %>%
mutate(map_source = as.factor(tolower(map_source))) %>%
filter(!duplicated(.))
# Confirm structure
glimpse(map_data_clean)
## Rows: 100
## Columns: 9
## $ x <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, …
## $ y <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7,…
## $ sq <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ map_height <int> 9, 18, 15, 25, 25, 23, 7, 37, 34, 28, 50, 37, 48, 1…
## $ real_height <int> 11, 55, 42, 51, 69, 71, 30, 53, 41, 75, 72, 37, 83,…
## $ reported.obstacles <int> 2, 38, 42, 47, NA, 29, 4, 49, 37, 25, NA, 34, 14, 3…
## $ map_last_update <int> 33, 53, 49, 59, 12, 55, 47, 76, 18, 72, 88, 11, 87,…
## $ map_source <fct> google, google, google, google, google, google, goo…
## $ obstacle <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, …
We create our map_data_clean object by converting
map_source to a factor and removing any duplicate rows.
This clean dataset is now ready for the MCAR test.
# We'll include: reported.obstacles, map_height, real_height, map_last_update
# We use map_data_clean, which we just created.
data_to_check <- as.data.frame(map_data_clean[
, c("reported.obstacles", "map_height", "real_height", "map_last_update")
])
# Run MCAR test
TestMCARNormality(data_to_check)
## Call:
## TestMCARNormality(data = data_to_check)
##
## Number of Patterns: 2
##
## Total number of cases used in the analysis: 100
##
## Pattern(s) used:
## reported.obstacles map_height real_height map_last_update
## group.1 1 1 1 1
## group.2 NA 1 1 1
## Number of cases
## group.1 93
## group.2 7
##
##
## Test of normality and Homoscedasticity:
## -------------------------------------------
##
## Hawkins Test:
##
## P-value for the Hawkins test of normality and homoscedasticity: 0.0006565239
##
## Either the test of multivariate normality or homoscedasticity (or both) is rejected.
## Provided that normality can be assumed, the hypothesis of MCAR is
## rejected at 0.05 significance level.
##
## Non-Parametric Test:
##
## P-value for the non-parametric test of homoscedasticity: 0.3252182
##
## Reject Normality at 0.05 significance level.
## There is not sufficient evidence to reject MCAR at 0.05 significance level.
The high p-value for the non-parametric test (0.325) means we fail to reject the null hypothesis. We can conclude the data is Missing Completely At Random (MCAR). This is great news, as it statistically validates our plan to use a simple and robust method like mean imputation.
# Mean imputation for reported.obstacles on our clean data
map_data_clean$reported.obstacles_imp <- ifelse(
is.na(map_data_clean$reported.obstacles),
mean(map_data_clean$reported.obstacles, na.rm = TRUE),
map_data_clean$reported.obstacles
)
# Check new column after imputation
summary(map_data_clean$reported.obstacles_imp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 3.00 21.23 21.46 38.00 50.00
With our imputation method validated, we now create a new,
complete column reported.obstacles_imp. Our dataset is now
fully clean, complete, and ready for feature engineering.
# Derive height_error: the difference between real and map height
map_data_clean$height_error <- map_data_clean$real_height - map_data_clean$map_height
# Derive obstacle_report_ratio: number of obstacle reports per day since last update
# We add +1 to denominator to avoid division by zero
map_data_clean$obstacle_report_ratio <- map_data_clean$reported.obstacles_imp / (map_data_clean$map_last_update + 1)
# Check summaries of new variables
summary(map_data_clean[, c("height_error", "obstacle_report_ratio")])
## height_error obstacle_report_ratio
## Min. : 0.00 Min. :0.0000
## 1st Qu.: 5.00 1st Qu.:0.1228
## Median :10.00 Median :0.4447
## Mean :11.74 Mean :0.7421
## 3rd Qu.:17.00 3rd Qu.:0.8350
## Max. :48.00 Max. :7.0000
We engineer two powerful new features. height_error
directly quantifies the danger (max error is 48 ft!), and
obstacle_report_ratio measures drone activity relative to
map freshness. These will be crucial for our risk maps.
Now that our data is clean and enriched, we create a series of visualizations to clearly communicate the scale of the risk.
# Define freshness levels
map_data_clean <- map_data_clean %>%
mutate(update_status = case_when(
map_last_update <= 5 ~ "Fresh (0–5 days)",
map_last_update <= 20 ~ "Moderate (6–20 days)",
TRUE ~ "Stale (21+ days)"
))
# Create a heatmap of update status by square
ggplot(map_data_clean, aes(x = x, y = y, fill = update_status)) +
geom_tile(color = "white") +
scale_fill_manual(
values = c(
"Fresh (0–5 days)" = "#4CAF50", # green
"Moderate (6–20 days)" = "#FFC107", # yellow
"Stale (21+ days)" = "#F44336" # red
)
) +
coord_fixed() +
labs(
title = "Map Update Freshness by Square",
x = "X Coordinate",
y = "Y Coordinate",
fill = "Update Status"
) +
theme_minimal()
This map is a sea of red, and that’s a huge problem. Nearly all the data is “Stale” (over 21 days old), so relying on it is like “flying blind.” This risk is exactly why we need a predictive model.
map_data_clean %>%
count(update_status) %>%
ggplot(aes(x = update_status, y = n, fill = update_status)) +
geom_bar(stat = "identity", width = 0.6) +
geom_text(aes(label = n), vjust = -0.5, size = 5) +
scale_fill_manual(values = c(
"Fresh (0–5 days)" = "#2ca02c",
"Moderate (6–20 days)" = "#ff7f0e",
"Stale (21+ days)" = "#d62728"
)) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
labs(
title = "Number of Squares by Map Update Freshness",
x = "Update Status",
y = "Number of Squares",
fill = "Legend"
) +
theme_minimal(base_size = 13)
This chart quantifies the risk: 76 out of 100 squares have
dangerously stale data, while only 7 are freshly updated and
reliable.
This brings us to the question: “Where are the highest risks for drone collisions?” We will define a set of safety rules based on our engineered variables. - Safe to fly: Low height error and a recently updated map. - Caution zone: Medium height error but a freshly updated map. - Danger zone: High height error OR a stale map with medium error. - Map needs update: Low height error but an outdated map (risk is uncertain).
# Reclassify safety using our defined logic
map_data_clean <- map_data_clean %>%
mutate(
height_risk = case_when(
height_error <= 10 ~ "Low",
height_error <= 25 ~ "Medium",
TRUE ~ "High"
),
map_fresh = ifelse(map_last_update <= 5, TRUE, FALSE),
safety_status = case_when(
height_risk == "Low" & map_fresh ~ "Safe to Fly",
height_risk == "Low" & !map_fresh ~ "Map Needs Update",
height_risk == "Medium" & map_fresh ~ "Caution Zone",
height_risk == "Medium" & !map_fresh ~ "Danger Zone",
height_risk == "High" ~ "Danger Zone"
)
)
This dplyr code block is the “brain” behind our
final risk map. It translates our raw data into these four clear,
actionable safety statuses by combining height_error with
map_fresh.
# Better color scale
color_scheme <- c(
"Safe to Fly" = "#2ca02c", # Green
"Map Needs Update" = "#bdbdbd", # Gray
"Caution Zone" = "#ffcc00", # Yellow
"Danger Zone" = "#d62728" # Red
)
# Plot again with fixed logic
ggplot(map_data_clean, aes(x = x, y = y, fill = safety_status)) +
geom_tile(color = "white", size = 0.3) +
scale_fill_manual(values = color_scheme) +
labs(
title = "Refined Drone Safety Risk Map (Height Error + Map Freshness)",
fill = "Drone Risk Status",
x = "X Coordinate",
y = "Y Coordinate"
) +
theme_minimal(base_size = 13)
This map confirms our fears. The grid is a minefield of confirmed
dangers and high-risk unknowns, dominated by “Danger Zones”
(red) and “Map Needs Update” (gray).Safe
to Fly” (green) squares—the only ones meeting our strict criteria - are
exceptionally rare.*
ggplot(map_data_clean, aes(x = map_source, y = height_error, fill = map_source)) +
geom_boxplot() +
labs(
title = "Comparison of Height Error by Map Source",
x = "Map Source",
y = "Height Error (in feet)"
) +
theme_minimal() +
theme(legend.position = "none")
This visual tells a clear story: the
google map
is an extreme gamble. Its massive blue box shows it’s wildly
inconsistent, with errors ranging from near-zero to a catastrophic 50
ft. In contrast, the army map is the clear
professional choice, offering the most predictable and
consistently low-error performance.
Our second task is to create a linear regression model to predict the
real_height for each square. This model will help us
understand which variables drive the map’s inaccuracy.
# Fit the linear regression model (excluding obstacle)
model_real_height <- lm(real_height ~ map_height + reported.obstacles_imp + map_last_update + map_source,
data = map_data_clean)
summary(model_real_height)
##
## Call:
## lm(formula = real_height ~ map_height + reported.obstacles_imp +
## map_last_update + map_source, data = map_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.4855 -5.0021 -0.5669 4.9448 23.3375
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.08124 3.07344 3.605 0.000501 ***
## map_height 0.92660 0.06224 14.888 < 2e-16 ***
## reported.obstacles_imp 0.10913 0.05424 2.012 0.047073 *
## map_last_update 0.03853 0.02845 1.354 0.178960
## map_sourcearmy -4.36321 2.53354 -1.722 0.088327 .
## map_sourcegoogle 9.98551 3.22030 3.101 0.002547 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.258 on 94 degrees of freedom
## Multiple R-squared: 0.7748, Adjusted R-squared: 0.7628
## F-statistic: 64.68 on 5 and 94 DF, p-value: < 2.2e-16
This first model explains 76% of the height variance (Adjusted
R-squared = 0.7628), confirming
map_heightis the single most dominant
predictor. However,the wide residual range hints that potential outliers
are skewing the results.
# Calculate Cook's distance
cooks_d <- cooks.distance(model_real_height)
# Visualize Cook's Distance
map_data_clean$cooks_d <- cooks_d
plot(cooks_d, type = "h", main = "Cook's Distance", ylab = "Distance")
cutoff <- 10 / nrow(map_data_clean) # Set a cutoff threshold
abline(h = cutoff, col = "red", lty = 2)
# Filter out influential points
map_data_clean_no_outliers <- map_data_clean %>%
filter(cooks_d <= cutoff)
The Cook’s Distance plot identifies several influential outliers (points above the red line). We filter these out to create a more robust and accurate model.
# Refit model without outliers
model_real_height_clean <- lm(real_height ~ map_height + reported.obstacles_imp +map_last_update + map_source,
data = map_data_clean_no_outliers)
summary(model_real_height_clean)
##
## Call:
## lm(formula = real_height ~ map_height + reported.obstacles_imp +
## map_last_update + map_source, data = map_data_clean_no_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.2937 -4.4884 -0.5406 4.5887 13.1419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.57927 2.38292 5.279 9.26e-07 ***
## map_height 0.92477 0.04774 19.369 < 2e-16 ***
## reported.obstacles_imp 0.08355 0.04211 1.984 0.05035 .
## map_last_update 0.02231 0.02230 1.000 0.31995
## map_sourcearmy -4.47804 1.93172 -2.318 0.02276 *
## map_sourcegoogle 9.77953 2.88016 3.395 0.00103 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.295 on 88 degrees of freedom
## Multiple R-squared: 0.8534, Adjusted R-squared: 0.8451
## F-statistic: 102.5 on 5 and 88 DF, p-value: < 2.2e-16
Success. After removing outliers, our Adjusted R-squared
improved, and more predictors (like map_sourcearmy) became
statistically significant. This model is a much better fit, and we can
now confidently validate its assumptions.
# 1. Linearity
plot(model_real_height_clean$fitted.values, model_real_height_clean$residuals,
xlab = "Fitted Values", ylab = "Residuals",
main = "Residuals vs Fitted Values")
abline(h = 0, col = "red")
# 2. Normality of residuals ((H0: Residuals are normal))
shapiro.test(model_real_height_clean$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_real_height_clean$residuals
## W = 0.97994, p-value = 0.1582
# 3.Homoscedasticity (H0: Residuals have constant variance)
bptest(model_real_height_clean)
##
## studentized Breusch-Pagan test
##
## data: model_real_height_clean
## BP = 10.147, df = 5, p-value = 0.07118
# 4. Independence (H0: No autocorrelation)
# Durbin-Watson Test (H0: No autocorrelation)
dwtest(model_real_height_clean)
##
## Durbin-Watson test
##
## data: model_real_height_clean
## DW = 1.981, p-value = 0.3759
## alternative hypothesis: true autocorrelation is greater than 0
# 5. Multicollinearity
vif(model_real_height_clean)
## GVIF Df GVIF^(1/(2*Df))
## map_height 1.151345 1 1.073007
## reported.obstacles_imp 1.154930 1 1.074677
## map_last_update 1.023586 1 1.011724
## map_source 1.081254 2 1.019722
All assumptions are satisfied. The residual plot shows random scatter, the Shapiro-Wilk and Breusch-Pagan tests are not significant (p > 0.05), and all VIF scores are near 1. Our linear model is statistically valid and reliable.
Our third task is to build a logistic regression model. Instead of
predicting the exact height, this model will predict the
probability of an obstacle being present (the
obstacle column).
# Fit logistic regression model
obstacle_glm <- glm(obstacle ~ map_height + reported.obstacles_imp + map_last_update + map_source,
family = binomial(link = "logit"),
data = map_data_clean)
summary(obstacle_glm)
##
## Call:
## glm(formula = obstacle ~ map_height + reported.obstacles_imp +
## map_last_update + map_source, family = binomial(link = "logit"),
## data = map_data_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.23952 4.47363 -1.842 0.0655 .
## map_height 0.19421 0.07878 2.465 0.0137 *
## reported.obstacles_imp 1.13647 0.72485 1.568 0.1169
## map_last_update 0.02480 0.03420 0.725 0.4683
## map_sourcearmy -0.30143 2.75555 -0.109 0.9129
## map_sourcegoogle 0.66797 2.56983 0.260 0.7949
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 102.791 on 99 degrees of freedom
## Residual deviance: 14.789 on 94 degrees of freedom
## AIC: 26.789
##
## Number of Fisher Scoring iterations: 12
This model is also a good fit. Unsurprisingly,
map_height is again the most significant predictor. For
every 1-foot increase in map height, the log-odds of there being a real
obstacle increase by 0.19.
# 1. Use the predict() function
# 'type = "response"' is the key part that says "give me probabilities 0-1"
probabilities <- predict(obstacle_glm, type = "response")
# 2. Let's look at the first 10 probabilities
head(probabilities)
## 1 2 3 4 5 6
## 0.06109876 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000
The summary() output shows the model’s formula, but
this predict() function is what we use to get the actual
answers. The type = "response" command converts the
mathematical “log-odds” into a simple probability (from 0 to 1), telling
us the percentage chance of an obstacle for each square.It , shows the
model is 6.1% sure row 1 is an obstacle, but 100% sure rows 2-6
are.
# Generate a binned residual plot to assess model fit
binnedplot(predict(obstacle_glm), resid(obstacle_glm),
main = "Binned Residual Plot",
xlab = "Predicted Probabilities",
ylab = "Residuals")
The binned residual plot confirms our model fits the data well. All
the black dots (average residuals) lie comfortably within the 95%
confidence bands, indicating no major specification errors.
# Fit the Bayesian regression model
Bayesian_model <- brm(
real_height ~ map_height + reported.obstacles_imp + map_last_update + map_source,
data = map_data_clean,
iter = 2000,
warmup = 200,
chains = 3,
thin = 2
)
##
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 4.1e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.41 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 201 / 2000 [ 10%] (Sampling)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Sampling)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Sampling)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Sampling)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.034 seconds (Warm-up)
## Chain 1: 0.053 seconds (Sampling)
## Chain 1: 0.087 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 1.4e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.14 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 201 / 2000 [ 10%] (Sampling)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Sampling)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Sampling)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Sampling)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.052 seconds (Warm-up)
## Chain 2: 0.053 seconds (Sampling)
## Chain 2: 0.105 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 9e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 201 / 2000 [ 10%] (Sampling)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Sampling)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Sampling)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Sampling)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.033 seconds (Warm-up)
## Chain 3: 0.056 seconds (Sampling)
## Chain 3: 0.089 seconds (Total)
## Chain 3:
# Print summary of the model
print(summary(Bayesian_model))
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: real_height ~ map_height + reported.obstacles_imp + map_last_update + map_source
## Data: map_data_clean (Number of observations: 100)
## Draws: 3 chains, each with iter = 2000; warmup = 200; thin = 2;
## total post-warmup draws = 2700
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept 11.12 3.08 5.15 17.19 1.00 2493
## map_height 0.93 0.06 0.80 1.05 1.00 2191
## reported.obstacles_imp 0.11 0.06 -0.00 0.22 1.00 2596
## map_last_update 0.04 0.03 -0.02 0.09 1.00 2627
## map_sourcearmy -4.31 2.54 -9.30 0.69 1.00 2327
## map_sourcegoogle 9.90 3.19 3.37 16.07 1.00 2405
## Tail_ESS
## Intercept 2438
## map_height 1830
## reported.obstacles_imp 2598
## map_last_update 2303
## map_sourcearmy 2344
## map_sourcegoogle 2236
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 8.36 0.61 7.30 9.66 1.00 2368 2110
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
The model converged successfully, indicated by all Rhat values being 1.00. The estimates are similar to our linear model, but now we have 95% credible intervals for each predictor, giving us a much richer understanding of their potential impact.
In our final section, we use our Bayesian model to build a real-world application: a path-planning algorithm that avoids danger by considering height uncertainty.
We will use our model to calculate the probability that the true height is 20 feet or less - a safe altitude for a drone.
# 1. Get all predictions at once (Vectorized is faster than a loop)
lin_preds <- posterior_linpred(Bayesian_model, newdata = map_data_clean)
# 2. Calculate the probability for each column (observation)
# This finds the % of draws that were <= 20 for each square.
safe_probs <- apply(lin_preds, 2, function(col) mean(col <= 20))
# Add probabilities and safety classification to the data frame
map_data_clean$safe_prob <- safe_probs
map_data_clean$bayesian_safety <- ifelse(map_data_clean$safe_prob > 0.5, "Safe", "Unsafe")
# View the safety breakdown
table(map_data_clean$bayesian_safety)
##
## Safe Unsafe
## 15 85
Our Bayesian model has classified the grid based on a 50% probability threshold of being under 20ft. It classifies 85 squares as “Unsafe” and 15 as “Safe,” giving us a clear risk-based map to build our pathfinder.
# Get top 2 blocks with highest predicted safety probability
top_safe_coords <- map_data_clean %>%
arrange(desc(safe_prob)) %>%
slice(1:2) %>%
dplyr::select(x, y)
print("Top 2 Safest Coordinates (Start and End):")
## [1] "Top 2 Safest Coordinates (Start and End):"
print(top_safe_coords)
## x y
## 1 2 8
## 2 7 3
For this demonstration, our model has identified the two “safest” squares in the entire grid to act as our origin (2_8)and destination (7_3).
# --- Build Edge List ---
# Helper function to get safety prob for a coordinate
get_prob <- function(x, y, data) {
prob <- data$safe_prob[data$x == x & data$y == y]
if (length(prob) == 0) { return(NA) }
return(prob)
}
# Create the edge list
edge_list <- list()
for (i in 1:nrow(map_data_clean)) {
row <- map_data_clean[i, ]
x1 <- row$x
y1 <- row$y
node1 <- paste(x1, y1, sep = "_")
# Define potential neighbors (up, down, left, right)
neighbors_coords <- data.frame(
x2 = c(x1, x1, x1 - 1, x1 + 1),
y2 = c(y1 - 1, y1 + 1, y1, y1)
)
for (j in 1:nrow(neighbors_coords)) {
x2 <- neighbors_coords$x2[j]
y2 <- neighbors_coords$y2[j]
# Check if neighbor is within the 10x10 grid
if (x2 >= 1 && x2 <= 10 && y2 >= 1 && y2 <= 10) {
node2 <- paste(x2, y2, sep = "_")
neighbor_prob <- get_prob(x2, y2, map_data_clean)
# Calculate weight (cost): 1 + (1 - safety_prob_neighbor)
cost <- 1 + (1 - neighbor_prob)
if (node1 < node2) {
edge_list[[length(edge_list) + 1]] <- data.frame(from = node1, to = node2, weight = cost)
}
}
}
}
edges <- do.call(rbind, edge_list)
head(edges)
## from to weight
## 1 1_1 1_2 2
## 2 1_1 2_1 2
## 3 1_2 1_3 2
## 4 1_2 2_2 2
## 5 1_3 1_4 2
## 6 1_3 2_3 2
This is the core of our application. We build a graph where the “cost” to move to a new square is inversely proportional to its safety. Moving to a very safe square (prob = 0.9) costs little (1.1), while moving to an unsafe square (prob = 0.1) is very expensive (1.9). This penalizes unsafe routes.
# 1) Build the graph from the edge list
grid_graph <- graph_from_data_frame(d = edges, directed = FALSE)
# 2) Get our start and end nodes from Step 2
start_node <- paste(top_safe_coords$x[1], top_safe_coords$y[1], sep = "_")
end_node <- paste(top_safe_coords$x[2], top_safe_coords$y[2], sep = "_")
# 3) Compute the safe path
# We use weights = NA to tell igraph to use the "weight" attribute we just built
shortest_path <- shortest_paths(
grid_graph,
from = start_node,
to = end_node,
weights = NA,
output = "vpath"
)$vpath[[1]]
# 4) Print the path
cat("Safest path from", start_node, "to", end_node, ":\n")
## Safest path from 2_8 to 7_3 :
print(names(shortest_path))
## [1] "2_8" "2_7" "2_6" "2_5" "2_4" "2_3" "3_3" "4_3" "5_3" "6_3" "7_3"
Success! Our algorithm can now find the path from 2_8 to 7_3with the lowest cumulative safety cost. This path, derived directly from our Bayesian model, provides a truly “safety-aware” route that intelligently avoids the grid’s most dangerous areas, successfully completing our project’s objective.