Machine Learning | Alex Baecher

Machine Learning the 'Tidy' Way

Fri, 22 Oct 2021 00:40:04 -0700

Introduction to machine learning with tidymodels

Tidymodels provides a clean, organized, and–most importantly–consistent programming syntax for data pre-processing, model specification, model fitting, model evaluation, and prediction.

Anatomy of tidymodels

A meta-package that installs and load the core packages listed below that you need for modeling and machine learning

rsamples: provides infrastructure for efficient data splitting and resampling
parsnip: a tidy, unified interface to models that can be used to try a range of models without getting bogged down in the syntactical minutiae of the underlying packages
recipes: a tidy interface to data pre-processing tools for feature engineering
workflows: workflows bundle your pre-processing, modeling, and post-processing together
tune: helps you optimize the hyperparameters of your model and pre-processing steps
yardstick: measures the effectiveness of models using performance metrics
dials: contains tools to create and manage values of tuning parameters and is designed to integrate well with the parsnip package
broom: summarizes key information about models in tidy tibble()s

First, lets load the tidymodels meta-package:

library(tidymodels)
library(tidyverse)

Package tutorials:

Data

I’ll demonstrate it’s features using an existing data set from Bruno Oliveria, Amphibio:

Link to publication: https://www.nature.com/articles/sdata2017123
Link to data: https://ndownloader.figstatic.com/files/8828578

Amphibio data

Download data:

# install.packages("downloader")
# library(downloader)
# 
# url <- "https://ndownloader.figstatic.com/files/8828578"
# download(url, dest="dial_broom/amphibio.zip", mode="wb") 
# unzip("dial_broom/amphibio.zip", exdir = "./dial_broom")

library(readr)

amphibio_raw <- read_csv("AmphiBIO_v1.csv")

The data consist of natural history information of amphibians, including habitat types, diet, size, ect.

Here’s the breakdown of taxonomic spread in the data:

Order: N = 3
Family: N = 61
Genera: N = 531
Species: N = 6776

There are also a lot of missing data, and what data do exist are wildly different scales. We’ll clean this up:

# Check how many NA's for each row
amphibio <- amphibio_raw %>%
  select("Order"
         ,"Body_mass_g"
         ,"Body_size_mm"
         ,"Litter_size_min_n"
         ,"Litter_size_max_n"
         ,"Reproductive_output_y"
         ) %>%
  na.omit %>%
  mutate(Body_mass_g = log(Body_mass_g),
         Body_size_mm = log(Body_size_mm),
         Litter_size_min_n = log(Litter_size_min_n),
         Litter_size_max_n = log(Litter_size_max_n),
         Reproductive_output_y = log(Reproductive_output_y)) %>%
  filter(!Order == "Gymnophiona")
  
amphibio %>%
  group_by(Order) %>%
  summarize(n = n())

Now let’s have a peak at the data:

  amphibio %>% 
  pivot_longer(!Order, names_to = "Metric", values_to = "Value") %>%
  ggplot(aes(Order, Value, col = Order)) + 
    geom_boxplot() + 
    facet_wrap(~Metric)

There are some trends in the data:

caudates are longer
anura have larger litter sizes

Given the data, one possible modeling application could be to use data to predict order using two models: knn and boosted regression trees.

To start the modeling process, we’ll use rsamples to split the data into training and testing sets.

set.seed(42)

tidy_split <- initial_split(amphibio, prop = 0.95)
tidy_train <- training(tidy_split)
tidy_test <- testing(tidy_split)
tidy_kfolds <- vfold_cv(tidy_train)

We can use recipes to preprocess the data:

# Recipes package 
## For preprocessing, feature engineering, and feature elimination 
tidy_rec <- recipe(Order ~ ., data = tidy_train) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_normalize(all_predictors()) %>%
  prep()

Now that we’ve created a recipe to process the data for modeling, we can use parsnip to model the data:

First, let’s have a look at the model’s description

library("webshot")
# ?boost_tree

boost_tree()

Description

boost_tree() defines a model that creates a series of decision trees forming an ensemble. Each tree depends on the results of previous trees. All trees in the ensemble are combined to produce a final prediction.

There are different ways to fit this model. See the engine-specific pages for more details:

xgboost (default)
C5.0
spark

?nearest_neighbors

nearest_neighbor():

defines a model that uses the K most similar data points from the training set to predict new samples.

There are different ways to fit this model. See the engine-specific pages for more details:

knn (default)

Now, let’s fit the models:

# Parsnip package 
## Standardized api for creating models 
tidy_boosted_model <- boost_tree(trees = tune(),
                                min_n = tune(),
                                learn_rate = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("xgboost")

tidy_knn_model <- nearest_neighbor(neighbors = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("kknn")

Our basic model recipe is complete, but now we want to use dials to tune parameters.

dials

For boosted regression trees, there are 3 basic parameters:

parameters(tidy_boosted_model)

## Collection of 3 parameters for tuning
## 
##  identifier       type    object
##       trees      trees nparam[+]
##       min_n      min_n nparam[+]
##  learn_rate learn_rate nparam[+]

trees: An integer for the number of trees contained in the ensemble.
min_n: An integer for the minimum number of data points in a node that is required for the node to be split further.
learn_rate: A number for the rate at which the boosting algorithm adapts from iteration-to-iteration (specific engines only).

Knn has a single parameter to tune: the neighbors

parameters(tidy_knn_model)

## Collection of 1 parameters for tuning
## 
##  identifier      type    object
##   neighbors neighbors nparam[+]

neighbors: A single integer for the number of neighbors to consider (often called k). For kknn, a value of 5 is used if neighbors is not specified.

So, we can use dials to set the possible parameter values, which can then be tuned using tune.

# Dials creates the parameter grids 
# Tune applies the parameter grid to the models 
# Dials pacakge 
boosted_params <- 5
knn_params <- 10

?grid_regular

## starting httpd help server ... done

boosted_grid <- grid_regular(parameters(tidy_boosted_model), levels = boosted_params)
boosted_grid

## # A tibble: 125 x 3
##    trees min_n   learn_rate
##    <int> <int>        <dbl>
##  1     1     2 0.0000000001
##  2   500     2 0.0000000001
##  3  1000     2 0.0000000001
##  4  1500     2 0.0000000001
##  5  2000     2 0.0000000001
##  6     1    11 0.0000000001
##  7   500    11 0.0000000001
##  8  1000    11 0.0000000001
##  9  1500    11 0.0000000001
## 10  2000    11 0.0000000001
## # ... with 115 more rows

knn_grid <- grid_regular(parameters(tidy_knn_model), levels = knn_params)
knn_grid

## # A tibble: 10 x 1
##    neighbors
##        <int>
##  1         1
##  2         2
##  3         4
##  4         5
##  5         7
##  6         8
##  7        10
##  8        11
##  9        13
## 10        15

Implement tuning grid using tune:

tune

# install.packages(c("xgboost", "kknn"))
library(xgboost)
library(kknn)

# Tune pacakge 
# system.time(
#   boosted_tune <- tune_grid(tidy_boosted_model,
#                             tidy_rec,
#                             resamples = tidy_kfolds,
#                             grid = boosted_grid)
# )
# write_rds(boosted_tune, "boosted_tune.rds")
boosted_tune <- read_rds("boosted_tune.rds")

# system.time(
#   knn_tune <- tune_grid(tidy_knn_model,
#                         tidy_rec,
#                         resamples = tidy_kfolds,
#                         grid = knn_grid)
# ) 
# write_rds(knn_tune, "knn_tune.rds")
knn_tune <- read_rds("knn_tune.rds")

#Use Tune package to extract best parameters using ROC_AUC handtill
boosted_param <- boosted_tune %>% select_best("roc_auc")
knn_param <- knn_tune %>% select_best("roc_auc")
#Apply parameters to the models
tidy_boosted_model_final <- finalize_model(tidy_boosted_model, boosted_param)
tidy_knn_model_final <- finalize_model(tidy_knn_model, knn_param)

Now, well try different options from dials for parameter tuning, using two additional methods for grid specification:

random grid with dials::grid_random
maximum entropy grid with dials::grid_max_entropy

grid_random

boosted_grid_rand <- grid_random(parameters(tidy_boosted_model), size = boosted_params)
boosted_grid_rand

## # A tibble: 5 x 3
##   trees min_n learn_rate
##   <int> <int>      <dbl>
## 1   190    21   2.32e- 5
## 2  1816    12   3.60e- 8
## 3   293    28   3.14e-10
## 4   314     8   2.52e- 7
## 5  1363     5   5.92e- 6

knn_grid_rand <- grid_random(parameters(tidy_knn_model), size = knn_params)
knn_grid_rand

## # A tibble: 7 x 1
##   neighbors
##       <int>
## 1         1
## 2        10
## 3         5
## 4         3
## 5        11
## 6         8
## 7         2

# system.time(
#   boosted_tune_rand <- tune_grid(tidy_boosted_model,
#                                  tidy_rec,
#                                  resamples = tidy_kfolds,
#                                  grid = boosted_grid_rand)
# )
# write_rds(boosted_tune_rand, "boosted_tune_rand.rds")
boosted_tune_rand <- read_rds("boosted_tune_rand.rds")

# system.time(
#   knn_tune_rand <- tune_grid(tidy_knn_model,
#                              tidy_rec,
#                              resamples = tidy_kfolds,
#                              grid = knn_grid_rand)
# )
# write_rds(knn_tune_rand, "knn_tune_rand.rds")
knn_tune_rand <- read_rds("knn_tune_rand.rds")

#Use Tune package to extract best parameters using ROC_AUC handtill
boosted_param_rand <- boosted_tune_rand %>% select_best("roc_auc")
knn_param_rand <- knn_tune_rand %>% select_best("roc_auc")

grid_max_entropy

boosted_grid_maxent <- grid_max_entropy(parameters(tidy_boosted_model), size = boosted_params)
boosted_grid_maxent

## # A tibble: 5 x 3
##   trees min_n learn_rate
##   <int> <int>      <dbl>
## 1   433    25   4.27e-10
## 2  1671    13   3.28e-10
## 3  1520     3   3.21e- 6
## 4   672     3   3.06e-10
## 5  1371    22   2.32e- 5

knn_grid_maxent <- grid_max_entropy(parameters(tidy_knn_model), size = knn_params)
knn_grid_maxent

## # A tibble: 10 x 1
##    neighbors
##        <int>
##  1         3
##  2        10
##  3         1
##  4        15
##  5        13
##  6         4
##  7         6
##  8         8
##  9         9
## 10        11

# system.time(
#   boosted_tune_maxent <- tune_grid(tidy_boosted_model,
#                                    tidy_rec,
#                                    resamples = tidy_kfolds,
#                                    grid = boosted_grid_maxent)
# )
# write_rds(boosted_tune_maxent, "boosted_tune_maxent.rds")
boosted_tune_maxent <- read_rds("boosted_tune_maxent.rds")

# system.time(
#   knn_tune_maxent <- tune_grid(tidy_knn_model,
#                                tidy_rec,
#                                resamples = tidy_kfolds,
#                                grid = knn_grid_maxent)
# )
# write_rds(knn_tune_maxent, "knn_tune_maxent.rds")
knn_tune_maxent <- read_rds("knn_tune.rds")

#Use Tune package to extract best parameters using ROC_AUC handtill
boosted_param_maxent <- boosted_tune_maxent %>% select_best("roc_auc")
knn_param_maxent <- knn_tune_maxent %>% select_best("roc_auc")

workflows

For combining model, parameters, and preprocessing

boosted_wf <- workflow() %>% 
  add_model(tidy_boosted_model_final) %>% 
  add_recipe(tidy_rec)

knn_wf <- workflow() %>% 
  add_model(tidy_knn_model_final) %>% 
  add_recipe(tidy_rec)

yardstick

For extracting metrics from the model

boosted_res <- last_fit(boosted_wf, tidy_split)
knn_res <- last_fit(knn_wf, tidy_split)

mods <- bind_rows(
  boosted_res %>% mutate(model = "xgb"),
  knn_res %>% mutate(model = "knn")) %>% 
  unnest(.metrics)

ggplot(bind_rows(mods$.predictions), aes(Order, .pred_Anura)) + 
  geom_boxplot()

ggplot(bind_rows(mods$.predictions), aes(Order, .pred_Caudata)) + 
  geom_boxplot()

ggplot(mods, aes(x = model, y = .estimate, col = model)) + 
  geom_point() + 
  facet_wrap(~.metric)

Confusion matrix to visualize model predictions against truth

boosted_res %>% unnest(.predictions) %>% 
  conf_mat(truth = Order, estimate = .pred_class) %>%
  autoplot()

Fit the entire data set using the final wf

final_boosted_model <- fit(boosted_wf, amphibio)

## [15:25:37] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

final_knn_model <- fit(knn_wf, amphibio)

broom

Now we can use broom to tidy the results from these models, and provide an intuitive view of their meaning!

augment()

First, we’ll use augment to obtain predictions, residuals, and other items from the model, which auto-binds them to the original dataset.

boosted_aug <- augment(final_boosted_model, new_data = amphibio[,-1])
knn_aug <- augment(final_knn_model, new_data = amphibio[,-1])

boosted_aug_long <- boosted_aug %>%
  pivot_longer(-c(.pred_class, .pred_Anura, .pred_Caudata), names_to = "predictor", values_to = "value")

Now we can evaluate the models using yardstick!

yardstick

final_boosted_model %>%
  predict(bake(tidy_rec, new_data = tidy_test), type = "prob") %>%
  bind_cols(tidy_test) %>%
  roc_auc(factor(Order), .pred_Anura)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.759

final_boosted_model %>%
  predict(bake(tidy_rec, new_data = tidy_test), type = "prob") %>%
  bind_cols(tidy_test) %>%
  roc_curve(factor(Order), .pred_Anura) %>%
  autoplot()

Evaluating knn model

final_knn_model %>%
  predict(bake(tidy_rec, new_data = tidy_test), type = "prob") %>%
  bind_cols(tidy_test) %>%
  roc_auc(factor(Order), .pred_Anura)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary           0.5

final_knn_model %>%
  predict(bake(tidy_rec, new_data = tidy_test), type = "prob") %>%
  bind_cols(tidy_test) %>%
  roc_curve(factor(Order), .pred_Anura) %>%
  autoplot()

final_knn_model %>%
  predict(bake(tidy_rec, new_data = tidy_test), type = "prob") %>%
  bind_cols(tidy_test) %>%
  roc_auc(factor(Order), .pred_Anura)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary           0.5

Visualizing predictions:

library(viridis)

## Loading required package: viridisLite

## 
## Attaching package: 'viridis'

## The following object is masked from 'package:scales':
## 
##     viridis_pal

ggplot(boosted_aug_long, aes(x = value, y = .pred_Anura, col = .pred_class)) + 
  geom_point() + 
  facet_wrap(~predictor) + 
  scale_color_viridis_d("Truth", option = "D") +
  theme_bw()

ggplot(boosted_aug_long, aes(x = value, y = .pred_Caudata, col = .pred_class)) + 
  geom_point() + 
  facet_wrap(~predictor) + 
  scale_color_viridis_d("Truth", option = "D") +
  theme_bw()

Linear regression with gradient descent

Wed, 22 Sep 2021 00:40:04 -0700

Introduction linear regression with gradient descent

This tutorial is a rough introduction into using gradient descent algorithms to estimate parameters (slope and intercept) for standard linear regressions, as an alternative to ordinary least squares (OLS) regression with a maximum likelihood estimator. To begin, I simulate data to perform a standard OLS regression with maximum likelihood using sums of squares. Once explained, I then demonstrate how to substitute gradient descent simply and interpret results.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.3     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## Warning: package 'readr' was built under R version 4.1.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Ordinary Least Square Regression

Simulate data

Generate random data in which y is a noisy function of x

set.seed(123)

x <- runif(1000, -5, 5)
y <- x + rnorm(1000) + 3

Fit a linear model

lm <- lm( y ~ x ) # Ordinary Least Squares regression with General Linear Model 
mod <- print(lm)

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      3.0118       0.9942

mod

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      3.0118       0.9942

Plot the data and the model

plot(x,y, col = "grey80", main='Regression using lm()', xlim = c(-2, 5), ylim = c(0,10)); 
text(0, 8, paste("Intercept = ", round(mod$coefficients[1], 2), sep = ""));
text(4, 2, paste("Slope = ", round(mod$coefficients[2], 2), sep = ""));
abline(v = 0, col = "grey80"); # line for y-intercept
abline(h = mod$coefficients[1], col = "grey80") # plot horizontal line at intercept value
abline(a = mod$coefficients[1], b = mod$coefficients[2], col='blue', lwd=2) # use slope and intercept to plot best fit line

Calculate intercept and slope using sum of squares

x_bar <- mean(x) # calculate mean of independent variable
y_bar <- mean(y) # calculate mean of dependent variable

slope <- sum((x - x_bar)*(y - y_bar))/sum((x - x_bar)^2) # calculate sum of differences between x & y, and divide by sum of squares of x
slope

## [1] 0.9941662

intercept <- y_bar - (slope * x_bar) # calculate difference of y_bar across the linear predictor
intercept

## [1] 3.011774

Plot data using manually calculated parameters

plot(x,y, col = "grey80", main='Regression with manual calculations', xlim = c(-2, 5), ylim = c(0,10)); 
abline(a = intercept, b = slope, col='blue', lwd=2)

Gradient Descent:

Using the same simulated data as before, we will estimate parameters using a machine learning algorithm

Here’s some figures I found helpful while trying to understand how gradient descent works:

To determine the goodness of fit for a given set of parameters, we will empliment a Squared error cost function (a way to calculate the degree of error for a guess for slope and intercept)

cost <- function(X, y, theta) {
  sum( (X %*% theta - y)^2 ) / (2*length(y))
}

We must also set two additional parameters: learning rate and iteration limit

alpha <- 0.01
num_iters <- 1000

# keep history
cost_history <- double(num_iters)
theta_history <- list(num_iters)

# initialize coefficients
theta <- matrix(c(0,0), nrow=2)

# add a column of 1's for the intercept coefficient
X <- cbind(1, matrix(x))

# gradient descent
for (i in 1:num_iters) {
  error <- (X %*% theta - y)
  delta <- t(X) %*% error / length(y)
  theta <- theta - alpha * delta
  cost_history[i] <- cost(X, y, theta)
  theta_history[[i]] <- theta
}

print(theta)

##           [,1]
## [1,] 3.0116439
## [2,] 0.9941657

Plot data and converging fit

iters <- c((1:31)^2, 1000)
cols <- rev(terrain.colors(num_iters))
library(gifski)
png("frame%03d.png")
par(ask = FALSE)

for (i in iters) {
  plot(x,y, col="grey80", main='Linear regression using Gradient Descent')
  text(x = -3, y = 10, paste("slope = ", round(theta_history[[i]][2], 3), sep = " "), adj = 0)
  text(x = -3, y = 8, paste("intercept = ", round(theta_history[[i]][1], 3), sep = " "), adj = 0)
  abline(coef=theta_history[[i]], col=cols[i], lwd = 2)
}

dev.off()

## png 
##   2

png_files <- sprintf("frame%03d.png", 1:32)
gif_file <- gifski(png_files, delay = 0.1)
unlink(png_files)
utils::browseURL(gif_file)

Calculate intercept and slope using gradient descent (Machine Learning):

plot(cost_history, type='line', col='blue', lwd=2, main='Cost function', ylab='cost', xlab='Iterations')

## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character

Using gradient descent with real data

I’ll demonstrate it’s features using an existing dataset from Bruno Oliveria: “Amphibio”:
• Link to publication: https://www.nature.com/articles/sdata2017123
• Link to data: https://ndownloader.figstatic.com/files/8828578

Load amphibio data!

install.packages("downloader")
library(downloader)

url <- "https://ndownloader.figstatic.com/files/8828578"
download(url, dest="lrgb/amphibio.zip", mode="wb") 
unzip("lrgb/amphibio.zip", exdir = "./lrgb")

df <- read_csv("AmphiBIO_v1.csv") %>%
  select("Order",
         "Body_mass_g",
         "Body_size_mm",
         "Size_at_maturity_min_mm",
         "Size_at_maturity_max_mm",
         "Litter_size_min_n",
         "Litter_size_max_n",
         "Reproductive_output_y") %>%
  na.omit %>%
  mutate_if(is.numeric, ~ log(.))

## Rows: 6776 Columns: 38

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (6): id, Order, Family, Genus, Species, OBS
## dbl (31): Fos, Ter, Aqu, Arb, Leaves, Flowers, Seeds, Arthro, Vert, Diu, Noc...
## lgl  (1): Fruits

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

plot(df$Body_size_mm, df$Size_at_maturity_max_mm, col = "grey80", main='Correlation of amphibian traits', xlab = "Body size (mm)", ylab = "Max size at maturity (mm)");

Fit a linear model

lm <- lm(Size_at_maturity_max_mm ~ Body_size_mm, data = df) # Ordinary Least Squares regression with General Linear Model 
mod <- print(lm)

## 
## Call:
## lm(formula = Size_at_maturity_max_mm ~ Body_size_mm, data = df)
## 
## Coefficients:
##  (Intercept)  Body_size_mm  
##       0.6237        0.7265

mod

## 
## Call:
## lm(formula = Size_at_maturity_max_mm ~ Body_size_mm, data = df)
## 
## Coefficients:
##  (Intercept)  Body_size_mm  
##       0.6237        0.7265

Plot the data and the model

plot(df$Body_size_mm, df$Size_at_maturity_max_mm, col = "grey80", main='Linear Regression using Sum of Squares', xlab = "Body size (mm)", ylab = "Max size at maturity (mm)"); 
text(4, 5, paste("Intercept = ", round(mod$coefficients[1], 2), sep = ""));
text(6, 3, paste("Slope = ", round(mod$coefficients[2], 2), sep = ""));
abline(a = mod$coefficients[1], b = mod$coefficients[2], col='blue', lwd=2) # use slope and intercept to plot best fit line

Calculate intercept and slope using sum of squares

x <- df$Body_size_mm
y <- df$Size_at_maturity_max_mm
x_bar <- mean(x) # calculate mean of independent variable
y_bar <- mean(y) # calculate mean of dependent variable

slope <- sum((x - x_bar)*(y - y_bar))/sum((x - x_bar)^2) # calculate sum of differences between x & y, and divide by sum of squares of x
slope

## [1] 0.7264703

intercept <- y_bar - (slope * x_bar) # calculate difference of y_bar across the linear predictor
intercept

## [1] 0.6237047

### plot data using manually calculated parameters
plot(x,y, col = "grey80", main='Linear Regression using Ordinary Least Squares', xlab = "Body size (mm)", ylab = "Max size at maturity (mm)"); 
abline(a = intercept, b = slope, col='blue', lwd=2)

Calculate intercept and slope using gradient descent (Machine Learning)

Squared error cost function (a way to calculate the degree of error for a guess for slope and intercept)

### learning rate and iteration limit
alpha <- 0.001
num_iters <- 1000

### keep history
cost_history <- double(num_iters)
theta_history <- list(num_iters)

### initialize coefficients
theta <- matrix(c(0,0), nrow=2)

### add a column of 1's for the intercept coefficient
X <- cbind(1, matrix(x))

# gradient descent
for (i in 1:num_iters) {
  error <- (X %*% theta - y)
  delta <- t(X) %*% error / length(y)
  theta <- theta - alpha * delta
  cost_history[i] <- cost(X, y, theta)
  theta_history[[i]] <- theta
}

print(theta)

##           [,1]
## [1,] 0.1816407
## [2,] 0.8175962

Plot data and converging fit

plot(x,y, col="grey80", main='Linear regression using Gradient Descent', xlab = "Body size (mm)", ylab = "Max size at maturity (mm)")
for (i in c((1:31)^2, 1000)) {
  abline(coef=theta_history[[i]], col="red")
}
abline(coef=theta, col="blue", lwd = 2)

plot(cost_history, type='line', col='blue', lwd=2, main='Cost function', ylab='cost', xlab='Iterations')

## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character

Empirical Dynamic Models for Forecasting

Fri, 12 Aug 2016 00:40:04 -0700

Introduction to EDMs for Forecasting Non-stationary data

EDMs are a data-driven solution for uncovering hidden dynamic behavior in natural systems, which are often complex and dynamic (referred to as “non-stationarity” or “non-linearity”). This non-linearity means that the sign and magnitude of relationships within a system change with time, and therefore linear statistical approaches fail to properly represent such changes. Rather than assuming that the system is governed by any set of equations (i.e. unlike meteorological systems), EDMs reconstruct the dynamics of the system from time series data (hence “data-driven”) and provide a mechanistic understanding of the system. Under EDMs, the dynamics of a system are encoded in the temporal ordering of the time series, and the behavior of such a system can be explained by relating various states of a system using time lags (i.e. estimating the mathematical relationship of one variable at time $X(t)$, to the same variable at other times: $X(t+1)$ and $X(t+2)$. By relating states of a system using such lags, causal relationships between variables in the original system may be uncovered–providing a number of ecologically relevant applications, including forecasting.

To reiterate, EDMs are driven by non-linear dynamics in a system (the relationship of a variable, or state, at various time lags vary in sign and magnitude). Taken’s theorem–the basis of EDM–states that an original system’s dynamics can be reconstructed by exploiting the mathematical relationships between historical records of a single variable. These relationships can be mapped 1-to-1 using the Lorenz Attractor (also known as the Butterfly attractor).

Tutorial on forecasting with stationary and non-stationary time series

Load libraries

library(astsa)
library(rEDM)
library(tidyverse)
library(forecast)
library(ggpubr)

Set time series parameters, where time = hrs and the temporal range is 4 days

set.seed(1)

time = 1:96

Stationary time series

Simulate autocorrelated timeseries data with stationarity (linear data, with cyclical autocorrelation) using `arima.sim`

Arima, or AutoRegressive Integrated Moving Average, models necessarily assume linearity, because they rely on a linear relationship to predict values from one time step to another.

stationary_y_arima <- arima.sim(n = length(time), list(ar = c(0.9, -0.8), ma = c(-0.41, 0.2)),
                                sd = sqrt(0.1))

df_ts <- data.frame(x = time, y = stationary_y_arima)

autoplot(stationary_y_arima) + ylab("Stationary Time Series")

Visualize autocorrelation structures using the Parial Autocorrelation Function Estimation feature in the `forecast` package (function `acf()`)

acf(stationary_y_arima)

pacf(stationary_y_arima)

Partition data into training and predicting subsets:

train <- 1:(length(time)/2)             # indices for the first 2/3 of the time series

Arima models for forecasting:

Run a standard Arima model, with no lag dependencies

This model is mathematically identical to a intercept only linear model:

$$\Large \hat{y}_t = \mu + \epsilon_{t}$$

Where, the intercept is equal to the mean of the response variable:

$$\Large \mu = \frac{1}{n} \sum_{t=1}^{n} y_{t}$$

a <- Arima(stationary_y_arima[train])

#plot the fitted values from Arima model
autoplot(fitted(a), col = "blue") + geom_path(data = df_ts, aes(x = x, y = y)) + ylab("Stationary Time Series")

Perform forecast of prediction data using a no-lag Arima model

autoplot(forecast(a, h = 48)) + geom_path(data = df_ts, aes(x = x, y = y)) + ylab("Stationary Time Series")

Autoregressive model, with one time dependency–an hourly lag term:

$$\Large \hat{y}_{t} = \mu + \phi_{1}y_{t-1} + \epsilon_{t}$$

Where, $\Large \phi_1$ is a coefficient of lag

a1 <- Arima(stationary_y_arima[train], c(1,0,0))

#plot the fitted values from Arima model
autoplot(fitted(a1), col = "blue") + geom_path(data = df_ts, aes(x = x, y = y)) + ylab("Stationary Time Series")

#plot the forecasted values from Arima model
autoplot(forecast(a1, h = 48)) + geom_path(data = df_ts, aes(x = x, y = y)) + ylab("Stationary Time Series")

Autoregressive model, with two hourly lags:

$$\Large \hat{y}_{t} = \mu + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \epsilon_{t}$$

a2 <- Arima(stationary_y_arima[train], c(1,0,0))

#plot the fitted values from Arima model
autoplot(fitted(a2), col = "blue") + geom_path(data = df_ts, aes(x = x, y = y)) + ylab("Stationary Time Series")

#plot the forecasted values from Arima model
autoplot(forecast(a2, h = 48)) + geom_path(data = df_ts, aes(x = x, y = y)) + ylab("Stationary Time Series")

Autoregressive models, with up to 5 hourly lags:

$$\Large \hat{y}_t = \mu + \phi_{1}y_{t-1} + […] + \phi_{5}y_{t-5} + \epsilon_{t}$$

a3 <- Arima(stationary_y_arima[train], c(3,0,0))
a4 <- Arima(stationary_y_arima[train], c(4,0,0))
a5 <- Arima(stationary_y_arima[train], c(5,0,0))

a1_gg <- autoplot(forecast(a3, h = 48)) + ggtitle("Arima Model Forecast: 3 hourly lags") +
  geom_path(data = df_ts, aes(x = x, y = y)) + 
  geom_path(aes(x = time[train], y = fitted(a3)[train]), col = "blue") + 
   ylab(" ")

a2_gg <- autoplot(forecast(a4, h = 48)) + ggtitle("Arima Model Forecast: 4 hourly lags") +
  geom_path(data = df_ts, aes(x = x, y = y)) + 
  geom_path(aes(x = time[train], y = fitted(a4)[train]), col = "blue") + 
   ylab("Stationary Time Series")

a3_gg <- autoplot(forecast(a5, h = 48)) + ggtitle("Arima Model Forecast: 5 hourly lags") +
  geom_path(data = df_ts, aes(x = x, y = y)) + 
  geom_path(aes(x = time[train], y = fitted(a5)[train]), col = "blue") + 
   ylab(" ")

ggarrange(a1_gg, a2_gg, a3_gg, ncol = 1)

Now, we can move into models with different cycle structures. For this, we will consider half day lags (12 hr periods)

Autoregressive models, with an hourly- and half-day-time dependency:

$$\Large \hat{y}_t = \mu + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \phi_{3}y_{t-3} + \phi_{4}y_{t-4} + \phi_{5}y_{t-12} + \epsilon_{t}$$

a41 <- Arima(stationary_y_arima[train], c(4,0,0), c(1,0,0))

autoplot(forecast(a41, h = 48)) + ggtitle("Arima Model Forecast: 4 hourly cycle lag") +
  geom_path(data = df_ts, aes(x = x, y = y)) + 
  geom_path(aes(x = time[train], y = fitted(a41)[train]), col = "blue") +
  ylab("Stationary Time Series")

Now, we will let the Arima algorithm choose the time lag parameters, using `auto.arima`:

aa <- auto.arima(stationary_y_arima[train])
summary(aa)

## Series: stationary_y_arima[train] 
## ARIMA(3,0,0) with zero mean 
## 
## Coefficients:
##          ar1      ar2      ar3
##       0.4728  -0.1068  -0.5655
## s.e.  0.1272   0.1513   0.1384
## 
## sigma^2 estimated as 0.08692:  log likelihood=-9.02
## AIC=26.04   AICc=26.97   BIC=33.52
## 
## Training set error measures:
##                      ME      RMSE       MAE      MPE     MAPE      MASE
## Training set 0.02152847 0.2854554 0.2289932 187.4472 335.9332 0.6855497
##                     ACF1
## Training set -0.06089878

# Auto-arima chose a 3-hour lag structure, with no half-day effects

autoplot(forecast(aa, h = 48)) + geom_path(data = df_ts, aes(x = x, y = y)) + 
  ylab("Stationary Time Series")

Non-stationary time series

Now we will simulate non-linear (a.k.a. non-stationary) data, where relationships change through time, using `diffinv`:

## non-stationary data
set.seed(44)
nonstationary_y <- diffinv(rnorm(length(time))) %>% ts()

autoplot(nonstationary_y) + ylab("Non-stationary Time Series")

Let’s see what the auto Arima algorithm estimates with non-stationary data:

aa_ns <- auto.arima(nonstationary_y[train])

summary(aa_ns)

## Series: nonstationary_y[train] 
## ARIMA(0,1,0) 
## 
## sigma^2 estimated as 1.137:  log likelihood=-69.71
## AIC=141.42   AICc=141.51   BIC=143.27
## 
## Training set error measures:
##                      ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.01182676 1.055224 0.7741009 0.9130602 36.00029 0.9791667
##                    ACF1
## Training set 0.08409507

Now, visualize forecast of a linear model with non-linear data!

df_ts_st <- data.frame(x = time, y = nonstationary_y[1:96])

aa_ns <- autoplot(forecast(aa_ns, h = 48)) + 
  geom_path(data = df_ts_st, aes(x = x, y = y)) + 
  ylab("Non-stationary Time Series"); aa_ns

Not a very good prediction… Let’s try empirical dynamic models!

Empirical Dynamic Models for forecasting:

The model is a system of three ordinary differential equations now known as the Lorenz equations:

$$\frac{dx}{dt} = \sigma(y - x)$$ $$\frac{dy}{dt} = x(p - x) - y$$ $$\frac{dz}{dt} = xy - \beta z$$

We will use the `simplex` function to determine how many dimensions (time lags) are needed to effectively develope a data-driven mechanistic formulation of the time series

# set data for historical record (library) and prediction
lib <- c(1, 48)
pred <- c(49, 96)

simplex_output <- simplex(nonstationary_y, lib, pred)
str(simplex_output)

## 'data.frame':    10 obs. of  16 variables:
##  $ E                  : int  1 2 3 4 5 6 7 8 9 10
##  $ tau                : num  1 1 1 1 1 1 1 1 1 1
##  $ tp                 : num  1 1 1 1 1 1 1 1 1 1
##  $ nn                 : num  2 3 4 5 6 7 8 9 10 11
##  $ num_pred           : num  47 46 45 44 43 42 41 40 39 38
##  $ rho                : num  0.768 0.796 0.682 0.716 0.515 ...
##  $ mae                : num  2.81 2.76 3.03 3.1 3.38 ...
##  $ rmse               : num  3.55 3.46 3.89 3.88 4.21 ...
##  $ perc               : num  0.979 0.978 1 1 1 ...
##  $ p_val              : num  7.73e-12 5.15e-13 3.37e-08 4.22e-09 1.56e-04 ...
##  $ const_pred_num_pred: num  47 46 45 44 43 42 41 40 39 38
##  $ const_pred_rho     : num  0.954 0.954 0.947 0.944 0.939 ...
##  $ const_pred_mae     : num  1.008 0.988 0.989 0.951 0.966 ...
##  $ const_pred_rmse    : num  1.23 1.21 1.22 1.17 1.18 ...
##  $ const_pred_perc    : num  0.979 0.978 0.978 1 1 ...
##  $ const_p_val        : num  8.26e-36 6.46e-35 1.10e-31 2.88e-30 3.02e-28 ...

Let’s visualize the forecasting skill (rho)

par(mar = c(4, 4, 1, 1), mgp = c(2.5, 1, 0))  # set margins for plotting
plot(simplex_output$E, simplex_output$rho, type = "l", lwd = 5, col = "light blue", xlab = "Embedding Dimension (E)", 
     ylab = "Forecast Skill (rho)")

simplex_output <- simplex(nonstationary_y, lib, pred, E = 2, tp = 1:10)
plot(simplex_output$tp, simplex_output$rho, type = "l", lwd = 5, col = "light blue", xlab = "Time to Prediction (tp)", 
     ylab = "Forecast Skill (rho)")

Run `simplex` to create EDM model for forecasting

smap_output <- simplex(nonstationary_y, lib, pred, E = 2, stats_only = FALSE)

predictions <- na.omit(smap_output$model_output[[1]])

df_ts_st_pred <- data.frame(x = time[51:96], y = nonstationary_y[51:96], predictions)

plot(df_ts_st$y~df_ts_st$x, type = "l")

edm <- ggplot(data = df_ts_st_pred) + ggtitle("Forecasts from EDM") + xlab("Time") + ylab(" ") + 
  geom_ribbon(aes(x = x, y = y, ymin = y - 1.96*sqrt(pred_var), ymax = y +.96*sqrt(pred_var)), fill = "blue", alpha = 0.2) +
  geom_ribbon(aes(x = x, y = y, ymin = y-sqrt(pred_var), ymax = y+sqrt(pred_var)), fill = "blue", alpha = 0.4) + 
  geom_path(aes(x = x, y = y)) + 
  geom_path(data = df_ts_st, aes(x = x, y = y)) + 
  ylab("Non-stationary Time Series"); edm

ggarrange(aa_ns + coord_cartesian(ylim = c(-20,8)) + ggtitle("Forecast with ARIMA"),
          edm + coord_cartesian(ylim = c(-20,8)) + ggtitle("Forecast with EDM")) + theme_bw()

Image 18.3

ggsave("forecasts.jpeg", dpi = 300)

## Saving 7 x 5 in image

Machine Learning | Alex Baecher

Machine Learning the 'Tidy' Way

Introduction to machine learning with tidymodels

Anatomy of tidymodels

Package tutorials:

Data

Amphibio data

boost_tree()

Description

?nearest_neighbors

nearest_neighbor():

defines a model that uses the K most similar data points from the training set to predict new samples.

There are different ways to fit this model. See the engine-specific pages for more details:

dials

tune

grid_random

grid_max_entropy

workflows

For combining model, parameters, and preprocessing

yardstick

For extracting metrics from the model

Fit the entire data set using the final wf

broom

augment()

Now we can evaluate the models using yardstick!

yardstick

Evaluating knn model

Visualizing predictions:

Linear regression with gradient descent

Introduction linear regression with gradient descent

Ordinary Least Square Regression

Simulate data

Generate random data in which y is a noisy function of x

Fit a linear model

Plot the data and the model

Calculate intercept and slope using sum of squares

Plot data using manually calculated parameters

Gradient Descent:

Using the same simulated data as before, we will estimate parameters using a machine learning algorithm

Here’s some figures I found helpful while trying to understand how gradient descent works:

To determine the goodness of fit for a given set of parameters, we will empliment a Squared error cost function (a way to calculate the degree of error for a guess for slope and intercept)

We must also set two additional parameters: learning rate and iteration limit

Plot data and converging fit

Calculate intercept and slope using gradient descent (Machine Learning):

Using gradient descent with real data

Load amphibio data!

Fit a linear model

Plot the data and the model

Calculate intercept and slope using sum of squares

Calculate intercept and slope using gradient descent (Machine Learning)

Squared error cost function (a way to calculate the degree of error for a guess for slope and intercept)

Plot data and converging fit

Empirical Dynamic Models for Forecasting

Introduction to EDMs for Forecasting Non-stationary data

Load libraries

Set time series parameters, where time = hrs and the temporal range is 4 days

Stationary time series

Simulate autocorrelated timeseries data with stationarity (linear data, with cyclical autocorrelation) using arima.sim

Arima, or AutoRegressive Integrated Moving Average, models necessarily assume linearity, because they rely on a linear relationship to predict values from one time step to another.

Visualize autocorrelation structures using the Parial Autocorrelation Function Estimation feature in the forecast package (function acf())

Partition data into training and predicting subsets:

Arima models for forecasting:

Run a standard Arima model, with no lag dependencies

This model is mathematically identical to a intercept only linear model:

Where, the intercept is equal to the mean of the response variable:

Perform forecast of prediction data using a no-lag Arima model

Autoregressive model, with one time dependency–an hourly lag term:

Autoregressive model, with two hourly lags:

Autoregressive models, with up to 5 hourly lags:

Now, we can move into models with different cycle structures. For this, we will consider half day lags (12 hr periods)

Autoregressive models, with an hourly- and half-day-time dependency:

Now, we will let the Arima algorithm choose the time lag parameters, using auto.arima:

Non-stationary time series

Now we will simulate non-linear (a.k.a. non-stationary) data, where relationships change through time, using diffinv:

Let’s see what the auto Arima algorithm estimates with non-stationary data:

Now, visualize forecast of a linear model with non-linear data!

Not a very good prediction… Let’s try empirical dynamic models!

Empirical Dynamic Models for forecasting:

The model is a system of three ordinary differential equations now known as the Lorenz equations:

We will use the simplex function to determine how many dimensions (time lags) are needed to effectively develope a data-driven mechanistic formulation of the time series

Simulate autocorrelated timeseries data with stationarity (linear data, with cyclical autocorrelation) using `arima.sim`

Visualize autocorrelation structures using the Parial Autocorrelation Function Estimation feature in the `forecast` package (function `acf()`)

Now, we will let the Arima algorithm choose the time lag parameters, using `auto.arima`:

Now we will simulate non-linear (a.k.a. non-stationary) data, where relationships change through time, using `diffinv`:

We will use the `simplex` function to determine how many dimensions (time lags) are needed to effectively develope a data-driven mechanistic formulation of the time series

Run `simplex` to create EDM model for forecasting