<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>KNN | Alex Baecher</title>
    <link>https://questlab.eco/tag/knn/</link>
      <atom:link href="https://questlab.eco/tag/knn/index.xml" rel="self" type="application/rss+xml" />
    <description>KNN</description>
    <generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><copyright>© 2026 Alex Baecher</copyright><lastBuildDate>Fri, 22 Oct 2021 00:40:04 -0700</lastBuildDate>
    <image>
      <url>https://questlab.eco/media/icon_hu16270048066519736882.png</url>
      <title>KNN</title>
      <link>https://questlab.eco/tag/knn/</link>
    </image>
    
    <item>
      <title>Machine Learning the &#39;Tidy&#39; Way</title>
      <link>https://questlab.eco/post/ml-tidymodels/</link>
      <pubDate>Fri, 22 Oct 2021 00:40:04 -0700</pubDate>
      <guid>https://questlab.eco/post/ml-tidymodels/</guid>
      <description>&lt;h1 id=&#34;introduction-to-machine-learning-with-tidymodels&#34;&gt;Introduction to machine learning with &lt;em&gt;tidymodels&lt;/em&gt;&lt;/h1&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Tidymodels&lt;/strong&gt;&lt;/em&gt; provides a clean, organized, and&amp;ndash;most importantly&amp;ndash;consistent programming syntax for data pre-processing, model specification, model fitting, model evaluation, and prediction.&lt;/p&gt;
&lt;h2 id=&#34;anatomy-of-tidymodels&#34;&gt;Anatomy of &lt;em&gt;tidymodels&lt;/em&gt;&lt;/h2&gt;
&lt;p&gt;A meta-package that installs and load the core packages listed below that you need for modeling and machine learning&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;rsamples&lt;/strong&gt;&lt;/em&gt;: provides infrastructure for efficient data splitting and resampling&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;parsnip&lt;/strong&gt;&lt;/em&gt;: a tidy, unified interface to models that can be used to try a range of models without getting bogged down in the syntactical minutiae of the underlying packages&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;recipes&lt;/strong&gt;&lt;/em&gt;: a tidy interface to data pre-processing tools for feature engineering&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;workflows&lt;/strong&gt;&lt;/em&gt;: workflows bundle your pre-processing, modeling, and post-processing together&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;tune&lt;/strong&gt;&lt;/em&gt;: helps you optimize the hyperparameters of your model and pre-processing steps&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;yardstick&lt;/strong&gt;&lt;/em&gt;: measures the effectiveness of models using performance metrics&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;dials&lt;/strong&gt;&lt;/em&gt;: contains tools to create and manage values of tuning parameters and is designed to integrate well with the parsnip package&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;broom&lt;/strong&gt;&lt;/em&gt;: summarizes key information about models in tidy tibble()s&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;First, lets load the &lt;em&gt;&lt;strong&gt;tidymodels&lt;/strong&gt;&lt;/em&gt; meta-package:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;library(tidymodels)
library(tidyverse)
&lt;/code&gt;&lt;/pre&gt;
&lt;h1 id=&#34;package-tutorials&#34;&gt;Package tutorials:&lt;/h1&gt;
&lt;h2 id=&#34;data&#34;&gt;Data&lt;/h2&gt;
&lt;p&gt;I&amp;rsquo;ll demonstrate it&amp;rsquo;s features using an existing data set from Bruno Oliveria, &lt;em&gt;Amphibio&lt;/em&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Link to publication: &lt;a href=&#34;https://www.nature.com/articles/sdata2017123&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;https://www.nature.com/articles/sdata2017123&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Link to data: &lt;a href=&#34;https://ndownloader.figstatic.com/files/8828578&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;https://ndownloader.figstatic.com/files/8828578&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;amphibio-data&#34;&gt;Amphibio data&lt;/h3&gt;
&lt;p&gt;Download data:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# install.packages(&amp;quot;downloader&amp;quot;)
# library(downloader)
# 
# url &amp;lt;- &amp;quot;https://ndownloader.figstatic.com/files/8828578&amp;quot;
# download(url, dest=&amp;quot;dial_broom/amphibio.zip&amp;quot;, mode=&amp;quot;wb&amp;quot;) 
# unzip(&amp;quot;dial_broom/amphibio.zip&amp;quot;, exdir = &amp;quot;./dial_broom&amp;quot;)

library(readr)

amphibio_raw &amp;lt;- read_csv(&amp;quot;AmphiBIO_v1.csv&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The data consist of natural history information of amphibians, including
habitat types, diet, size, ect.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s the breakdown of taxonomic spread in the data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Order: N = 3&lt;/li&gt;
&lt;li&gt;Family: N = 61&lt;/li&gt;
&lt;li&gt;Genera: N = 531&lt;/li&gt;
&lt;li&gt;Species: N = 6776&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are also a lot of missing data, and what data do exist are wildly
different scales. We&amp;rsquo;ll clean this up:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check how many NA&#39;s for each row
amphibio &amp;lt;- amphibio_raw %&amp;gt;%
  select(&amp;quot;Order&amp;quot;
         ,&amp;quot;Body_mass_g&amp;quot;
         ,&amp;quot;Body_size_mm&amp;quot;
         ,&amp;quot;Litter_size_min_n&amp;quot;
         ,&amp;quot;Litter_size_max_n&amp;quot;
         ,&amp;quot;Reproductive_output_y&amp;quot;
         ) %&amp;gt;%
  na.omit %&amp;gt;%
  mutate(Body_mass_g = log(Body_mass_g),
         Body_size_mm = log(Body_size_mm),
         Litter_size_min_n = log(Litter_size_min_n),
         Litter_size_max_n = log(Litter_size_max_n),
         Reproductive_output_y = log(Reproductive_output_y)) %&amp;gt;%
  filter(!Order == &amp;quot;Gymnophiona&amp;quot;)
  
amphibio %&amp;gt;%
  group_by(Order) %&amp;gt;%
  summarize(n = n())
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let&amp;rsquo;s have a peak at the data:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;  amphibio %&amp;gt;% 
  pivot_longer(!Order, names_to = &amp;quot;Metric&amp;quot;, values_to = &amp;quot;Value&amp;quot;) %&amp;gt;%
  ggplot(aes(Order, Value, col = Order)) + 
    geom_boxplot() + 
    facet_wrap(~Metric)
&lt;/code&gt;&lt;/pre&gt;


















&lt;figure  &gt;
  &lt;div class=&#34;d-flex justify-content-center&#34;&gt;
    &lt;div class=&#34;w-100&#34; &gt;&lt;img alt=&#34; &#34; srcset=&#34;
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-4-1_hu16402664525895876493.webp 400w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-4-1_hu16502879561273140279.webp 760w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-4-1_hu16367533267459427282.webp 1200w&#34;
               src=&#34;https://questlab.eco/media/posts/script_files/figure-markdown_strict/unnamed-chunk-4-1_hu16402664525895876493.webp&#34;
               width=&#34;672&#34;
               height=&#34;480&#34;
               loading=&#34;lazy&#34; data-zoomable /&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;figcaption&gt;
      
    &lt;/figcaption&gt;&lt;/figure&gt;

&lt;p&gt;There are some trends in the data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;caudates are longer&lt;/li&gt;
&lt;li&gt;anura have larger litter sizes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Given the data, one possible modeling application could be to use data
to predict order using two models: knn and boosted regression trees.&lt;/p&gt;
&lt;p&gt;To start the modeling process, we&amp;rsquo;ll use &lt;em&gt;rsamples&lt;/em&gt; to split the data
into training and testing sets.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set.seed(42)

tidy_split &amp;lt;- initial_split(amphibio, prop = 0.95)
tidy_train &amp;lt;- training(tidy_split)
tidy_test &amp;lt;- testing(tidy_split)
tidy_kfolds &amp;lt;- vfold_cv(tidy_train)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can use &lt;em&gt;recipes&lt;/em&gt; to preprocess the data:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Recipes package 
## For preprocessing, feature engineering, and feature elimination 
tidy_rec &amp;lt;- recipe(Order ~ ., data = tidy_train) %&amp;gt;% 
  step_dummy(all_nominal(), -all_outcomes()) %&amp;gt;% 
  step_normalize(all_predictors()) %&amp;gt;%
  prep()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now that we&amp;rsquo;ve created a recipe to process the data for modeling, we can
use &lt;em&gt;&lt;strong&gt;parsnip&lt;/strong&gt;&lt;/em&gt; to model the data:&lt;/p&gt;
&lt;p&gt;First, let&amp;rsquo;s have a look at the model’s description&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;library(&amp;quot;webshot&amp;quot;)
# ?boost_tree
&lt;/code&gt;&lt;/pre&gt;
&lt;iframe src=&#34;https://parsnip.tidymodels.org/reference/boost_tree.html&#34; width=&#34;100%&#34; height=&#34;400px&#34;&gt;
&lt;/iframe&gt;
&lt;h2 id=&#34;boost_tree&#34;&gt;&lt;em&gt;boost_tree()&lt;/em&gt;&lt;/h2&gt;
&lt;h3 id=&#34;description&#34;&gt;Description&lt;/h3&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;boost_tree()&lt;/strong&gt;&lt;/em&gt; defines a model that creates a series of decision trees
forming an ensemble. Each tree depends on the results of previous trees.
All trees in the ensemble are combined to produce a final prediction.&lt;/p&gt;
&lt;p&gt;There are different ways to fit this model. See the engine-specific
pages for more details:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;xgboost (default)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;C5.0&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;spark&lt;/p&gt;
&lt;h1 id=&#34;nearest_neighbors&#34;&gt;?nearest_neighbors&lt;/h1&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;iframe src=&#34;https://parsnip.tidymodels.org/reference/nearest_neighbor.html&#34; width=&#34;100%&#34; height=&#34;400px&#34;&gt;
&lt;/iframe&gt;
&lt;h2 id=&#34;nearest_neighbor&#34;&gt;&lt;em&gt;nearest_neighbor()&lt;/em&gt;:&lt;/h2&gt;
&lt;h3 id=&#34;defines-a-model-that-uses-the-k-most-similar-data-points-from-the-training-set-to-predict-new-samples&#34;&gt;defines a model that uses the K most similar data points from the training set to predict new samples.&lt;/h3&gt;
&lt;h3 id=&#34;there-are-different-ways-to-fit-this-model-see-the-engine-specific-pages-for-more-details&#34;&gt;There are different ways to fit this model. See the engine-specific pages for more details:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;knn (default)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now, let&amp;rsquo;s fit the models:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Parsnip package 
## Standardized api for creating models 
tidy_boosted_model &amp;lt;- boost_tree(trees = tune(),
                                min_n = tune(),
                                learn_rate = tune()) %&amp;gt;% 
  set_mode(&amp;quot;classification&amp;quot;) %&amp;gt;% 
  set_engine(&amp;quot;xgboost&amp;quot;)

tidy_knn_model &amp;lt;- nearest_neighbor(neighbors = tune()) %&amp;gt;% 
  set_mode(&amp;quot;classification&amp;quot;) %&amp;gt;% 
  set_engine(&amp;quot;kknn&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Our basic model recipe is complete, but now we want to use &lt;em&gt;dials&lt;/em&gt; to
tune parameters.&lt;/p&gt;
&lt;h2 id=&#34;dials&#34;&gt;&lt;em&gt;&lt;strong&gt;dials&lt;/strong&gt;&lt;/em&gt;&lt;/h2&gt;
&lt;p&gt;For boosted regression trees, there are 3 basic parameters:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;parameters(tidy_boosted_model)

## Collection of 3 parameters for tuning
## 
##  identifier       type    object
##       trees      trees nparam[+]
##       min_n      min_n nparam[+]
##  learn_rate learn_rate nparam[+]
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;trees&lt;/em&gt;: An integer for the number of trees contained in the ensemble.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;min_n&lt;/em&gt;: An integer for the minimum number of data points in a node that is required for the node to be split further.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;learn_rate&lt;/em&gt;: A number for the rate at which the boosting algorithm adapts from iteration-to-iteration (specific engines only).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Knn has a single parameter to tune: the neighbors&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;parameters(tidy_knn_model)

## Collection of 1 parameters for tuning
## 
##  identifier      type    object
##   neighbors neighbors nparam[+]
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;neighbors&lt;/em&gt;: A single integer for the number of neighbors to consider
(often called k). For kknn, a value of 5 is used if neighbors is not
specified.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So, we can use &lt;em&gt;dials&lt;/em&gt; to set the possible parameter values, which can
then be tuned using &lt;em&gt;tune&lt;/em&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Dials creates the parameter grids 
# Tune applies the parameter grid to the models 
# Dials pacakge 
boosted_params &amp;lt;- 5
knn_params &amp;lt;- 10

?grid_regular

## starting httpd help server ... done

boosted_grid &amp;lt;- grid_regular(parameters(tidy_boosted_model), levels = boosted_params)
boosted_grid

## # A tibble: 125 x 3
##    trees min_n   learn_rate
##    &amp;lt;int&amp;gt; &amp;lt;int&amp;gt;        &amp;lt;dbl&amp;gt;
##  1     1     2 0.0000000001
##  2   500     2 0.0000000001
##  3  1000     2 0.0000000001
##  4  1500     2 0.0000000001
##  5  2000     2 0.0000000001
##  6     1    11 0.0000000001
##  7   500    11 0.0000000001
##  8  1000    11 0.0000000001
##  9  1500    11 0.0000000001
## 10  2000    11 0.0000000001
## # ... with 115 more rows

knn_grid &amp;lt;- grid_regular(parameters(tidy_knn_model), levels = knn_params)
knn_grid

## # A tibble: 10 x 1
##    neighbors
##        &amp;lt;int&amp;gt;
##  1         1
##  2         2
##  3         4
##  4         5
##  5         7
##  6         8
##  7        10
##  8        11
##  9        13
## 10        15
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Implement tuning grid using &lt;em&gt;tune&lt;/em&gt;:&lt;/p&gt;
&lt;h2 id=&#34;tune&#34;&gt;&lt;em&gt;&lt;strong&gt;tune&lt;/strong&gt;&lt;/em&gt;&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# install.packages(c(&amp;quot;xgboost&amp;quot;, &amp;quot;kknn&amp;quot;))
library(xgboost)
library(kknn)

# Tune pacakge 
# system.time(
#   boosted_tune &amp;lt;- tune_grid(tidy_boosted_model,
#                             tidy_rec,
#                             resamples = tidy_kfolds,
#                             grid = boosted_grid)
# )
# write_rds(boosted_tune, &amp;quot;boosted_tune.rds&amp;quot;)
boosted_tune &amp;lt;- read_rds(&amp;quot;boosted_tune.rds&amp;quot;)

# system.time(
#   knn_tune &amp;lt;- tune_grid(tidy_knn_model,
#                         tidy_rec,
#                         resamples = tidy_kfolds,
#                         grid = knn_grid)
# ) 
# write_rds(knn_tune, &amp;quot;knn_tune.rds&amp;quot;)
knn_tune &amp;lt;- read_rds(&amp;quot;knn_tune.rds&amp;quot;)

#Use Tune package to extract best parameters using ROC_AUC handtill
boosted_param &amp;lt;- boosted_tune %&amp;gt;% select_best(&amp;quot;roc_auc&amp;quot;)
knn_param &amp;lt;- knn_tune %&amp;gt;% select_best(&amp;quot;roc_auc&amp;quot;)
#Apply parameters to the models
tidy_boosted_model_final &amp;lt;- finalize_model(tidy_boosted_model, boosted_param)
tidy_knn_model_final &amp;lt;- finalize_model(tidy_knn_model, knn_param)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, well try different options from &lt;em&gt;dials&lt;/em&gt; for parameter tuning, using
two additional methods for grid specification:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;random grid with &lt;em&gt;dials::grid_random&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;maximum entropy grid with &lt;em&gt;dials::grid_max_entropy&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;grid_random&#34;&gt;&lt;em&gt;grid_random&lt;/em&gt;&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;boosted_grid_rand &amp;lt;- grid_random(parameters(tidy_boosted_model), size = boosted_params)
boosted_grid_rand

## # A tibble: 5 x 3
##   trees min_n learn_rate
##   &amp;lt;int&amp;gt; &amp;lt;int&amp;gt;      &amp;lt;dbl&amp;gt;
## 1   190    21   2.32e- 5
## 2  1816    12   3.60e- 8
## 3   293    28   3.14e-10
## 4   314     8   2.52e- 7
## 5  1363     5   5.92e- 6

knn_grid_rand &amp;lt;- grid_random(parameters(tidy_knn_model), size = knn_params)
knn_grid_rand

## # A tibble: 7 x 1
##   neighbors
##       &amp;lt;int&amp;gt;
## 1         1
## 2        10
## 3         5
## 4         3
## 5        11
## 6         8
## 7         2

# system.time(
#   boosted_tune_rand &amp;lt;- tune_grid(tidy_boosted_model,
#                                  tidy_rec,
#                                  resamples = tidy_kfolds,
#                                  grid = boosted_grid_rand)
# )
# write_rds(boosted_tune_rand, &amp;quot;boosted_tune_rand.rds&amp;quot;)
boosted_tune_rand &amp;lt;- read_rds(&amp;quot;boosted_tune_rand.rds&amp;quot;)

# system.time(
#   knn_tune_rand &amp;lt;- tune_grid(tidy_knn_model,
#                              tidy_rec,
#                              resamples = tidy_kfolds,
#                              grid = knn_grid_rand)
# )
# write_rds(knn_tune_rand, &amp;quot;knn_tune_rand.rds&amp;quot;)
knn_tune_rand &amp;lt;- read_rds(&amp;quot;knn_tune_rand.rds&amp;quot;)

#Use Tune package to extract best parameters using ROC_AUC handtill
boosted_param_rand &amp;lt;- boosted_tune_rand %&amp;gt;% select_best(&amp;quot;roc_auc&amp;quot;)
knn_param_rand &amp;lt;- knn_tune_rand %&amp;gt;% select_best(&amp;quot;roc_auc&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;grid_max_entropy&#34;&gt;&lt;em&gt;grid_max_entropy&lt;/em&gt;&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;boosted_grid_maxent &amp;lt;- grid_max_entropy(parameters(tidy_boosted_model), size = boosted_params)
boosted_grid_maxent

## # A tibble: 5 x 3
##   trees min_n learn_rate
##   &amp;lt;int&amp;gt; &amp;lt;int&amp;gt;      &amp;lt;dbl&amp;gt;
## 1   433    25   4.27e-10
## 2  1671    13   3.28e-10
## 3  1520     3   3.21e- 6
## 4   672     3   3.06e-10
## 5  1371    22   2.32e- 5

knn_grid_maxent &amp;lt;- grid_max_entropy(parameters(tidy_knn_model), size = knn_params)
knn_grid_maxent

## # A tibble: 10 x 1
##    neighbors
##        &amp;lt;int&amp;gt;
##  1         3
##  2        10
##  3         1
##  4        15
##  5        13
##  6         4
##  7         6
##  8         8
##  9         9
## 10        11

# system.time(
#   boosted_tune_maxent &amp;lt;- tune_grid(tidy_boosted_model,
#                                    tidy_rec,
#                                    resamples = tidy_kfolds,
#                                    grid = boosted_grid_maxent)
# )
# write_rds(boosted_tune_maxent, &amp;quot;boosted_tune_maxent.rds&amp;quot;)
boosted_tune_maxent &amp;lt;- read_rds(&amp;quot;boosted_tune_maxent.rds&amp;quot;)

# system.time(
#   knn_tune_maxent &amp;lt;- tune_grid(tidy_knn_model,
#                                tidy_rec,
#                                resamples = tidy_kfolds,
#                                grid = knn_grid_maxent)
# )
# write_rds(knn_tune_maxent, &amp;quot;knn_tune_maxent.rds&amp;quot;)
knn_tune_maxent &amp;lt;- read_rds(&amp;quot;knn_tune.rds&amp;quot;)

#Use Tune package to extract best parameters using ROC_AUC handtill
boosted_param_maxent &amp;lt;- boosted_tune_maxent %&amp;gt;% select_best(&amp;quot;roc_auc&amp;quot;)
knn_param_maxent &amp;lt;- knn_tune_maxent %&amp;gt;% select_best(&amp;quot;roc_auc&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;workflows&#34;&gt;&lt;em&gt;workflows&lt;/em&gt;&lt;/h2&gt;
&lt;h3 id=&#34;for-combining-model-parameters-and-preprocessing&#34;&gt;For combining model, parameters, and preprocessing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;boosted_wf &amp;lt;- workflow() %&amp;gt;% 
  add_model(tidy_boosted_model_final) %&amp;gt;% 
  add_recipe(tidy_rec)

knn_wf &amp;lt;- workflow() %&amp;gt;% 
  add_model(tidy_knn_model_final) %&amp;gt;% 
  add_recipe(tidy_rec)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;yardstick&#34;&gt;&lt;em&gt;yardstick&lt;/em&gt;&lt;/h2&gt;
&lt;h3 id=&#34;for-extracting-metrics-from-the-model&#34;&gt;For extracting metrics from the model&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;boosted_res &amp;lt;- last_fit(boosted_wf, tidy_split)
knn_res &amp;lt;- last_fit(knn_wf, tidy_split)

mods &amp;lt;- bind_rows(
  boosted_res %&amp;gt;% mutate(model = &amp;quot;xgb&amp;quot;),
  knn_res %&amp;gt;% mutate(model = &amp;quot;knn&amp;quot;)) %&amp;gt;% 
  unnest(.metrics)

ggplot(bind_rows(mods$.predictions), aes(Order, .pred_Anura)) + 
  geom_boxplot()
&lt;/code&gt;&lt;/pre&gt;


















&lt;figure  &gt;
  &lt;div class=&#34;d-flex justify-content-center&#34;&gt;
    &lt;div class=&#34;w-100&#34; &gt;&lt;img alt=&#34; &#34; srcset=&#34;
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-17-1_hu2799537828652431356.webp 400w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-17-1_hu12140637065517628874.webp 760w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-17-1_hu14879292936314113267.webp 1200w&#34;
               src=&#34;https://questlab.eco/media/posts/script_files/figure-markdown_strict/unnamed-chunk-17-1_hu2799537828652431356.webp&#34;
               width=&#34;672&#34;
               height=&#34;480&#34;
               loading=&#34;lazy&#34; data-zoomable /&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;figcaption&gt;
      
    &lt;/figcaption&gt;&lt;/figure&gt;

&lt;pre&gt;&lt;code&gt;ggplot(bind_rows(mods$.predictions), aes(Order, .pred_Caudata)) + 
  geom_boxplot()
&lt;/code&gt;&lt;/pre&gt;


















&lt;figure  &gt;
  &lt;div class=&#34;d-flex justify-content-center&#34;&gt;
    &lt;div class=&#34;w-100&#34; &gt;&lt;img alt=&#34; &#34; srcset=&#34;
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-17-2_hu17655385560798890030.webp 400w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-17-2_hu17734957877442164994.webp 760w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-17-2_hu3054777465481101903.webp 1200w&#34;
               src=&#34;https://questlab.eco/media/posts/script_files/figure-markdown_strict/unnamed-chunk-17-2_hu17655385560798890030.webp&#34;
               width=&#34;672&#34;
               height=&#34;480&#34;
               loading=&#34;lazy&#34; data-zoomable /&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;figcaption&gt;
      
    &lt;/figcaption&gt;&lt;/figure&gt;

&lt;pre&gt;&lt;code&gt;ggplot(mods, aes(x = model, y = .estimate, col = model)) + 
  geom_point() + 
  facet_wrap(~.metric)
&lt;/code&gt;&lt;/pre&gt;


















&lt;figure  &gt;
  &lt;div class=&#34;d-flex justify-content-center&#34;&gt;
    &lt;div class=&#34;w-100&#34; &gt;&lt;img alt=&#34; &#34; srcset=&#34;
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-17-3_hu2528959886499758053.webp 400w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-17-3_hu5960965889861658502.webp 760w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-17-3_hu8034350242501598142.webp 1200w&#34;
               src=&#34;https://questlab.eco/media/posts/script_files/figure-markdown_strict/unnamed-chunk-17-3_hu2528959886499758053.webp&#34;
               width=&#34;672&#34;
               height=&#34;480&#34;
               loading=&#34;lazy&#34; data-zoomable /&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;figcaption&gt;
      
    &lt;/figcaption&gt;&lt;/figure&gt;

&lt;p&gt;Confusion matrix to visualize model predictions against truth&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;boosted_res %&amp;gt;% unnest(.predictions) %&amp;gt;% 
  conf_mat(truth = Order, estimate = .pred_class) %&amp;gt;%
  autoplot()
&lt;/code&gt;&lt;/pre&gt;


















&lt;figure  &gt;
  &lt;div class=&#34;d-flex justify-content-center&#34;&gt;
    &lt;div class=&#34;w-100&#34; &gt;&lt;img alt=&#34; &#34; srcset=&#34;
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-18-1_hu11449505557213433085.webp 400w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-18-1_hu1067537527602828794.webp 760w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-18-1_hu17114581279966710681.webp 1200w&#34;
               src=&#34;https://questlab.eco/media/posts/script_files/figure-markdown_strict/unnamed-chunk-18-1_hu11449505557213433085.webp&#34;
               width=&#34;672&#34;
               height=&#34;480&#34;
               loading=&#34;lazy&#34; data-zoomable /&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;figcaption&gt;
      
    &lt;/figcaption&gt;&lt;/figure&gt;

&lt;h3 id=&#34;fit-the-entire-data-set-using-the-final-wf&#34;&gt;Fit the entire data set using the final wf&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;final_boosted_model &amp;lt;- fit(boosted_wf, amphibio)

## [15:25:37] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective &#39;binary:logistic&#39; was changed from &#39;error&#39; to &#39;logloss&#39;. Explicitly set eval_metric if you&#39;d like to restore the old behavior.

final_knn_model &amp;lt;- fit(knn_wf, amphibio)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;broom&#34;&gt;&lt;em&gt;broom&lt;/em&gt;&lt;/h2&gt;
&lt;p&gt;Now we can use &lt;em&gt;broom&lt;/em&gt; to tidy the results from these models, and
provide an intuitive view of their meaning!&lt;/p&gt;
&lt;h2 id=&#34;augment&#34;&gt;&lt;em&gt;augment()&lt;/em&gt;&lt;/h2&gt;
&lt;p&gt;First, we’ll use &lt;em&gt;augment&lt;/em&gt; to obtain predictions, residuals, and other
items from the model, which auto-binds them to the original dataset.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;boosted_aug &amp;lt;- augment(final_boosted_model, new_data = amphibio[,-1])
knn_aug &amp;lt;- augment(final_knn_model, new_data = amphibio[,-1])

boosted_aug_long &amp;lt;- boosted_aug %&amp;gt;%
  pivot_longer(-c(.pred_class, .pred_Anura, .pred_Caudata), names_to = &amp;quot;predictor&amp;quot;, values_to = &amp;quot;value&amp;quot;) 
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;now-we-can-evaluate-the-models-using-yardstick&#34;&gt;Now we can evaluate the models using &lt;em&gt;yardstick&lt;/em&gt;!&lt;/h2&gt;
&lt;h1 id=&#34;yardstick-1&#34;&gt;&lt;em&gt;yardstick&lt;/em&gt;&lt;/h1&gt;
&lt;pre&gt;&lt;code&gt;final_boosted_model %&amp;gt;%
  predict(bake(tidy_rec, new_data = tidy_test), type = &amp;quot;prob&amp;quot;) %&amp;gt;%
  bind_cols(tidy_test) %&amp;gt;%
  roc_auc(factor(Order), .pred_Anura)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
## 1 roc_auc binary         0.759

final_boosted_model %&amp;gt;%
  predict(bake(tidy_rec, new_data = tidy_test), type = &amp;quot;prob&amp;quot;) %&amp;gt;%
  bind_cols(tidy_test) %&amp;gt;%
  roc_curve(factor(Order), .pred_Anura) %&amp;gt;%
  autoplot() 
&lt;/code&gt;&lt;/pre&gt;


















&lt;figure  &gt;
  &lt;div class=&#34;d-flex justify-content-center&#34;&gt;
    &lt;div class=&#34;w-100&#34; &gt;&lt;img alt=&#34; &#34; srcset=&#34;
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-21-1_hu7672928658517871079.webp 400w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-21-1_hu15580196728854118655.webp 760w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-21-1_hu3457540179927346600.webp 1200w&#34;
               src=&#34;https://questlab.eco/media/posts/script_files/figure-markdown_strict/unnamed-chunk-21-1_hu7672928658517871079.webp&#34;
               width=&#34;672&#34;
               height=&#34;480&#34;
               loading=&#34;lazy&#34; data-zoomable /&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;figcaption&gt;
      
    &lt;/figcaption&gt;&lt;/figure&gt;

&lt;h2 id=&#34;evaluating-knn-model&#34;&gt;Evaluating knn model&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;final_knn_model %&amp;gt;%
  predict(bake(tidy_rec, new_data = tidy_test), type = &amp;quot;prob&amp;quot;) %&amp;gt;%
  bind_cols(tidy_test) %&amp;gt;%
  roc_auc(factor(Order), .pred_Anura)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
## 1 roc_auc binary           0.5

final_knn_model %&amp;gt;%
  predict(bake(tidy_rec, new_data = tidy_test), type = &amp;quot;prob&amp;quot;) %&amp;gt;%
  bind_cols(tidy_test) %&amp;gt;%
  roc_curve(factor(Order), .pred_Anura) %&amp;gt;%
  autoplot()
&lt;/code&gt;&lt;/pre&gt;


















&lt;figure  &gt;
  &lt;div class=&#34;d-flex justify-content-center&#34;&gt;
    &lt;div class=&#34;w-100&#34; &gt;&lt;img alt=&#34; &#34; srcset=&#34;
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-22-1_hu18200676328988334459.webp 400w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-22-1_hu18171846850144008820.webp 760w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-22-1_hu11196280528281156913.webp 1200w&#34;
               src=&#34;https://questlab.eco/media/posts/script_files/figure-markdown_strict/unnamed-chunk-22-1_hu18200676328988334459.webp&#34;
               width=&#34;672&#34;
               height=&#34;480&#34;
               loading=&#34;lazy&#34; data-zoomable /&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;figcaption&gt;
      
    &lt;/figcaption&gt;&lt;/figure&gt;

&lt;pre&gt;&lt;code&gt;final_knn_model %&amp;gt;%
  predict(bake(tidy_rec, new_data = tidy_test), type = &amp;quot;prob&amp;quot;) %&amp;gt;%
  bind_cols(tidy_test) %&amp;gt;%
  roc_auc(factor(Order), .pred_Anura)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
## 1 roc_auc binary           0.5
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;visualizing-predictions&#34;&gt;Visualizing predictions:&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;library(viridis)

## Loading required package: viridisLite

## 
## Attaching package: &#39;viridis&#39;

## The following object is masked from &#39;package:scales&#39;:
## 
##     viridis_pal

ggplot(boosted_aug_long, aes(x = value, y = .pred_Anura, col = .pred_class)) + 
  geom_point() + 
  facet_wrap(~predictor) + 
  scale_color_viridis_d(&amp;quot;Truth&amp;quot;, option = &amp;quot;D&amp;quot;) +
  theme_bw()
&lt;/code&gt;&lt;/pre&gt;


















&lt;figure  &gt;
  &lt;div class=&#34;d-flex justify-content-center&#34;&gt;
    &lt;div class=&#34;w-100&#34; &gt;&lt;img alt=&#34; &#34; srcset=&#34;
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-23-1_hu8944076588591668619.webp 400w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-23-1_hu16889084873753849499.webp 760w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-23-1_hu4124343322066363760.webp 1200w&#34;
               src=&#34;https://questlab.eco/media/posts/script_files/figure-markdown_strict/unnamed-chunk-23-1_hu8944076588591668619.webp&#34;
               width=&#34;672&#34;
               height=&#34;480&#34;
               loading=&#34;lazy&#34; data-zoomable /&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;figcaption&gt;
      
    &lt;/figcaption&gt;&lt;/figure&gt;

&lt;pre&gt;&lt;code&gt;ggplot(boosted_aug_long, aes(x = value, y = .pred_Caudata, col = .pred_class)) + 
  geom_point() + 
  facet_wrap(~predictor) + 
  scale_color_viridis_d(&amp;quot;Truth&amp;quot;, option = &amp;quot;D&amp;quot;) +
  theme_bw()
&lt;/code&gt;&lt;/pre&gt;


















&lt;figure  &gt;
  &lt;div class=&#34;d-flex justify-content-center&#34;&gt;
    &lt;div class=&#34;w-100&#34; &gt;&lt;img alt=&#34; &#34; srcset=&#34;
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-23-2_hu11046834690038990038.webp 400w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-23-2_hu6041234132259651925.webp 760w,
               /media/posts/script_files/figure-markdown_strict/unnamed-chunk-23-2_hu6316135883687380815.webp 1200w&#34;
               src=&#34;https://questlab.eco/media/posts/script_files/figure-markdown_strict/unnamed-chunk-23-2_hu11046834690038990038.webp&#34;
               width=&#34;672&#34;
               height=&#34;480&#34;
               loading=&#34;lazy&#34; data-zoomable /&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;figcaption&gt;
      
    &lt;/figcaption&gt;&lt;/figure&gt;

</description>
    </item>
    
  </channel>
</rss>
