# Preprocessing data with R `recipes` package: how to impute by mode in numeric columns (to fit model with xgboost)?

## Issue

I want to use `xgboost`

for a classification problem, and two predictors (out of several) are binary columns that also happen to have some missing values. Before fitting a model with `xgboost`

, I want to replace those missing values by imputing the mode in each binary column.

My problem is that I want to do this imputation as part of a `tidymodels`

"recipe". That is, not using typical data wrangling procedures such as `dplyr`

/`tidyr`

/`data.table`

, etc. Doing the imputation within a recipe should guard against "information leakage".

Although the `recipes`

package provides many `step_*()`

functions that are designed for data preprocessing, I could not find a way to do the desired imputation by *mode* on numeric binary columns. While there *is* a function called `step_impute_mode()`

, it accepts only nominal variables (i.e., of class `factor`

or `character`

). But I need my binary columns to remain numeric so they could be passed to the `xgboost`

engine.

Consider the following toy example. I took it from this reference page and changed the data a bit to reflect the problem.

**create toy data**

```
# install.packages("xgboost")
library(tidymodels)
tidymodels_prefer()
# original data shipped with package
data(two_class_dat)
# simulating 2-column binary data + NAs
n_rows <- nrow(two_class_dat)
df_x1_x2 <-
data.frame(x1 = rbinom(n_rows, 1, runif(1)),
x2 = rbinom(n_rows, 1, runif(1)))
## randomly replace 25% of each column with NAs
df_x1_x2[c("x1", "x2")] <-
lapply(df_x1_x2[c("x1", "x2")], function(x) {
x[sample(seq_along(x), 0.25 * length(x))] <- NA
x
})
# bind original data & simulated data
df_to_xgboost <- cbind(two_class_dat, df_x1_x2)
# split data to training and testing
data_train <- df_to_xgboost[-(1:10), ]
data_test <- df_to_xgboost[ 1:10 , ]
```

**set up model specification & preprocessing recipe using tidymodels tools**

```
# model specification
xgb_spec <-
boost_tree(trees = 15) %>%
# This model can be used for classification or regression, so set mode
set_mode("classification") %>%
set_engine("xgboost")
# preprocessing recipe
xgb_recipe <-
recipe(formula = Class ~ ., data = data_train) %>%
step_bin2factor(x1, x2) %>% # <-~-~-~-~-~-~-~-~-~-~-~-~-~| these 2 lines are the heart of the problem
step_impute_mode(x1, x2) # <-~-~-~-~-~-~-~-~-~-~-~-~-~| I can't impute unless I first convert columns from numeric to factor/chr.
# | But once I do, xgboost fails with non-numeric data.
# | There isn't `step_*()` for converting back to numeric (like as.numeric())
# bind `xgb_spec` and `xgb_recipe` into a workflow object
xgb_wflow <-
workflow() %>%
add_recipe(xgb_recipe) %>%
add_model(xgb_spec)
```

**fit the model**

```
fit(xgb_wflow, data_train)
#> Error in xgboost::xgb.DMatrix(x, label = y, missing = NA): 'data' has class 'character' and length 3124.
#> 'data' accepts either a numeric matrix or a single filename.
#> Timing stopped at: 0 0 0
```

The fitting fails because `data_train$x1`

and `data_train$x2`

become factors per `step_bin2factor(x1, x2)`

. So that’s my current catch: On the one hand, I can’t fit `xgboost`

model unless all data is numeric; on the other hand, I can’t impute by mode unless data is factor/chr.

Although there *is* a way to build custom `step_*()`

functions, it’s a bit complex. So I first wanted to reach out and see whether there’s a trivial solution I might be missing. I think that my current situation with `xgboost`

and binary predictors seems pretty mainstream, and I don’t want to reinvent the wheel.

## Solution

Credit to user @gus who answered here:

```
xgb_recipe <-
recipe(formula = Class ~ ., data = data_train) %>%
step_num2factor(c(x1, x2),
transform = function(x) x + 1,
levels = c("0", "1")) %>%
step_impute_mode(x1, x2) %>%
step_mutate_at(c(x1, x2), fn = ~ as.numeric(.) - 1)
```

## The entire code

```
# install.packages("xgboost")
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
tidymodels_prefer()
data(two_class_dat)
n_rows <- nrow(two_class_dat)
df_x1_x2 <-
data.frame(x1 = rbinom(n_rows, 1, runif(1)),
x2 = rbinom(n_rows, 1, runif(1)))
df_x1_x2[c("x1", "x2")] <-
lapply(df_x1_x2[c("x1", "x2")], function(x) {
x[sample(seq_along(x), 0.25 * length(x))] <- NA
x
})
df_to_xgboost <- cbind(two_class_dat, df_x1_x2)
###
data_train <- df_to_xgboost[-(1:10), ]
data_test <- df_to_xgboost[ 1:10 , ]
xgb_spec <-
boost_tree(trees = 15) %>%
set_mode("classification") %>%
set_engine("xgboost")
xgb_recipe <-
recipe(formula = Class ~ ., data = data_train) %>%
step_num2factor(c(x1, x2),
transform = function(x) x + 1,
levels = c("0", "1")) %>%
step_impute_mode(x1, x2) %>%
step_mutate_at(c(x1, x2), fn = ~ as.numeric(.) - 1)
xgb_recipe %>% prep() %>% bake(new_data = NULL)
#> # A tibble: 781 x 5
#> A B x1 x2 Class
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1.44 1.68 1 1 Class1
#> 2 2.34 2.32 1 1 Class2
#> 3 2.65 1.88 0 1 Class2
#> 4 0.849 0.813 1 1 Class1
#> 5 3.25 0.869 1 1 Class1
#> 6 1.05 0.845 0 1 Class1
#> 7 0.886 0.489 1 0 Class1
#> 8 2.91 1.54 1 1 Class1
#> 9 3.14 2.06 1 1 Class2
#> 10 1.04 0.886 1 1 Class2
#> # ... with 771 more rows
xgb_wflow <-
workflow() %>%
add_recipe(xgb_recipe) %>%
add_model(xgb_spec)
fit(xgb_wflow, data_train)
#> [09:35:36] WARNING: amalgamation/../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
#> == Workflow [trained] ==========================================================
#> Preprocessor: Recipe
#> Model: boost_tree()
#>
#> -- Preprocessor ----------------------------------------------------------------
#> 3 Recipe Steps
#>
#> * step_num2factor()
#> * step_impute_mode()
#> * step_mutate_at()
#>
#> -- Model -----------------------------------------------------------------------
#> ##### xgb.Booster
#> raw: 59.4 Kb
#> call:
#> xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0,
#> colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1,
#> subsample = 1, objective = "binary:logistic"), data = x$data,
#> nrounds = 15, watchlist = x$watchlist, verbose = 0, nthread = 1)
#> params (as set within xgb.train):
#> eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", objective = "binary:logistic", nthread = "1", validate_parameters = "TRUE"
#> xgb.attributes:
#> niter
#> callbacks:
#> cb.evaluation.log()
#> # of features: 4
#> niter: 15
#> nfeatures : 4
#> evaluation_log:
#> iter training_logloss
#> 1 0.551974
#> 2 0.472546
#> ---
#> 14 0.251547
#> 15 0.245090
```

^{Created on 2021-12-25 by the reprex package (v2.0.1.9000)}

Answered By – Emman

**This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 **