Skip to contents

This function perturbes the dataset by shuffling one at a time a subset of features that appear in a population of models and recomputes the evaluation of those models. The mean deltas of the score to consider will give a measure of importance. Two methods are implemented: the first (extensive), will shuffle feature by feature multiple times and will compute the evaluation for the whole population of models, which can be very time consuming. The second (optimized) and the default approach consists on using a different seed when shuffling a given feature and computing the population. In this setting it is not needed to run multiple seeds on the whole dataset. This procedure is designed to be applied in cross validation.

Usage

evaluateFeatureImportanceInPopulation(
  pop,
  X,
  y,
  clf,
  score = "fit_",
  filter.ci = TRUE,
  method = "optimized",
  seed = c(1:10),
  aggregation = "mean",
  verbose = TRUE
)

Arguments

pop:

a population of models to be considered. This population will be filtered if filter.ci = TRUE (default) using the interval confidence computed around the best model using a binomial distribution.

X:

dataset used to classify

y:

variable to predict

clf:

an object containing the different parameters of the classifier

score:

the attribute of the model to be considered in the evaluation (default:fit_)

filter.ci:

filter the population based on the best model confidence interval (default:TRUE)

method:

Two methods are implemented: the first (extensive), will shuffle feature by feature multiple times and will compute the evaluation for the whole population of models, which can be very time consuming. The second (optimized) and the default approach consists on using a different seed when shuffling a given feature and computing the population.

seed:

one or more seeds to be used in the extensive method shuffling (default:c(1:10). For the optimized method only the first seed will be used and the rest of the seeds that are needed for each model will be incremented from there.

aggregation:

the method to be used to aggregate the evaluation for a the whole population (default: mean), but can be either mean or median.

verbose:

wether to print out information during the execution process.

Value

a data.frame with features in rows and the population mean/median score for each model*seed of the population