Ошибка при использовании rfe в каретке с моделью PLS-DA

Я пытаюсь использовать функцию rfe из пакета caret в сочетании с моделью PLS-DA.

sessionInfo()

R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
 [1] splines   grid      parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] mclust_4.4            Kendall_2.2           doBy_4.5-13           survival_2.37-7       statmod_1.4.20       
 [6] preprocessCore_1.26.1 sva_3.10.0            mgcv_1.8-4            nlme_3.1-119          corpcor_1.6.7        
[11] car_2.0-22            reshape2_1.4.1        gplots_2.16.0         DMwR_0.4.1            mi_0.09-19           
[16] arm_1.7-07            lme4_1.1-7            Matrix_1.1-5          MASS_7.3-37           randomForest_4.6-10  
[21] plyr_1.8.1            pls_2.4-3             caret_6.0-41          ggplot2_1.0.0         lattice_0.20-29      
[26] pcaMethods_1.54.0     Rcpp_0.11.4           Biobase_2.24.0        BiocGenerics_0.10.0  

loaded via a namespace (and not attached):
 [1] abind_1.4-0         bitops_1.0-6        boot_1.3-14         BradleyTerry2_1.0-5 brglm_0.5-9         caTools_1.17.1     
 [7] class_7.3-11        coda_0.16-1         codetools_0.2-10    colorspace_1.2-4    compiler_3.1.1      digest_0.6.8       
[13] e1071_1.6-4         foreach_1.4.2       foreign_0.8-62      gdata_2.13.3        gtable_0.1.2        gtools_3.4.1       
[19] iterators_1.0.7     KernSmooth_2.23-13  minqa_1.2.4         munsell_0.4.2       nloptr_1.0.4        nnet_7.3-8         
[25] proto_0.3-10        quantmod_0.4-3      R2WinBUGS_2.1-19    ROCR_1.0-5          rpart_4.1-8         scales_0.2.4       
[31] stringr_0.6.2       tools_3.1.1         TTR_0.22-0          xts_0.9-7           zoo_1.7-11   

Для практики я запустил следующий пример, используя данные радужной оболочки глаза.

data(iris)
subsets <- 2:4
ctrl <- rfeControl(functions = caretFuncs, method = 'cv', number = 5, verbose=TRUE)
trctrl <- trainControl(method='cv', number=5)
mod <- rfe(Species ~., data = iris, sizes = subsets, rfeControl = ctrl, trControl = trctrl, method = 'pls')

Все работает хорошо.

mod
Recursive feature selection

Outer resampling method: Cross-Validated (5 fold) 

Resampling performance over subset size:

 Variables Accuracy Kappa AccuracySD KappaSD Selected
         2   0.6533  0.48    0.02981 0.04472         
         3   0.8067  0.71    0.06412 0.09618        *
         4   0.7867  0.68    0.07674 0.11511         

The top 3 variables (out of 3):
   Sepal.Width, Petal.Length, Sepal.Length

Однако, если я попытаюсь воспроизвести это на сгенерированных мной данных, я получу следующую ошибку. Я не могу понять почему! Если у вас есть какие-то идеи, мне было бы интересно их выслушать.

x <- as.data.frame(matrix(0,10,10))
for(i in 1:9) {x[,i] <- rnorm(10,0,1)}
x[,10] <- as.factor(rbinom(10, 1, 0.5))
subsets <- 2:9
ctrl <- rfeControl(functions = caretFuncs, method = 'cv', number = 5, verbose=TRUE)
trctrl <- trainControl(method='cv', number=5)
mod <- rfe(V10 ~., data = x, sizes = subsets, rfeControl = ctrl, trControl = trctrl, method = 'pls')

Error in { : task 1 failed - "undefined columns selected"

In addition: Warning messages:
1: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

2: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

4: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

5: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

person NicolaV    schedule 12.03.2015    source источник


Ответы (1)


Я понял (после долгих разговоров), что уровни переменной фактора отклика должны быть символами, чтобы сочетать PLS-DA с RFE в каретке.

Например...

x <- data.frame(matrix(rnorm(1000),100,10))
y <- as.factor(c(rep('Positive',40), rep('Negative',60)))
data <- data.frame(x,y)
subsets <- 2:9
ctrl <- rfeControl(functions = caretFuncs, method = 'cv', number = 5, verbose=TRUE)
trctrl <- trainControl(method='cv', number=5)
mod <- rfe(y ~., data, sizes = subsets, rfeControl = ctrl, trControl = trctrl, method = 'pls')
person NicolaV    schedule 20.03.2015