Cross Validation

"Trust your CV score in Kaggle competitions more than the public LB score."

  • Hold-out (standard one 80/20 split)
  • K-folds (split data into k folds and each fold would be a validation set)
  • Leave-one-out (extreme K-folds)
  • Leave-p-out
  • Stratified K-folds (useful for imbalanced datasets)
  • Repeated K-folds (pick 80/20 split data randomly k times, bad for imbalanced datasets)
  • Nested K-folds: need to implement mannually, good for hyperparameter tuning
  • Time series CV (deals with forwardlooking bias in TS data)

Description of CV techniques

from sklearn.model_selection import KFold, GroupKFold

Nested Cross Validation: alt text