Sklearn Stratified K-Fold - Splitting Data & Saving to File

Python

In this code snippet we see how to take training data, split it into k stratified folds for both training and validation before then saving each fold to a csv.

 1|  import pandas as pd
 2|  from sklearn.model_selection import StratifiedKFold
 3|  
 4|  df = pd.read_csv('data/raw/train.csv')
 5|  
 6|  # initialise a StratifiedKFold object with 5 folds and
 7|  # declare the column that we which to group by which in this
 8|  # case is the column called "label"
 9|  skf = StratifiedKFold(n_splits=5)
10|  target = df.loc[:,'label']
11|  
12|  # for each fold split the data into train and validation 
13|  # sets and save the fold splits to csv
14|  fold_no = 1
15|  for train_index, val_index in skf.split(df, target):
16|      train = df.loc[train_index,:]
17|      val = df.loc[val_index,:]
18|      train.to_csv('data/processed/folds/' + 'train_fold_' + str(fold_no) + '.csv')
19|      val.to_csv('data/processed/folds/' + 'val_fold_' + str(fold_no) + '.csv')
20|      fold_no += 1
Did you find this snippet useful?

Sign up for free to to add this to your code library