Sklearn Stratified K-Fold - Splitting Data & Saving to File
Python
In this code snippet we see how to take training data, split it into k stratified folds for both training and validation before then saving each fold to a csv.
1| import pandas as pd 2| from sklearn.model_selection import StratifiedKFold 3| 4| df = pd.read_csv('data/raw/train.csv') 5| 6| # initialise a StratifiedKFold object with 5 folds and 7| # declare the column that we which to group by which in this 8| # case is the column called "label" 9| skf = StratifiedKFold(n_splits=5) 10| target = df.loc[:,'label'] 11| 12| # for each fold split the data into train and validation 13| # sets and save the fold splits to csv 14| fold_no = 1 15| for train_index, val_index in skf.split(df, target): 16| train = df.loc[train_index,:] 17| val = df.loc[val_index,:] 18| train.to_csv('data/processed/folds/' + 'train_fold_' + str(fold_no) + '.csv') 19| val.to_csv('data/processed/folds/' + 'val_fold_' + str(fold_no) + '.csv') 20| fold_no += 1
133
121
117
109