Pandas Undersampling for Imbalanced Binary Classification
Python
An example of how to handle imbalanced data in Python. This is based on the titanic dataset. Here we split the main dataframe into separate survived and deceased dataframe. The deceased dataframe is the larger dataframe so we sample the same number of rows from this dataframe as there are in the survived dataframe to make them the same size. We then concat both data frames back together to create a dataframe that is balanced.
1| survived = df[df['survived']==1] 2| deceased = df[df['survived']==0] 3| deceased = deceased.sample(n=len(survived), random_state=101) 4| df = pd.concat([survived,deceased],axis=0)
147
131
125
118