Pandas Undersampling for Imbalanced Binary Classification

Python

An example of how to handle imbalanced data in Python. This is based on the titanic dataset. Here we split the main dataframe into separate survived and deceased dataframe. The deceased dataframe is the larger dataframe so we sample the same number of rows from this dataframe as there are in the survived dataframe to make them the same size. We then concat both data frames back together to create a dataframe that is balanced.

 1|  survived = df[df['survived']==1]
 2|  deceased = df[df['survived']==0]
 3|  deceased = deceased.sample(n=len(survived), random_state=101)
 4|  df = pd.concat([survived,deceased],axis=0)
Did you find this snippet useful?

Sign up for free to to add this to your code library