Stratified Sampling with Pandas: How to Create a Representative Sample by Category


A stratified sample is one that takes a sample with an even amount of representation from a certain group within the population. For example if we were taking a sample from data relating to individuals we might want to make sure we had equal representation of men and women or equal representation from each age group.

In the example below we want create a sample from our df dataframe that contains equal representation of data with the three categories A, B & C.

 1|  # Create three stratums, one for each category
 2|  stratum_A = df[df['category']=='A']
 3|  stratum_B = df[df['category']=='B']
 4|  stratum_C = df[df['category']=='C']
 6|  strata = [stratum_A,
 7|            stratum_B,
 8|            stratum_C]
10|  # Create empty dataframe that will contain the stratified sample
11|  stratified_sample = pd.DataFrame(columns=df.columns)
13|  # Loop through each stratum, sample 100 rows from each and add to stratified sample dataframe
14|  for stratum in strata:
15|      sample = stratum.sample(n=100,random_state=101)
16|      stratified_sample = pd.concat([stratified_sample, sample])
Did you find this snippet useful?

Sign up for free to to add this to your code library