Evaluating Feature Importance Tutorial
A big part of analysing our models post training is whether the features we used for training actually helped in predicting the target and by how much. Tree based machine learning algorithms such as Random Forest and XGBoost come with a feature importance attribute that outputs an array containing a value between 0 and 100 for each feature representing how useful the model found each feature in trying to predict the target. This gives us the opportunity to analyse what contributed to the accuracy of the model and what features were just noise. With this information we can check that the model is working as we would expect, discard features if we feel they are not adding any value and use it to hypothesis about new features that we could engineer for another iteration of the model.
However, the problem with the feature importance attribute is that the output is an unlabelled, unordered array of values so looking at it in isolation won’t tell us much about our model. So we are going to take this array and create a function that plots the feature importance data on a labelled and ordered Seaborn bar chart that will give us a more intuitive understanding of which features our model has deemed useful.
IMPORT LIBRARIES
First let’s make sure we have imported all the required libraries. We are going to need Pandas, Numpy, Matplotlib and Seaborn. In this example we have already trained a Random Forest model using a data frame named “train_X” and named it “rf_model”.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
DECLARE PLOT FEATURE IMPORTANCE FUNCTION
Now to start with, we are going to declare the function “plot_feature_importance” and tell it what parameters we’re going to pass when calling. In this case we are going to pass in the feature importance values (importance), the feature names from training data (names) and also a string identifying the model type that we’ll use to title the bar chart.
def plot_feature_importance(importance,names,model_type):
CAST NUMPY ARRAYS
Next we are going to cast the feature importance and feature names as Numpy arrays. This allows us to construct a two column data frame from the two arrays.
def plot_feature_importance(importance,names,model_type):
#Create arrays from feature importance and feature names
feature_importance = np.array(importance)
feature_names = np.array(names)
CONSTRUCT DATA FRAME
To construct the data frame we will use a Dictionary containing the feature importance values and the feature names where the Dictionary key will be the column names. Once this has been created we can then sort the data frame by feature importance value giving us a labelled and ordered feature importance data frame.
def plot_feature_importance(importance,names,model_type):
#Create arrays from feature importance and feature names
feature_importance = np.array(importance)
feature_names = np.array(names)
#Create a DataFrame using a Dictionary
data={'feature_names':feature_names,'feature_importance':feature_importance}
fi_df = pd.DataFrame(data)
#Sort the DataFrame in order decreasing feature importance
fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True))
PLOT FEATURE IMPORTANCE BAR CHART
Finally we can use Matplotlib and Seaborn to plot the feature importance bar chart. Here we use the model_type parameter that we will pass to the function to give our plot it’s title.
def plot_feature_importance(importance,names,model_type):
#Create arrays from feature importance and feature names
feature_importance = np.array(importance)
feature_names = np.array(names)
#Create a DataFrame using a Dictionary
data={'feature_names':feature_names,'feature_importance':feature_importance}
fi_df = pd.DataFrame(data)
#Sort the DataFrame in order decreasing feature importance
fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True))
#Define size of bar plot
plt.figure(figsize=(10,8))
#Plot Searborn bar chart
sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
#Add chart labels
plt.title(model_type + 'FEATURE IMPORTANCE')
plt.xlabel('FEATURE IMPORTANCE')
plt.ylabel('FEATURE NAMES')
RANDOM FOREST FEATURE IMPORTANCE PLOT
Now we have created the function it’s time to call it, passing the feature importance attribute array from the model, the feature names from our training dataset and also declaring the type of model for the title.
plot_feature_importance(rf_model.feature_importances_,train.columns,'RANDOM FOREST')
We can also use the function with other algorithms that include a feature importance attribute.
XGBOOST FEATURE IMPORTANCE PLOT
plot_feature_importance(xgb_model.feature_importances_,train.columns,'XG BOOST')
CATBOOST FEATURE IMPORTANCE PLOT
plot_feature_importance(cb_model.get_feature_importance(),train.columns,'CATBOOST')
You can find the code snippet associated with this article here.
Originally published on AnalyseUp.com