Remove Stop Words from Text in DataFrame Column
Python
This code snippet gives an example of how to remove stop words such as "the", "at" etc from columns in a Pandas dataframe that contains text. This is an important early cleaning step before transforming text data into a bag of words for NLP modelling.
Here we have a dataframe with a column named "tweet" that contains tweet text data. We use the Pandas apply with the lambda function along with list comprehension to remove stop words as declared in the NLTK library.
1| import nltk 2| nltk.download('stopwords') 3| from nltk.corpus import stopwords 4| 5| stop_words = stopwords.words('english') 6| df['tweet'] = df['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
127
122
115