Remove Stop Words from Text in DataFrame Column

Python

This code snippet gives an example of how to remove stop words such as "the", "at" etc from columns in a Pandas dataframe that contains text. This is an important early cleaning step before transforming text data into a bag of words for NLP modelling.

Here we have a dataframe with a column named "tweet" that contains tweet text data. We use the Pandas apply with the lambda function along with list comprehension to remove stop words as declared in the NLTK library.

 1|  import nltk
 2|  nltk.download('stopwords')
 3|  from nltk.corpus import stopwords
 4|  
 5|  stop_words = stopwords.words('english')
 6|  df['tweet'] = df['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
Did you find this snippet useful?

Sign up for free to to add this to your code library