How to Check if Text in a Dataframe Column is English

Python

To check if the text in a Pandas dataframe column is english we use a function to calculate the percentage of english words in the text for each row.

We then filter the dataframe to only include rows where the text contains more 75% of english words.

 1|  import nltk
 2|  nltk.download('words')
 3|  from nltk.corpus import words
 4|  
 5|  def get_english_word_rate(row):
 6|      row_words = row.text.lower().split()
 7|      word_count = len(row_words)
 8|      english_words = 0
 9|      for w in row_words:
10|          if w in words.words():
11|              english_words += 1
12|      return english_words / word_count
13|  
14|  df['english_word_rate'] = df.apply(get_english_word_rate,axis=1)
15|  
16|  df = df[df['english_word_rate'] > 0.75]
Did you find this snippet useful?

Sign up for free to to add this to your code library