How to Check if Text in a Dataframe Column is English
Python
To check if the text in a Pandas dataframe column is english we use a function to calculate the percentage of english words in the text for each row.
We then filter the dataframe to only include rows where the text contains more 75% of english words.
1| import nltk 2| nltk.download('words') 3| from nltk.corpus import words 4| 5| def get_english_word_rate(row): 6| row_words = row.text.lower().split() 7| word_count = len(row_words) 8| english_words = 0 9| for w in row_words: 10| if w in words.words(): 11| english_words += 1 12| return english_words / word_count 13| 14| df['english_word_rate'] = df.apply(get_english_word_rate,axis=1) 15| 16| df = df[df['english_word_rate'] > 0.75]
149
132
127
119