2 Upvotes
How to Check if Text in a Dataframe Column is English
To check if the text in a Pandas dataframe column is english we use a function to calculate the percentage of english words in the text for each row.
We then filter the dataframe to only include rows where the text contains more 75% of english words.
import nltk nltk.download('words') from nltk.corpus import words def get_english_word_rate(row): row_words = row.text.lower().split() word_count = len(row_words) english_words = 0 for w in row_words: if w in words.words(): english_words += 1 return english_words / word_count df['english_word_rate'] = df.apply(get_english_word_rate,axis=1) df = df[df['english_word_rate'] > 0.75]
By detro - Last Updated March 25, 2022, 11:27 p.m.
COMMENTS
RELATED SNIPPETS
2
2
2
2
1
Find Snippets by Language
Find Snippets by Use