2 Upvotes

How to Check if Text in a Dataframe Column is English

Python
NLP

To check if the text in a Pandas dataframe column is english we use a function to calculate the percentage of english words in the text for each row.

We then filter the dataframe to only include rows where the text contains more 75% of english words.

import nltk
nltk.download('words')
from nltk.corpus import words

def get_english_word_rate(row):
    row_words = row.text.lower().split()
    word_count = len(row_words)
    english_words = 0
    for w in row_words:
        if w in words.words():
            english_words += 1
    return english_words / word_count

df['english_word_rate'] = df.apply(get_english_word_rate,axis=1)

df = df[df['english_word_rate'] > 0.75]

By detro - Last Updated March 25, 2022, 11:27 p.m.

Did you find this snippet useful?

Sign up to bookmark this in your snippet library

COMMENTS
RELATED SNIPPETS
Top Contributors
103
100