How to Return the Most Frequent Bigrams from Text Using NLTK

Python

In this snippet we return one bigram that appears at least twice in the string variable text.

 1|  import nltk
 2|  from nltk.collocations import *
 3|  bigram_assoc_measures = nltk.collocations.BigramAssocMeasures()
 4|  
 5|  text = 'One Two One Two Three Four Five Six'
 6|  
 7|  #1. Split text into words
 8|  text = text.split()
 9|  
10|  #2. Set minimum number of bigrams to extract and 
11|  #of those how many to return
12|  minimum_number_of_bigrams = 2
13|  top_bigrams_to_return = 1
14|  
15|  #3. Get bigrams contained in text variable
16|  finder = BigramCollocationFinder.from_words(text)
17|  
18|  #4. Filter bigrams to those that appear at least twice
19|  finder.apply_freq_filter(minimum_number_of_bigrams) 
20|  
21|  #5. Return one of the top bigrams
22|  finder.nbest(bigram_assoc_measures.pmi, bigrams_to_return)  
Did you find this snippet useful?

Sign up for free to to add this to your code library