Python Text Processing Useful Resources

Python Text Processing - Frequency Distribution



Counting the frequency of occurrence of a word in a body of text is often needed during text processing. This can be achieved by applying the word_tokenize() function and appending the result to a list to keep count of the words as shown in the below program.

Example - Getting Frequencies of Words

from nltk.tokenize import word_tokenize
from nltk.corpus import gutenberg

sample = gutenberg.raw("blake-poems.txt")

token = word_tokenize(sample)
wlist = []

for i in range(50):
    wlist.append(token[i])

wordfreq = [wlist.count(w) for w in wlist]
print("Pairs\n" + str(zip(token, wordfreq)))

Output

When we run the above program, we get the following output −

[([', 1), (Poems', 1), (by', 1), (William', 1), (Blake', 1)
...]

Conditional Frequency Distribution

Conditional Frequency Distribution is used when we want to count words meeting specific crteria satisfying a set of text.

main.py

import nltk
from nltk.corpus import brown

cfd = nltk.ConditionalFreqDist(
          (genre, word)
          for genre in brown.categories()
          for word in brown.words(categories=genre))
categories = ['hobbies', 'romance','humor']
searchwords = [ 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=categories, samples=searchwords)

Output

When we run the above program, we get the following output −

          may might  must  will 
hobbies   131    22    83   264 
romance    11    51    45    43 
  humor     8     8     9    13 
Advertisements