Python Text Processing Useful Resources

Selected Reading

Python Text Processing - Text Classification

Quiz

Many times, we need to categorise the available text into various categories by some pre-defined criteria. nltk provides such feature as part of various corpora. In the below example we look at the movie review corpus and check the categorization available.

Example - Categorising Data

main.py

# Lets See how the movies are classified
from nltk.corpus import movie_reviews

all_cats = []
for w in movie_reviews.categories():
    all_cats.append(w.lower())
print(all_cats)

Output

When we run the above program, we get the following output −

['neg', 'pos']

Example - Tokenizing Data

Now let's look at the content of one of the files with a positive review. The sentences in this file are tokenized and we print the first four sentences to see the sample.

main.py

from nltk.corpus import movie_reviews
from nltk.tokenize import sent_tokenize
fields = movie_reviews.fileids()

sample = movie_reviews.raw("pos/cv944_13521.txt")

token = sent_tokenize(sample)
for lines in range(4):
    print(token[lines])

Output

When we run the above program we get the following output −

meteor threat set to blow away all volcanoes & twisters !
summer is here again !
this season could probably be the most ambitious = season this decade 
with hollywood churning out films 
like deep impact , = godzilla , the x-files , armageddon , the truman show , 
all of which has but = one main aim , to rock the box office .
leading the pack this summer is = deep impact , one of the first few film 
releases from the = spielberg-katzenberg-geffen's dreamworks production company .

Example - Tokenizing words

Next, we tokenize the words in each of these files and find the most common words by using the FreqDist function from nltk.

main.py

import nltk
from nltk.corpus import movie_reviews
fields = movie_reviews.fileids()

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(10))

Output

When we run the above program we get the following output −

[(,', 77717), (the', 76529), (.', 65876), (a', 38106), (and', 35576), 
(of', 34123), (to', 31937), (u"'", 30585), (is', 25195), (in', 21822)]

Previous Quiz Next