Python Text Processing Useful Resources

Python Text Processing - Chunk Classification



Classification based chunking involves classifying the text as a group of words rather than individual words. A simple scenario is tagging the text in sentences. We will use a corpus to demonstrate the classification. We choose the corpus conll2000 which has data from the of the Wall Street Journal corpus (WSJ) used for noun phrase-based chunking.

First, we add the corpus to our environment using the following command.

>>>import nltk
>>>nltk.download('conll2000')

Lets have a look at the first few sentences in this corpus.

from nltk.corpus import conll2000

x = (conll2000.sents())
for i in range(3):
   print(x[i])
   print('\n')

Output

When we run the above program we get the following output −

['Confidence', 'in', 'the', 'pond', 'is', 'widely',...]
['Chancellor', 'of', 'the', 'Excheqer', 'Nigel', 'Lawson', ...]
['Bt', 'analysts', 'reckon', 'nderlying', 'spport', 'for', ...]

Next we use the fucntion tagged_sents() to get the sentences tagged to their classifiers.

from nltk.corpus import conll2000

x = (conll2000.tagged_sents())
for i in range(3):
   print(x[i])
   print ('\n')

Output

When we run the above program we get the following output −

[('Confidence', 'NN'), ('in', 'IN'), ...]
[('Chancellor', 'NNP'), ('of', 'IN'), ...]
[('Bt', 'CC'), ('analysts', 'NNS'), ...]
Advertisements