Python Text Processing Useful Resources

Selected Reading

Python Text Processing - Quick Guide

Python Text Processing - Introduction

Text processing has a direct application to Natural Language Processing, also known as NLP. NLP is aimed at processing the languages spoken or written by humans when they communicate with one another. This is different from the communication between a computer and a human where the communication is wither a computer program written by human or some gesture by human like clicking the mouse at some position. NLP tries to understand the natural language spoken by humans and classify it, analyses it as well if required respond to it. Python has a rich set of libraries which cater to the needs of NLP. The Natural Language Tool Kit (NLTK) is a suite of such libraries which provides the functionalities required for NLP.

Below are some applications which use NLP and indirectly python's NLTK.

Summarization

Many times, we need to get the summary of a news article, a movie plot or a big story. They are all written in human language and without NLP we have to rely on another human's interpretation and presentation of such summary to us. But with help of NLP we can write programs to use NLTK and summarize the long text with various parameters, like what is the percentage of text we want in the final output, choosing the positive and negative words for summarization etc. The online news feeds rely on such summarization techniques to present news insights.

Voice Based Tools

The voice-based tools like apples Siri or Amazon Alexa rely on NLP to understand the interaction mad with humans. They have a large training data set of words, sentences and grammar to interpret the question or command coming from a human and process it. Though it is about voice, indirectly it also gets translated to text and the resulting text form the voice is taken through the NLP system to produce result.

Information Extraction

Web scrapping is a common example of extracting data form the web pages using python code. Here it may not be strictly NLP based but it does involve text processing. For example, if we need to extract only the headers present in a html page, then we look for the h1 tag int he page structure and find a way to extract the text between only those tags. This need text processing program from python.

Spam Filtering

The spam in emails can be identified and eliminated by analysing the text in the subject line as well as in the content of the message. As the spam emails are usually sent in bulk to many recipients, even if their subjects and contents have little variation, that can be matched and tagged to mark them as spam Again it needs the use of the NLTK libraries.

Language Translation

Computerized language translation relies heavily on NLP. As more and more languages are used in the online platform, it becomes a necessity to automate the translation from one human language to another. This will involve programming to handle the vocabulary, grammar and context tagging of the languages involved in translation. Again, NLTK is used to handle such requirements.

Sentiment Analysis

To find out the overall reaction to the performance of a movie, we may have to read thousands of feedback posts from the audience. But that too can be automated by using the classification of positive an negative feedback through words and sentence analysis. And then measuring the frequency of positive and negative reviews to find the overall sentiment of the audience. This obviously needs the analysis of the human language written by the audience and NLTK is used heavily here for processing the text.

Python Text Processing - Environment Setup

To successfully create and run the example code in this tutorial we will need an environment set up which will have both general-purpose python as well as the special packages required for Data science. We will first look as installing the general-purpose python which can be python 2 or python 3. But we will prefer python 2 for this tutorial mainly because of its maturity and wider support of external packages.

Getting Python

The most up-to-date and current source code, binaries, documentation, news, etc., is available on the official website of Python https://www.python.org/

You can download Python documentation from https://www.python.org/doc/. The documentation is available in HTML, PDF, and PostScript formats.

Installing Python

Python distribution is available for a wide variety of platforms. You need to download only the binary code applicable for your platform and install Python.

If the binary code for your platform is not available, you need a C compiler to compile the source code manually. Compiling the source code offers more flexibility in terms of choice of features that you require in your installation.

Here is a quick overview of installing Python on various platforms −

Unix and Linux Installation

Here are the simple steps to install Python on Unix/Linux machine.

Open a Web browser and go to https://www.python.org/downloads/.
Follow the link to download zipped source code available for Unix/Linux.
Download and extract files.
Editing the Modules/Setup file if you want to customize some options.
run ./configure script
make
make install

This installs Python at standard location /usr/local/bin and its libraries at /usr/local/lib/pythonXX where XX is the version of Python.

Windows Installation

Here are the steps to install Python on Windows machine.

Open a Web browser and go to https://www.python.org/downloads/.
Follow the link for the Windows installer python-XYZ.msi file where XYZ is the version you need to install.
To use this installer python-XYZ.msi, the Windows system must support Microsoft Installer 2.0. Save the installer file to your local machine and then run it to find out if your machine supports MSI.
Run the downloaded file. This brings up the Python install wizard, which is really easy to use. Just accept the default settings, wait until the install is finished, and you are done.

Macintosh Installation

Recent Macs come with Python installed, but it may be several years out of date. See http://www.python.org/download/mac/ for instructions on getting the current version along with extra tools to support development on the Mac. For older Mac OS's before Mac OS X 10.3 (released in 2003), MacPython is available.

Jack Jansen maintains it and you can have full access to the entire documentation at his website − https://homepages.cwi.nl/~jack/macpython/index.html. You can find complete installation details for Mac OS installation.

Setting up PATH

Programs and other executable files can be in many directories, so operating systems provide a search path that lists the directories that the OS searches for executables.

The path is stored in an environment variable, which is a named string maintained by the operating system. This variable contains information available to the command shell and other programs.

The path variable is named as PATH in Unix or Path in Windows (Unix is case sensitive; Windows is not).

In Mac OS, the installer handles the path details. To invoke the Python interpreter from any particular directory, you must add the Python directory to your path.

Setting path at Unix/Linux

To add the Python directory to the path for a particular session in Unix −

In the csh shell − type setenv PATH "$PATH:/usr/local/bin/python" and press Enter.
In the bash shell (Linux) − type export ATH="$PATH:/usr/local/bin/python" and press Enter.
In the sh or ksh shell − type PATH="$PATH:/usr/local/bin/python" and press Enter.
Note − /usr/local/bin/python is the path of the Python directory

Setting path at Windows

To add the Python directory to the path for a particular session in Windows −

At the command prompt − type path %path%;C:\Python and press Enter.

Note − C:\Python is the path of the Python directory

Python Environment Variables

Here are important environment variables, which can be recognized by Python −

Sr.No.	Variable & Description
1	PYTHONPATH It has a role similar to PATH. This variable tells the Python interpreter where to locate the module files imported into a program. It should include the Python source library directory and the directories containing Python source code. PYTHONPATH is sometimes preset by the Python installer.
2	PYTHONSTARTUP It contains the path of an initialization file containing Python source code. It is executed every time you start the interpreter. It is named as .pythonrc.py in Unix and it contains commands that load utilities or modify PYTHONPATH.
3	PYTHONCASEOK It is used in Windows to instruct Python to find the first case-insensitive match in an import statement. Set this variable to any value to activate it.
4	PYTHONHOME It is an alternative module search path. It is usually embedded in the PYTHONSTARTUP or PYTHONPATH directories to make switching module libraries easy.

Running Python

There are three different ways to start Python −

Interactive Interpreter

You can start Python from Unix, DOS, or any other system that provides you a command-line interpreter or shell window.

Enter python the command line.

Start coding right away in the interactive interpreter.

$py # Unix/Linux
or
py% # Unix/Linux
or
C:> py # Windows/DOS

Here is the list of all the available command line options −

Sr.No.	Option & Description
1	-d It provides debug output.
2	-O It generates optimized bytecode (resulting in .pyo files).
3	-S Do not run import site to look for Python paths on startup.
4	-v verbose output (detailed trace on import statements).
5	-X disable class-based built-in exceptions (just use strings); obsolete starting with version 1.6.
6	-c cmd run Python script sent in as cmd string
7	file run Python script from given file

Script from the Command-line

A Python script can be executed at command line by invoking the interpreter on your application, as in the following −

$py script.py # Unix/Linux

or

py% script.py # Unix/Linux

or 

C: >py script.py # Windows/DOS

Note − Be sure the file permission mode allows execution.

Integrated Development Environment

You can run Python from a Graphical User Interface (GUI) environment as well, if you have a GUI application on your system that supports Python.

Unix − IDLE is the very first Unix IDE for Python.
Windows − PythonWin is the first Windows interface for Python and is an IDE with a GUI.
Macintosh − The Macintosh version of Python along with the IDLE IDE is available from the main website, downloadable as either MacBinary or BinHex'd files.

Installing NLTK Pack

NLTK is very straight forward to integrate into the python environment. Use the below command to add NLTK to the environment.

sudo pip install -U nltk

The addition of other libraries will be discussed in each chapter as and when we need for their use in the python program.

Python Text Processing - String Immutability

In python, the string data types are immutable. Which means a string value cannot be updated. We can verify this by trying to update a part of the string which will led us to an error.

Checking Immutability of a String

main.py

# Can not reassign 
t= "Tutorialspoint"
print(type(t))
t[0] = "M"

Output

When we run the above program, we get the following output −

<class 'str'>

Warnings/Errors:
Traceback (most recent call last):
  File "/home/cg/root/31c1433c/main.py", line 4, in <module>
    t[0] = "M"
    ~^^^
TypeError: 'str' object does not support item assignment

Checking Memory Location of Letters of a String

We can further verify this by checking the memory location address of the position of the letters of the string.

main.py

x = 'banana'

for idx in range (0,5):
    print(x[idx], "=", id(x[idx]))

Output

When we run the above program we get the following output. As you can see above a and a point to same location. Also N and N also point to the same location.

b = 11817208
a = 11817160
n = 11817784
a = 11817160
n = 11817784

Python Text Processing - Sorting Lines

Many times, we need to sort the content of a file for analysis. For example, we want to get the sentences written by different students to get arranged in the alphabetical order of their names. That will involve sorting just not by the first character of the line but also all the characters starting from the left. In the below program we first read the lines from a file then print them using the sort function which is part of the standard python library.

Printing the File Content

main.py

fileName = "poem.txt"
data=file(fileName).readlines()
for i in range(len(data)):
   print(data[i])

Output

When we run the above program, we get the following output −

Summer is here.
Sky is bright.
Birds are gone.
Nests are empty.
Where is Rain?

Sorting Lines in the File

Now we apply the sort function before printing the content of the file. the lines get sorted as per the first alphabet form the left.

main.py

FileName = "poem.txt"
data=file(fileName).readlines()
data.sort()
for i in range(len(data)):
    print(data[i])

Output

When we run the above program, we get the following output −

Birds are gone.
Nests are empty.
Sky is bright.
Summer is here.
Where is Rain?

Python Text Processing - Counting Tokens in Paragraphs

While reading the text from a source, sometimes we also need to find out some statistics about the type of words used. That makes it necessary to count the number of words as well as lines with a specific type of words in a given text. In the below example we show programs to count the words in a paragraph using two different approaches. We consider a text file for this purpose which contains the summary of a Hollywood movie.

Reading the File

main.py

fileName = "GodFather.txt"

with open(fileName, 'r') as file:
    lines_in_file = file.read()
    print(lines_in_file)

Output

When we run the above program we get the following output −

Vito Corleone is the aging don (head) of the Corleone Mafia Family. 
His youngest son Michael has returned from WWII just in time to see 
the wedding of Connie Corleone (Michael's sister) to Carlo Rizzi. 
...

Counting Words Using nltk

Next we use the nltk module to count the words in the text. Please note the word '(head)' is counted as 3 words and not one.

main.py

import nltk

fileName = "GodFather.txt"

with open(fileName, 'r') as file:
    lines_in_file = file.read()
    
    nltk_tokens = nltk.word_tokenize(lines_in_file)
    print(nltk_tokens)
    print("\n")
    print("Number of Words: " , len(nltk_tokens))

Output

When we run the above program we get the following output −

['Vito', 'Corleone', 'is', 'the', 'aging', 'don',
...
]

Number of Words:  167

Counting Words Using Split

Next we count the words using Split function and here the word '(head)' is counted as a single word and not 3 words as in case of using nltk.

fileName = "GodFather.txt"

with open(fileName, 'r') as file:
    lines_in_file = file.read()

    print(lines_in_file.split())
    print("\n")
    print("Number of Words: ", len(lines_in_file.split()))

Output

When we run the above program we get the following output −

['Vito', 'Corleone', 'is', 'the', 'aging', 'don', 
...
]

Number of Words:  146

Python Text Processing - Binary ASCII Conversion

The ASCII to binary and binary to ascii conversion is carried out by the in-built binascii module. It has a very straight forward usage with functions which take the input data and do the conversion. The below program shows the use of binascii module and its functions named b2a_uu and a2b_uu. The uu stands for "UNIX-to-UNIX encoding" which takes care of the data conversion from strings to binary and ascii values as required by the program.

Binary ASCII Conversion

main.py

import binascii

text = b"Simply Easy Learning"

# Converting binary to ascii
data_b2a = binascii.b2a_uu(text)
print("**Binary to Ascii** \n")
print(data_b2a)

# Converting back from ascii to binary 
data_a2b = binascii.a2b_uu(data_b2a)
print("**Ascii to Binary** \n")
print(data_a2b)

Output

When we run the above program we get the following output −

**Binary to Ascii** 

b'44VEM<&QY($5A<WD@3&5A<FYI;F< \n'
**Ascii to Binary** 

b'Simply Easy Learning'

Python Text Processing - File as String

While reading a file it is read as a dictionary with multiple elements. So, we can access each line of the file using the index of the element. In the below example we have a file which has multiple lines and they those lines become individual elements of the file.

Example - Reading a File line by line

main.py

with open ("GodFather.txt", "r") as BigFile:
    data=BigFile.readlines()

# Print each line
	for i in range(len(data)):
    print("Line No -",i)
    print(data[i])

When we run the above program, we get the following output −

Line No - 0
Vito Corleone is the aging don (head) of the Corleone Mafia Family. 
...

File as a String

But the entire file content can be read as a single string by removing the new line character and using the read function as shown below. In the result there are no multiple lines.

main.py

with open("GodFather.txt", 'r') as BigFile:
    data=BigFile.read().replace('\n', '')
	
# Verify the string type 
	print(type(data))
	
# Print the file content as a single string
    print(data)

Output

When we run the above program, we get the following output −

string
Vito Corleone is the aging don (head) of the Corleone Mafia Family...

Python Text Processing - Backward File Reading

When we normally read a file, the contents are read line by line from the beginning of the file. But there may be scenarios where we want to read the last line first. For example, the data in the file has latest record in the bottom and we want to read the latest records first. To achieve this requirement we install the required package to perform this action by using the command below.

pip3 install file-read-backwards

Example - Reading File Line By Line

But before reading the file backwards, let's read the content of the file line by line so that we can compare the result after backward reading.

main.py

with open ("GodFather.txt", "r") as BigFile:
    data=BigFile.readlines()

# Print each line
	for i in range(len(data)):
    print("Line No- ",i )
    print(data[i])

Output

When we run the above program, we get the following output −

Line No-  0
Vito Corleone is the aging don (head) of the Corleone Mafia Family. 

Line No-  1
His youngest son Michael has returned from WWII just in time to ...

Example - Reading Lines Backward

Now to read the file backwards we use the installed module.

main.py

from file_read_backwards import FileReadBackwards

with FileReadBackwards("GodFather.txt", encoding="utf-8") as BigFile:

# getting lines by lines starting from the last line up
    for line in BigFile:
        print(line)

Output

When we run the above program, we get the following output −

The Don barely survives, which leads his son Michael to begin a violent...

You can verify the lines have been read in a reverse order.

Reading Words Backward

We can also read the words in the file backward. For this we first read the lines backwards and then tokenize the words in it with applying reverse function. In the below example we have word tokens printed backwards form the same file using both the package and nltk module.

main.py

import nltk
from file_read_backwards import FileReadBackwards

with FileReadBackwards("GodFather.txt", encoding="utf-8") as BigFile:

# getting lines by lines starting from the last line up
# And tokenizing with applying reverse()
    for line in BigFile:
        word_data= line
        nltk_tokens = nltk.word_tokenize(word_data)
        nltk_tokens.reverse()
        print(nltk_tokens)

Output

When we run the above program we get the following output −

['.', 'apart', 'family', 'Corleone'..., 'The']
['.', 'men', 'hit', 'his', 'of', 'some', ...'This']
...

Python Text Processing - Filter Duplicate Words

Many times, we have a need of analysing the text only for the unique words present in the file. So, we need to eliminate the duplicate words from the text. This is achieved by using the word tokenization and set functions available in nltk.

Example - Without preserving the order

In the below example we first tokenize the sentence into words. Then we apply set() function which creates an unordered collection of unique elements. The result has unique words which are not ordered.

main.py

from nltk.tokenize import word_tokenize
word_data = "The Sky is blue also the ocean is blue also Rainbow has a blue colour." 

# First Word tokenization
nltk_tokens = word_tokenize(word_data)

# Applying Set
no_order = list(set(nltk_tokens))

print(no_order)

Output

When we run the above program, we get the following output −

['is', 'Rainbow', 'ocean', 'the', 'has', 'The', 'Sky', '.', 'a', 'also', 'colour', 'blue']

Preserving the Order

To get the words after removing the duplicates but still preserving the order of the words in the sentence, we read the words and add it to list by appending it.

main.py

from nltk.tokenize import word_tokenize
word_data = "The Sky is blue also the ocean is blue also Rainbow has a blue colour." 

# First Word tokenization
nltk_tokens = word_tokenize(word_data)

ordered_tokens = set()
result = []
for word in nltk_tokens:
    if word not in ordered_tokens:
        ordered_tokens.add(word)
        result.append(word)
     
print(result)

Output

When we run the above program, we get the following output −

['The', 'Sky', 'is', 'blue', 'also', 'the', 'ocean', 'Rainbow', 'has', 'a', 'colour', '.']

Python Text Processing - Extract Emails from Text

To extract emails form text, we can take of regular expression. In the below example we take help of the regular expression package to define the pattern of an email ID and then use the findall() function to retrieve those text which match this pattern.

Example - Extracting Email

main.py

import re
text = "Please contact us at [email protected] for further information." + \
        " You can also give feedbacl at [email protected]"

emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text)
print(emails)

Output

When we run the above program, we get the following output −

['[email protected]', '[email protected]']

Python Text Processing - Extract URL from Text

URL extraction is achieved from a text file by using regular expression. The expression fetches the text wherever it matches the pattern. Only the re module is used for this purpose.

Example - Reading URLs from a file

We can take a input file containig some URLs and process it thorugh the following program to extract the URLs. The findall()function is used to find all instances matching with the regular expression.

Inout File - url_example.txt

Shown is the input file below. Which contains teo URLs.

Now a days you can learn almost anything by just visiting http://www.google.com. But if you are completely new to computers or internet then first you need to leanr those fundamentals. Next
you can visit a good e-learning site like - https://www.tutorialspoint.com to learn further on a variety of subjects.

Now, when we take the above input file and process it through the following program we get the required output whihc gives only the URLs extracted from the file.

main.py

import re
 
with open("url_example.txt") as file:
   for line in file:
      urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
      print(urls)

Output

When we run the above program we get the following output −

['http://www.google.com.']
['https://www.tutorialspoint.com']

Python Text Processing - Pretty Printing

The python module pprint is used for giving proper printing formats to various data objects in python. Those data objects can represent a dictionary data type or even a data object containing the JSON data. In the below example we see how that data looks before applying the pprint module and after applying it.

Pretty Print a Dictionary

main.py

import pprint

student_dict = {'Name': 'Tusar', 'Class': 'XII', 
     'Address': {'FLAT ':1308, 'BLOCK ':'A', 'LANE ':2, 'CITY ': 'HYD'}}

print(student_dict)
print("\n")
print("***With Pretty Print***")
print("-----------------------")
pprint.pprint(student_dict,width=-1)

Output

When we run the above program, we get the following output −

{'Address': {'FLAT ': 1308, 'LANE ': 2, 'CITY ': 'HYD', 'BLOCK ': 'A'}, 'Name': 'Tusar', 'Class': 'XII'}


***With Pretty Print***
-----------------------
{'Address': {'BLOCK ': 'A',
             'CITY ': 'HYD',
             'FLAT ': 1308,
             'LANE ': 2},
 'Class': 'XII',
 'Name': 'Tusar'}

Example - Handling JSON Data

Pprint can also handle JSON data by formatting them to a more readable format.

main.py

import pprint

emp = {"Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ],
   "Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ],   
   "StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
      "7/30/2013","6/17/2014"],
   "Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"] }

x= pprint.pformat(emp, indent=2)
print(x)

Output

When we run the above program, we get the following output −

{ 'Dept': [ 'IT',
            'Operations',
            'IT',
            'HR',
            'Finance',
            'IT',
            'Operations',
            'Finance'],
  'Name': ['Rick', 'Dan', 'Michelle', 'Ryan', 'Gary', 'Nina', 'Simon', 'Guru'],
  'Salary': [ '623.3',
              '515.2',
              '611',
              '729',
              '843.25',
              '578',
              '632.8',
              '722.5'],
  'StartDate': [ '1/1/2012',
                 '9/23/2013',
                 '11/15/2014',
                 '5/11/2014',
                 '3/27/2015',
                 '5/21/2013',
                 '7/30/2013',
                 '6/17/2014']}

Python Text Processing - State Machine

A state machine is about designing a program to control the flow in an application. it is a directed graph, consisting of a set of nodes and a set of transition functions. Processing a text file very often consists of sequential reading of each chunk of a text file and doing something in response to each chunk read. The meaning of a chunk depends on what types of chunks were present before it and what chunks come after it. The machine is about designing a program to control the flow in an application. it is a directed graph, consisting of a set of nodes and a set of transition functions. Processing a text file very often consists of sequential reading of each chunk of a text file and doing something in response to each chunk read. The meaning of a chunk depends on what types of chunks were present before it and what chunks come after it.

Consider a scenario where the text put has to be a continuous string of repetition of sequence of AGC(used in protein analysis). If this specific sequence is maintained in the input string the state of the machine remains TRUE but as soon as the sequence deviates, the state of the machine becomes FALSE and remains FALSE after wards. This ensures the further processing is stopped even though there may be more chunks of correct sequences available later.

Defining a State Machine

The below program defines a state machine which has functions to start the machine, take inputs for processing the text and step through the processing.

main.py

class StateMachine:

# Initialize 
    def start(self):
        self.state = self.startState

# Step through the input
    def step(self, inp):
        (s, o) = self.getNextValues(self.state, inp)
        self.state = s
        return o

# Loop through the input		
    def feeder(self, inputs):
        self.start()
        return [self.step(inp) for inp in inputs]

# Determine the TRUE or FALSE state
class TextSeq(StateMachine):
    startState = 0
    def getNextValues(self, state, inp):
        if state == 0 and inp == 'A':
            return (1, True)
        elif state == 1 and inp == 'G':
            return (2, True)
        elif state == 2 and inp == 'C':
            return (0, True)
        else:
            return (3, False)


InSeq = TextSeq()

x = InSeq.feeder(['A','A','A'])
print(x)

y = InSeq.feeder(['A', 'G', 'C', 'A', 'C', 'A', 'G'])
print(y)

Output

When we run the above program, we get the following output −

[True, False, False]
[True, True, True, True, False, False, False]

In the result of x, the pattern of AGC fails for the second input after the first 'A'. The state of the result remains False forever after this. In the result of Y, the pattern of AGC continues till the 4th input. Hence the state of the result remains True till that point. But from 5th input the result changes to False as G is expected, but C is found.

Python Text Processing - Capitalize and Translate

Capitalization strings is a regular need in any text processing system. Python achieves it by using the built-in functions in the standard library. In the below example we use the two string functions, capwords() and upper() to achieve this. While 'capwords' capitalizes the first letter of each word, 'upper' capitalizes the entire string.

Example - Capitalization of Strings

main.py

import string

text = 'Tutorialspoint - simple easy learning.'

print(string.capwords(text))
print(text.upper())

Output

When we run the above program we get the following output −

Tutorialspoint - Simple Easy Learning.
TUTORIALSPOINT - SIMPLE EASY LEARNING.

Example - Translation of Strings

Translation in python essentially means substituting specific letters with another letter. It can work for encryption decryption of strings.

main.py

text = 'Tutorialspoint - simple easy learning.'

transtable = str.maketrans('tpol', 'wxyz')
print(text.translate(transtable))

Output

When we run the above program we get the following output −

Tuwyriazsxyinw - simxze easy zearning.

Python Text Processing - Tokenization

In Python tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language. The various tokenization functions in-built into the nltk module itself and can be used in programs as shown below.

Line Tokenization

In the below example we divide a given text into different lines by using the function sent_tokenize.

main.py

import nltk
sentence_data = "The First sentence is about Python." + \
   "The Second: about Django. You can learn Python,Django and Data Ananlysis here. "
nltk_tokens = nltk.sent_tokenize(sentence_data)
print (nltk_tokens)

When we run the above program, we get the following output −

['The First sentence is about Python.', 
'The Second: about Django.', 
'You can learn Python,Django and Data Ananlysis here.']

Non-English Tokenization

In the below example we tokenize the German text.

main.py

import nltk

german_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
german_tokens=german_tokenizer.tokenize('Wie geht es Ihnen?  Gut, danke.')
print(german_tokens)

Output

When we run the above program, we get the following output −

['Wie geht es Ihnen?', 'Gut, danke.']

Word Tokenzitaion

We tokenize the words using word_tokenize function available as part of nltk.

main.py

import nltk

word_data = "It originated from the idea that there are readers" + \
   "who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)

Output

When we run the above program we get the following output −

['It', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers', 
'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the',
'comforts', 'of', 'their', 'drawing', 'rooms']

Python Text Processing - Removing Stopwords

Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc. Such words are already captured this in corpus named corpus. We first download it to our python environment.

import nltk
nltk.download('stopwords')

It will download a file with English stopwords.

Verifying the Stopwords

main.py

from nltk.corpus import stopwords
stopwords.words('english')
print (stopwords.words() [0:20])

Output

When we run the above program we get the following output −

['tyre', 'rreth', 'le', 'atyre', 'këta', 'megjithëse', 'kemi', 'per', 
'ndonëse', 'dytë', 'pse', 'tha', 'aty', 'ndaj', 'ke', 'këtë', 'duhet', 
'pa', 'perket', 'veç']

The various language other than English which has these stopwords are as below.

main.py

from nltk.corpus import stopwords
print(stopwords.fileids())

Output

When we run the above program we get the following output −

['albanian', 'arabic', 'azerbaijani', 'basque', 'belarusian', 'bengali', 'catalan', 
'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek',
 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 
 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish',
 'tajik', 'tamil', 'turkish', 'uzbek']

Example - Removing stopwords

We use the below example to show how the stopwords are removed from the list of words.

main.py

from nltk.corpus import stopwords
en_stops = set(stopwords.words('english'))

all_words = ['There', 'is', 'a', 'tree','near','the','river']
for word in all_words: 
    if word not in en_stops:
        print(word)

Output

When we run the above program we get the following output −

There
tree
near
river

Python Text Processing - Synonyms and Antonyms

Synonyms and Antonyms are available as part of the wordnet which a lexical database for the English language. It is available as part of nltk corpora access. In wordnet Synonyms are the words that denote the same concept and are interchangeable in many contexts so that they are grouped into unordered sets (synsets). We use these synsets to derive the synonyms and antonyms as shown in the below programs.

Example - Getting Synonyms

main.py

from nltk.corpus import wordnet

synonyms = []

for syn in wordnet.synsets("Soil"):
   for lm in syn.lemmas():
      synonyms.append(lm.name())
print (set(synonyms))

Output

When we run the above program we get the following output −

set([grease', filth', dirt', begrime', soil', 
grime', land', bemire', dirty', grunge', 
stain', territory', colly', ground'])

Example - Getting Antonyms

To get the antonyms we simply uses the antonym function.

main.py

from nltk.corpus import wordnet
antonyms = []

for syn in wordnet.synsets("ahead"):
   for lm in syn.lemmas():
      if lm.antonyms():
          antonyms.append(lm.antonyms()[0].name())

print(set(antonyms))

Output

When we run the above program, we get the following output −

set([backward', back'])

Python Text Processing - Text Translation

Text translation from one language to another is increasingly becoming common for various websites as they cater to an international audience. The python package which helps us do this is called translate.

This package can be installed by the following way. It provides translation for major languages.

pip3 install translate

Example - Translating a Sentence

Below is an example of translating a simple sentence from English to Spanish. The default from language being English.

main.py

from translate import Translator
translator= Translator(to_lang="es")
translation = translator.translate("Good Morning!")
print translation

Output

When we run the above program, we get the following output −

¡Buenos días!

Example - Translation Between Any Two Languages

If we have the need specify the from-language and the to-language, then we can specify it as in the below program.

main.py

from translate import Translator
translator= Translator(from_lang="es",to_lang="en")
translation = translator.translate("¡Buenos días!")
print(translation)

Output

When we run the above program, we get the following output −

Good Morning!

Python Text Processing - Word Replacement

Replacing the complete string or a part of string is a very frequent requirement in text processing. The replace() method returns a copy of the string in which the occurrences of old have been replaced with new, optionally restricting the number of replacements to max.

Following is the syntax for replace() method −

str.replace(old, new[, max])

Parameters

old − This is old substring to be replaced.
new − This is new substring, which would replace old substring.
max − If this optional argument max is given, only the first count occurrences are replaced.

This method returns a copy of the string with all occurrences of substring old replaced by new. If the optional argument max is given, only the first count occurrences are replaced.

Example

Example - Usage of replace() method

The following example shows the usage of replace() method.

main.py

str = "this is string example....wow!!! this is really string"
print(str.replace("is", "was"))
print(str.replace("is", "was", 3))

Result

When we run above program, it produces the following result −

thwas was string example....wow!!! thwas was really string
thwas was string example....wow!!! thwas is really string

Example - Replacement Ignoring Case

main.py

import re
sourceline  = re.compile("Tutor", re.IGNORECASE)
 
Replacedline  = sourceline.sub("Tutor","Tutorialspoint has the best tutorials for learning.")
print (Replacedline)

Output

When we run the above program, we get the following output −

Tutorialspoint has the best Tutorials for learning.

Python Text Processing - Spell Check

Checking of spelling is a basic requirement in any text processing or analysis. The python package pyspellchecker provides us this feature to find the words that may have been mis-spelled and also suggest the possible corrections.

First, we need to install the required package using the following command in our python environment.

 pip3 install pyspellchecker

Example - Spell Check

Now we see below how the package is used to point out the wrongly spelled words as well as make some suggestions about possible correct words.

main.py

from spellchecker import SpellChecker

spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown(['let', 'us', 'wlak','on','the','groun'])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

Output

When we run the above program we get the following output −

group
{'group', 'ground', 'groan', 'grout', 'grown', 'groin'}
walk
{'flak', 'weak', 'walk'}

Example - Case Sensitive Spell Check

If we use Let in place of let then this becomes a case sensitive comparison of the word with the closest matched words in dictionary and the result looks different now.

main.py

from spellchecker import SpellChecker

spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown(['Let', 'us', 'wlak','on','the','groun'])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

Output

When we run the above program we get the following output −

group
{'groan', 'group', 'groin', 'grown', 'ground', 'grout'}
walk
{'flak', 'weak', 'walk'}

Python Text Processing - WordNet Interface

WordNet is a dictionary of English, similar to a traditional thesaurus NLTK includes the English WordNet. We can use it as a reference for getting the meaning of words, usage example and definition. A collection of similar words is called lemmas. The words in WordNet are organized and nodes and edges where the nodes represent the word text and the edges represent the relations between the words. below we will see how we can use the WordNet module.

All Lemmas

main.py

from nltk.corpus import wordnet as wn
res=wn.synset('locomotive.n.01').lemma_names()
print(res)

Output

When we run the above program, we get the following output −

[u'locomotive', u'engine', u'locomotive_engine', u'railway_locomotive']

Word Definition

The dictionary definition of a word can be obtained by using the definition function. It describes the meaning of the word as we can find in a normal dictionary.

main.py

from nltk.corpus import wordnet as wn
resdef = wn.synset('ocean.n.01').definition()
print(resdef)

Output

When we run the above program, we get the following output −

a large body of water constituting a principal part of the hydrosphere

Usage Examples

We can get the example sentences showing some usage examples of the words using the exmaples() function.

main.py

from nltk.corpus import wordnet as wn
res_exm = wn.synset('good.n.01').examples()
print(res_exm)

Output

When we run the above program we get the following output −

['for your own good', "what's the good of worrying?"]

Opposite Words

Get All the opposite words by using the antonym function.

main.py

from nltk.corpus import wordnet as wn
# get all the antonyms
res_a = wn.lemma('horizontal.a.01.horizontal').antonyms()
print(res_a)

Output

When we run the above program we get the following output −

[Lemma('inclined.a.02.inclined'), Lemma('vertical.a.01.vertical')]

Python Text Processing - Corpora Access

Corpora is a group presenting multiple collections of text documents. A single collection is called corpus. One such famous corpus is the Gutenberg Corpus which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. In the below example we access the names of only those files from the corpus which are plain text with filename ending as .txt.

main.py

from nltk.corpus import gutenberg
fields = gutenberg.fileids()

print(fields)

Output

When we run the above program, we get the following output −

[austen-emma.txt', austen-persuasion.txt', austen-sense.txt', bible-kjv.txt', 
blake-poems.txt', bryant-stories.txt', burgess-busterbrown.txt',
carroll-alice.txt', chesterton-ball.txt', chesterton-brown.txt', 
chesterton-thursday.txt', edgeworth-parents.txt', melville-moby_dick.txt',
milton-paradise.txt', shakespeare-caesar.txt', shakespeare-hamlet.txt',
shakespeare-macbeth.txt', whitman-leaves.txt']

Accessing Raw Text

We can access the raw text from these files using sent_tokenize function which is also available in nltk. In the below example we retrieve the first two paragraphs of the blake poen text.

main.py

from nltk.tokenize import sent_tokenize
from nltk.corpus import gutenberg

sample = gutenberg.raw("blake-poems.txt")

token = sent_tokenize(sample)

for para in range(2):
    print(token[para])

Output

When we run the above program we get the following output −

[Poems by William Blake 1789]

 
SONGS OF INNOCENCE AND OF EXPERIENCE
and THE BOOK of THEL


 SONGS OF INNOCENCE
 
 
 INTRODUCTION
 
 Piping down the valleys wild,
   Piping songs of pleasant glee,
 On a cloud I saw a child,
   And he laughing said to me:
 
 "Pipe a song about a Lamb!"
So I piped with merry cheer.

Python Text Processing - Tagging Words

Tagging is an essential feature of text processing where we tag the words into grammatical categorization. We take help of tokenization and pos_tag function to create the tags for each word.

main.py

import nltk

text = nltk.word_tokenize("A Python is a serpent which eats eggs from the nest")
tagged_text=nltk.pos_tag(text)
print(tagged_text)

Output

When we run the above program, we get the following output −

[('A', 'DT'), ('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('serpent', 'NN'), 
('which', 'WDT'), ('eats', 'VBZ'), ('eggs', 'NNS'), ('from', 'IN'), 
('the', 'DT'), ('nest', 'JJS')]

Tag Descriptions

We can describe the meaning of each tag by using the following program which shows the in-built values.

main.py

import nltk

nltk.help.upenn_tagset('NN')
nltk.help.upenn_tagset('IN')
nltk.help.upenn_tagset('DT')

Output

When we run the above program, we get the following output −

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those

Tagging a Corpus

We can also tag a corpus data and see the tagged result for each word in that corpus.

main.py

import nltk

from nltk.tokenize import sent_tokenize
from nltk.corpus import gutenberg
sample = gutenberg.raw("blake-poems.txt")
tokenized = sent_tokenize(sample)
for i in tokenized[:2]:
   words = nltk.word_tokenize(i)
   tagged = nltk.pos_tag(words)
   print(tagged)

Output

When we run the above program we get the following output −

[([', 'JJ'), (Poems', 'NNP'), (by', 'IN'), (William', 'NNP'), (Blake', 'NNP'), (1789', 'CD'), 
(]', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (AND', 'NNP'), (OF', 'NNP'), 
(EXPERIENCE', 'NNP'), (and', 'CC'), (THE', 'NNP'), (BOOK', 'NNP'), (of', 'IN'), 
(THEL', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (INTRODUCTION', 'NNP'), 
(Piping', 'VBG'), (down', 'RP'), (the', 'DT'), (valleys', 'NN'), (wild', 'JJ'), 
(,', ','), (Piping', 'NNP'), (songs', 'NNS'), (of', 'IN'), (pleasant', 'JJ'), (glee', 'NN'),
 (,', ','), (On', 'IN'), (a', 'DT'), (cloud', 'NN'), (I', 'PRP'), (saw', 'VBD'), 
 (a', 'DT'), (child', 'NN'), (,', ','), (And', 'CC'), (he', 'PRP'), (laughing', 'VBG'), 
 (said', 'VBD'), (to', 'TO'), (me', 'PRP'), (:', ':'), (``', '``'), (Pipe', 'VB'),
 (a', 'DT'), (song', 'NN'), (about', 'IN'), (a', 'DT'), (Lamb', 'NN'), (!', '.'), (u"''", "''")]

Python Text Processing - Chunks and Chinks

Chunking is the process of grouping similar words together based on the nature of the word. In the below example we define a grammar by which the chunk must be generated. The grammar suggests the sequence of the phrases like nouns and adjectives etc. which will be followed when creating the chunks. The pictorial output of chunks is shown below.

Example - Chunking

main.py

import nltk

sentence = [("The", "DT"), ("small", "JJ"), ("red", "JJ"),("flower", "NN"), 
("flew", "VBD"), ("through", "IN"),  ("the", "DT"), ("window", "NN")]
grammar = "NP: {<nn>?<dt>*<jj>}" 
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence) 
print(result)
result.draw()

Output

When we run the above program we get the following output −

Example - Changing the Grammar

Changing the grammar, we get a different output as shown below.

main.py

import nltk

sentence = [("The", "DT"), ("small", "JJ"), ("red", "JJ"),("flower", "NN"),
 ("flew", "VBD"), ("through", "IN"),  ("the", "DT"), ("window", "NN")]

grammar = "NP: {<nn>?<dt>*<jj>}" 

chunkprofile = nltk.RegexpParser(grammar)
result = chunkprofile.parse(sentence) 
print(result)
result.draw()

Output

When we run the above program we get the following output −

Chinking

Chinking is the process of removing a sequence of tokens from a chunk. If the sequence of tokens appears in the middle of the chunk, these tokens are removed, leaving two chunks where they were already present.

main.py

import nltk

sentence = [("The", "DT"), ("small", "JJ"), ("red", "JJ"),
("flower", "NN"), ("flew", "VBD"), ("through", "IN"), 
 ("the", "DT"), ("window", "NN")]

grammar = r"""
  NP:
    {<.>+}         # Chunk everything
    }<jj>+{      # Chink sequences of JJ and NN
  """
chunkprofile = nltk.RegexpParser(grammar)
result = chunkprofile.parse(sentence) 
print(result)
result.draw()

Output

When we run the above program, we get the following output −

As you can see the parts meeting the criteria in grammar are left out from the Noun phrases as separate chunks. This process of extracting text not in the required chunk is called chinking.

Python Text Processing - Chunk Classification

Classification based chunking involves classifying the text as a group of words rather than individual words. A simple scenario is tagging the text in sentences. We will use a corpus to demonstrate the classification. We choose the corpus conll2000 which has data from the of the Wall Street Journal corpus (WSJ) used for noun phrase-based chunking.

First, we add the corpus to our environment using the following command.

>>>import nltk
>>>nltk.download('conll2000')

Lets have a look at the first few sentences in this corpus.

from nltk.corpus import conll2000

x = (conll2000.sents())
for i in range(3):
   print(x[i])
   print('\n')

Output

When we run the above program we get the following output −

['Confidence', 'in', 'the', 'pond', 'is', 'widely',...]
['Chancellor', 'of', 'the', 'Excheqer', 'Nigel', 'Lawson', ...]
['Bt', 'analysts', 'reckon', 'nderlying', 'spport', 'for', ...]

Next we use the fucntion tagged_sents() to get the sentences tagged to their classifiers.

from nltk.corpus import conll2000

x = (conll2000.tagged_sents())
for i in range(3):
   print(x[i])
   print ('\n')

Output

When we run the above program we get the following output −

[('Confidence', 'NN'), ('in', 'IN'), ...]
[('Chancellor', 'NNP'), ('of', 'IN'), ...]
[('Bt', 'CC'), ('analysts', 'NNS'), ...]

Python Text Processing - Text Classification

Many times, we need to categorise the available text into various categories by some pre-defined criteria. nltk provides such feature as part of various corpora. In the below example we look at the movie review corpus and check the categorization available.

Example - Categorising Data

main.py

# Lets See how the movies are classified
from nltk.corpus import movie_reviews

all_cats = []
for w in movie_reviews.categories():
    all_cats.append(w.lower())
print(all_cats)

Output

When we run the above program, we get the following output −

['neg', 'pos']

Example - Tokenizing Data

Now let's look at the content of one of the files with a positive review. The sentences in this file are tokenized and we print the first four sentences to see the sample.

main.py

from nltk.corpus import movie_reviews
from nltk.tokenize import sent_tokenize
fields = movie_reviews.fileids()

sample = movie_reviews.raw("pos/cv944_13521.txt")

token = sent_tokenize(sample)
for lines in range(4):
    print(token[lines])

Output

When we run the above program we get the following output −

meteor threat set to blow away all volcanoes & twisters !
summer is here again !
this season could probably be the most ambitious = season this decade 
with hollywood churning out films 
like deep impact , = godzilla , the x-files , armageddon , the truman show , 
all of which has but = one main aim , to rock the box office .
leading the pack this summer is = deep impact , one of the first few film 
releases from the = spielberg-katzenberg-geffen's dreamworks production company .

Example - Tokenizing words

Next, we tokenize the words in each of these files and find the most common words by using the FreqDist function from nltk.

main.py

import nltk
from nltk.corpus import movie_reviews
fields = movie_reviews.fileids()

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(10))

Output

When we run the above program we get the following output −

[(,', 77717), (the', 76529), (.', 65876), (a', 38106), (and', 35576), 
(of', 34123), (to', 31937), (u"'", 30585), (is', 25195), (in', 21822)]

Python Text Processing - Bigrams

Some English words occur together more frequently. For example - Sky High, do or die, best performance, heavy rain etc. So, in a text document we may need to identify such pair of words which will help in sentiment analysis. First, we need to generate such word pairs from the existing sentence maintain their current sequences. Such pairs are called bigrams. Python has a bigram function as part of NLTK library which helps us generate these pairs.

Example - Bigrams

main.py

import nltk

word_data = "The best performance can bring in sky high success."
nltk_tokens = nltk.word_tokenize(word_data)  	

print(list(nltk.bigrams(nltk_tokens)))

Output

When we run the above program we get the following output −

[('The', 'best'), ('best', 'performance'), ('performance', 'can'), ('can', 'bring'), 
('bring', 'in'), ('in', 'sky'), ('sky', 'high'), ('high', 'success'), ('success', '.')]

This result can be used in statistical findings on the frequency of such pairs in a given text. That will corelate to the general sentiment of the descriptions present int he body of the text.

Python Text Processing - Process PDF

Python can read PDF files and print out the content after extracting the text from it. For that we have to first install the required module which is PyPDF2. Below is the command to install the module. You should have pip already installed in your python environment.

pip install pypdf2

Example - Processing PDF

On successful installation of this module we can read PDF files using the methods available in the module.

main.py

import PyPDF2

pdfName = 'Tutorialspoint.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)
page = read_pdf.getPage(0)
page_content = page.extractText()
print(page_content)

Output

When we run the above program, we get the following output −

Tutorials Point originated from the idea that there exists a class of readers who respond better 
to online content and prefer to learn new skills at their own pace from the comforts of their 
drawing rooms.
 
The journey commenced with a single tutorial on HTML in 2006 and elated by the response 
it generated, we worked our way to adding fresh tutorials to our repository which now 
proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming
languages to web designing to academics and much more.

Example - Reading Multiple Pages

To read a pdf with multiple pages and print each of the page with a page number we use the a loop with getPageNumber() function. In the below example we the PDF file which has two pages. The contents are printed under two separate page headings.

import PyPDF2

pdfName = 'Tutorialspoint2.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)

for i in xrange(read_pdf.getNumPages()):
    page = read_pdf.getPage(i)
    print('Page No - ' + str(1+read_pdf.getPageNumber(page)))
    page_content = page.extractText()
    print(page_content)

Output

When we run the above program, we get the following output −

Page No - 1
Tutorials Point originated from the idea that there exists a class of readers who respond better to 
online content and prefer to learn new skills at their own pace from the comforts of their drawing 
rooms. 

Page No - 2
 
The journey commenced with a single tutorial on HTML in 2006 and elated by the response it 
generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts 
a wealth of tutorials and allied articles on topics ranging from p
rogramming languages to web 
designing to academics and much more.

Python Text Processing - Process Word Document

To read a word document we take help of the module named docx. We first install docx as shown below. Then write a program to use the different functions in docx module to read the entire file by paragraphs.

We use the below command to get the docx module into our environment.

 pip install docx

Reading a Word Document

In the below example we read the content of a word document by appending each of the lines to a paragraph and finally printing out all the paragraph text.

main.py

import docx

def readtxt(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

print (readtxt('Tutorialspoint.docx'))

Output

When we run the above program, we get the following output −

Tutorials Point originated from the idea that there exists a class of readers who respond 
better to online content and prefer to learn new skills at their own pace from the comforts 
of their drawing rooms. 

The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated, 
we worked our way to adding fresh tutorials to our repository which now proudly flaunts 
a wealth of tutorials and allied articles on topics ranging from programming languages 
to web designing to academics and much more.

Reading Individual Paragraphs

We can read a specific paragraph from the word document using the paragraphs attribute. In the below example we read only the second paragraph from the word document.

main.py

import docx

doc = docx.Document('Tutorialspoint.docx')
print(len(doc.paragraphs))

print(doc.paragraphs[2].text)

Output

When we run the above program, we get the following output −

The journey commenced with a single tutorial on HTML in 2006 and elated by the response 
it generated, we worked our way to adding fresh tutorials to our repository 
which now proudly flaunts a wealth of tutorials and allied articles on topics 
ranging from programming languages to web designing to academics and much more.

Python Text Processing - Reading RSS Feed

RSS (Rich Site Summary) is a format for delivering regularly changing web content. Many news-related sites, weblogs and other online publishers syndicate their content as an RSS Feed to whoever wants it. In python we take help of the below package to read and process these feeds.

pip install feedparser

Feed Structure

In the below example we get the structure of the feed so that we can analyse further about which parts of the feed we want to process.

main.py

import feedparser
NewsFeed = feedparser.parse("https://timesofindia.indiatimes.com/rssfeedstopstories.cms")
entry = NewsFeed.entries[1]

print entry.keys()

Output

When we run the above program, we get the following output −

dict_keys(['title', 'title_detail', 'summary', 'summary_detail', 
'links', 'link', 'id', 'guidislink', 'published', 'published_parsed', 
'authors', 'author', 'author_detail'])

Feed Title and Posts

Reading Title and Head of RSS Feed

In the below example we read the title and head of the rss feed.

main.py

import feedparser

NewsFeed = feedparser.parse("https://timesofindia.indiatimes.com/rssfeedstopstories.cms")

print('Number of RSS posts :', len(NewsFeed.entries))

entry = NewsFeed.entries[1]
print('Post Title :',entry.title)

Output

When we run the above program we get the following output −

Number of RSS posts : 47
Post Title : Why Saturday? How Israel-US strikes targeted Khamenei and his inner circle

Feed Details

Based on above entry structure we can derive the necessary details from the feed using python program as shown below. As entry is a dictionary we utilise its keys to produce the values needed.

main.py

import feedparser

NewsFeed = feedparser.parse("https://timesofindia.indiatimes.com/rssfeedstopstories.cms")

entry = NewsFeed.entries[1]

print(entry.published)
print("******")
print(entry.summary)
print("------News Link--------")
print(entry.link)

Output

When we run the above program we get the following output −

Sun, 01 Mar 2026 12:15:06 +0530
******
Iran launched retaliatory strikes across key Gulf cities, including Dubai, Doha, and Manama, 
targeting areas hosting US military bases. These attacks followed US and Israeli strikes that 
reportedly killed Iran's Supreme Leader. Major airports experienced evacuations and flight 
suspensions due to the escalating regional conflict.
------News Link--------
https://timesofindia.indiatimes.com/world/middle-east
/iran-strikes-gulf-again-more-explosions-in-dubai-doha-and-manama-airports-targeted/
articleshow/128908100.cms

Python Text Processing - Sentiment Analysis

Sentiment Analysis is about analysing the general opinion of the audience. It may be a reaction to a piece of news, movie or any a tweet about some matter under discussion. Generally, such reactions are taken from social media and clubbed into a file to be analysed through NLP. We will take a simple case of defining positive and negative words first. Then taking an approach to analyse those words as part of sentences using those words. We use the sentiment_analyzer module from nltk. We first carry out the analysis with one word and then with paired words also called bigrams. Finally, we mark the words with negative sentiment as defined in the mark_negation function.

Example - Sentiment Analysis

main.py

import nltk
from nltk.sentiment.util import extract_unigram_feats
from nltk.sentiment.util import extract_bigram_feats
from nltk.sentiment.util import mark_negation

# Analysing for single words
def OneWord(): 
   positive_words = ['good', 'progress', 'luck']
   text = 'Hard Work brings progress and good luck.'.split()                 
   analysis = extract_unigram_feats(text, positive_words) 
   print(' ** Sentiment with one word **\n')
   print(analysis) 

# Analysing for a pair of words	
def WithBigrams(): 
   word_sets = [('Regular', 'fit'), ('fit', 'fine')] 
   text = 'Regular excercise makes you fit and fine'.split() 
   analysis = extract_bigram_feats(text, word_sets) 
   print('\n*** Sentiment with bigrams ***\n') 
   print(analysis)

# Analysing the negation words. 
def NegativeWord():
   text = 'Lack of good health can not bring success to students'.split() 
   analysis = mark_negation(text) 
   print('\n**Sentiment with Negative words**\n')
   print(analysis) 
    
OneWord()
WithBigrams() 
NegativeWord()

Output

When we run the above program we get the following output −

 ** Sentiment with one word **

{'contains(luck)': False, 'contains(good)': True, 'contains(progress)': True}

*** Sentiment with bigrams ***

{'contains(fit - fine)': False, 'contains(Regular - fit)': False}

**Sentiment with Negative words**

['Lack', 'of', 'good', 'health', 'can', 'not', 'bring_NEG', 'success_NEG', 'to_NEG', 'students_NEG']

Python Text Processing - Search And Match

Using regular expressions there are two fundamental operations which appear similar but have significant differences. The re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string. This plays an important role in text processing as often we have to write the correct regular expression to retrieve the chunk of text for sentimental analysis as an example.

Using Regular Expression

main.py

import re

if re.search("tor", "Tutorial"):
   print("1. search result found anywhere in the string")
        
if re.match("Tut", "Tutorial"):
   print("2. Match with beginning of string")
         
if not re.match("tor", "Tutorial"):
   print("3. No match with match if not beginning")
        
# Search as Match        
if not re.search("^tor", "Tutorial"):
   print("4. search as match")

Output

When we run the above program, we get the following output −

1. search result found anywhere in the string
2. Match with beginning of string
3. No match with match if not beginning
4. search as match

Python Text Processing - Munging

Munging in general means cleaning up anything messy by transforming them. In our case we will see how we can transform text to get some result which gives us some desirable changes to data. At a simple level it is only about transforming the text we are dealing with.

Example - Munging

In the below example we plan to shuffle and then rearrange all the letters of a sentence except the first and the last one to get the possible alternate words which may get generated as a mis-spelled word during writing by a human. This rearrangement helps us in

main.py

import random

import re

def replace(t):
    inner_word = list(t.group(2))
    random.shuffle(inner_word)
    return t.group(1) + "".join(inner_word) + t.group(3)
text = "Hello, You should reach the finish line."
print(re.sub(r"(\w)(\w+)(\w)", replace, text))

print(re.sub(r"(\w)(\w+)(\w)", replace, text))

Output

When we run the above program we get the following output −

Hlleo, You slohud recah the fniish line.
Hello, You soulhd reach the fniish line.

Here you can see how the words are jumbled except for the first and the last letters. By taking a statistical approach to wrong spelling we can decided what are the commonly misspelled words and supply the correct spelling for them.

Python Text Processing - Text Wrapping

Text wrapping is required when the text grabbed from some source is not properly formatted to be displayed within the available screen width. This is achieved by using the below package which can be installed in our environment with below command.

pip3 install parawrap

Example - Wrapping a long text

The below paragraph has a single string of text which is continuous. on applying the wrap function we can see how the text is separated into multiple lines separated with commas.

main.py

import parawrap

text = "In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleone's daughter Connie (Talia Shire) and Carlo Rizzi (Gianni Russo). Vito (Marlon Brando), the head of the Corleone Mafia family, is known to friends and associates as Godfather. He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors because, according to Italian tradition, no Sicilian can refuse a request on his daughter's wedding day. One of the men who asks the Don for a favor is Amerigo Bonasera, a successful mortician and acquaintance of the Don, whose daughter was brutally beaten by two young men because she refused their advances; the men received minimal punishment from the presiding judge. The Don is disappointed in Bonasera, who'd avoided most contact with the Don due to Corleone's nefarious business dealings. The Don's wife is godmother to Bonasera's shamed daughter, a relationship the Don uses to extract new loyalty from the undertaker. The Don agrees to have his men punish the young men responsible (in a non-lethal manner) in return for future service if necessary."

print parawrap.wrap(text)

Output

When we run the above program we get the following output −

['In late summer 1945, guests are gathered for the wedding reception of',
...
]

We can also apply the wrap function with specific width as input parameter which will cut the words if required to maintain the required width of the wrap function.

Example - Wrapping with specific width

main.py

import parawrap

text = "In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleone's daughter Connie (Talia Shire) and Carlo Rizzi (Gianni Russo). Vito (Marlon Brando), the head of the Corleone Mafia family, is known to friends and associates as Godfather. He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors because, according to Italian tradition, no Sicilian can refuse a request on his daughter's wedding day. One of the men who asks the Don for a favor is Amerigo Bonasera, a successful mortician and acquaintance of the Don, whose daughter was brutally beaten by two young men because she refused their advances; the men received minimal punishment from the presiding judge. The Don is disappointed in Bonasera, who'd avoided most contact with the Don due to Corleone's nefarious business dealings. The Don's wife is godmother to Bonasera's shamed daughter, a relationship the Don uses to extract new loyalty from the undertaker. The Don agrees to have his men punish the young men responsible (in a non-lethal manner) in return for future service if necessary."

print(parawrap.wrap(text,5))

Output

When we run the above program we get the following output −

['In', 'late', 'summe', 'r', '1945,', 'guest', 's are', 'gathe', 'red'
...]

Python Text Processing - Frequency Distribution

Counting the frequency of occurrence of a word in a body of text is often needed during text processing. This can be achieved by applying the word_tokenize() function and appending the result to a list to keep count of the words as shown in the below program.

Example - Getting Frequencies of Words

from nltk.tokenize import word_tokenize
from nltk.corpus import gutenberg

sample = gutenberg.raw("blake-poems.txt")

token = word_tokenize(sample)
wlist = []

for i in range(50):
    wlist.append(token[i])

wordfreq = [wlist.count(w) for w in wlist]
print("Pairs\n" + str(zip(token, wordfreq)))

Output

When we run the above program, we get the following output −

[([', 1), (Poems', 1), (by', 1), (William', 1), (Blake', 1)
...]

Conditional Frequency Distribution

Conditional Frequency Distribution is used when we want to count words meeting specific crteria satisfying a set of text.

main.py

import nltk
from nltk.corpus import brown

cfd = nltk.ConditionalFreqDist(
          (genre, word)
          for genre in brown.categories()
          for word in brown.words(categories=genre))
categories = ['hobbies', 'romance','humor']
searchwords = [ 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=categories, samples=searchwords)

Output

When we run the above program, we get the following output −

          may might  must  will 
hobbies   131    22    83   264 
romance    11    51    45    43 
  humor     8     8     9    13

Python Text Processing - Text Summarization

Text summarization involves generating a summary from a large body of text which somewhat describes the context of the large body of text. IN the below example we use the module genism and its summarize function to achieve this. We install the below package to achieve this.

 pip install gensim_sum_ext

Example - Usage of Summarize Function

The below paragraph is about a movie plot. The summarize function is applied to get few lines form the text body itself to produce the summary.

main.py

from gensim.summarization import summarize
text = "In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleones " + \
       "daughter Connie (Talia Shire) and Carlo Rizzi (Gianni Russo). Vito (Marlon Brando),"  + \
       "the head of the Corleone Mafia family, is known to friends and associates as Godfather. "  + \
       "He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors "  + \
       "because, according to Italian tradition, no Sicilian can refuse a request on his daughter's wedding " + \
       " day. One of the men who asks the Don for a favor is Amerigo Bonasera, a successful mortician "  + \
       "and acquaintance of the Don, whose daughter was brutally beaten by two young men because she"  + \
       "refused their advances; the men received minimal punishment from the presiding judge. " + \
       "The Don is disappointed in Bonasera, who'd avoided most contact with the Don due to Corleone's" + \
       "nefarious business dealings. The Don's wife is godmother to Bonasera's shamed daughter, " + \
       "a relationship the Don uses to extract new loyalty from the undertaker. The Don agrees " + \
       "to have his men punish the young men responsible (in a non-lethal manner) in return for " + \
        "future service if necessary."
          
print summarize(text)

Output

When we run the above program we get the following output −

He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors because, according to Italian tradition, no Sicilian can refuse a request on his daughter's wedding  day.

extracting Keywords

We can also extract keywords from a body of text by using the keywords function from the gensim library as below.

main.py

from gensim.summarization import keywords
text = "In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleones " + \
       "daughter Connie (Talia Shire) and Carlo Rizzi (Gianni Russo). Vito (Marlon Brando),"  + \
       "the head of the Corleone Mafia family, is known to friends and associates as Godfather. "  + \
       "He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors "  + \
       "because, according to Italian tradition, no Sicilian can refuse a request on his daughter's wedding " + \
       " day. One of the men who asks the Don for a favor is Amerigo Bonasera, a successful mortician "  + \
       "and acquaintance of the Don, whose daughter was brutally beaten by two young men because she"  + \
       "refused their advances; the men received minimal punishment from the presiding judge. " + \
       "The Don is disappointed in Bonasera, who'd avoided most contact with the Don due to Corleone's" + \
       "nefarious business dealings. The Don's wife is godmother to Bonasera's shamed daughter, " + \
       "a relationship the Don uses to extract new loyalty from the undertaker. The Don agrees " + \
       "to have his men punish the young men responsible (in a non-lethal manner) in return for " + \
        "future service if necessary."

print keywords(text)

Output

When we run the above program, we get the following output −

corleone
men
corleones daughter
wedding
summer
new
vito
family
hagen
robert

Python Text Processing - Stemming Algorithms

In the areas of Natural Language Processing we come across situation where two or more words have a common root. For example, the three words - agreed, agreeing and agreeable have the same root word agree. A search involving any of these words should treat them as the same word which is the root word. So, it becomes essential to link all the words into their root word. The NLTK library has methods to do this linking and give the output showing the root word.

There are three most used stemming algorithms available in nltk. They give slightly different result. The below example shows the use of all the three stemming algorithms and their result.

Example - Usage of Stemming Algorithms

main.py

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import SnowballStemmer 

porter_stemmer = PorterStemmer()
lanca_stemmer = LancasterStemmer()
sb_stemmer = SnowballStemmer("english",)

word_data = "Aging head of famous crime family decides to transfer his position to one of his subalterns" 
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#Next find the roots of the word
print('***PorterStemmer****\n')
for w_port in nltk_tokens:
   print("Actual: %s  || Stem: %s"  % (w_port,porter_stemmer.stem(w_port)))

print('\n***LancasterStemmer****\n')
for w_lanca in nltk_tokens:
   print("Actual: %s  || Stem: %s"  % (w_lanca,lanca_stemmer.stem(w_lanca)))
print('\n***SnowballStemmer****\n')

for w_snow in nltk_tokens:
   print("Actual: %s  || Stem: %s"  % (w_snow,sb_stemmer.stem(w_snow)))

Output

When we run the above program we get the following output −

***PorterStemmer****

Actual: Aging  || Stem: age
Actual: head  || Stem: head
Actual: of  || Stem: of
Actual: famous  || Stem: famou
Actual: crime  || Stem: crime
Actual: family  || Stem: famili
Actual: decides  || Stem: decid
Actual: to  || Stem: to
Actual: transfer  || Stem: transfer
Actual: his  || Stem: hi
Actual: position  || Stem: posit
Actual: to  || Stem: to
Actual: one  || Stem: one
Actual: of  || Stem: of
Actual: his  || Stem: hi
Actual: subalterns  || Stem: subaltern

***LancasterStemmer****

Actual: Aging  || Stem: ag
Actual: head  || Stem: head
Actual: of  || Stem: of
Actual: famous  || Stem: fam
Actual: crime  || Stem: crim
Actual: family  || Stem: famy
Actual: decides  || Stem: decid
Actual: to  || Stem: to
Actual: transfer  || Stem: transf
Actual: his  || Stem: his
Actual: position  || Stem: posit
Actual: to  || Stem: to
Actual: one  || Stem: on
Actual: of  || Stem: of
Actual: his  || Stem: his
Actual: subalterns  || Stem: subaltern

***SnowballStemmer****

Actual: Aging  || Stem: age
Actual: head  || Stem: head
Actual: of  || Stem: of
Actual: famous  || Stem: famous
Actual: crime  || Stem: crime
Actual: family  || Stem: famili
Actual: decides  || Stem: decid
Actual: to  || Stem: to
Actual: transfer  || Stem: transfer
Actual: his  || Stem: his
Actual: position  || Stem: posit
Actual: to  || Stem: to
Actual: one  || Stem: one
Actual: of  || Stem: of
Actual: his  || Stem: his
Actual: subalterns  || Stem: subaltern

Python Text Processing - Constrained Search

Many times, after we get the result of a search we need to search one level deeper into part of the existing search result. For example, in a given body of text we aim to get the web addresses and also extract the different parts of the web address like the protocol, domain name etc. In such scenario we need to take help of group function which is used to divide the search result into various groups bases on the regular expression assigned. We create such group expression by separating the main search result using parentheses around the searchable part excluding the fixed words we want match.

Example - Usage of Search

main.py

import re
text = "The web address is https://www.tutorialspoint.com"

# Taking "://" and "." to separate the groups 
result = re.search('([\\w.-]+)://([\\w.-]+)\\.([\\w.-]+)', text)
if result :
    print("The main web Address: ",result.group())
    print("The protocol: ",result.group(1))
    print("The doman name: ",result.group(2)) 
    print("The TLD: ",result.group(3))

Output

When we run the above program, we get the following output −

The main web Address:  https://www.tutorialspoint.com
The protocol:  https
The doman name:  www.tutorialspoint
The TLD:  com

Previous Next