NLTK for Simple Article Categorization

Article categorization has been here for a while, but huge improvements have been achieved when Machine Learning techniques improved. However, not always we will need something that complex to categorize an article. Here I will show you how we can categorize an article using NLTK in python 2.7.x with a few dozen lines of code.

x2062a4072509f2ab963bc5b8b71a71eb_nltk-863-430-c-pagespeed-ic-o0vmys_a95

We will asume that we will receive the article with out html tags and properly encoded. The steps to solve this problem are the following:

Create a couple of classes to save in memory the data needed

Create a util class to use the information in the previous class to process the data

Return the results of the categorization

The main reason to use python is that is the de-facto language for language processing and the libraries that exists in .net are not mantained anymore.

Create a couple of classes to save needed data

We will create two classes. The first one called Category and the seconde one called Data


class Category:
    def __init__(self, name, keywords):
        self.name = name
        self.keywords = keywords

class Data:
    def __init__(self, article):
        self.categories = None
        self.article= article

The first class will save the name of the category and the keywords to identify it, the second one will save the article data and it will assign categories in None that will be set later.

Create a util class to process the data

This class will have a method which receives an article class and an array of categories classes. With this information we will use NLTK to get the category in the most simple way. This is quite a long class so it will be important to read all the comments for further explanation.


'''
Analyzer class
'''
import nltk
import base64
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import sent_tokenize, word_tokenize
from category import Category

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

class Analyzer:
    def __init__(self, data, categories):
        self.data = data
        self.categories = categories
     
        # Distribution threshold. How many times the 
        # word must be repeated in the distribution to 
        # be considered as part of the category
        self.threshold_distribution = 1

        # Category threshold. How many words is
        # the limit to be considered as part of the category
        self.threshold_category_words = 3

    def analize_content_with_nltk(self):

        # Tokenizer which will remove the extra with spaces and weird characters
        tokenizer = RegexpTokenizer(r'\w+')
        # Stop word dictionary to remove common words like prepositions
        stop_words = set(stopwords.words('english'))
        # Stemmer to reduce the words to their stem
        stemmer = PorterStemmer()

        print "processing..."

        # First we will tokenize the article
        article_tokens = tokenizer.tokenize(self.data.article)
        # Remove the stop words in the article
        article_tokens_no_stop_words = [w for w in article_tokens if not w in stop_words]

        print 'article tokens: ' + str(len(article_tokens))
        print 'article tokens no stop words: ' + str(len(article_tokens_no_stop_words))
    
        # Stem the article tokens
        article_tokens_stemming = []
        for word in article_tokens_no_stop_words:
            article_tokens_stemming.append(stemmer.stem(word.lower()))

        # Create a frequent distribution for the article´s token
        article_dist = nltk.FreqDist(article_tokens_stemming)
   
        # Iterate over the categories array
        article_categories = []
        for category in self.categories:  
            # Tokenize the category keywords          
            keywords_tokens = list(set(tokenizer.tokenize(' '.join(category.keywords))))
 
            # If there are no tokens continue with the next category
            if(len(keywords_tokens) == 0):
                continue

            print 'category: ' + category.name
            print 'keywords tokens: ' + str(len(keywords_tokens))

            # Stem the category keywords
            keywords_tokens_stemming = []
            for keyword in keywords_tokens:
                keywords_tokens_stemming.append(stemmer.stem(keyword.lower()))
                 
            # Using the first threshold verify if in the article distribution
            # one of the keywords appear. If is the case save it to the 
            # keywords found array
            keywords_found = []
            for keyword in keywords_tokens_stemming:
                if(article_dist[keyword] &gt;= self.threshold_distribution):
                    keywords_found.append(keyword)
                
            print "keywords found: " + str(keywords_found)

            # If the number of keywords found in the article distribution 
            # is equal or pass the second threshold we will assume the article
            # of this category. We will then repeat the process for another category        
            if(len(keywords_found) &gt;= self.threshold_category_words):
                article_categories.append(category.name)            
   
        print(article_categories)
        self.data.categories = article_categories

We used NLP to solve this task instead of Machine Learning. This of course make if faster to categorize, but not so accurate. I will definitely go with a Machine Learning approach if you have the resources and time to implement it.

Return the results

Finally, we will create our new class and call the method to get the results.


#!/usr/bin/env python
from category import Category
from data import Data

# Gunicorn entry point.
app = create_app()

if __name__ == '__main__':
    # Entry point when run via Python interpreter.
    data = Data("Some article content over here about soccer and sports")
    category1 = Category("Sports", ["soccer","game","sports", "entretainment"])
    category2 = Category("Movies", ["cinema","oscars","movies", "imdb"])
    
    # Create the analyzer class
    analyzer = Analyzer(data, [category1, category2])
    
    # Call the method
    analyzer.analize_content_with_nltk()
   
    # Show result
    print analyzer.data.categories

This show you how to implement a simple article categorization using NLTK and python. I hope it will help someone and as always keep learning !!!

NLTK for Simple Article Categorization

NLTK for Simple Article Categorization

Like this:

Related

Written by:

Jorge Cardenas

Leave a ReplyCancel reply

NLTK for Simple Article Categorization

Share this:

Like this:

Related

Written by:

Jorge Cardenas

Leave a ReplyCancel reply