NLTK for Simple Article Categorization

Article categorization has been here for a while, but huge improvements have been achieved when Machine Learning techniques improved. However, not always we will need something that complex to categorize an article. Here I will show you how we can categorize an article using NLTK in python 2.7.x with a few dozen lines of code.

x2062a4072509f2ab963bc5b8b71a71eb_nltk-863-430-c-pagespeed-ic-o0vmys_a95

We will asume that we will receive the article with out html tags and properly encoded. The steps to solve this problem are the following:

  • Create a couple of classes to save in memory the data needed
  • Create a util class to use the information in the previous class to process the data
  • Return the results of the categorization
  • The main reason to use python is that is the de-facto language for language processing and the libraries that exists in .net are not mantained anymore.

    Create a couple of classes to save needed data

    We will create two classes. The first one called Category and the seconde one called Data

    
    class Category:
        def __init__(self, name, keywords):
            self.name = name
            self.keywords = keywords
    
    class Data:
        def __init__(self, article):
            self.categories = None
            self.article= article
    
    

    The first class will save the name of the category and the keywords to identify it, the second one will save the article data and it will assign categories in None that will be set later.

    Create a util class to process the data

    This class will have a method which receives an article class and an array of categories classes. With this information we will use NLTK to get the category in the most simple way. This is quite a long class so it will be important to read all the comments for further explanation.

    
    '''
    Analyzer class
    '''
    import nltk
    import base64
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    from nltk.tokenize import RegexpTokenizer
    from nltk.tokenize import sent_tokenize, word_tokenize
    from category import Category
    
    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('averaged_perceptron_tagger')
    
    class Analyzer:
        def __init__(self, data, categories):
            self.data = data
            self.categories = categories
         
            # Distribution threshold. How many times the 
            # word must be repeated in the distribution to 
            # be considered as part of the category
            self.threshold_distribution = 1
    
            # Category threshold. How many words is
            # the limit to be considered as part of the category
            self.threshold_category_words = 3
    
        def analize_content_with_nltk(self):
    
            # Tokenizer which will remove the extra with spaces and weird characters
            tokenizer = RegexpTokenizer(r'\w+')
            # Stop word dictionary to remove common words like prepositions
            stop_words = set(stopwords.words('english'))
            # Stemmer to reduce the words to their stem
            stemmer = PorterStemmer()
    
            print "processing..."
    
            # First we will tokenize the article
            article_tokens = tokenizer.tokenize(self.data.article)
            # Remove the stop words in the article
            article_tokens_no_stop_words = [w for w in article_tokens if not w in stop_words]
    
            print 'article tokens: ' + str(len(article_tokens))
            print 'article tokens no stop words: ' + str(len(article_tokens_no_stop_words))
        
            # Stem the article tokens
            article_tokens_stemming = []
            for word in article_tokens_no_stop_words:
                article_tokens_stemming.append(stemmer.stem(word.lower()))
    
            # Create a frequent distribution for the article´s token
            article_dist = nltk.FreqDist(article_tokens_stemming)
       
            # Iterate over the categories array
            article_categories = []
            for category in self.categories:  
                # Tokenize the category keywords          
                keywords_tokens = list(set(tokenizer.tokenize(' '.join(category.keywords))))
     
                # If there are no tokens continue with the next category
                if(len(keywords_tokens) == 0):
                    continue
    
                print 'category: ' + category.name
                print 'keywords tokens: ' + str(len(keywords_tokens))
    
                # Stem the category keywords
                keywords_tokens_stemming = []
                for keyword in keywords_tokens:
                    keywords_tokens_stemming.append(stemmer.stem(keyword.lower()))
                     
                # Using the first threshold verify if in the article distribution
                # one of the keywords appear. If is the case save it to the 
                # keywords found array
                keywords_found = []
                for keyword in keywords_tokens_stemming:
                    if(article_dist[keyword] >= self.threshold_distribution):
                        keywords_found.append(keyword)
                    
                print "keywords found: " + str(keywords_found)
    
                # If the number of keywords found in the article distribution 
                # is equal or pass the second threshold we will assume the article
                # of this category. We will then repeat the process for another category        
                if(len(keywords_found) >= self.threshold_category_words):
                    article_categories.append(category.name)            
       
            print(article_categories)
            self.data.categories = article_categories
    
    

    We used NLP to solve this task instead of Machine Learning. This of course make if faster to categorize, but not so accurate. I will definitely go with a Machine Learning approach if you have the resources and time to implement it.

    Return the results

    Finally, we will create our new class and call the method to get the results.

    
    #!/usr/bin/env python
    from category import Category
    from data import Data
    
    # Gunicorn entry point.
    app = create_app()
    
    if __name__ == '__main__':
        # Entry point when run via Python interpreter.
        data = Data("Some article content over here about soccer and sports")
        category1 = Category("Sports", ["soccer","game","sports", "entretainment"])
        category2 = Category("Movies", ["cinema","oscars","movies", "imdb"])
        
        # Create the analyzer class
        analyzer = Analyzer(data, [category1, category2])
        
        # Call the method
        analyzer.analize_content_with_nltk()
       
        # Show result
        print analyzer.data.categories
    
    

    This show you how to implement a simple article categorization using NLTK and python. I hope it will help someone and as always keep learning !!!

    Advertisements

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out /  Change )

    Google photo

    You are commenting using your Google account. Log Out /  Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out /  Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out /  Change )

    Connecting to %s