RAISE Systems

Creating a word cloud with coronavirus news titles

By Gustavo Rodríguez
at April 7, 2020

Creating a word cloud with data from MongoDB in a Jupyter Notebook

Word clouds can be an interesting way to display data, but they look better with a shape, so in this example we’re going to show how to make a word cloud with the shape coming form from a PNG image

Requirements

Most importantly, you’ll need a dataset with the news titles, if you want you can ask me and I’ll provide it, send us an email.

Python packages

pymongo
numpy
PIL, the python image library
matplotlib
wordcloud

We’ll also need a PNG image in black and white with the shape we want to display, note that the image mode should be in grey scale to avoid the wrong pixel values in the numpy arrays

The image should look like this (scaled to 700x700)

Implementation

Launch jupyter notebook from your project’s folder and create a new notebook

First, add the imports

import pymongo
import numpy as np
from os import path
import os
from PIL import Image
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

That’s what we’ll need to work with the Word Cloud

Then, we’ll need to load the image and build a mask for for the word cloud

d = path.dirname(__file__) if "__file__" in locals() else os.getcwd()
cmask = np.array(Image.open(path.join(d, "corona1.png")))

Now we’ll set the STOPWORDS, words we don’t want to show

stopwords = set(STOPWORDS)
stopwords.add("coronavirus")
stopwords.add("will")
stopwords.add("say")
stopwords.add("says")
stopwords.add("now")
stopwords.add("covid")

As you can see, I added some words that I didn’t want to display because they’re not very relevant

Now we’ll define the class that we’ll use for this task

class TextUtil:
    """Class for working with text stored in MongoDB"""

    def __init__(self, load_amount):
        """Initialize the class with the amount of posts we want to load,
           the amount of text and the MongoDB connection"""
        self.load_amount = load_amount
        self.text = ""
        self.client = pymongo.MongoClient(
            "mongodb://ADDRESS:27017/admin",
            username='USER',
            password='PASSWORD'
        )

Note: You need to edit the connection parameters to be able to connect to the MongoDB database

The constructor has one parameter which is the load_amount, we’ll use that for setting the amount of posts we want to load from the database.

We also set a new field for this class which is the text, for storing the loaded text from the database, and the pymongo client connection

Now we need to add some class methods to work with the data, starting with get_posts

def get_posts(self):
    """Set the posts from the database"""
    db = self.client["chat"]
    posts = db["posts"].find({}, {"title": 1}).sort(
        "timestamp", pymongo.ASCENDING).limit(self.load_amount)
    for p in posts:
        self.text += " ".join(p["title"].split(" ")).lower()
    print("Loaded posts") # To show the progress

That method will get the posts from the MongoDB database and set them on the text class field for processing by the word cloud

We’re going to need a final method to process the data and draw the image

def draw_image(self, filename):
    """Draws an image with the text"""
    print("Drawing image")
    wc = WordCloud(
        stopwords=stopwords,
        background_color="white",
        max_words=1000,
        mask=cmask,
        contour_width=3,
        contour_color='steelblue')
    wc.generate(self.text)
    wc.to_file(filename)
    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    plt.figure()
    plt.imshow(cmask, cmap=plt.cm.gray, interpolation='bilinear')
    plt.axis("off")
    plt.show()

Finally we can create a new instance of our class with the desired amount of posts we want from the database

t_handler = TextUtil(10000)  # Amount of posts

And use it for getting the posts and drawing the image

t_handler.get_posts()
t_handler.draw_image("coronares.png")

Now, in the same directory, we should have a new file coronares.png which should be our desired image

You can find the code of the whole notebook here

Creating a word cloud with coronavirus news titles

Creating a word cloud with data from MongoDB in a Jupyter Notebook

Requirements

Implementation

Comments