Using Vectorized Operations to Analyze Data Fast
The core building blocks in all programmers' toolboxes are looping and conditionals – usually materialized as a for loop or an if statement respectively. Almost any programming problem in its most fundamental form can be broken down into a series of conditional operations (only do something if a specific condition is met) and a series of iterative operations (carry on doing the same thing until a condition is met).
In machine learning, vectors, matrices, and tensors become the basic building blocks, taking over from arrays and linked lists. When we are manipulating and analyzing matrices, we often want to apply a single operation or function to the entire matrix.
Programmers coming from a traditional computer science background will often use a for loop or a while loop to do this kind of analysis or manipulation, but they are inefficient.
Instead, it is important to become comfortable with vectorized operations. Nearly all modern processors support efficiently modifying matrices and vectors in parallel by executing the same operation to each element simultaneously.
Similarly, many software packages are optimized for exactly this use case: applying the same operator to many rows of a matrix.
But if you are used to writing for loops, it can be difficult to get out of the habit. So, we will compare the for loop with the vectorized operation to help you understand the reason to avoid using a for loop. In the next exercise, we'll use our headlines dataset again and do some basic analysis. We'll do each piece of analysis twice: first using a for loop, and then again using a vectorized operation. You'll see the speed differences even on this relatively small dataset, but these differences will be even more important on the larger datasets that we previously discussed.
While some languages have great support for vectorized operations out of the box, Python relies mainly on third-party libraries to take advantage of these. We'll be using pandas in the upcoming exercise.
Exercise 1.02: Applying Vectorized Operations to Entire Matrices
In this exercise, we'll use the pandas library to load the same clickbait dataset and we'll carry out some descriptive analysis. We'll do each piece of analysis twice to see the efficiency gains of using vectorized operations compared to for loops.
Perform the following steps to complete the exercise:
- Create a new directory, Exercise01.02, in the Chapter01 directory to store the files for this exercise.
- Open your Terminal (macOS or Linux) or Command Prompt (Windows), navigate to the Chapter01 directory, and type jupyter notebook.
- In the Jupyter notebook, click the Exercise01.02 directory and create a new notebook file with a Python3 kernel.
- Import the pandas library and use it to read the dataset file into a DataFrame, as shown in the following code:
import pandas as pd
df = pd.read_csv("../Datasets/clickbait-headlines.tsv", \
sep="\t", names=["Headline", "Label"])
df
You should get the following output:
We import the pandas library and then use the read_csv() function to read the file into a DataFrame called df. We pass the sep argument to indicate that the file uses tab (\t) characters as separators and then pass in the column names as the names argument. The output is summarized to show only the first few entries and the last few, followed by a description of how many rows and columns there are in the entire DataFrame.
- Calculate the length of each headline and print out the first 10 lengths using a for loop, along with the total performance timing, as shown in the following code:
%%time
lengths = []
for i, row in df.iterrows():
lengths.append(len(row[0]))
print(lengths[:10])
You should get the following output:
[42, 60, 72, 49, 66, 51, 51, 58, 57, 76]
CPU times: user 1.82 s, sys: 50.8 ms, total: 1.87 s
Wall time: 1.95 s
We declare an empty array to store the lengths, then loop through each row in our DataFrame using the iterrows() method. We append the length of the first item of each row (the headline) to our array, and finally, print out the first 10 results.
- Now re-calculate the length of each row, but this time using vectorized operations, as shown in the following code:
%%time
lengths = df['Headline'].apply(len)
print(lengths[:10])
You should get the following output:
0 42
1 60
2 72
3 49
4 66
5 51
6 51
7 58
8 57
9 76
Name: Headline, dtype: int64
CPU times: user 6.31 ms, sys: 1.7 ms, total: 8.01 ms
Wall time: 7.76 ms
We use the apply() function to apply len to every row in our DataFrame, without a for loop. Then we print the results to verify they are the same as when we used the for loop. From the output, we can see the results are the same, but this time it took only 16.3 milliseconds instead of over 1 second to carry out all of these calculations. Now, let's try a different calculation.
- This time, find the average length of all clickbait headlines and compare this average to the length of normal headlines, as shown in the following code:
%%time
from statistics import mean
normal_lengths = []
clickbait_lengths = []
for i, row in df.iterrows():
if row[1] == 1: # clickbait
clickbait_lengths.append(len(row[0]))
else:
normal_lengths.append(len(row[0]))
print("Mean normal length is {}"\
.format(mean(normal_lengths)))
print("Mean clickbait length is {}"\
.format(mean(clickbait_lengths)))
Note
The # symbol in the code snippet above denotes a code comment. Comments are added into code to help explain specific bits of logic.
You should get the following output:
Mean normal length is 52.0322
Mean clickbait length is 55.6876
CPU times: user 1.91 s, sys: 40.7 ms, total: 1.95 s
Wall time: 2.03 s
We import the mean function from the statistics library. This time, we set up two empty arrays, one for the lengths of normal headlines and one for the lengths of clickbait headlines. We use the iterrows() function again to check every row and calculate the length, but this time store the result in one of our two arrays, based on whether the headline is clickbait or not. We then take the average of each array and print it out.
- Now recalculate this output using vectorized operations, as shown in the following code:
%%time
print(df[df["Label"] == 0]['Headline'].apply(len).mean())
print(df[df["Label"] == 1]['Headline'].apply(len).mean())
You should get the following output:
52.0322
55.6876
CPU times: user 10.5 ms, sys: 3.14 ms, total: 13.7 ms
Wall time: 14 ms
In each line, we look at only a subset of the DataFrame: first when the label is 0, and second when it is 1. We again apply the len function to each row that matches the condition and then take the average of the entire result. We confirm that the output is the same as before, but the overall time is in milliseconds in this case.
- As a final test, calculate how often the word "you" appears in each kind of headline, as shown in the following code:
%%time
from statistics import mean
normal_yous = 0
clickbait_yous = 0
for i, row in df.iterrows():
num_yous = row[0].lower().count("you")
if row[1] == 1: # clickbait
clickbait_yous += num_yous
else:
normal_yous += num_yous
print("Total 'you's in normal headlines {}".format(normal_yous))
print("Total 'you's in clickbait headlines {}".format(clickbait_yous))
You should get the following output:
Total 'you's in normal headlines 43
Total 'you's in clickbait headlines 2527
CPU times: user 1.48 s, sys: 8.84 ms, total: 1.49 s
Wall time: 1.53 s
We define two variables, normal_yous and clickbait_yous, to count the total occurrences of the word you in each class of headline. We loop through the entire dataset again using a for loop and the iterrows() function. For each row, we use the count() function to count how often the word you appear and then add this total to the relevant total. Finally, we print out both results, seeing that you appear very often in clickbait headlines, but hardly in non-clickbait headlines.
- Rerun the same analysis without using a for loop and compare the time, as shown in the following code:
%%time
print(df[df["Label"] == 0]['Headline']\
.apply(lambda x: x.lower().count("you")).sum())
print(df[df["Label"] == 1]['Headline']\
.apply(lambda x: x.lower().count("you")).sum())
You should get the following output:
43
2527
CPU times: user 20.8 ms, sys: 1.32 ms, total: 22.1 ms
Wall time: 27.9 ms
We break the dataset into two subsets and apply the same operation to each. This time, our function is a bit more complicated than the len function we used before, so we define an anonymous function inline using lambda. We lowercase each headline and count how often "you" appears and then sum the results. We notice that the performance time, in this case, is again in milliseconds.
Note
To access the source code for this specific section, please refer to https://packt.live/2OmyEE2.
In this exercise, the main takeaway we can see is that vectorized operations can be many times faster than using for loops. We also learned some interesting things about clickbait characteristics though. For example, the word "you" appears very often in clickbait headlines (2,527 times), but hardly ever in normal headlines (43 times). Clickbait headlines are also, on average, slightly longer than non-clickbait headlines.
Let's implement the concepts learned so far in the next activity.
Activity 1.01: Creating a Text Classifier for Movie Reviews
In this activity, we will create another text classifier. Instead of training a machine learning model to discriminate between clickbait headlines and normal headlines, we will train a similar classifier to discriminate between positive and negative movie reviews.
The objectives of our activity are as follows:
- Vectorize the text of IMDb movie reviews and label these as positive or negative.
- Train an SVM classifier to predict whether a movie review is positive or negative.
- Check how accurate our classifier is on a held-out test set.
- Evaluate our classifier on out-of-context data.
Note
We will be using some randomizers in this activity. It is helpful to set the global random seeds to ensure that the results you see are the same as in the examples. Sklearn uses the NumPy random seed, and we will also use the shuffle function from the built-in random library. You can ensure you see the same results by adding the following code:
import random
import numpy as np
random.seed(1337)
np.random.seed(1337)
We'll use the aclImdb dataset of 100,000 movie reviews from Internet Movie Database (IMDb) – 50,000 each for training and testing. Each dataset has 25,000 positive reviews and 25,000 negative ones, so this is a larger dataset than our headlines one. The dataset can be found in our GitHub repository at the following location: https://packt.live/2C72sBN
You need to download the aclImdb folder from the GitHub repository.
Dataset Citation: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
In Exercise 1.01, Training a Machine Learning Model to Identify Clickbait Headlines, we had one file, with each line representing a different data item. Now we have a file for each data item, so keep in mind that we'll need to restructure some of our training code accordingly.
Note
The code and the resulting output for this exercise have been loaded in a Jupyter notebook that can be found at https://packt.live/3iWYZGH.
Perform the following steps to complete the activity:
- Import the os library and the random library, and define where our training and test data is stored using four variables: one for training_positive, one for training_negative, one for test_positive, and one for test_negative, each pointing at the respective dataset subdirectory.
- Define a read_dataset function that takes a path to a dataset and a label (either pos or neg) that reads the contents of each file in the given directory and adds these contents into a data structure that is a list of tuples. Each tuple contains both the text of the file and the label, pos or neg. An example is shown as follows. The actual data should be read from disk instead of being defined in code:
contents_labels = [('this is the text from one of the files', 'pos'), ('this is another text', 'pos')]
- Use the read_dataset function to read each dataset into its variable. You should have four variables in total: train_pos, train_neg, test_pos, and test_neg, each one of which is a list of tuples, containing the relative text and labels.
- Combine the train_pos and train_neg datasets. Do the same for the test_pos and test_neg datasets.
- Use the random.shuffle function to shuffle the train and test datasets separately. This gives us datasets where the training data is mixed up, instead of feeding all the positive and then all the negative examples to the classifier in order.
- Split each of the train and test datasets back into data and labels respectively. You should have four variables again called train_data, y_train, test_data, and y_test where the y prefix indicates that the respective array contains labels.
- Import TfidfVectorizer from sklearn, initialize an instance of it, fit the vectorizer on the training data, and vectorize both the training and testing data into the X_train and X_test variables respectively. Time how long this takes and print out the shape of the training vectors at the end.
- Again, find the execution time, import LinearSVC from sklearn and initialize an instance of it. Fit the SVM on the training data and training labels, and then generate predictions on the test data (X_test).
- Import accuracy_score and classification_report from sklearn and calculate the results of your predictions. You should get the following output:
- See how your classifier performs on data on different topics. Create two restaurant reviews as follows:
good_review = "The restaurant was really great! "\
"I ate wonderful food and had a very good time"
bad_review = "The restaurant was awful. "\
"The staff were rude and "\
"the food was horrible. "\
"I hated it"
- Now vectorize each using the same vectorizer and generate predictions for whether each one is negative or positive. Did your classifier guess correctly?
Now that we've built two machine learning models and gained some hands-on experience with vectorized operations, it's time to recap.
Note
The solution to this activity can be found on page 580.