Loading the dataset

As we did previously, we load our dataset by defining our training instances and labels, as well as our test instances and labels. We are able to use the load_data parameter on imdb to load in our pre-processed data into a 50/50 train–test split. We can also indicate the number of most frequently occurring words we want to keep in our dataset. This helps us control the inherent complexity of our task as we work with review vectors of reasonable sizes. It is safe to assume that rare words occurring in reviews would have to do more with the specific subject matter of a given movie, and so they have little influence on the sentiment of that review in question. Due to this, we will limit the number of words to 12,000.