Scikit-Learn API

The objective of the scikit-learn API is to provide an efficient and unified syntax to make ML accessible to non-ML experts, as well as to facilitate and popularize its use among several industries.

How Does It Work?

Although it has many collaborators, the scikit-learn API was built and has been updated by considering a set of principles that prevent framework code proliferation, where different code performs similar functionalities. On the contrary, it promotes simple conventions and consistency. Due to this, the scikit-learn API is consistent among all models, and once the main functionalities have been learned, it can be used widely.

The scikit-learn API is pided into three complementary interfaces that share a common syntax and logic: the estimator, the predictor, and the transformer. The estimator interface is used for creating models and fitting the data into them; the predictor, as its name suggests, is used to make predictions based on the models that were trained previously; and finally, the transformer is used for converting data.

Estimator

This is considered to be the core of the entire API, as it is the interface in charge of fitting the models to the input data. It works by instantiating the model to be used and then applies a fit() method, which triggers the learning process so that it builds a model based on the data.

The fit() method receives the training data as arguments in two separate variables: the features matrix and the target matrix (conventionally called X_train and Y_train). For unsupervised models, this method only takes in the first argument (X_train).

This method creates the model trained to the input data, which can later be used for predicting.

Some models take other arguments besides the training data, which are also called hyperparameters. These hyperparameters are initially set to their default values but can be tuned to improve the performance of the model, which will be discussed in later sections.

The following is an example of a model being trained:

from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

model.fit(X_train, Y_train)

First, it is required that you import the type of algorithm to be used from scikit-learn; for example, a Gaussian NaÏve Bayes algorithm (which will be further explored in Chapter 4, Supervised Learning Algorithms: Predicting Annual Income) for classification. It is always good practice to import only the algorithm to be used, and not the entire library, as this will ensure that your code runs faster.

Note

To find the syntax for importing a different model, use the documentation of scikit-learn. Go to the following link, click the algorithm that you wish to implement, and you will find the instructions there: http://scikit-learn.org/stable/user_guide.html.

The second line of code oversees the instantiation of the model and stores it in a variable. Lastly, the model is fitted to the input data.

In addition to this, the estimator also offers other complementary tasks, as follows:

  • Feature extraction, which involves transforming input data into numerical features that can be used for ML purposes.
  • Feature selection, which selects the features in your data that contribute to the prediction output of the model.
  • Dimensionality reduction, which takes high-dimensional data and converts it into a lower dimension.
Predictor

As explained previously, the predictor takes the model created by the estimator and uses it to perform predictions on unseen data. In general terms, for supervised models, it feeds the model a new set of data, usually called X_test, to get a corresponding target or label based on the parameters that were learned while training the model.

Moreover, some unsupervised models can also benefit from the predictor. While this method does not output a specific target value, it can be useful to assign a new instance to a cluster.

Following the preceding example, the implementation of the predictor can be seen as follows:

Y_pred = model.predict(X_test)

We apply the predict() method to the previously trained model and input the new data as an argument to the method.

In addition to predicting, the predictor can also implement methods that are in charge of quantifying the confidence of the prediction (that is, a numeric value representative of the level of performance of the model). These performance measures vary from model to model, but their main objective is to determine how far the prediction is from reality. This is done by taking an X_test with its corresponding Y_test and comparing it to the predictions made with the same X_test.

Transformer

As we saw previously, data is usually transformed before being fed to a model. Considering this, the API contains a transform() method that allows you to perform some preprocessing techniques.

It can be used both as a starting point to transform the input data of the model (X_train), as well as further along to modify data that will be fed to the model for predictions. This latter application is crucial to get accurate results as it ensures that the new data follows the same distribution as the data that was used to train the model.

The following is an example of a transformer that normalizes the values of the training data:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_train)

X_train = scaler.transform(X_train)

The StandardScaler class standardizes the data that it receives as arguments. As you can see, after importing and instantiating the transformer (that is, StandardScaler), it needs to be fit to the data to then effectively transform it:

X_test = scaler.transform(X_test)

The advantage of the transformer is that once it has been applied to the training dataset, it stores the values used for transforming the training data; this can be used to transform the test dataset to the same distribution, as seen in the preceding snippet.

In conclusion, we discussed one of the main benefits of using scikit-learn, which is its API. This API follows a consistent structure that makes it easy for non-experts to apply ML algorithms.

To model an algorithm on scikit-learn, the first step is to instantiate the model's class and fit it to the input data using an estimator, which is usually done by calling the fit() method of the class. Finally, once the model has been trained, it is possible to predict new values using the predictor by calling the predict() method of the class.

Additionally, scikit-learn also has a transformer interface that allows you to transform data as needed. This is useful for performing preprocessing methods over the training data, which can then also be used to transform the testing data to follow the same distribution.