Evaluation of the model

There should be a process to test machine learning algorithms and discover whether or not we have chosen the right algorithms, and to validate the output the algorithm provides against the problem statement.

This is the last step in the machine learning process, where we check the accuracy with the defined threshold for success criteria and, if the accuracy is greater than or equal to the threshold, then we are done. If not, we need to start all over again with a different machine learning algorithm, different parameter settings, more data, and changed data transformation. All steps in the entire machine learning process can be repeated, or a subset of it can be repeated. These are repeated till we come to the definition of "done" and are satisfied with the results.

The machine learning process is a very iterative one. Findings from one step may require a previous step to be repeated with new information. For example, during the data transformation step, we may find some data quality issues that may require us to go back to acquire more data from a different source.

Each step may also require several iterations. This is of particular interest, as the data preparation step may undergo several iterations, and the model selection may undergo several iterations. In the entire sequence of activities stated for performing machine learning, any activity can be repeated any number of times. For example, it is common to try different machine learning algorithms to train the model before moving on to testing the model. So, it is important to recognize that this is a highly iterative process and not a linear one.

Test set creation: We have to define the test dataset clearly. The goal of the test dataset is as follows:

Quickly and consistently test the algorithm that has been selected to solve the problem
Test a variety of algorithms to determine whether they are able to solve the problem
Determine which algorithm would be worth using to solve the problem
Determine whether there is a problem with the data considered for evaluation purposes as, if all algorithms consistently fail to produce proper results, there is a possibility that the data itself might require a revisit

Performance measure: The performance measure is a way to evaluate the model created. Different performance metrics will need to be used to evaluate different machine learning models. These are standard performance measures from which we can choose to test our model. There may not be a need to customize the performance measures for our model.

The following are some of the important terms that need to be known to understand the performance measure of algorithms:

Overfitting: The machine learning model is overfitting the training data when we see that the model performs well on the training data but does not perform well on the evaluation data. This is because the model is memorizing the data it has seen and is unable to generalize to unseen examples.
Underfitting: The machine learning model is underfitting the training data when the model performs poorly on the training data. This is because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y).
Cross-validation: Cross-validation is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it. In k-fold cross-validation, the original sample is randomly partitioned into k equally sized subsamples.
Confusion matrix: In the field of machine learning, and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm.
Bias: Bias is the tendency of a model to make predictions in a consistent way.
Variance: Variance is the tendency of a model to make predictions that vary from the true relationship between the parameters and the labels.
Accuracy: Correct results are divided by total results.
Error: Incorrect results are divided by total results.

Precision: The number of correct results returned by a machine learning algorithm are divided by the number of all returned results.
Recall: The number of correct results returned by a machine learning algorithm are divided by the number of results that should have been returned.