Introduction to Machine Learning

Machine learning (ML) is a subset of Artificial Intelligence (AI) that consists of a wide variety of algorithms capable of learning from the data that is being fed to them, without being specifically programmed for a task. This ability to learn from data allows the algorithms to create models that are capable of solving complex data problems by finding patterns in historical data and improving them as new data is fed to the models.

These different ML algorithms use different approximations to solve a task (such as probability functions), but the key element is that they are able to consider a countless number of variables for a particular data problem, making the final model better at solving the task than humans are. The models that are created using ML algorithms are created to find patterns in the input data so that those patterns can be used to make informed predictions in the future.

Applications of ML

Some of the popular tasks that can be solved using ML algorithms are price/demand predictions, product/service recommendation, and data filtering, among others. The following is a list of real-life examples of such tasks:

  • On-demand price prediction: Companies whose services vary in price according to demand can use ML algorithms to predict future demand and determine whether they will have the capability to meet it. For instance, in the transportation industry, if future demand is low (low season), the price for flights will drop. On the other hand, is demand is high (high season), flights are likely to increase in price.
  • Recommendations in entertainment: Using the music that you currently use, as well as that of the people similar to you, ML algorithms can construct models capable of suggesting new records that you may like. That is also the case of video streaming applications, as well as online bookstores.
  • Email filtering: ML has been used for a while now in the process of filtering incoming emails in order to separate spam from your desired emails. Lately, it also has the capability to sort unwanted emails into more categories, such as social and promotions.

Choosing the Right ML Algorithm

When it comes to developing ML solutions, it is important to highlight that, more often than not, there is no one solution for a data problem, much like there is no algorithm that fits all data problems. According to this and considering that there is a large quantity of algorithms in the field of ML, choosing the right one for a certain data problem is often the turning point that separates outstanding models from mediocre ones.

The following steps can help narrow down the algorithms to just a few:

  1. Understand your data: Considering that data is the key to being able to develop any ML solutions, the first step should always be to understand it in order to be able to filter out any algorithm that is unable to process such data.

    For instance, considering the quantity of features and observations in your dataset, it is possible to determine whether an algorithm capable of producing outstanding results with a small dataset is required. The number of instances/features to consider a dataset small depends on the data problem, the quantity of the outputs, and so on. Moreover, by understanding the types of fields in your dataset, you will also be able to determine whether you need an algorithm capable of working with categorical data.

  2. Categorize the data problem: As per the following diagram, in this step, you should analyze your input data to determine if it contains a target feature (a feature whose values you want to be modeled and predicted) or not. Datasets with a target feature are also known as labeled data and are solved using supervised learning (A) algorithms. On the other hand, datasets without a target feature are known as unlabeled data and are solved using unsupervised learning algorithms (B).

    Moreover, the output data (the form of output that you expect from the model) also plays a key role in determining the algorithms to be used. If the output from the model needs to be a continuous number, the task to be solved is a regression problem (C). On the other hand, if the output is a discrete value (a set of categories, for instance), the task at hand is a classification problem (D). Finally, if the output is a subgroup of observations, the process to be performed is a clustering task (E):

    Figure 1.1: Demonstrating the pision of tasks

    This pision of tasks will be explored in more detail in the Supervised and Unsupervised Learning section of this chapter.

  3. Choose a set of algorithms: Once the preceding steps have been performed, it is possible to filter out the algorithms that perform well over the input data and that are able to arrive at the desired outcome. Depending on your resources and time limitations, you should choose from this list of apt algorithms the ones that you want to test out over your data problem, considering that it is always a good practice to try more than one algorithm.

These steps will be explained in more detail in the next chapter using a real-life data problem as an example.