Learning from Big Data

In the first two chapters, we set the context for intelligent machines with the big data revolution and how big data is fueling rapid advances in artificial intelligence. We also emphasized the need for a global vocabulary for universal knowledge representation. We have also seen how that need is fulfilled with the use of ontologies and how ontologies help construct a semantic view of the world.

The quest is for the knowledge, which is derived from information, which is in turn derived from the vast amounts of data that we are generating. Knowledge facilitates a rational decision-making process for machines that complements and augments human capabilities. We have seen how the Resource Description Framework (RDF) provides the schematic backbone for the knowledge assets along with Web Ontology Language (OWL) fundamentals and the query language for RDFs (SPARQL).

In this chapter, we are going to look at some of the basic concepts of machine learning and take a deep pe into some of the algorithms. We will use Spark's machine learning libraries. Spark is one of the most popular computer frameworks for the implementation of algorithms and as a generic computation engine on big data. Spark fits into the big data ecosystem well, with a simple programming interface, and very effectively leverages the power of distributed and resilient computing frameworks. Although this chapter does not assume any background with statistics and mathematics, it will greatly help if the reader has some programming background, in order to understand the code snippets and to try and experiment with the examples.

In this chapter, we will see broad categories of machine learning in supervised and unsupervised learning, before taking a deep pe, with examples, into:

  • Regression analysis
  • Data clustering
  • K-means
  • Data dimensionality reduction
  • Singular value decomposition
  • Principal component analysis (PCA)

In the end, we will have an overview of the Spark programming model and Spark's Machine Learning library (Spark MLlib). With all this background knowledge at our disposal, we will implement a recommendation system to conclude this chapter.