Previous chapter
IntroductionIntroduction
Next chapter

Introduction

This course covers machine learning using the R programming language. Machine learning (ML) is a sub-discipline of Artificial Intelligence (AI) focusing on learning patterns and building models from existing training data. In that respect machine learning is closely related to statistics (statistical learning) but is also an engineering discipline due to the fact that increasing data volumes demand more sophisticated algorithms and infrastructure.

Machine Learning has emerged from computer science and artificial intelligence to create and apply sophisticated algorithms and models without the need of explicit (hard-coded) rules and instructions. Instead, the models can extract rules and patterns from data and can be applied to problems which long have been thought could only be solved by humans. Compared to other classical fields of artificial intelligence which aim to model human intelligence and general problem solving, machine learning algorithms are typically applied to much narrower domains.

Deep Learning solely covers neural networks - especially ones with multiple hidden layers and novel architectures. Deep Learning is not part of this course but will be covered in the module Deep Learning with R.

Applications

The applications and research of machine learning techniques has risen steadily over the past decades and even exponentially over the past five years. The ongoing evolution or state-of-the-art has typically been measured through human vs machine competitions where the first breakthrough was reported after the win of Deep Blue against the chess champion Gary Kasparov in 1997 (after having lost the first one in 1996, see also here).

Gary Kasparov playing against Deep Blue (Source: Wikipedia)

Gary Kasparov playing against Deep Blue (Source: Wikipedia)

The latest human vs machine competition which drew significant media attention was the win of Alpha Go (Google/Deepmind) against the 9-Dan professional player Lee Sedol - the first time a computer has beaten such a strong human player.

First 99 moves (Source: Wikipedia)

First 99 moves (Source: Wikipedia)

However, despite the significant achievements of machine/statistical learning in playing games like chess and Go we often overlook how many popular products of our daily life are based on machine learning like the Google search engine or recommendation systems by Amazon and Netflix. The hardest problem in this field nowadays is the implementation of self-driving cars to navigate in a fully autonomous fashion without driver intervention and safer than human beings. This is also considered to be Level-5 autonomy and is being researched by major car companies globally with Tesla currently considered being in the lead.

Waymo/Google Chrysler Pacifica Hybrid undergoing testing in the San Francisco Bay Area (Source: Wikipedia)

Waymo/Google Chrysler Pacifica Hybrid undergoing testing in the San Francisco Bay Area (Source: Wikipedia)

Supervised vs Unsupervised Learning

We generally differentiate between 2 types of learning: supervised or unsupervised.

The crucial point for supervised learning is that the output to be predicted needs to be known for model training. Typical supervised models are used are linear/logistic regression, trees, neural networks and support vector machines. If numerical outputs are predicted we generally speak of regression whereas categorical outputs (aka factors) are used for classification.

By contrast, unsupervised learning does not contain a specified output. The main goal is to learn structure and statistical patterns present in the data. Examples for such models include clustering and principle component analysis (PCA).

Within the supervised learning domain we additionally differentiate between regression and classification. If the variable to be predicted is numeric we the speak of a regression problem, if the variable is categorical (a factor) we have a classification problem.

Exercise

Below you can find different descriptions of learning problems. Try to determine whether the problem is either supervised or an unsupervised.