Data Mining and Exploration [INFR11007]

Lab 4: Performance evaluation and model selection

In this lab, we look at various performance metrics for classification. We then turn our attention to cross-validation and hyper-parameter tuning. Finally, we touch on Bayesian optimisation for hyper-parameter tuning.

As always, let's start by importing the basic packages and modules we will need :

In [1]:
# Import required packages 
from __future__ import division, print_function # Imports from __future__ since we're running Python 2
import os
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
random_state = 10 # Ensure reproducible results
%matplotlib inline

Classification performance evaluation

In this lab, we will look at the following classification metrics:

  • classification accuracy
  • logarithmic loss
  • confusion matrices

As always, we will make use of the familiar landsat satellite dataset which is 36-dimensional and comprises 6 classes.

Landsat sattelite data pre-processing

========== Question 1 ==========

Load the landsat_train.csv dataset into a pandas DataFrame called landsat_train_full, and the landsat_test.csv dataset into a DataFrame called landsat_test. Display the shapes of the two DataFrames.

Hint: The DataFrames should have 37 columns including the class labels, and 4435 and 2000 entries for the training and testing datasets, respectively.

In [2]:
# Your code goes here
path_train_full = os.path.join(os.getcwd(), 'datasets', 'landsat', 'landsat_train.csv')
path_test = os.path.join(os.getcwd(), 'datasets', 'landsat', 'landsat_test.csv')
landsat_train_full = pd.read_csv(path_train_full, delimiter = ',')
landsat_test = pd.read_csv(path_test, delimiter = ',')
print("There are {} entries and {} columns in the landsat_train DataFrame"\
      .format(landsat_train_full.shape[0], landsat_train_full.shape[1]))
print("There are {} entries and {} columns in the landsat_test DataFrame"\
      .format(landsat_test.shape[0], landsat_test.shape[1]))
There are 4435 entries and 37 columns in the landsat_train DataFrame
There are 2000 entries and 37 columns in the landsat_test DataFrame

========== Question 2 ==========

Load the dataset class names stored in landsat_classes.csv' into a dictionary. You are free to choose whatever method you wish.

Replace the label numbers in both the landsat_train_full and landsat_test DataFrames with the corresponding class names.

Hint: If unsure, check out the provided solutions for Lab 3.

In [3]:
# Your code goes here
labels_path = os.path.join(os.getcwd(), 'datasets', 'landsat', 'landsat_classes.csv') # Load data
landsat_labels = pd.read_csv(labels_path, delimiter = ',', index_col=0) # Load csv file
landsat_labels_dict = landsat_labels.to_dict()["Class"] # Convert to dictionary
print("Labels dictionary:\n{}".format(landsat_labels_dict)) # Print dictionary
landsat_train_full.replace({'label' : landsat_labels_dict}, inplace=True) # Perform replacement (train)
landsat_test.replace({'label' : landsat_labels_dict}, inplace=True) # Perform replacement (test)
Labels dictionary:
{1: 'red soil', 2: 'cotton crop', 3: 'grey soil', 4: 'damp grey soil', 5: 'soil with vegetation stubble', 6: 'mixture class (all types present)', 7: 'very damp grey soil'}

========== Question 3 ==========

Store the training features, training labels, testing features and testing labels into numpy arrays X_train_full, y_train_full, X_test, and y_test respectively.

In [4]:
# Your code goes here
X_train_full = landsat_train_full.drop('label', axis=1).values.astype(np.float) # Training features
y_train_full = landsat_train_full['label'].values # Training labels
X_test = landsat_test.drop('label', axis=1).values.astype(np.float) # Training features
y_test = landsat_test['label'].values # Training labels

Hold-out validation

We currently have two datasets, namely X_train_full and X_test. If we just wanted to train a simple classifier with default settings and evaluate performance on the test subset, then our current approach should be good enough.

Even simple classifiers, however, have hyper-parameters that need to be carefully tuned. In order to do so, we need a separate validation subset of the data which should be different from the training set. We should never perform model (i.e. hyper-parameter) selection by using the test set, because if we do so, we won't be able to evaluate the generalisation of our model on unseen data (and yes, in case you are wondering, that means that by performing model selection we might in a way overfit on the validation set).

The simplest approach we can follow is to split our data three-way, that is, have a training, a validation, and a test set. We are already given the test set, so all we need to do is to split our training set (which we have called train_full) into training and validation subsets.

Thankfully, sklearn offers an implementation of this operation which is called train_test_split. By default, this function will shuffle our data before splitting it. The test_size input parameter indicates the relative size of the test set (which will be used as the validation set in our case) and the random_state parameter can be used to ensure we can get reproducible results if we call this function multiple times.

Let's transform our full training set into two subsets called train and val. We will feed in our X_train_full and y_train_full arrays and call the new arrays X_train, X_val, y_train, and y_val, respectively.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, 
                                                  test_size=0.33, random_state=random_state)

Feature standardisation

We have already seen that feature standardisation (i.e. transforming the data so that they have zero mean and unit variance) is important for some unsupervised dimensionality reduction methods. It turns out that it is also crucial for the training efficiency of many supervised algorithms, especially those that use some form of optimisation (e.g. logistic regression trained via gradient descent).

Feature standardisation can hardly ever do any harm when it comes to training classifiers, so it is generally good practice to deploy it as a first step in our data processing pipeline.

It is essential, however, to perform feature standardisation by using the training data only. If we use the whole dataset for estimating feature means and variances, we will have information leakage from the test set to the training set, and our results might be over-optimistic.

========== Question 4 ==========

Create a StandardScaler instance and fit it by using the training features only (X_train).

Then by using the object you just fit, standardise (i.e. call the transform() method) the training, validation and test input features and save the results into three new numpy arrays, X_train_sc, X_val_sc and X_test_sc.

Hint: If unsure how to perform this step, check out Lab 3.

In [6]:
# Your code goes here
from sklearn.preprocessing import StandardScaler
sc = StandardScaler().fit(X_train)
X_train_sc = sc.transform(X_train)
X_val_sc = sc.transform(X_val)
X_test_sc = sc.transform(X_test)

Gaussian Naive Bayes classification

Now we want to use a simple Gaussian Naive Bayes classifier to get a feel for the performance baseline we can achieve on our validation set. Read about the Naive Bayes classifier and the underlying assumption if you are not already familiar with it.

We will make use of the GaussianlNB class in sklearn. Check out the user guide description and documentation to familiarise yourself with this class.

All classifier objects in sklearn implement a fit() and predict() method. The first learns the parameters of the model and the latter classifies inputs. For a Naive Bayes classifier, the fit() method takes at least two input arguments X and y, where X are the input features and y are the labels associated with each example in the training dataset (i.e. targets).

Let's train our classifier by calling its fit() method:

In [7]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(X_train_sc,y_train)

Classification accuracy

Scikit-learn model objects have built in scoring methods. The default score() method for GaussianNB estimates classification accuracy. Alternatively, we can compute the prediction for the test data and make use of the accuracy_score() function (that is in fact what the classifier's score() method does under the hood).

In [8]:
from sklearn.metrics import accuracy_score
# By using the predict() method and accuracy_score metric
gnb_prediction = gnb.predict(X_val_sc)
gnb_accuracy = accuracy_score(y_val, gnb_prediction) # The accuracy_score() function takes as inputs
                                                 # the true labels and the predicted ones

# By using the score() method
gnb_accuracy_alt = gnb.score(X_val_sc, y_val) # The score() method takes as inputs 
                                              # the test input features and the associated (true) labels

# Print results
print("GNB classification accuracy on validation set (by using the accuracy_score() function): {:.3f}"
      .format(gnb_accuracy))
print("GNB classification accuracy on validation set (by using the model's score() method): {:.3f}"
      .format(gnb_accuracy_alt))
GNB classification accuracy on validation set (by using the accuracy_score() function): 0.791
GNB classification accuracy on validation set (by using the model's score() method): 0.791

========== Question 5 [optional] ==========

Write your own function for computing classification accuracy by taking as inputs the vector of the true labels y_true and the vector of predicted labels y_pred. Compare its outcome to the results from using the scikit-learn accuracy_score metric.

In [9]:
# Your code goes here
def my_classification_accuracy(y_true, y_pred):
    """Computes classification accuracy.
    
    y_true : list or array
        vector with true labels
        
    y_pred : list or array
        vector with predicted labels
        
    """
    y_true, y_pred = np.asarray(y_true), np.asarray(y_pred)
    assert y_true.shape == y_pred.shape, "Arrays must be of equal shape."
    return np.sum(y_true == y_pred) / y_true.size
In [10]:
# Double-check that results come out as expected
print("GNB classification accuracy on test set (by using the custom function): {:.3f}"
      .format(my_classification_accuracy(y_val, gnb_prediction)))
GNB classification accuracy on test set (by using the custom function): 0.791

Baseline classification [optional]

How can we know if the performance of our classifier is vaguely good or bad? One simple idea is to try to compare its performance to a baseline.

========== Question 6 [optional] ==========

What is the simplest classifier you can think of? (Try for a moment to forget everything you know about machine learning and think how you would classify test inputs in the simplest, and perhaps dumbest, way).

Implement the baseline classifier of your choice and compute its classification accuracy on the validation set. Does the GNB model perform better than the baseline?

This might seem as an unnecessary hassle at this point, but it is important to always check what the baseline performance level is for a given task. There are cases where you can spend hours optimising a classifier, only to find out later on, that its performance does not exceed the baseline (i.e. because the input features, for instance, are not informative of the labels).

In [11]:
# Your code goes here

# Predict always the most frequent class
from scipy.stats import mode # Computes the mode of a signal
from sklearn.metrics import accuracy_score
dominant_class = mode(y_train).mode[0]
y_my_dummy = [dominant_class] * X_val_sc.shape[0]

# Make uniformly random predictions
np.random.seed(random_state) # Set random seed to ensure reproducibility
labels = np.unique(y_train)
random_prediction_int = np.random.randint(0, labels.size, y_val.size)
random_prediction_cat = np.zeros((y_val.size,), dtype='O')
for sample in np.arange(y_val.size):
    random_prediction_cat[sample] = labels[random_prediction_int[sample]]


print("Baseline classification accuracy on validation set (most frequent class): {:.3f}".
      format(accuracy_score(y_val, y_my_dummy)))
print("Baseline classification accuracy on validation set (uniformly random prediction): {:.3f}".
      format(accuracy_score(y_val, random_prediction_cat)))
Baseline classification accuracy on validation set (most frequent class): 0.230
Baseline classification accuracy on validation set (uniformly random prediction): 0.154
/afs/inf.ed.ac.uk/user/s10/s1043682/.local/lib/python3.4/site-packages/scipy/stats/stats.py:245: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  "values. nan values will be ignored.", RuntimeWarning)

Your answer goes here

The simplest classifier is to classify everything as the most frequent class in the training set. The accuracy of this classifier is 0.23 on the validation set, therefore we can be certain that our GNB classifier performs better than chance.

Another option is to make uniformly random predictions. This classifier will yield an even lower accuracy score on the validation set.

========== Question 7 [optional] ==========

It turns out that sklearn implements the DummyClassifier class which offers sever choices for dummy classifiers (strategy parameter). Is the baseline classifier you came up with in the previous question included in this class? If so, double-check that your estimate about baseline performance matches the one returned by the DummyClassifier in sklearn.

In [12]:
from sklearn.dummy import DummyClassifier
# Your code goes here

# most_frequent strategy
dcl_mf = DummyClassifier(strategy='most_frequent')
dcl_mf.fit(X_train_sc, y_train) # Clf user guide for alternative strategy options
y_most_frequent = dcl_mf.predict(X_val)

# uniformly random prediction strategy
# Set random_state parameter to ensure reproducibility
dcl_rnd = DummyClassifier(strategy='uniform', random_state=10).fit(X_train_sc, y_train) 
y_random = dcl_rnd.predict(X_val)

print("Baseline classification accuracy on validation set (most frequent class): {:.3f}".
      format(accuracy_score(y_val, y_most_frequent)))
print("Baseline classification accuracy on validation set (random prediction): {:.3f}".
      format(accuracy_score(y_val, y_random)))
Baseline classification accuracy on validation set (most frequent class): 0.230
Baseline classification accuracy on validation set (random prediction): 0.154

Confusion matrix

Scikit-learn also has a confusion_matrix implementation which returns a numpy array (square matrix) of dimensionality K, where K is the number of classes (6 in our case).

========== Question 8 ==========

By using the prediction of the Gaussian Naive Bayes model, compute and display the confusion matrix on the validation set.

In [14]:
from sklearn.metrics import confusion_matrix
# Your code goes here
cm = confusion_matrix(y_val, gnb_prediction)
print('Confusion matrix\n{}'.format(cm))
Confusion matrix
[[150   0   0   4  11   1]
 [  0  85  23   2   1  19]
 [  0  25 291   7   0   3]
 [  1   0  10 264  62   0]
 [  2   8   0  14  95  20]
 [  0  66   3   0  24 273]]

You are provided with the following function which uses seaborn's heatmap function to visualise a confusion matrix. The parameter normalize can be used to convert the rows of the confusion matrix into normalised prediction scores/probabilities (i.e. the sum of each row will be equal to 1).

In [15]:
# Plot confusion matrix by using seaborn heatmap function
def plot_confusion_matrix(cm, normalize=False, classes=None, title='Confusion matrix'):
    """Plots a confusion matrix.
    
    If normalize is set to True, the rows of the confusion matrix are normalized so that they sum up to 1.
    
    """
    if normalize is True:
        cm = cm/cm.sum(axis=1)[:, np.newaxis]
        vmin, vmax = 0., 1.
        fmt = '.2f'
    else:
        vmin, vmax = None, None
        fmt = 'd'
    if classes is not None:
        sns.heatmap(cm, xticklabels=classes, yticklabels=classes, vmin=vmin, vmax=vmax, 
                    annot=True, annot_kws={"fontsize":9}, fmt=fmt)
    else:
        sns.heatmap(cm, vmin=0., vmax=1.)
    plt.title(title)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

========== Question 9 ==========

Use the provided plot_confusion_matrix to visualise the confusion matrix on the test dataset. Make two calls of the function, one with the normalize parameter set to True, and another one with the same parameter set to False. Use a single figure with two subplots. Inspect the outcome.

In [16]:
# Your code goes here
fig = plt.figure(figsize=(9,5))
ax1 = fig.add_subplot(121)
plot_confusion_matrix(cm, normalize=False, \
                      classes=gnb.classes_) # un-normalized
ax2 = fig.add_subplot(122)
plot_confusion_matrix(cm, normalize=True, \
                      classes=gnb.classes_, title='Normalised confusion matrix') # normalized
ax2.get_yaxis().set_visible(False)
fig.tight_layout()

Cross-entropy (logistic) loss

Classification accuracy is not the only metric used to evaluate classification performance, and in many cases, it can be very misleading. One such case is when we deal with unbalanced datasets in binary classification tasks. In this case, we might want to use different metrics such as precision, recall, F1-score and/or area under the ROC curve. Unfortunately, we won't have time to cover all these in the current lab, but you are free to experiment with them if you are keen.

Another particular case where the accuracy metric is not very useful, is when we care not only about the classification predictions, but also about the associated probability distributions. Sometimes, for instance, we might consider that it is not catastrophic to do a misclassification, as long the associated prediction probability is not too high. In other words, we might want to penalise wrong classifications that are made with high confidence more than misclassifications that are made with low confidence.

One metric that takes into account the prediction probabilities is the logarithmic loss or cross-entropy. This metric is very often used as the evaluation score in data science competitions. You might recognise this metric as the loss funcion used while training logistic regression models and neural networks.

========== Question 10 ==========

By using the log-loss implementation in sklearn, compute the log-loss scores for the Dummy and Gaussian Naive Bayes classifers from Questions 6 and 8.

In [17]:
from sklearn.metrics import log_loss
# Your code goes here
pred_proba_dummy = dcl_mf.predict_proba(X_val_sc)
pred_proba_gnb = gnb.predict_proba(X_val_sc)
print("Log-loss scores:\nDummy classifier {}\nGaussian Naive Bayes classifier {}".
      format(log_loss(y_val, pred_proba_dummy), log_loss(y_val, pred_proba_gnb)))
Log-loss scores:
Dummy classifier 26.588252047175107
Gaussian Naive Bayes classifier 4.057455891821185

Classifier comparison

Now we want to compare the performance of different types of classifiers on the validation set. This is a common first step to decide which classifier might be most suitable for the task at hand.

========== Question 11 ==========

Compare the classification accuracy of various types of classifiers of your choice (at least three different types) on the landsat dataset. Use the training set for training classifiers and the validation set to evaluate performance.

Log and print the classification accuracy and cross-entropy (aka logarithmic loss) scores for each classifier.

Reminder: Make sure you make use of the standardised versions of the data, i.e. X_train_sc and X_val_sc.

Hint: You might find this sklearn tutorial very useful.

In [19]:
# Your code goes here
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
names = ["Logistic Regression", "Nearest Neighbors", "Linear SVM", "RBF SVM",
         "Decision Tree", "Random Forest", "Neural Net (Multi-layer perceptron)",
         "Naive Bayes", "LDA", "QDA"]
classifiers = [
    LogisticRegression(),
    KNeighborsClassifier(n_neighbors=9),
    SVC(kernel="linear", probability=True, random_state=random_state),
    SVC(kernel='rbf', probability=True, random_state=random_state),
    DecisionTreeClassifier(max_depth=10),
    RandomForestClassifier(max_depth=10, n_estimators=50,random_state=random_state),
    MLPClassifier(random_state=random_state),
    GaussianNB(),
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis()]
ca_score = {} # Classification accuracy
ce_score = {} # Cross-entropy
for name, clf in zip(names, classifiers):
    clf.fit(X_train_sc, y_train)
    ca_score[name] = clf.score(X_val_sc, y_val)
    ce_score[name] = log_loss(y_val, clf.predict_proba(X_val_sc))
print('Classification performance on validation set:')
for clf in names:
    print ("{}, accuracy: {:.3f}, log-loss: {:.3f}".format(clf, ca_score[clf], ce_score[clf]))
Classification performance on validation set:
Logistic Regression, accuracy: 0.837, log-loss: 0.490
Nearest Neighbors, accuracy: 0.889, log-loss: 0.423
Linear SVM, accuracy: 0.867, log-loss: 0.337
RBF SVM, accuracy: 0.893, log-loss: 0.282
Decision Tree, accuracy: 0.854, log-loss: 2.944
Random Forest, accuracy: 0.898, log-loss: 0.284
Neural Net (Multi-layer perceptron), accuracy: 0.902, log-loss: 0.249
Naive Bayes, accuracy: 0.791, log-loss: 4.057
LDA, accuracy: 0.833, log-loss: 0.784
QDA, accuracy: 0.855, log-loss: 0.980

========== Question 12 ==========

Which classifier seems to perform best on the validation set?

Are your observations as expected, or do you find them rather suprising?

Would you trust the results of this comparison? If not, explain why.

Your answer goes here

In terms of classification accuracy, it seems that the Neural Net classifier performs best at this point (although results might depend on how we set the random_state parameter). The performances of the Random Forest and SVM classifiers (with RBF kernel) also rank very highly.

In terms of cross-entropy loss, the neural net again yields the highest performance (i.e lowest log-loss). This should not be suprising, since it is exactty this score that is minimised during the training of a neural network (and a logistic regression classifier of course, but a neural network with hidden layers provides more flexibility in fitting models due to making use of non-linearities in the data).

We should not, however, trust the results of this analysis too much. The first reason is that, with few exceptions, most of the classifiers deployed here have many hyper-parameters, the setting thereof dramatically affects the classifier's performance. Since we have not tuned these hyper-parameters in a systematic way, we should not draw any generic conclusions about the performance of the classifiers.

Another reason is that we have compared the performance of the various classifiers on a specific subset of the data. Hence, we have only computed an approximation of the generalisation error on unseen data. Thankfully, in our case, the validation set is relatively large (~1460 data points), so we have reasons to believe that this might be a relatively good approximation. If we wanted this approximation to be as accurate as possible, we should have used K-fold cross-validation.

Hyper-parameter tuning

If we want to optimise the hyper-parameters of classifiers we should never do so by using the test set. If we compare the performance of various classifiers -after we have tuned their hyper-parameters- by using a validation set, we should equally not use the same set for parameter optimisation.

Instead, we should use two independent procedures for hyper-parameter tuning and classifier comparison. For instance, we could split again our training set into two subsets and use the former to train classifiers under various parameter configurations and the latter to assess performance. Then by picking the parameter settings which gave optimal results for each classifier, we can compare the performance of the best-performing models on our original validation set.

Alternatively, we can use K-fold cross-validation on the training set for setting hyper-parameters; this approach will yield slightly more reliable results, since hyper-parameter selection will be based on averaging across a few runs.

Sklearn offers a very convenient tool for fitting model hyperparameters called GridSearchCV. Spend a few minutes reading the documentation of this class and make sure you understand how it works.

A GridSearchCV classifier is constructed by defining the following parameters, among others:

  • estimator: this is another (base) classifier which has been constructed but not fitted yet.
  • param_grid: this is a dictionary, where the different hyper-parameters to be optimised are defined, as well as the search space for each hyper-parameter, which can be either discrete or continuous.
  • scoring: this is a string defining the objective function (i.e. scoring method) to be used for hyper-parameter optimisation. Some options are accuracy for classification accuracy, neg_log_loss for log-loss, or user-defined metrics. A list of all the available metrics can be found here.
  • cv: cross-validation to be used for optimising the hyper-parameters. This can be either an integer K or a sklearn cross-validation generator. If an integer is provided, then KFold cross-validation will be used. By default sklearn will use 3-fold CV.

Let's now make a first attempt to optimise the regularisation parameter C of an SVM classifier with an RBF kernel.

Spend a few moments to understand what the following piece of code does. You can hopefully see that after we have fitted a GridSearchCV classifier, we can access the best scores achieved and best-scoring parameter configurations by looking at the best_params_ and best_score_ attributes, respectively.

In [20]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold

cv = KFold(n_splits=3, shuffle=True, random_state=random_state)
svc = SVC(kernel='rbf')
parameters = {'C': np.logspace(-3,3,7)}
svc_clf = GridSearchCV(estimator=svc, cv=cv, param_grid=parameters, scoring='accuracy')
svc_clf.fit(X_train_sc, y_train)
print("Best setting of C parameter for SVC with RBF kernel: {}".format(svc_clf.best_params_["C"]))
print("Best cross-validated score: {:.3f}".
      format(svc_clf.best_score_))
print("Classification accuracy on validation set: {:.3f}".format(svc_clf.score(X_val_sc,y_val)))
Best setting of C parameter for SVC with RBF kernel: 100.0
Best cross-validated score: 0.898
Classification accuracy on validation set: 0.911

========== Question 13 ==========

By adapting the above code provided, optimise both the regularisation parameter C and the kernel coefficient gamma of an SVM classifier with RBF kernel.

For C, you can use the previous grid, and for gamma you can use a logarithmic range between $10^{-4}$ to $10^{1}$.

Print the best scoring parameter configuration and the classification accuracy score on the validation set.

In [21]:
# Your code goes here
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold

cv = KFold(n_splits=3, shuffle=True, random_state=random_state)
svc = SVC(kernel='rbf')
parameters = {'C': np.logspace(-3,3,7), 'gamma' : np.logspace(-4, 1, 6)}
svc_clf = GridSearchCV(estimator=svc, cv=cv, param_grid=parameters, scoring='accuracy')
svc_clf.fit(X_train_sc, y_train)
print("Best parameters for SVC with RBF kernel, C: {}, gamma: {}".
      format(svc_clf.best_params_["C"], svc_clf.best_params_["gamma"]))
print("Best cross-validated score: {}".
      format(svc_clf.best_score_))
print("Classification accuracy on validation set: {:.3f}".format(svc_clf.score(X_val_sc,y_val)))
Best parameters for SVC with RBF kernel, C: 10.0, gamma: 0.1
Best cross-validated score: 0.9074385728710872
Classification accuracy on validation set: 0.917

========== Question 14 ==========

How does the performance on the validation set compare to that achieved previously with other classifers?

If you did not want to use the provided GridSearchCV and KFold modules from sklearn but wanted to write your own code from scratch, how many for loops would you have included in your code?

Your answer goes here

The performance attained by the SVM classifier on the validation set after tuning the C and gamma parameters via cross-validation is the highest achieved so far. Once again, results will depend on how we set the random seed, so they might be slightly different if you set it to a different number.

If we were to write the code from scratch, we would need three for-loops; one for looping over the possible C values, another one for looping over the gamma values, and a final one for looping over the different training/validation folds, as part of the cross-validation procedure. You will appreciate that, among others, one particular advantage of using the sklearn modules is that we can write cleaner code.

========== Question 15 [Optional] ==========

Optimise the number of hidden units and regularisation constant alpha of an MLP classifier with one hidden layer. Use the neg_log_loss as the scoring function, which will perform hyper-parameter optimisation by minimising the log-loss (i.e. cross-entropy). Use default settings for the other parameters.

In [22]:
# Your code goes here

# Ignore ConvergenceWarnings from sklearn
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning) 

# MLP with variable hidden layer size and alpha, score: log-loss
cv = KFold(n_splits=3, shuffle=True, random_state=random_state)
mlp = MLPClassifier(random_state=random_state)
parameters = {'hidden_layer_sizes' : [(10,), (100,), (1000,)], 'alpha' : np.logspace(-8,0,9)}
mlp_clf = GridSearchCV(mlp, param_grid=parameters, scoring='neg_log_loss')
mlp_clf.fit(X_train_sc, y_train)
print("Best parameters for MLP classifier: hidden layer size: {}, alpha: {}, best cross-validated score: {:.3f}".
      format(mlp_clf.best_params_["hidden_layer_sizes"], mlp_clf.best_params_["alpha"], mlp_clf.best_score_))
print("Classification accuracy on validation set: {:.3f}".format(accuracy_score(y_val, mlp_clf.predict(X_val_sc))))
Best parameters for MLP classifier: hidden layer size: (100,), alpha: 0.1, best cross-validated score: -0.291
Classification accuracy on validation set: 0.889

========== Question 16 [Optional] ==========

Why is the classification accuracy not better than the one achieved in Question 11? Did we do something wrong?

Your answer goes here

We did not do anything wrong. The reason why our classification accuracy did not improve is that we we have been optimising for the negative log-loss score and not for classification accuracy, hence our algorithm selected those hyper-parameters that minimised the log-loss and not the ones that would have maximised classification accuracy.

The selection of the objective function (aka score) is extremely important for hyper-parameter optimisation, and depends upon the task at hand. There are no hard rules here. But it is generally a good idea to optimise your models with regards to the same objective function that will be finally used to assess performance.

Let's now have a look at what hyper-parameter configuration we would have ended up with, should we have chosen to maximise classification accuracy instead.

In [21]:
# MLP with variable hidden layer size and alpha, score: accuracy
cv = KFold(n_splits=3, shuffle=True, random_state=random_state)
mlp = MLPClassifier(random_state=random_state)
parameters = {'hidden_layer_sizes' : [(10,), (100,), (1000,)], 'alpha' : np.logspace(-8,0,9)}
mlp_clf = GridSearchCV(mlp, param_grid=parameters, scoring='accuracy')
mlp_clf.fit(X_train_sc, y_train)
print("Best parameters for MLP classifier: hidden layer size: {}, alpha: {}, best cross-validated score: {:.3f}".
      format(mlp_clf.best_params_["hidden_layer_sizes"], mlp_clf.best_params_["alpha"], mlp_clf.best_score_))
print("Classification accuracy on validation set: {:.3f}".format(accuracy_score(y_val, mlp_clf.predict(X_val_sc))))
Best parameters for MLP classifier: hidden layer size: (1000,), alpha: 0.01, best cross-validated score: 0.900
Classification accuracy on validation set: 0.904

Classification accuracy has now improved as compared to using default settings. Note that the selection of hyper-parameters can significantly differ, depending upon the objective function used.

Bayesian optimisation

Bayesian optimisation is a strategy for optimising black-box functions without the need for computing derivatives. Bayesian optimisation methods can prove very useful for optimising the hyper-parameters of machine learning models in arbitrary search spaces, for both continuous and discrete-valued hyper-parameters.

In this final section, we will use the scikit-optimize package to tune the C and gamma hyper-parameters of an SVM classifier with RBF kernel (see Question 13). We will use a Gaussian Process prior over the average cross-validated classification accuracy score, and will try to maximise this metric by tuning the SVM hyper-parameters.

Spend a few moments trying to understand what the following piece of code does. If something is not clear, you can consult the skopt documentation and examples.

In [23]:
from sklearn.model_selection import cross_val_score
cv = KFold(n_splits=3, shuffle=True, random_state=random_state) # We want to use 3-fold CV to compute the loss in each iteration
svc_clf = SVC(kernel='rbf') # Constructor without defining hyper-parameters just yet

def objective_svc(params): # Here we define the metric we want to minimise
    C, gamma = params
    svc_clf.set_params(C=C, 
                      gamma=gamma) # Set the parameters of the clf
    
    return -np.mean(cross_val_score(svc_clf, X_train_sc, y_train, cv=cv, n_jobs=-1,
                                    scoring="accuracy")) # We want to maximise average accuracy, i.e. minimise minus average accuracy

# Search space for the two parameters
space  = [(10**-3, 10**3, "uniform"), # C
          (10**-4, 10**1, "uniform")] # gamma

# Initial values (optional)
x0 = [1, 10**-2]

The following piece of code uses the above defined cost function (objective_scv) and the skopt gp_minimise function to select the best hyper-parameter configuration in the defined search space. The input parameters are as follows:

  • func: function that we wish to minimise
  • dimensions: the search space for the hyper-parameters
  • x0: inital values for the hyper-parameters
  • n_calls: number of times the function will be evaluated
  • random_state: random seed
  • n_random_starts: before we start modelling the optimised function with a GP Regression model, we want to try a few random choices for the hyper-parameters.
  • kappa: trade-off between exploration vs. exploitation.
In [25]:
from skopt import gp_minimize
res_gp = gp_minimize(func=objective_svc, dimensions=space, x0=x0, 
                     n_calls=25, random_state=random_state, n_random_starts=5, kappa=1.9)
print("Best score with Bayesian optimisation: {:.3f}".format(-res_gp.fun))
print("Best parameters with Bayesian optimisation:\nC: {}\ngamma: {}"
      .format(res_gp.x[0],res_gp.x[1]))
/afs/inf.ed.ac.uk/user/s10/s1043682/.local/lib/python3.4/site-packages/skopt/optimizer/optimizer.py:366: UserWarning: The objective has been evaluated at this point before.
  warnings.warn("The objective has been evaluated "
Best score with Bayesian optimisation: 0.908
Best parameters with Bayesian optimisation:
C: 18.233775965744574
gamma: 0.12201861484165527

By using the skopt plot_convergence function, we can plot the objective function (i.e. evaluated cross-validated accuracy score) against the number of iterations of the Bayesian optimisation process.

In [26]:
from skopt.plots import plot_convergence
plt.figure(figsize=(5,2))
plot_convergence(res_gp)
plt.grid()
plt.show()

It is important to clarify here, that the plot above shows the minimum value of the optimised function achieved at step $n$. Nevertheless, this does not mean that at a certain step the objective function should always decrease as compared to the previous step (as is the case with e.g. with gradient descent).

If we wish to plot the value of the evaluated function $f \left (x \right)$ vs. the iteration number $n$, we can do the following:

In [27]:
plt.figure(figsize=(5,2))
plt.plot(res_gp.func_vals)
plt.scatter(range(len(res_gp.func_vals)), res_gp.func_vals)
plt.ylabel(r'$f(x)$')
plt.xlabel('Number of calls $n$')
plt.xlim([0, len(res_gp.func_vals)])
plt.show()

Now, let's compute the classification accuracy on the validation set by using the selected hyper-parameter configuration. Since we are not performing any further comparisons or hyper-parameter optimisation, we can finally use the test set to report the performance of our selected method.

In [28]:
svc_opt = SVC(kernel='rbf',
             C=res_gp.x[0],
             gamma=res_gp.x[1]).fit(X_train_sc,y_train)
print("Classification accuracy on validation set: {:.3f}".format(accuracy_score(y_val, svc_opt.predict(X_val_sc))))
print("Classification accuracy on test set: {:.3f}".format(accuracy_score(y_test, svc_opt.predict(X_test_sc))))
Classification accuracy on validation set: 0.917
Classification accuracy on test set: 0.905

Let's take a moment to reflect on what we have achieved here. By running only 25 function evaluations (i.e. iterations), we have achieved 0.916 accuracy on the validation set. This is only 0.1% smaller than the best score achieved with grid search optimisation, where we had to evaluate our cost function 42 times (since we used a 7 X 6 grid). If we allow the GP optimiser to run for more iterations, we might even get higher performance than we can achieve with grid-search, since we are not restricting our hyper-parameters to discrete values, as opposed to grid search optimisation.

Finally, it is worth noting that the classification performance scores are slightly different for the validation and test sets. In this part we have not used the validation set to perform hyper-parameter optimisation (remember we used K-fold CV within the training set to do so), hence, we can view y_val and y_test two as two separate test sets. The computed classification accuracy scores are only approximations of the performance of the classifier on unseen data. The observed difference in performance for the two sets is due to the fluctuation of the approximation when using finite size datasets.

========== Question 17 [Optional] ==========

Use the skopt package to tune the hyper-parameters of an MLPClassifier with one hidden layer (i.e. number of hidden units and alpha) via Bayesian optimisation.

Finally, report the accuracy of the best parameter configuration on the validation and test sets.

In [29]:
# Your code goes here
cv = KFold(n_splits=3, shuffle=True, random_state=random_state)
mlp_clf = MLPClassifier(random_state=random_state)

def objective_mlp(params):
    number_hidden_units, alpha = params

    mlp_clf.set_params(hidden_layer_sizes = (number_hidden_units,),
                      alpha=alpha)

    return -np.mean(cross_val_score(mlp_clf, X_train_sc, y_train, cv=cv, n_jobs=-1,
                                    scoring="accuracy"))

space  = [(10, 1000),                       # number of hidden units
          (10**-8, 1)]                      # alpha
x0 = [100, 10**-4]
In [30]:
res_gp = gp_minimize(objective_mlp, space, x0=x0, n_calls=25, random_state=random_state, n_random_starts=5)
print("Best score with Bayesian optimisation: {:.3f}".format(res_gp.fun))
print("Best parameters with Bayesian optimisation:\n-hidden layer size: ({},)\n-alpha: {}"
      .format(res_gp.x[0],res_gp.x[1]))
/afs/inf.ed.ac.uk/user/s10/s1043682/.local/lib/python3.4/site-packages/skopt/optimizer/optimizer.py:366: UserWarning: The objective has been evaluated at this point before.
  warnings.warn("The objective has been evaluated "
/afs/inf.ed.ac.uk/user/s10/s1043682/.local/lib/python3.4/site-packages/skopt/optimizer/optimizer.py:366: UserWarning: The objective has been evaluated at this point before.
  warnings.warn("The objective has been evaluated "
/afs/inf.ed.ac.uk/user/s10/s1043682/.local/lib/python3.4/site-packages/skopt/optimizer/optimizer.py:366: UserWarning: The objective has been evaluated at this point before.
  warnings.warn("The objective has been evaluated "
Best score with Bayesian optimisation: -0.900
Best parameters with Bayesian optimisation:
-hidden layer size: (1000,)
-alpha: 1e-08
In [31]:
# Convergence plot
plt.figure(figsize=(5,5))
plot_convergence(res_gp)
plt.grid()
plt.show()
In [32]:
# Train final model and report accuracy on validation and test sets
mlp_opt = MLPClassifier(random_state=random_state,
                        hidden_layer_sizes = (res_gp.x[0],),
                        alpha=res_gp.x[1])
mlp_opt.fit(X_train_sc,y_train)
print("Classification accuracy on validation set: {:.3f}".format(accuracy_score(y_val, mlp_opt.predict(X_val_sc))))
print("Classification accuracy on test set: {:.3f}".format(accuracy_score(y_test, mlp_opt.predict(X_test_sc))))
Classification accuracy on validation set: 0.903
Classification accuracy on test set: 0.895

Again, note that classification accuracy is slightly worse on the test set.

It is also worth noting that, skopt can deal with both continuous and discrete hyper-parameters in the same way. In our example, only integer values make sense for the number_hidden_units parameter.