Data Mining and Exploration [INFR11007]

Python environment set-up

Part of the fist lab will be spent in setting-up a python working environvment. Follow the instructions provided and make sure your environment is set-up correctly. In case you run into trouble, please seek advice in the lab sessions.

Lab 1: Data exploratory analysis and visualisations

In this lab we introduce the landsat satellite dataset that we will be using throughout the course and perform some exloratory data analysis and visualisations on the dataset.

Landsat satellite dataset

Description

The database consists of the multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood. The aim is to predict this classification, given the multi-spectral values. One frame of Landsat MSS imagery consists of four digital images of the same scene in different spectral bands. Two of these are in the visible region (corresponding approximately to green and red regions of the visible spectrum) and two are in the (near) infra-red. Each pixel is an 8-bit binary word, with 0 corresponding to black and 255 to white. The spatial resolution of a pixel is about 80m x 80m. Each image contains 2340 x 3380 such pixels.

Our database is a (tiny) sub-area of a scene, consisting of 82 x 100 pixels. Each line of data corresponds to a 3x3 square neighbourhood of pixels completely contained within the 82x100 sub-area. Each line contains the pixel values in the four spectral bands (converted to ASCII) of each of the 9 pixels in the 3x3 neighbourhood and a number indicating the classification label of the central pixel.

Classes

The number is a code for the following classes:

  1. red soil
  2. cotton crop
  3. grey soil
  4. damp grey soil
  5. soil with vegetation stubble
  6. mixture class (all types present)
  7. very damp grey soil

NB. There are no examples with class 6 in this dataset, hence there are actually 6 classes, i.e. 1,2,3,4,5 and 7.

Data pre-processing

The data is given in random order and certain lines of data have been removed so that the original image cannot be reconstructed from this dataset.

In each line of data, the four spectral values for the top-left pixel are given first followed by the four spectral values for the top-middle pixel and then those for the top-right pixel, and so on with the pixels read out in sequence left-to-right and top-to-bottom.

Number of examples

  • training set: 4435
  • test set: 2000

Number of attributes

36 (= 4 spectral bands x 9 pixels in neighbourhood )

Attributes

The attributes are numerical, in the range 0 to 255.

Execute the cell below to import the packages we will be using throughout this lab. The last line enforces the matplotlib figures to be rendered within the notebook.

In [1]:
# Import required packages 
from __future__ import division, print_function # Imports from __future__ since we're running Python 2
import os
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Pandas

Pandas is a library for data manipulation and analysis. There are two fundamental data structures in pandas: the Series and DataFrame structures which are built on top of NumPy arrays. DataFrame structures are very similar to Data Frames used in the R programming language.

Let's start by loading the training and test datasets into pandas DataFrames.

In [2]:
train_path = os.path.join(os.getcwd(), 'datasets', 'landsat', 'landsat_train.csv')
test_path = os.path.join(os.getcwd(), 'datasets', 'landsat', 'landsat_test.csv')
landsat_train = pd.read_csv(train_path, delimiter = ',')
landsat_test = pd.read_csv(test_path, delimiter = ',')

Preliminary inspections

We can use the pandas head() method to inspect the first n entries in the DataFrame. Note that the last column in the DataFrame contains the label for each data point.

In [3]:
# Show first 5 instances of train set
landsat_train.head(n=5)
Out[3]:
pixel_1_1 pixel_1_2 pixel_1_3 pixel_1_4 pixel_2_1 pixel_2_2 pixel_2_3 pixel_2_4 pixel_3_1 pixel_3_2 ... pixel_7_4 pixel_8_1 pixel_8_2 pixel_8_3 pixel_8_4 pixel_9_1 pixel_9_2 pixel_9_3 pixel_9_4 label
0 92 115 120 94 84 102 106 79 84 102 ... 104 88 121 128 100 84 107 113 87 3
1 84 102 106 79 84 102 102 83 80 102 ... 100 84 107 113 87 84 99 104 79 3
2 84 102 102 83 80 102 102 79 84 94 ... 87 84 99 104 79 84 99 104 79 3
3 80 102 102 79 84 94 102 79 80 94 ... 79 84 99 104 79 84 103 104 79 3
4 84 94 102 79 80 94 98 76 80 102 ... 79 84 103 104 79 79 107 109 87 3

5 rows × 37 columns

Alternatively, we can use the pandas sample() method to inspect n random entries in the DataFrame. We can set the random_state parameter to ensure reproducible results.

========== Question 1 ==========

Inspect 7 random entries in the test dataset. Set the random_state parameter to a number of your choice (i.e. 10) to ensure reproducible results.

In [4]:
# Your code goes here
landsat_test.sample(7, random_state=10)
Out[4]:
pixel_1_1 pixel_1_2 pixel_1_3 pixel_1_4 pixel_2_1 pixel_2_2 pixel_2_3 pixel_2_4 pixel_3_1 pixel_3_2 ... pixel_7_4 pixel_8_1 pixel_8_2 pixel_8_3 pixel_8_4 pixel_9_1 pixel_9_2 pixel_9_3 pixel_9_4 label
115 46 34 119 131 52 48 110 105 71 77 ... 135 41 31 131 139 44 40 120 120 2
1779 44 29 114 133 47 37 114 122 50 63 ... 88 56 61 91 85 56 64 91 85 2
1449 66 109 122 100 66 109 122 100 66 109 ... 90 67 102 114 94 67 102 114 90 1
145 89 111 110 87 89 106 110 87 89 106 ... 88 88 107 108 88 88 107 113 92 3
1009 67 72 77 62 75 87 96 79 79 99 ... 57 67 66 72 60 63 66 68 57 7
1840 84 99 104 79 84 99 104 79 88 99 ... 90 92 103 110 86 87 103 110 83 3
579 82 96 100 78 82 96 96 78 82 96 ... 80 85 97 101 80 85 97 105 80 4

7 rows × 37 columns

========== Question 2 ==========

The pandas info() method provides basic information (i.e. number of rows/columns, variable data types) about a DataFrame.

Display the basic information about the landsat_train dataframe. How many attributes/samples are there in the dataset?

In [5]:
# Your code goes here
landsat_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4435 entries, 0 to 4434
Data columns (total 37 columns):
pixel_1_1    4435 non-null int64
pixel_1_2    4435 non-null int64
pixel_1_3    4435 non-null int64
pixel_1_4    4435 non-null int64
pixel_2_1    4435 non-null int64
pixel_2_2    4435 non-null int64
pixel_2_3    4435 non-null int64
pixel_2_4    4435 non-null int64
pixel_3_1    4435 non-null int64
pixel_3_2    4435 non-null int64
pixel_3_3    4435 non-null int64
pixel_3_4    4435 non-null int64
pixel_4_1    4435 non-null int64
pixel_4_2    4435 non-null int64
pixel_4_3    4435 non-null int64
pixel_4_4    4435 non-null int64
pixel_5_1    4435 non-null int64
pixel_5_2    4435 non-null int64
pixel_5_3    4435 non-null int64
pixel_5_4    4435 non-null int64
pixel_6_1    4435 non-null int64
pixel_6_2    4435 non-null int64
pixel_6_3    4435 non-null int64
pixel_6_4    4435 non-null int64
pixel_7_1    4435 non-null int64
pixel_7_2    4435 non-null int64
pixel_7_3    4435 non-null int64
pixel_7_4    4435 non-null int64
pixel_8_1    4435 non-null int64
pixel_8_2    4435 non-null int64
pixel_8_3    4435 non-null int64
pixel_8_4    4435 non-null int64
pixel_9_1    4435 non-null int64
pixel_9_2    4435 non-null int64
pixel_9_3    4435 non-null int64
pixel_9_4    4435 non-null int64
label        4435 non-null int64
dtypes: int64(37)
memory usage: 1.3 MB

Your answer goes here

There are 4435 entries and 37 columns in the landsat_train dataframe.

Alternatively, we can use the shape attribute of a DataFrame to get the number of entries (i.e. rows/samples) and columns (i.e. attributes).

In [6]:
print("There are {} entries and {} columns in the landsat_train DataFrame"\
      .format(landsat_train.shape[0], landsat_train.shape[1]))
There are 4435 entries and 37 columns in the landsat_train DataFrame

Another very useful pandas method is describe() which generates summary statistics about the columns in a DataFrame.

In [7]:
# Summary statistics in train set
landsat_train.describe()
Out[7]:
pixel_1_1 pixel_1_2 pixel_1_3 pixel_1_4 pixel_2_1 pixel_2_2 pixel_2_3 pixel_2_4 pixel_3_1 pixel_3_2 ... pixel_7_4 pixel_8_1 pixel_8_2 pixel_8_3 pixel_8_4 pixel_9_1 pixel_9_2 pixel_9_3 pixel_9_4 label
count 4435.000000 4435.000000 4435.000000 4435.000000 4435.000000 4435.000000 4435.000000 4435.000000 4435.000000 4435.000000 ... 4435.000000 4435.000000 4435.000000 4435.000000 4435.000000 4435.000000 4435.000000 4435.000000 4435.000000 4435.000000
mean 69.473957 83.855242 99.322886 82.555581 69.208794 83.499436 99.174521 82.480496 68.963472 83.131003 ... 82.673957 69.033145 83.434273 99.183766 82.638106 68.797069 83.153326 99.061556 82.584442 3.650282
std 13.649635 22.725079 16.672166 18.695450 13.552423 22.811809 16.627583 18.714272 13.497858 22.813312 ... 18.884322 13.533934 22.812090 16.741240 18.939820 13.437785 22.762684 16.660735 18.895513 2.217055
min 40.000000 27.000000 56.000000 33.000000 40.000000 27.000000 50.000000 29.000000 40.000000 27.000000 ... 34.000000 39.000000 27.000000 50.000000 29.000000 40.000000 27.000000 50.000000 29.000000 1.000000
25% 60.000000 71.000000 85.000000 70.000000 60.000000 71.000000 85.000000 69.000000 60.000000 71.000000 ... 70.000000 60.000000 71.000000 85.000000 69.000000 60.000000 71.000000 85.000000 69.000000 2.000000
50% 68.000000 87.000000 101.000000 83.000000 68.000000 85.000000 101.000000 81.000000 67.000000 85.000000 ... 81.000000 68.000000 85.000000 101.000000 81.000000 67.000000 85.000000 100.000000 81.000000 3.000000
75% 80.000000 103.000000 113.000000 92.000000 80.000000 103.000000 113.000000 92.000000 79.000000 103.000000 ... 92.000000 79.500000 103.000000 113.000000 92.000000 79.000000 103.000000 113.000000 92.000000 5.000000
max 104.000000 137.000000 140.000000 154.000000 102.000000 137.000000 145.000000 157.000000 104.000000 130.000000 ... 154.000000 104.000000 128.000000 145.000000 157.000000 104.000000 130.000000 145.000000 157.000000 7.000000

8 rows × 37 columns

Some times we might want to compute further statistics about the columns in a DataFrame. Since pandas structures are built on top of numpy arrays, each DataFrame is essentially a two-dimensional numpy array. Hence, we can use numpy or scipy functions to perform any sort of transformations or compute statistics. The next cell computes high-order statistics (i.e. skewness and kurtosis) by using functions imported from scipy.stats.

In [9]:
from scipy.stats import skew, kurtosis
print('Skewness:\n{}'.format(skew(landsat_train)[:-1])) # Get rid of label column 
print('Kurtosis:\n{}'.format(kurtosis(landsat_train)[:-1])) # Get rid of label column
Skewness:
[ 0.03897976 -0.66114747 -0.11532597  0.88180938  0.05867339 -0.64255423
 -0.11676827  0.89154795  0.06286078 -0.64115429 -0.10967779  0.91558233
  0.03864269 -0.6650612  -0.12403743  0.87217058  0.03516946 -0.65737898
 -0.11111385  0.90241981  0.056391   -0.64982478 -0.10552114  0.90049647
  0.03864561 -0.67319444 -0.12702005  0.88087324  0.04238649 -0.66352782
 -0.13171674  0.88438801  0.05262999 -0.66424573 -0.12624233  0.88873574]
Kurtosis:
[-0.73962977 -0.21807077 -0.89152972  1.30588852 -0.72693453 -0.27374012
 -0.91441639  1.28448304 -0.72192199 -0.29861851 -0.91068177  1.34217389
 -0.74918916 -0.20644875 -0.91032411  1.1945522  -0.73138799 -0.2564044
 -0.93930816  1.26275161 -0.70827217 -0.2865319  -0.92976268  1.26566877
 -0.71984675 -0.17613722 -0.90109915  1.21253805 -0.72119253 -0.22213609
 -0.91358958  1.2209361  -0.70923185 -0.24331318 -0.91402879  1.27599748]

Class labels

The class label names are stored in a separate file. Let's load them in another DataFrame called landsat_labels and inspect.

In [9]:
# Read classes and display
labels_path = os.path.join(os.getcwd(), 'datasets', 'landsat', 'landsat_classes.csv')
landsat_labels = pd.read_csv(labels_path, delimiter = ',', index_col=0)
landsat_labels
Out[9]:
Class
1 red soil
2 cotton crop
3 grey soil
4 damp grey soil
5 soil with vegetation stubble
6 mixture class (all types present)
7 very damp grey soil

Pandas provides the to_dict() method which can be used to transform a DataFrame into a python dictionary. It will normally return a dictionary of dictionaries, one for each column in the DataFrame. Since we only have one column in landsat_labels, we can use its name to access it and end up with a single dictionary.

In [10]:
# Turn the labels dataframe into a dictionary
# We only have one column in the DataFrame
landsat_labels_dict = landsat_labels.to_dict()["Class"]
landsat_labels_dict
Out[10]:
{1: 'red soil',
 2: 'cotton crop',
 3: 'grey soil',
 4: 'damp grey soil',
 5: 'soil with vegetation stubble',
 6: 'mixture class (all types present)',
 7: 'very damp grey soil'}

Seaborn

Seaborn is a visualisation library built on top of matplotlib which offers some aesthetic enhancement and, more importantly, provides some high-level functions for "exploring and understanding data". Seaborn is also tightly integrated with pandas and provides support for both numpy and pandas data structures.

Visualising distributions

As a first visualising step, we want to get a feel for the distribution of the various features in the dataset.

For this purpose, we can use the seaborn distplot() function which combines a histogram with a kernel density estimate plot. Make sure you read the documentation of this function to understand how it can be used.

We have a total of 36 features (9 pixels $\times$ 4 spectral bands) and we will produce one distplot for each feature.

Execute the cell below to visualise the univariate distributions and kernel density estimates.

You should spend some time here to make sure you understand what each line of code does in this cell.

In [11]:
fig, ax = plt.subplots(9,4, figsize=(17,17)) # Figure with 9 rows and 4 columns
pixels = np.arange(1,10) # Pixel values (1-9)
bands = np.arange(1,5) # Spectral band values (1-4)
for ii, pixel in enumerate(pixels):
    for jj, band in enumerate(bands):
        variable_name = 'pixel_' + str(pixel) + '_' + str(band) # Get the variable name of interest
        sns.distplot(landsat_train[variable_name], ax=ax[ii][jj], kde=True) # Use a single feature at a time
        ax[ii][jj].xaxis.label.set_visible(False)
[ax[0][ii].set_title("Band {}".format(band)) for ii, band in enumerate(bands)] # Set band titles for top plots
[ax[ii][0].set_ylabel("Pixel {}".format(pixel)) for ii, pixel in enumerate(pixels)] # Set pixel titles for left-most plots
fig.tight_layout()
plt.show()

========== Question 3 ==========

It seems like intensity distributions for the different pixels are similar within the same spectral band. Is this surprising? If not, explain why.

Your answer goes here

This is somehow expected given the procedure followed to generate the dataset; the same pixels might appear in different positions (1-9) across the entries in our dataset (remember that many 3 $\times$ 3 were extracted from the same image).

========== Question 4 ==========

Given the observation made above, we now want to visualise the pixel intensity distributions by pooling all pixels together for each entry in the dataset. We still want to do this separately for each spectral band.

Modify the code provided above to produce a figure with 4 subplots (one for each spectral band), and within each subplot show the distribution and kernel density estimate for the pooled pixel intensity values. For each distplot set the number of bins equal to 25.

Hint: the distplot() function accepts one-dimensional arrays. To be able to use it, you will need to transform the data. For this purpose, you might find the reshape() numpy function useful.

In [12]:
# Your code goes here
fig, ax = plt.subplots(1,4, figsize=(17,3))
for ii, band in enumerate(bands):
    variable_names = ['pixel_' + str(pixel) + '_' + str(band) for pixel in pixels] # All pixels for the specified band
    sns.distplot(landsat_train[variable_names].values.reshape(-1,), ax=ax[ii], kde=True, bins=25) # Reshape into 1D array
    ax[ii].set_title('Band {}'.format(band)) # Subplot titles
ax[0].set_ylabel('Pooled pixels') # ylabel for left-most subplot
fig.tight_layout()
plt.show()

Now, suppose we want to visualise the pooled pixel distributions separately for every spectral band, as well as for every class in the dataset. We can do this by filtering the data according to their corresponding label, one class at a time.

You are provided with sample code to achieve this. Once again, make sure you understand what every line of code does in the following cell.

In [13]:
# Show distributions separately for each class and spectral band
labels = np.sort(landsat_train.label.unique()) # Get the labels in the dataset, by looking at possible values of "label" attribute
fig, ax = plt.subplots(labels.size,4, figsize=(17,14))
for ii, label in enumerate(labels):
    for jj, band in enumerate(bands):
        variable_names = ['pixel_' + str(pixel) + '_' + str(band) for pixel in pixels] # Pool pixels together
        sns.distplot(landsat_train[landsat_train["label"]==label][variable_names].values.reshape(-1,), \
                     ax=ax[ii][jj], kde=True, bins=25) # Filter by label
[ax[0][ii].set_title("Band {}".format(band)) for ii, band in enumerate(bands)] # Set band titles on top plots
[ax[ii][0].set_ylabel("{}".format(landsat_labels_dict[label])) for ii, label in enumerate(labels)] # Set label titles in left-most plots
fig.tight_layout()
plt.show()

It looks like the different classes can be discriminated by looking at the distribution of the pooled pixel intensities. This is good news, as it means that a classifier would hopefully be able to predict the right labels from pixel intensity values.

In some cases we might want to plot the kernel density estimates for each class on top of one another to be able to roughly tell whether the classes are distinguishable. If we use the distplot() function as we have done so far, things might get a bit too messy. Instead, we can use seaborn's kdeplot() function which only plots the kernel density estimate of a variable.

========== Question 5 ==========

Produce a figure with four subplots (one for each spectral band) and within each subplot use the kdeplot() function to show the kernel density estimate for the pooled pixel intensity values, separately for each class. Pay special attention in setting the legend(s) in your figure correctly.

In [16]:
# Your code goes here
fig, ax = plt.subplots(1,4, figsize=(17,4))
for ii, band in enumerate(bands):
    for label in labels:
        variable_names = ['pixel_' + str(pixel) + '_' + str(band) for pixel in pixels]
        sns.kdeplot(landsat_train[landsat_train["label"]==label][variable_names].values.reshape(-1,), \
                    ax=ax[ii], label=landsat_labels_dict[label])
[ax[ii].set_title("Band {}".format(band)) for ii, band in enumerate(bands)] # Set band titles
ax[0].legend(loc='upper center', bbox_to_anchor=(1.6, 1.25),
          ncol=6) # Put legend outside plot
[ax[ii].legend_.remove() for ii in np.arange(1,4)] # Remove all legends except the first one
plt.show()

========== Question 6 ==========

By observing the above kernel density estimate plots, which classes do you think are easy/difficult to separate when using pixel intensity values only?

Your answer goes here

Examples of classes which are easy to separate:

  • Cotton crop vs. grey soil
  • Soil with vegetation stubble vs. grey soil
  • Cotton crop vs. damp grey soil

Examples of classes which are hard to separate:

  • Damp grey soil vs. grey soil
  • Red soil vs. soil with vegetation stubble
  • Red soil vs. grey soil

Correlations

So far, we have focused on univariate feature distributions. Now, we want to get a feel for the correlations between different features. Seaborn offers the pairplot() function, which is an excellent tool for visualising pair-wise relationships between variables.

The following example shows the pairwise relationship between the features pixel_1_1 and pixel_1_2. Refer to the pairplot documentation to understand how this function can be used. Feel free to experiment with other pairs of variables.

In [15]:
sns.pairplot(landsat_train, vars = ["pixel_1_1", "pixel_1_2"], \
             plot_kws={'s' : 6}, diag_kws={'bins' : 25}) # Set variables of interest, marker size and bins for histograms
plt.show()

The above plot shows the pair-wise relationship between only two variables. Our feature space is 36-dimensional, so if we wanted to repeat the same procedure for each possible pair of variables we would end up with a 36 $\times$ 36 figure which would not be very meaningful (also it would be fairly computationally expensive to produce).

Instead, we can pool pixels together again, as we did in the previous part. This time, instead of treating each pixel in the same way and combining all pixel values, we can compute the average pixel value in each spectral band.

The following bit of code computes the average pixel value within each spectral band, separately for each observation, and saves the result in a new column.

In [17]:
for band in bands:
    variable_names = ['pixel_' + str(pixel) + '_' + str(band) for pixel in pixels]
    landsat_train['avg_' + str(band)] = landsat_train[variable_names].mean(axis=1)
landsat_train.head(5) # Show the first 5 observations in the updated dataframe
Out[17]:
pixel_1_1 pixel_1_2 pixel_1_3 pixel_1_4 pixel_2_1 pixel_2_2 pixel_2_3 pixel_2_4 pixel_3_1 pixel_3_2 ... pixel_8_4 pixel_9_1 pixel_9_2 pixel_9_3 pixel_9_4 label avg_1 avg_2 avg_3 avg_4
0 92 115 120 94 84 102 106 79 84 102 ... 100 84 107 113 87 3 90.111111 112.666667 117.555556 90.666667
1 84 102 106 79 84 102 102 83 80 102 ... 87 84 99 104 79 3 84.888889 105.222222 109.000000 83.444444
2 84 102 102 83 80 102 102 79 84 94 ... 79 84 99 104 79 3 83.555556 100.444444 104.333333 80.666667
3 80 102 102 79 84 94 102 79 80 94 ... 79 84 103 104 79 3 82.222222 98.666667 102.888889 79.000000
4 84 94 102 79 80 94 98 76 80 102 ... 79 79 107 109 87 3 80.777778 99.555556 103.888889 80.666667

5 rows × 41 columns

========== Question 7 ==========

By using the seaborn pairplot() function, show the pairwise correlations between the mean pixel values in each spectral band for the training set (landsat_train).

Hint: make appropriate use of the vars argument of the function.

Which spectral band pairs exhibit the strongest correlations? Are these correlations expected?

In [18]:
# Your code goes here
g = sns.pairplot(landsat_train, vars=['avg_' + str(band) for band in bands], \
                 plot_kws={'s' : 6}, diag_kws={'bins' : 25}) # Set marker size and number of bins for histograms

Your answer goes here

The strongest correlations appear in pairs (1,2) and (3,4). This is somewhat expected, since bands 1 and 2 correspond to the visible region, whereas bands 3 and 4 correspond to the near-infrared region.

========== Question 8 ==========

The pairplot function can also be used to visualise pair-wise relationships between variables, conditioned on the label, that is, separately for each class.

Modify your code from the previous question to visualise pair-wise relationships between spectral bands, separately for each class. For the diagonal plots, show kernel density estimates instead of histograms which are shown by default. Do not worry about changing the legend entries at this point.

Hint: make appropriate use of the hue and diag_kind parameters of the pairplot function.

In [19]:
# Your code goes here
g = sns.pairplot(landsat_train, vars=['avg_' + str(band) for band in bands], \
                 hue='label', diag_kind = 'kde', plot_kws={'s' : 6})

========== Question 9 ==========

Do you think that feature interactions can help in discrimanting the different classes? Would it make sense to use a classifier that makes use of such correlations to predict labels? For instance, would you expect a Quadratic Discriminant Analysis (QDA) classifier to perform better or worse than a Gaussian Naive Bayes (GNB) model?

Your answer goes here

From the above visualisation, it is obvious that feature interactions (i.e. correlations) exhibit patterns which are characteristic for the different classes. Thus, we would expect that making use of such correlations would improve classification performance. The difference between QDA and GNB is that the latter assumes conditional independence of the features given the label. From the above visualisation, it is obvious that this is not the case for our data. Hence, we would expect that QDA would outperform GNB in this setting.