Data Mining and Exploration [INFR11007]¶

Python environment set-up¶

Part of the fist lab will be spent in setting-up a python working environvment. Follow the instructions provided and make sure your environment is set-up correctly. In case you run into trouble, please seek advice in the lab sessions.

Lab 1: Data exploratory analysis and visualisations¶

In this lab we introduce the landsat satellite dataset that we will be using throughout the course and perform some exloratory data analysis and visualisations on the dataset.

Landsat satellite dataset¶

Description¶

The database consists of the multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood. The aim is to predict this classification, given the multi-spectral values. One frame of Landsat MSS imagery consists of four digital images of the same scene in different spectral bands. Two of these are in the visible region (corresponding approximately to green and red regions of the visible spectrum) and two are in the (near) infra-red. Each pixel is an 8-bit binary word, with 0 corresponding to black and 255 to white. The spatial resolution of a pixel is about 80m x 80m. Each image contains 2340 x 3380 such pixels.

Our database is a (tiny) sub-area of a scene, consisting of 82 x 100 pixels. Each line of data corresponds to a 3x3 square neighbourhood of pixels completely contained within the 82x100 sub-area. Each line contains the pixel values in the four spectral bands (converted to ASCII) of each of the 9 pixels in the 3x3 neighbourhood and a number indicating the classification label of the central pixel.

Classes¶

The number is a code for the following classes:

red soil
cotton crop
grey soil
damp grey soil
soil with vegetation stubble
mixture class (all types present)
very damp grey soil

NB. There are no examples with class 6 in this dataset, hence there are actually 6 classes, i.e. 1,2,3,4,5 and 7.

Data pre-processing¶

The data is given in random order and certain lines of data have been removed so that the original image cannot be reconstructed from this dataset.

In each line of data, the four spectral values for the top-left pixel are given first followed by the four spectral values for the top-middle pixel and then those for the top-right pixel, and so on with the pixels read out in sequence left-to-right and top-to-bottom.

Number of examples¶

training set: 4435
test set: 2000

Number of attributes¶

36 (= 4 spectral bands x 9 pixels in neighbourhood )

Attributes¶

The attributes are numerical, in the range 0 to 255.

Execute the cell below to import the packages we will be using throughout this lab. The last line enforces the matplotlib figures to be rendered within the notebook.

# Import required packages 
from __future__ import division, print_function # Imports from __future__ since we're running Python 2
import os
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Pandas¶

Pandas is a library for data manipulation and analysis. There are two fundamental data structures in pandas: the Series and DataFrame structures which are built on top of NumPy arrays. DataFrame structures are very similar to Data Frames used in the R programming language.

Let's start by loading the training and test datasets into pandas DataFrames.

train_path = os.path.join(os.getcwd(), 'datasets', 'landsat', 'landsat_train.csv')
test_path = os.path.join(os.getcwd(), 'datasets', 'landsat', 'landsat_test.csv')
landsat_train = pd.read_csv(train_path, delimiter = ',')
landsat_test = pd.read_csv(test_path, delimiter = ',')

Preliminary inspections¶

We can use the pandas head() method to inspect the first n entries in the DataFrame. Note that the last column in the DataFrame contains the label for each data point.

# Show first 5 instances of train set
landsat_train.head(n=5)

Alternatively, we can use the pandas sample() method to inspect n random entries in the DataFrame. We can set the random_state parameter to ensure reproducible results.

========== Question 1 ==========¶

Inspect 7 random entries in the test dataset. Set the random_state parameter to a number of your choice (i.e. 10) to ensure reproducible results.

# Your code goes here
landsat_test.sample(7, random_state=10)

========== Question 2 ==========¶

The pandas info() method provides basic information (i.e. number of rows/columns, variable data types) about a DataFrame.

Display the basic information about the landsat_train dataframe. How many attributes/samples are there in the dataset?

# Your code goes here
landsat_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4435 entries, 0 to 4434
Data columns (total 37 columns):
pixel_1_1    4435 non-null int64
pixel_1_2    4435 non-null int64
pixel_1_3    4435 non-null int64
pixel_1_4    4435 non-null int64
pixel_2_1    4435 non-null int64
pixel_2_2    4435 non-null int64
pixel_2_3    4435 non-null int64
pixel_2_4    4435 non-null int64
pixel_3_1    4435 non-null int64
pixel_3_2    4435 non-null int64
pixel_3_3    4435 non-null int64
pixel_3_4    4435 non-null int64
pixel_4_1    4435 non-null int64
pixel_4_2    4435 non-null int64
pixel_4_3    4435 non-null int64
pixel_4_4    4435 non-null int64
pixel_5_1    4435 non-null int64
pixel_5_2    4435 non-null int64
pixel_5_3    4435 non-null int64
pixel_5_4    4435 non-null int64
pixel_6_1    4435 non-null int64
pixel_6_2    4435 non-null int64
pixel_6_3    4435 non-null int64
pixel_6_4    4435 non-null int64
pixel_7_1    4435 non-null int64
pixel_7_2    4435 non-null int64
pixel_7_3    4435 non-null int64
pixel_7_4    4435 non-null int64
pixel_8_1    4435 non-null int64
pixel_8_2    4435 non-null int64
pixel_8_3    4435 non-null int64
pixel_8_4    4435 non-null int64
pixel_9_1    4435 non-null int64
pixel_9_2    4435 non-null int64
pixel_9_3    4435 non-null int64
pixel_9_4    4435 non-null int64
label        4435 non-null int64
dtypes: int64(37)
memory usage: 1.3 MB

Your answer goes here

There are 4435 entries and 37 columns in the landsat_train dataframe.

Alternatively, we can use the shape attribute of a DataFrame to get the number of entries (i.e. rows/samples) and columns (i.e. attributes).

print("There are {} entries and {} columns in the landsat_train DataFrame"\
      .format(landsat_train.shape[0], landsat_train.shape[1]))

There are 4435 entries and 37 columns in the landsat_train DataFrame

Another very useful pandas method is describe() which generates summary statistics about the columns in a DataFrame.

# Summary statistics in train set
landsat_train.describe()

Some times we might want to compute further statistics about the columns in a DataFrame. Since pandas structures are built on top of numpy arrays, each DataFrame is essentially a two-dimensional numpy array. Hence, we can use numpy or scipy functions to perform any sort of transformations or compute statistics. The next cell computes high-order statistics (i.e. skewness and kurtosis) by using functions imported from scipy.stats.

from scipy.stats import skew, kurtosis
print('Skewness:\n{}'.format(skew(landsat_train)[:-1])) # Get rid of label column 
print('Kurtosis:\n{}'.format(kurtosis(landsat_train)[:-1])) # Get rid of label column

Skewness:
[ 0.03897976 -0.66114747 -0.11532597  0.88180938  0.05867339 -0.64255423
 -0.11676827  0.89154795  0.06286078 -0.64115429 -0.10967779  0.91558233
  0.03864269 -0.6650612  -0.12403743  0.87217058  0.03516946 -0.65737898
 -0.11111385  0.90241981  0.056391   -0.64982478 -0.10552114  0.90049647
  0.03864561 -0.67319444 -0.12702005  0.88087324  0.04238649 -0.66352782
 -0.13171674  0.88438801  0.05262999 -0.66424573 -0.12624233  0.88873574]
Kurtosis:
[-0.73962977 -0.21807077 -0.89152972  1.30588852 -0.72693453 -0.27374012
 -0.91441639  1.28448304 -0.72192199 -0.29861851 -0.91068177  1.34217389
 -0.74918916 -0.20644875 -0.91032411  1.1945522  -0.73138799 -0.2564044
 -0.93930816  1.26275161 -0.70827217 -0.2865319  -0.92976268  1.26566877
 -0.71984675 -0.17613722 -0.90109915  1.21253805 -0.72119253 -0.22213609
 -0.91358958  1.2209361  -0.70923185 -0.24331318 -0.91402879  1.27599748]

Class labels¶

The class label names are stored in a separate file. Let's load them in another DataFrame called landsat_labels and inspect.

# Read classes and display
labels_path = os.path.join(os.getcwd(), 'datasets', 'landsat', 'landsat_classes.csv')
landsat_labels = pd.read_csv(labels_path, delimiter = ',', index_col=0)
landsat_labels

Pandas provides the to_dict() method which can be used to transform a DataFrame into a python dictionary. It will normally return a dictionary of dictionaries, one for each column in the DataFrame. Since we only have one column in landsat_labels, we can use its name to access it and end up with a single dictionary.

# Turn the labels dataframe into a dictionary
# We only have one column in the DataFrame
landsat_labels_dict = landsat_labels.to_dict()["Class"]
landsat_labels_dict

{1: 'red soil',
 2: 'cotton crop',
 3: 'grey soil',
 4: 'damp grey soil',
 5: 'soil with vegetation stubble',
 6: 'mixture class (all types present)',
 7: 'very damp grey soil'}

Seaborn¶

Seaborn is a visualisation library built on top of matplotlib which offers some aesthetic enhancement and, more importantly, provides some high-level functions for "exploring and understanding data". Seaborn is also tightly integrated with pandas and provides support for both numpy and pandas data structures.

Visualising distributions¶

As a first visualising step, we want to get a feel for the distribution of the various features in the dataset.

For this purpose, we can use the seaborn distplot() function which combines a histogram with a kernel density estimate plot. Make sure you read the documentation of this function to understand how it can be used.

We have a total of 36 features (9 pixels $\times$ 4 spectral bands) and we will produce one distplot for each feature.

Execute the cell below to visualise the univariate distributions and kernel density estimates.

You should spend some time here to make sure you understand what each line of code does in this cell.

fig, ax = plt.subplots(9,4, figsize=(17,17)) # Figure with 9 rows and 4 columns
pixels = np.arange(1,10) # Pixel values (1-9)
bands = np.arange(1,5) # Spectral band values (1-4)
for ii, pixel in enumerate(pixels):
    for jj, band in enumerate(bands):
        variable_name = 'pixel_' + str(pixel) + '_' + str(band) # Get the variable name of interest
        sns.distplot(landsat_train[variable_name], ax=ax[ii][jj], kde=True) # Use a single feature at a time
        ax[ii][jj].xaxis.label.set_visible(False)
[ax[0][ii].set_title("Band {}".format(band)) for ii, band in enumerate(bands)] # Set band titles for top plots
[ax[ii][0].set_ylabel("Pixel {}".format(pixel)) for ii, pixel in enumerate(pixels)] # Set pixel titles for left-most plots
fig.tight_layout()
plt.show()

========== Question 3 ==========¶

It seems like intensity distributions for the different pixels are similar within the same spectral band. Is this surprising? If not, explain why.

Your answer goes here

This is somehow expected given the procedure followed to generate the dataset; the same pixels might appear in different positions (1-9) across the entries in our dataset (remember that many 3 $\times$ 3 were extracted from the same image).

========== Question 4 ==========¶

Given the observation made above, we now want to visualise the pixel intensity distributions by pooling all pixels together for each entry in the dataset. We still want to do this separately for each spectral band.

Modify the code provided above to produce a figure with 4 subplots (one for each spectral band), and within each subplot show the distribution and kernel density estimate for the pooled pixel intensity values. For each distplot set the number of bins equal to 25.

Hint: the distplot() function accepts one-dimensional arrays. To be able to use it, you will need to transform the data. For this purpose, you might find the reshape() numpy function useful.

# Your code goes here
fig, ax = plt.subplots(1,4, figsize=(17,3))
for ii, band in enumerate(bands):
    variable_names = ['pixel_' + str(pixel) + '_' + str(band) for pixel in pixels] # All pixels for the specified band
    sns.distplot(landsat_train[variable_names].values.reshape(-1,), ax=ax[ii], kde=True, bins=25) # Reshape into 1D array
    ax[ii].set_title('Band {}'.format(band)) # Subplot titles
ax[0].set_ylabel('Pooled pixels') # ylabel for left-most subplot
fig.tight_layout()
plt.show()

Now, suppose we want to visualise the pooled pixel distributions separately for every spectral band, as well as for every class in the dataset. We can do this by filtering the data according to their corresponding label, one class at a time.

You are provided with sample code to achieve this. Once again, make sure you understand what every line of code does in the following cell.

# Show distributions separately for each class and spectral band
labels = np.sort(landsat_train.label.unique()) # Get the labels in the dataset, by looking at possible values of "label" attribute
fig, ax = plt.subplots(labels.size,4, figsize=(17,14))
for ii, label in enumerate(labels):
    for jj, band in enumerate(bands):
        variable_names = ['pixel_' + str(pixel) + '_' + str(band) for pixel in pixels] # Pool pixels together
        sns.distplot(landsat_train[landsat_train["label"]==label][variable_names].values.reshape(-1,), \
                     ax=ax[ii][jj], kde=True, bins=25) # Filter by label
[ax[0][ii].set_title("Band {}".format(band)) for ii, band in enumerate(bands)] # Set band titles on top plots
[ax[ii][0].set_ylabel("{}".format(landsat_labels_dict[label])) for ii, label in enumerate(labels)] # Set label titles in left-most plots
fig.tight_layout()
plt.show()

It looks like the different classes can be discriminated by looking at the distribution of the pooled pixel intensities. This is good news, as it means that a classifier would hopefully be able to predict the right labels from pixel intensity values.

In some cases we might want to plot the kernel density estimates for each class on top of one another to be able to roughly tell whether the classes are distinguishable. If we use the distplot() function as we have done so far, things might get a bit too messy. Instead, we can use seaborn's kdeplot() function which only plots the kernel density estimate of a variable.

========== Question 5 ==========¶

Produce a figure with four subplots (one for each spectral band) and within each subplot use the kdeplot() function to show the kernel density estimate for the pooled pixel intensity values, separately for each class. Pay special attention in setting the legend(s) in your figure correctly.

# Your code goes here
fig, ax = plt.subplots(1,4, figsize=(17,4))
for ii, band in enumerate(bands):
    for label in labels:
        variable_names = ['pixel_' + str(pixel) + '_' + str(band) for pixel in pixels]
        sns.kdeplot(landsat_train[landsat_train["label"]==label][variable_names].values.reshape(-1,), \
                    ax=ax[ii], label=landsat_labels_dict[label])
[ax[ii].set_title("Band {}".format(band)) for ii, band in enumerate(bands)] # Set band titles
ax[0].legend(loc='upper center', bbox_to_anchor=(1.6, 1.25),
          ncol=6) # Put legend outside plot
[ax[ii].legend_.remove() for ii in np.arange(1,4)] # Remove all legends except the first one
plt.show()

========== Question 6 ==========¶

By observing the above kernel density estimate plots, which classes do you think are easy/difficult to separate when using pixel intensity values only?

Your answer goes here

Examples of classes which are easy to separate:

Cotton crop vs. grey soil
Soil with vegetation stubble vs. grey soil
Cotton crop vs. damp grey soil

Examples of classes which are hard to separate:

Damp grey soil vs. grey soil
Red soil vs. soil with vegetation stubble
Red soil vs. grey soil

Correlations¶

So far, we have focused on univariate feature distributions. Now, we want to get a feel for the correlations between different features. Seaborn offers the pairplot() function, which is an excellent tool for visualising pair-wise relationships between variables.

The following example shows the pairwise relationship between the features pixel_1_1 and pixel_1_2. Refer to the pairplot documentation to understand how this function can be used. Feel free to experiment with other pairs of variables.

sns.pairplot(landsat_train, vars = ["pixel_1_1", "pixel_1_2"], \
             plot_kws={'s' : 6}, diag_kws={'bins' : 25}) # Set variables of interest, marker size and bins for histograms
plt.show()

The above plot shows the pair-wise relationship between only two variables. Our feature space is 36-dimensional, so if we wanted to repeat the same procedure for each possible pair of variables we would end up with a 36 $\times$ 36 figure which would not be very meaningful (also it would be fairly computationally expensive to produce).

Instead, we can pool pixels together again, as we did in the previous part. This time, instead of treating each pixel in the same way and combining all pixel values, we can compute the average pixel value in each spectral band.

The following bit of code computes the average pixel value within each spectral band, separately for each observation, and saves the result in a new column.

for band in bands:
    variable_names = ['pixel_' + str(pixel) + '_' + str(band) for pixel in pixels]
    landsat_train['avg_' + str(band)] = landsat_train[variable_names].mean(axis=1)
landsat_train.head(5) # Show the first 5 observations in the updated dataframe

========== Question 7 ==========¶

By using the seaborn pairplot() function, show the pairwise correlations between the mean pixel values in each spectral band for the training set (landsat_train).

Hint: make appropriate use of the vars argument of the function.

Which spectral band pairs exhibit the strongest correlations? Are these correlations expected?

# Your code goes here
g = sns.pairplot(landsat_train, vars=['avg_' + str(band) for band in bands], \
                 plot_kws={'s' : 6}, diag_kws={'bins' : 25}) # Set marker size and number of bins for histograms

Your answer goes here

The strongest correlations appear in pairs (1,2) and (3,4). This is somewhat expected, since bands 1 and 2 correspond to the visible region, whereas bands 3 and 4 correspond to the near-infrared region.

========== Question 8 ==========¶

The pairplot function can also be used to visualise pair-wise relationships between variables, conditioned on the label, that is, separately for each class.

Modify your code from the previous question to visualise pair-wise relationships between spectral bands, separately for each class. For the diagonal plots, show kernel density estimates instead of histograms which are shown by default. Do not worry about changing the legend entries at this point.

Hint: make appropriate use of the hue and diag_kind parameters of the pairplot function.

# Your code goes here
g = sns.pairplot(landsat_train, vars=['avg_' + str(band) for band in bands], \
                 hue='label', diag_kind = 'kde', plot_kws={'s' : 6})

========== Question 9 ==========¶

Do you think that feature interactions can help in discrimanting the different classes? Would it make sense to use a classifier that makes use of such correlations to predict labels? For instance, would you expect a Quadratic Discriminant Analysis (QDA) classifier to perform better or worse than a Gaussian Naive Bayes (GNB) model?

Your answer goes here

From the above visualisation, it is obvious that feature interactions (i.e. correlations) exhibit patterns which are characteristic for the different classes. Thus, we would expect that making use of such correlations would improve classification performance. The difference between QDA and GNB is that the latter assumes conditional independence of the features given the label. From the above visualisation, it is obvious that this is not the case for our data. Hence, we would expect that QDA would outperform GNB in this setting.

	pixel_1_1	pixel_1_2	pixel_1_3	pixel_1_4	pixel_2_1	pixel_2_2	pixel_2_3	pixel_2_4	pixel_3_1	pixel_3_2	...	pixel_7_4	pixel_8_1	pixel_8_2	pixel_8_3	pixel_8_4	pixel_9_1	pixel_9_2	pixel_9_3	pixel_9_4	label
count	4435.000000	4435.000000	4435.000000	4435.000000	4435.000000	4435.000000	4435.000000	4435.000000	4435.000000	4435.000000	...	4435.000000	4435.000000	4435.000000	4435.000000	4435.000000	4435.000000	4435.000000	4435.000000	4435.000000	4435.000000
mean	69.473957	83.855242	99.322886	82.555581	69.208794	83.499436	99.174521	82.480496	68.963472	83.131003	...	82.673957	69.033145	83.434273	99.183766	82.638106	68.797069	83.153326	99.061556	82.584442	3.650282
std	13.649635	22.725079	16.672166	18.695450	13.552423	22.811809	16.627583	18.714272	13.497858	22.813312	...	18.884322	13.533934	22.812090	16.741240	18.939820	13.437785	22.762684	16.660735	18.895513	2.217055
min	40.000000	27.000000	56.000000	33.000000	40.000000	27.000000	50.000000	29.000000	40.000000	27.000000	...	34.000000	39.000000	27.000000	50.000000	29.000000	40.000000	27.000000	50.000000	29.000000	1.000000
25%	60.000000	71.000000	85.000000	70.000000	60.000000	71.000000	85.000000	69.000000	60.000000	71.000000	...	70.000000	60.000000	71.000000	85.000000	69.000000	60.000000	71.000000	85.000000	69.000000	2.000000
50%	68.000000	87.000000	101.000000	83.000000	68.000000	85.000000	101.000000	81.000000	67.000000	85.000000	...	81.000000	68.000000	85.000000	101.000000	81.000000	67.000000	85.000000	100.000000	81.000000	3.000000
75%	80.000000	103.000000	113.000000	92.000000	80.000000	103.000000	113.000000	92.000000	79.000000	103.000000	...	92.000000	79.500000	103.000000	113.000000	92.000000	79.000000	103.000000	113.000000	92.000000	5.000000
max	104.000000	137.000000	140.000000	154.000000	102.000000	137.000000	145.000000	157.000000	104.000000	130.000000	...	154.000000	104.000000	128.000000	145.000000	157.000000	104.000000	130.000000	145.000000	157.000000	7.000000

	pixel_1_1	pixel_1_2	pixel_1_3	pixel_1_4	pixel_2_1	pixel_2_2	pixel_2_3	pixel_2_4	pixel_3_1	pixel_3_2	...	pixel_7_4	pixel_8_1	pixel_8_2	pixel_8_3	pixel_8_4	pixel_9_1	pixel_9_2	pixel_9_3	pixel_9_4	label
0	92	115	120	94	84	102	106	79	84	102	...	104	88	121	128	100	84	107	113	87	3
1	84	102	106	79	84	102	102	83	80	102	...	100	84	107	113	87	84	99	104	79	3
2	84	102	102	83	80	102	102	79	84	94	...	87	84	99	104	79	84	99	104	79	3
3	80	102	102	79	84	94	102	79	80	94	...	79	84	99	104	79	84	103	104	79	3
4	84	94	102	79	80	94	98	76	80	102	...	79	84	103	104	79	79	107	109	87	3

	pixel_1_1	pixel_1_2	pixel_1_3	pixel_1_4	pixel_2_1	pixel_2_2	pixel_2_3	pixel_2_4	pixel_3_1	pixel_3_2	...	pixel_7_4	pixel_8_1	pixel_8_2	pixel_8_3	pixel_8_4	pixel_9_1	pixel_9_2	pixel_9_3	pixel_9_4	label
115	46	34	119	131	52	48	110	105	71	77	...	135	41	31	131	139	44	40	120	120	2
1779	44	29	114	133	47	37	114	122	50	63	...	88	56	61	91	85	56	64	91	85	2
1449	66	109	122	100	66	109	122	100	66	109	...	90	67	102	114	94	67	102	114	90	1
145	89	111	110	87	89	106	110	87	89	106	...	88	88	107	108	88	88	107	113	92	3
1009	67	72	77	62	75	87	96	79	79	99	...	57	67	66	72	60	63	66	68	57	7
1840	84	99	104	79	84	99	104	79	88	99	...	90	92	103	110	86	87	103	110	83	3
579	82	96	100	78	82	96	96	78	82	96	...	80	85	97	101	80	85	97	105	80	4

	pixel_1_1	pixel_1_2	pixel_1_3	pixel_1_4	pixel_2_1	pixel_2_2	pixel_2_3	pixel_2_4	pixel_3_1	pixel_3_2	...	pixel_8_4	pixel_9_1	pixel_9_2	pixel_9_3	pixel_9_4	label	avg_1	avg_2	avg_3	avg_4
0	92	115	120	94	84	102	106	79	84	102	...	100	84	107	113	87	3	90.111111	112.666667	117.555556	90.666667
1	84	102	106	79	84	102	102	83	80	102	...	87	84	99	104	79	3	84.888889	105.222222	109.000000	83.444444
2	84	102	102	83	80	102	102	79	84	94	...	79	84	99	104	79	3	83.555556	100.444444	104.333333	80.666667
3	80	102	102	79	84	94	102	79	80	94	...	79	84	103	104	79	3	82.222222	98.666667	102.888889	79.000000
4	84	94	102	79	80	94	98	76	80	102	...	79	79	107	109	87	3	80.777778	99.555556	103.888889	80.666667

	pixel_1_1	pixel_1_2	pixel_1_3	pixel_1_4	pixel_2_1	pixel_2_2	pixel_2_3	pixel_2_4	pixel_3_1	pixel_3_2	...	pixel_7_4	pixel_8_1	pixel_8_2	pixel_8_3	pixel_8_4	pixel_9_1	pixel_9_2	pixel_9_3	pixel_9_4	label
0	92	115	120	94	84	102	106	79	84	102	...	104	88	121	128	100	84	107	113	87	3
1	84	102	106	79	84	102	102	83	80	102	...	100	84	107	113	87	84	99	104	79	3
2	84	102	102	83	80	102	102	79	84	94	...	87	84	99	104	79	84	99	104	79	3
3	80	102	102	79	84	94	102	79	80	94	...	79	84	99	104	79	84	103	104	79	3
4	84	94	102	79	80	94	98	76	80	102	...	79	84	103	104	79	79	107	109	87	3

	pixel_1_1	pixel_1_2	pixel_1_3	pixel_1_4	pixel_2_1	pixel_2_2	pixel_2_3	pixel_2_4	pixel_3_1	pixel_3_2	...	pixel_7_4	pixel_8_1	pixel_8_2	pixel_8_3	pixel_8_4	pixel_9_1	pixel_9_2	pixel_9_3	pixel_9_4	label
0	92	115	120	94	84	102	106	79	84	102	...	104	88	121	128	100	84	107	113	87	3
1	84	102	106	79	84	102	102	83	80	102	...	100	84	107	113	87	84	99	104	79	3
2	84	102	102	83	80	102	102	79	84	94	...	87	84	99	104	79	84	99	104	79	3
3	80	102	102	79	84	94	102	79	80	94	...	79	84	99	104	79	84	103	104	79	3
4	84	94	102	79	80	94	98	76	80	102	...	79	84	103	104	79	79	107	109	87	3