Land Use Classification with Random Forests Classifier

Written by Men Vuthy, 2022


You can also run the code here in Google Colab. Try clicking button below:

Open In Colab

Objective

The objective is to classify the land use into 5 classes using supervised learning method. The classifier used in this classification is Random Forests Classifier with 500 ensembles. The target classes of land use are: Forest, Water, Herbaceous, Barren and Urban.

Code

Importing data from GitHub

[1]:
!git clone https://github.com/menvuthy/sample_images.git
Cloning into 'sample_images'...
remote: Enumerating objects: 16, done.
remote: Counting objects: 100% (16/16), done.
remote: Compressing objects: 100% (13/13), done.
remote: Total 16 (delta 1), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (16/16), done.

Importing modules

[2]:
# Import Python 3's print function and division
from __future__ import print_function, division

# Import GDAL, NumPy, Pandas and matplotlib
from osgeo import gdal, gdal_array
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Import scikit-learn modules
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import joblib
[3]:
# Tell GDAL to throw Python exceptions, and register all drivers
gdal.UseExceptions()
gdal.AllRegister()
[4]:
# Read in our image and ROI image
img_ds = gdal.Open('/content/sample_images/data/LE70220491999322EDC01_stack.gtif', gdal.GA_ReadOnly)
roi_ds = gdal.Open('/content/sample_images/data/training_data.gtif', gdal.GA_ReadOnly)
[5]:
img = np.zeros((img_ds.RasterYSize, img_ds.RasterXSize, img_ds.RasterCount),
               gdal_array.GDALTypeCodeToNumericTypeCode(img_ds.GetRasterBand(1).DataType))
for b in range(img.shape[2]):
    img[:, :, b] = img_ds.GetRasterBand(b + 1).ReadAsArray()

roi = roi_ds.GetRasterBand(1).ReadAsArray().astype(np.uint8)

# Display them
plt.figure(figsize=(15,15))
plt.subplot(121)
plt.imshow(img[:, :, 4], cmap=plt.cm.Greys_r)
plt.title('SWIR1')

plt.subplot(122)
plt.imshow(roi, cmap=plt.cm.Spectral)
plt.title('ROI Training Data')

plt.show()
../../../_images/Content_Documentation_machine-learning_1-land_use_classification_with_random_forests_11_0.png
[6]:
# Find how many non-zero entries we have -- i.e. how many training data samples?
n_samples = (roi > 0).sum()
print('We have {n} samples'.format(n=n_samples))

# What are our classification labels?
labels = np.unique(roi[roi > 0])
print('The training data include {n} classes: {classes}'.format(n=labels.size,
                                                                classes=labels))
# We will need a "X" matrix containing our features, and a "y" array containing our labels
#     These will have n_samples rows
#     In other languages we would need to allocate these and them loop to fill them, but NumPy can be faster

X = img[roi > 0, :]  # include 8th band, which is Fmask, for now
y = roi[roi > 0]

print('Our X matrix is sized: {sz}'.format(sz=X.shape))
print('Our y array is sized: {sz}'.format(sz=y.shape))

# Mask out clouds, cloud shadows, and snow using Fmask
clear = X[:, 7] <= 1

X = X[clear, :7]  # we can ditch the Fmask band now
y = y[clear]

print('After masking, our X matrix is sized: {sz}'.format(sz=X.shape))
print('After masking, our y array is sized: {sz}'.format(sz=y.shape))
We have 718 samples
The training data include 5 classes: [1 2 3 4 5]
Our X matrix is sized: (718, 8)
Our y array is sized: (718,)
After masking, our X matrix is sized: (718, 7)
After masking, our y array is sized: (718,)
[7]:
# Create a dataframe with the four feature variables
df = pd.DataFrame(X, columns = [1, 2, 3, 4, 5, 6, 7])
df['class'] = y
df.head()
[7]:
1 2 3 4 5 6 7 class
0 348 507 346 3001 1611 646 2587 1
1 369 529 365 3030 1662 726 2587 1
2 348 486 365 3143 1662 646 2637 1
3 327 441 327 3002 1561 646 2537 1
4 327 463 346 2887 1486 567 2537 1
[8]:
# Create Test and Train Data
df['train'] = np.random.uniform(0, 1, len(X)) <= .75
df.head()
[8]:
1 2 3 4 5 6 7 class train
0 348 507 346 3001 1611 646 2587 1 True
1 369 529 365 3030 1662 726 2587 1 False
2 348 486 365 3143 1662 646 2637 1 False
3 327 441 327 3002 1561 646 2537 1 False
4 327 463 346 2887 1486 567 2537 1 True
[9]:
# Create dataframes with test rows and training rows
train, test = df[df['train']==True], df[df['train']==False]

# show the number of observations for test and train dataframe
print('Training data:', len(train))
print('Testing data:', len(test))
Training data: 539
Testing data: 179
[10]:
# Create a list of the feature column's names
features = df.columns[:7]

features
[10]:
Index([1, 2, 3, 4, 5, 6, 7], dtype='object')
[11]:
# Converting each land use class name into digits
# classes = pd.factorize(train['class'])[0]

# Since our land use is already in digits, there's no need to factorize
classes = train['class']

Training model with Random Forest Classifier

[12]:
# Initialize our model with 500 trees
clf = RandomForestClassifier(n_estimators=500, oob_score=True, random_state=0)

# Fit our model to training data
clf = clf.fit(train[features], classes)

Saving and loading model

[13]:
# save
joblib.dump(clf, "./random_forest.joblib")

# load, no need to initialize the loaded_rf
rfc_model = joblib.load("./random_forest.joblib")
[14]:
bands = [1, 2, 3, 4, 5, 7, 6]

for b, imp in zip(bands, rfc_model.feature_importances_):
    print('Band {b} importance: {imp}'.format(b=b, imp=imp))
Band 1 importance: 0.14429193686579558
Band 2 importance: 0.19165129431923916
Band 3 importance: 0.11280886832281747
Band 4 importance: 0.10120422287031725
Band 5 importance: 0.18964449880307316
Band 7 importance: 0.16422821708284654
Band 6 importance: 0.09617096173591086
[15]:
# View Out-of-Bag accuracy score
print('Our OOB prediction of accuracy is: {oob}%'.format(oob=rfc_model.oob_score_ * 100))
Our OOB prediction of accuracy is: 97.77365491651206%

Predicting test dataset

[16]:
# Apply the trained Classifier to the test
preds = rfc_model.predict(test[features])
preds
[16]:
array([1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 4, 3,
       3, 5, 3, 3, 4, 4, 4, 3, 5, 4, 4, 4, 4, 5, 5, 5, 4, 4, 4, 4, 4, 4,
       4, 5, 5, 4, 5, 4, 4, 5, 4, 4, 5, 5, 4, 4, 5, 5, 5, 5, 5, 5, 1, 1,
       4, 3, 3, 1, 5, 3, 1, 5, 3, 1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 5, 1, 4, 4, 1, 1, 1, 1, 2, 4, 3, 3, 1, 3, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 4, 1, 1, 1, 4, 4, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1], dtype=uint8)
[17]:
# View accuracy classification (cross-validation) score
print('Our classification accuracy is: {cv}%'.format(cv=accuracy_score(test['class'], preds)* 100))
Our classification accuracy is: 97.20670391061452%

Visualizing confusion matrix

[18]:
# Get and reshape confusion matrix data
Matrix = confusion_matrix(test['class'], preds)
matrix = Matrix.astype('float') / Matrix.sum(axis=1)[:, np.newaxis]
[19]:
# Build the plot
plt.figure(figsize=(15,5))
sns.heatmap(Matrix, annot=True, annot_kws={'size':10},
            cmap=plt.cm.Greens, linewidths=0.2)

# Add labels to the plot
class_names = ['Forest', 'Water', 'Herbaceous', 'Barren', 'Urban']
tick_marks = np.arange(len(class_names))
tick_marks2 = tick_marks + 0.5
plt.xticks(tick_marks, class_names, rotation=25, fontsize=10)
plt.yticks(tick_marks2, class_names, rotation=0, fontsize=10)
plt.xlabel('Predicted label', fontsize=12)
plt.ylabel('True label', fontsize=12)
plt.title('Confusion Matrix for Random Forest Model')
plt.show()
../../../_images/Content_Documentation_machine-learning_1-land_use_classification_with_random_forests_29_0.png
[20]:
# Build the plot
plt.figure(figsize=(15,5))
sns.heatmap(matrix, annot=True, annot_kws={'size':10},
            cmap=plt.cm.Greens, linewidths=0.2)

# Add labels to the plot
class_names = ['Forest', 'Water', 'Herbaceous', 'Barren', 'Urban']
tick_marks = np.arange(len(class_names))
tick_marks2 = tick_marks + 0.5
plt.xticks(tick_marks, class_names, rotation=25, fontsize=10)
plt.yticks(tick_marks2, class_names, rotation=0, fontsize=10)
plt.xlabel('Predicted label', fontsize=12)
plt.ylabel('True label', fontsize=12)
plt.title('Confusion Matrix for Random Forest Model')
plt.show()
../../../_images/Content_Documentation_machine-learning_1-land_use_classification_with_random_forests_30_0.png

Predicting the whole image

[21]:
# Take our full image, ignore the Fmask band, and reshape into long 2d array (nrow * ncol, nband) for classification
new_shape = (img.shape[0] * img.shape[1], img.shape[2] - 1)

img_as_array = img[:, :, :7].reshape(new_shape)
print('Reshaped from {o} to {n}'.format(o=img.shape,
                                        n=img_as_array.shape))

# Now predict for each pixel
class_prediction = rfc_model.predict(img_as_array)

# Reshape our classification map
class_prediction = class_prediction.reshape(img[:, :, 0].shape)
Reshaped from (250, 250, 8) to (62500, 7)
[22]:
# Visualize

# First setup a 5-4-3 composite
def color_stretch(image, index, minmax=(0, 10000)):
    colors = image[:, :, index].astype(np.float64)

    max_val = minmax[1]
    min_val = minmax[0]

    # Enforce maximum and minimum values
    colors[colors[:, :, :] > max_val] = max_val
    colors[colors[:, :, :] < min_val] = min_val

    for b in range(colors.shape[2]):
        colors[:, :, b] = colors[:, :, b] * 1 / (max_val - min_val)

    return colors

img543 = color_stretch(img, [4, 3, 2], (0, 8000))

# See https://github.com/matplotlib/matplotlib/issues/844/
n = class_prediction.max()
# Next setup a colormap for our map
colors = dict((
    (0, (0, 0, 0, 255)),  # Nodata
    (1, (0, 150, 0, 255)),  # Forest
    (2, (0, 0, 255, 255)),  # Water
    (3, (0, 255, 0, 255)),  # Herbaceous
    (4, (160, 82, 45, 255)),  # Barren
    (5, (255, 0, 0, 255))  # Urban
))
# Put 0 - 255 as float 0 - 1
for k in colors:
    v = colors[k]
    _v = [_v / 255.0 for _v in v]
    colors[k] = _v

index_colors = [colors[key] if key in colors else
                (255, 255, 255, 0) for key in range(1, n + 1)]
cmap = plt.matplotlib.colors.ListedColormap(index_colors, 'Classification', n)

# Now show the classmap next to the image
plt.figure(figsize=(15,15))
plt.subplot(121)
plt.imshow(img543)
plt.title('Composite Image')


plt.subplot(122)
plt.imshow(class_prediction, cmap=cmap, interpolation='none')
plt.title('Classification with Random Forests')

plt.show()
../../../_images/Content_Documentation_machine-learning_1-land_use_classification_with_random_forests_33_0.png