Part 2: Principal Component Analysis

This part of the lab exercise is designed to familiarize students with Principal Component Analysis (PCA) for multispectral imagery.

Readings: Mather and Koch (2011), Computer Processing of Remotely-Sensed Images, pp 160-168

Correlation Among Multispectral Images

Correlation is a statistical technique that is used to evaluate the degree of association between two variables. Correlation, usually designated by the Pearson r value, can be positive (indicating that as variable X increases in value variable Y also tends to increase) or negative (as X increases, Y tends to decrease in value). Correlation values vary from -1, indicating a strong negative association between variables, and 1, indicating a strong positive association. An r-value of 0 indicates that there is no statistical association between the two test variables.

Correlation provides a valuable tool for assessing the degree to which the brightness values in different bands of multispectral imagery are associated with one another. Use WhiteboxTools' ImageCorrelation tool to generate the correlation matrix for each of the bands ofthe K-W/Cambridge/Guelph Landsat 8 scene.

Include the correlation matrix with your final lab report (1 mark).

2.1. Using the generated correlation matrix identify any groups of images that exhibit strong correlation, i.e. r-values of 0.9 or greater and -0.9 or less? (3 marks)

2.2. Which of the seven bands is the most unique, i.e. does not exhibit a high degree of correlation with any other bands in the data set? (1 mark)

Principal Component Analysis (PCA)

It is important that you read the following preamble carefully.

While we may have seven bands of multispectral data in our test data set, the correlation analysis above shows that we do not have seven bands worth of information within the imagery. That is, correlation among the individual bands represents redundancy in the data set. As such, we should be able to use a data reduction technique to eliminate this redundancy and reduce the total number of images that need to be analyzed. For particularly intensive remote sensing analyses, this can be an important, or even necessary, step. For example, if we can use fewer band images to perform image classification, while still retaining the same amount of information in the data set, we will greatly improve the efficiency of the analysis.

PCA is one such data reduction technique that is widely used in remote sensing applications. PCA is used to reduce the number of band images necessary for classification (i.e. as a data reduction technique), for noise reduction, for change detection applications, and in many other areas. It is one of the most useful data transformations that we encounter in remote sensing.

As we've seen, any multi- or hyper-spectral imagery is likely to contain a substantial amount of redundancy owing to the correlation among the images. That is, the actual dimensionality of a multi-spectral data set is likely less than the number of bands. PCA transforms the original image data set into fewer, uncorrelated images. The technique works by transforming the axes (i.e. plural of axis) of the multispectral space (a 7-dimensional space in the case of our Landsat data) such that it coincides with the directions of greatest correlation. Each of these new axes are orthogonal (right angle) to one another.

Use WhiteboxTools' PrincipalComponentAnalysis tool to run a PCA on the Landsat data set. Input all seven available bands. Do not standardized the PCA. This is only done when the variances in your input images differ substantially, such as would be the case if they contained values that were recorded in different units (e.g. feet and meters). Each of our input images have the same bit depth so we do not need to standardize the analysis. We want to create all of the component images, i.e. num_comp=7.

You may run the PCA in a Python script (e.g. pca.py) that has been set up in the usual manner (i.e. importing WhiteboxTools, creating a wbt object, etc.), by using the following code:

from WBT.whitebox_tools import WhiteboxTools


wbt = WhiteboxTools()
wbt.work_dir = "/path/to/data/" # Update this

print("Performing PCA...")
wbt.verbose = True  # We would like the PCA report to be automatically displayed
wbt.principal_component_analysis(
    inputs="band1_clipped.tif;band2_clipped.tif;band3_clipped.tif;band4_clipped.tif;band5_clipped.tif;band6_clipped.tif;band7_clipped.tif",
    output="pca_report.html",
    num_comp=7,
    standardized=False
)

Several outputs will be generated when the tool has completed. An HTML PCA report will be created and, hopefully, automatically displayed. This report contains useful data to help us interpret the results of the analysis. The first table that is in the PCA report lists the amount of explained variance (in non-cumulative and cumulative form), the eigenvalue, and the eigenvector for each component. Yikes, that's a lot of jargon that you're probably unfamiliar with! Okay, take a deep breath and let's look at it more closely. First of all, each of the seven components refer to the seven newly created, transformed images that we have created by running this tool. You can think of the amount of explained variance associated with each component as a measure of how much information content within the original multi-spectral data set that a component has. The higher this value is, the more important the component is. In fact, this same information is presented in graphical form in the Scree Plot that was also output when the tool completed.

Include the PCA report table and the scree plot in your final report (2 marks).

2.3. How does the amount of information, i.e. explained variance, vary by component number? (1 mark)

The eigenvalue is really just a related measure of information content and the eigenvector simply describes the mathematical transformation (rotation coordinates) that correspond to a particular component image. Neither of these two things are all that important for us now. They are necessary if you ever want to perform an inverse PCA, taking the components, or a subset of them, and transforming them back into the original coordinate system. This is sometimes a useful thing to do if you are using PCA for noise reduction applications.

Now then, you might have noticed something a bit strange about the results. I said previously that PCA is often used to reduce the number of images that we need to analyze (e.g. for image classification applications), but in fact the PCA has spat out seven new images. This is always the case; PCA will produce as many components as there are input images, unless you specify for it to create fewer (in this case, the images are still created, just not saved). The idea is that in data reduction applications the user is able to leave out several of the less important components from further analyses. Importantly, leaving out some components does not significantly affect the amount of information content in the overall data set. Is it just me or is that not a really nifty trick? I thought so too.

2.4. Based on the cumulative amount of explained variance, what is the actual dimensionality of this data set? How many, and which of the components would you need to include in any subsequent analysis to ensure that the vast majority of the information (variance) in the data set is not lost? (2 mark)

2.5. How would the shape of the scree-plot change if our original data set contained less correlation among images than we observed? That is, would the slope of the plot be steepened or flattened if there were less correlation in the data set and why? (2 marks)

Now it's time to take a look at the actual component images. Using your data visalization software of choice, display each of the component images generated by the PCA tool (i.e PCA_component1, PCA_component2, PCA_component3...).

2.6. Examine each of the seven PCA component images carefully. Prepare a table in which you describe each component with respect to the scene/landscape characteristics that are included. For example, which component(s) contain information about water depth, atmospheric haze, vegetation, image noise, etc. You will be graded based on the level of detail you provide. (7 marks)

Now examine the Factor Loadings table within the PCA text report. These loadings values describe the correlation (i.e. r values) between each of the PCA components (table columns) and the original seven Landsat band images (table rows). These values tell you how the information contained in an image is spread among the PCA components. An analysis of factor loadings can be reveal very useful information about the data set. For example, it can help you to identify groups of similar images.

2.7. Do the factor loadings reveal any natural groupings of similar images? If so, identify the groupings. (2 marks)

You may recall that near the start of this section I said that PCA transforms the original image data set into a set of uncorrelated images.

You will want to run the ImageCorrelation tool on the seven PCA components to confirm that this is the case and include the correlation results in your final report (1 mark).

PCA is one heck of a nifty trick, isn't it?

Creating a PCA Composite Image

By creating a colour-composite image of the first three PCA components (i.e PCA_component1, PCA_component2, PCA_component3), we are able to create a colour image that contains almost the same amount of information as the entire original data set of seven bands. Similarly, a colour composite of three of the higher PCA components (i.e PCA_component5, PCA_component6, PCA_component7), alows us to exam the noise parts of the data set. Use the following script to generate these two colour-composite images.

from WBT.whitebox_tools import WhiteboxTools


wbt = WhiteboxTools()
wbt.work_dir = "/path/to/data/" # Update this

wbt.verbose = False # We don't need the progress to be updated for each operation.

# First, perform contrast stretches on the individual PCA components
for band_num in range(1, 8):
    print("Performing stretch on component {}".format(band_num))
    in_file = "PCA_component{}.tif".format(
        band_num)
    out_file = "PCA_component{}_cs.tif".format(
        band_num)
    wbt.percentage_contrast_stretch(
        i=in_file, output=out_file, clip=5.0, num_tones=1024)

print("Creating colour composites...")
# PCA1-PCA2-PCA3 RGB
wbt.create_colour_composite(
    red="PCA_component1_cs.tif",
    green="PCA_component2_cs.tif",
    blue="PCA_component3_cs.tif",
    output="PCA123RGB.tif",
    enhance=False,
)

# PCA5-PCA6-PCA7 RGB
wbt.create_colour_composite(
    red="PCA_component5_cs.tif",
    green="PCA_component6_cs.tif",
    blue="PCA_component7_cs.tif",
    output="PCA567RGB.tif",
    enhance=False,
)

print("All done!")

Once the script has completed, display PCA123RGB.tif and PCA567RGB.tif in the data visualization software.

Include screenshots of these two images with your final Lab report (2 marks).

2.8. What colours are the various common land-covers in the scene, including pavement (urban), bare soil, crop cover, forest, and water, displayed with in the 'signal-component' image (PCA123RGB.tif)? Hint: it may help to look at your enhanced natural-colour composite image to pick out sites with these various land-covers. (5 marks)

2.9. Examining the noise-component image (PCA567RGB.tif), describe the relative noise content within urban vs rural areas within the scene. (2 marks)