PCA for attributes

This tool can be used to perform a principal component analysis (PCA) on a group of specified attributes from a vector file. PCA is a common data reduction technique that is used to reduce the dimensionality of multi-dimensional data space. Correlation among attributes in the original data set represents data redundancy, i.e. fewer attributes than are present are required to represent the same information, where the information is related to variance within the attributes. PCA transforms the original data set of n attributes into n 'components' variables, where each component is uncorrelated with all other components. The technique works by transforming the axes of the multi-spectral space such that they coincides with the directions of greatest correlation. Each of these new axes are orthogonal to one another. PCA is therefore a type of coordinate system transformation. The PCA components are arranged such that the greatest amount of variance (or information) within the original data set, is contained within the first component and the amount of variance decreases with each component. It is often the case that the majority of the information contained in a multi-dimensional data set can be represented by the first three or four PCA components. The higher-order components are often associated with noise in the original data set.

The user must specify the name of the input shapefile and the names of the input attributes. Additionally, the user must specify whether to perform a standardized PCA and the number of output components to generate (all components will be output unless otherwise specified; there can be as many components generated as there are input attributes). A standardized PCA is performed using the correlation matrix rather than the variance-covariance matrix. This is appropriate when the variances in the input attributes differ substantially, such as would be the case if they contained values that were recorded in different units (e.g. feet and meters) or on different scales. The tool can output a specified number of components (must be less than the number of variables). If this parameter is not specified (left blank in dialog) all components will be recorded in the output.

Several outputs will be generated when the tool has completed. A text report will output into the text area at the bottom of the Whitebox user-interface. This report contains useful data and it is advisable to save it for later reference by right-clicking over the text area and selecting 'Save'. The first table that is in the PCA report lists the amount of explained variance (in non-cumulative and cumulative form), the eigenvalue, and the eigenvector for each component. Each of the PCA components refer to the newly created, transformed attribute stored in the shapefile's database file. The amount of explained variance associated with each component can be thought of as a measure of how much information content within the original multi-spectral data set that a component has. The higher this value is, the more important the component is. This same information is presented in graphical form in the 'Scree Plot' that is also output by the tool. Note that you can save the scree plot by right-clicking over the plot and selecting 'Save'. The eigenvalue is another measure of the information content of a component and the eigenvector describes the mathematical transformation (rotation coordinates) that correspond to a particular component image.

Factor Loadings are also output in a table within the PCA text report. These loading values describe the correlation (i.e. r values) between each of the PCA components (columns) and the original attributes (rows). These values show you how the information contained in an attribute is spread among the components. An analysis of factor loadings can be reveal useful information about the data set. For example, it can help to identify groups of similar attributes.

PCA is used as a data reduction technique and for noise reduction. While this tool is intended to be applied to the attributes of a vector file, PCA can also be performed on imagery data using the Principal Component Analysis tool.

See Also:


The following is an example of a Python script that uses this tool:

wd = pluginHost.getWorkingDirectory()
# Input data has the shapefile name followed
# by each attribute included in the analysis,
# separated by semicolons.
inputData = wd + "neighbourhoods.shp" + ";" + "NUM_SCHOOLS" + ";" + "INCOME" + ";" + "POP_DENSITY"
standardized = "true"
numOutputComponents = "not specified"
args = [inputData, standardized, numOutputComponents]
pluginHost.runPlugin("PCAForAttributes", args, False)

This is a Groovy script also using this tool:

def wd = pluginHost.getWorkingDirectory()
// Input data has the shapefile name followed
// by each attribute included in the analysis,
// separated by semicolons.
def inputData = wd + "neighbourhoods.shp" + ";" + "NUM_SCHOOLS" + ";" + "INCOME" + ";" + "POP_DENSITY"
def standardized = "true"
def numOutputComponents = "2"
String[] args = [inputData, standardized, numOutputComponents]
pluginHost.runPlugin("PCAForAttributes", args, False)