More Information about Principal Component Analysis in REDFIN
The first two sections below contain a very brief introduction to Principal Component Analysis (PCA), and it's application to 2DGE data. More in-depth information can be found in text books on statistical methods, or by searching the Internet. If you are already familiar with the general ideas of PCA, you may want to skip ahead to the sections describing the PCA plot in REDFIN.
Introduction to PCA
Principal component analysis (PCA) involves a mathematical procedure to find a smaller set of synthetic variables that capture the variance in an original data set. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.
The results of a PCA are usually discussed in terms of component scores and loadings. The loadings tell how the principal components are related to the original variables and the scores show how much of the variance of each data point (sample) that is associated with a particular principal component. PCA can be used as an exploratory tool to identify unknown trends in a multidimensional data set and to find samples or variables that tend to vary in the same way. If a multivariate dataset is visualized as a set of coordinates in a highdimensional data space (1 axis per variable), PCA supplies the user with a lowerdimensional picture, a "shadow" of this object when viewed from its (in some sense) most informative viewpoint.
PCA has many other names. For instance, it is equivalent to the discrete Karhunen-Loève transform (KLT), the Hotelling transform, the proper orthogonal decomposition (POD), or the Singular Value Decomposition (SVD) of the covariance matrix of the data set.
Redfin uses Standardized (or Auto-scaled) PCA, meaning that the variance of each variable is normalized before the principal components are calculated. This is appropriate when the variables are not directly comparable, e.g. if they have different units.
PCA for the analysis of 2DGE data
For the purposes of PCA, each sample (gel image) constitutes one data point, described by the volumes of the protein spots found in the image. So each sample is described by several thousand variables, each with its own variance (gel-to-gel spot volume difference). Finding the major principal components means finding the combination of spot volumes that best describe the gel-to-gel differences. The loading of the major principal components, i.e. the information about what spots/proteins contribute to the principal component, can indicate which proteins vary the most between samples. The scores for each sample characterizes that sample in terms of the principal components. Thus, samples with similar scores are in some sense similar. If the samples are shown in a 2- or 3-dimensional graph with the two (three) first principal components on the axes, samples with similar behaviour will tend to "sit together".
The PCA plot in REDFIN
The PCA in REDFIN is performed using the samples in the currently selected comparison and the spots in the current list as input. This means that if not all gels are included in the groups in the current comparison, you will be studying a subset of your data. Applying filters to the list of proteins means that you select a subset of the possible variables (spot volumes) describing your samples. This will of course affect the results shown in the PCA plot and it's interpretation. Sorting the list of proteins has no effect on the PCA. No direct information about the groups is used for the PCA.
The axes in the 3D-plot represent the three most important principal components. Each principal component is a linear combination of the volumes of the proteins, i.e. the score for a particular sample (it's position along the principal component axis) is calculated by summing the spot volumes found in that gel image, weighted by constant (positive or negative) factors that is different for each protein ID. By moving the mouse over each axis, you can see how much of the total variance this axis contributes, and the IDs of the proteins with the largest weight in the principal component. Note that distances along different axes cannot be directly compared as they represent completely uncorrelated variables, i.e. they represent changes in the volumes of different proteins. The plot shows the average position of the samples in each group, with an option to show all the individual samples. Things to look for could be outliers, or samples that cluster in the graph, which means the samples are similar or vary in a correlated way.
Comments
Standardization (data centering and scaling):
Redfin uses Standardized (or Auto-scaled) PCA, meaning that the variance of each variable is normalized before the principal components are calculated. This is appropriate when the variables are not directly comparable, e.g. if they have different units. We do not want to directly compare the sample-to-sample variation in rare protein with the variation of an abundant protein. The assumption is that the relative variance is comparable for different proteins, rather than the absolute variance (e.g. +-10% rather than +-10 pmol). In effect, we are giving rare proteins a higher weight, to put them on equal footing with abundant proteins. The results will be different, and arguably less informative, if you use PCA without scaling. In most cases you will always find that the most abundant proteins are the most important in describing the differences between samples. The scaling is performed as follows: for each variable (protein) the mean value (the mean of the spot volume in all samples) is subtracted and then the values are divided by the norm of the centered values.
A priori information:
You do not usually want to include any knowledge about sample relations in the PCA, since this will limit it’s exploratory power. Although Redfin will not use the group setup information directly you can include such information if you filter your protein list on fold change or ANOVA p-value. Then you choose to use only proteins with a known difference between the groups, which will tend to separate the groups in a PCA plot. Similarly, you need to be very careful if you choose to use only the protein spot you have marked as interesting (“star rating” in Redfin). In the PCA, you will tend to discover only what you already knew, and the results will have no power to confirm/corroborate your findings.
"Missing values":
The standard Ludesi protocol for analyzing gel images does not add a spot to the data set if it is not possible to correctly measure the spot volume in the image. This commonly happens for technical reasons e.g. if the gel is damaged in that region, if the spot staining is too weak for detection, or if the spot has not separated from nearby spots. In this way we avoid including a measurement value that we know is the result of a spurious measurement, which would only add noise to the data set. The downside is that we will not have a spot value for every possible protein spot in every gel, and these “missing values” must be handled with care in some types of statistical analysis. Most multi-variate methods, such as PCA, assume that the samples (gels) are described by the same number of variables (spot volumes). If the PCA plot is created from a list of spots that has not been filtered on “Presence”, all “missing values” will be replaced by zeroes. This makes it likely that these spots will show up as being highly significant in the PCA, which is not unreasonable since these spots are certainly something that is very different between the gels. However, most often you will want to exclude these spots (by setting a “Presence” cut-off of 100%) since the spots are most often missing due to technical deficiencies, rather than a biological change.
Technical replicates:
Technical replicates are samples that are taken from the same biological specimen. The variance between these samples is then only due to technical factors (e.g. gel preparation and image analysis). In most statistical methods technical replicates should not be treated as independent variables. However, in order to be able to explore technical variance, as well as biological differences, the PCA plot in REDFIN shows technical replicates as separate samples. Typically you would like the technical replicates to cluster in the PCA plot, which means that the technical variance is smaller than the biological.
