Pooling samples: some considerations

Posted by ludesi in 2DE Knowledge Base

A question that keeps popping up is about the pooling of samples. Especially if this results in an experiment where each group of a particular condition consists only of technical replicates.

Pooling samples is especially common when the amount of available protein material is very small. Out of necessity, samples are pooled and then split into different aliquots prior to separation by 2D gel electrophoresis. If you don’t have several pools of the same condition, you will end up with only technical replicates in each group. Of course, an ideal experimental design would always consist of a mix of biological and technical replicates, of which the biological replicates bear more weight for answering the underlying biological question. However, the reality of budget and sample limitations often forces us to compromise.

So if you are pooling your samples, here are some things to be aware of.

When you pool your samples, you lose information about the biological variation. To understand why this is a big disadvantage, we should take a step back and consider what it is we would like to achieve and how we propose to do that.

In most proteomics experiments we try to find differentially expressed proteins, according to specific pre-defined conditions. For example, proteins that are up- or down-regulated in diseased mice relative to healthy mice. The underlying biological population we are using to study the response would be a certain strain of mice. Of course it is unrealistic to test every single mouse in that population, so in order to retain statistical confidence in what we are seeing in our experiment, we randomly sample from the underlying biological population (biological replicates) and use the variation amongst our sub-group (the biological variation) to statistically validate our differentially expressed proteins.

Thus, the inherent biological variance is an important factor to make statistically sound conclusions. When you pool samples, you average out this variance. Even when you split your pool into different aliquots and run these on individual gels, the variance you are left with is purely technical and no longer biological. Your samples are no longer independent and this needs to be also taken into consideration when you do the statistics.

Of course, one could argue that having technical replicates stemming from a pooled sample is better than having conventional technical replicates. After all, the average volume for a given protein in our pool is comprised of the different individual volumes, and therefore this average value can be assumed to be closer to the “truth”.

Lets look at a more specific example. Say, we pool 5 biological samples for group A and 5 for group B, and the volume for a given protein (which is now of course the average volume of our pooled 5 samples) in group A is 5 and for group B it is 10. Just looking at the numbers, this looks like a nice and clear change in expression for our protein. However, what we don’t know is what the variation amongst the biological samples for this protein really was.

Were the volumes of that protein distributed like this in the first group:

1, 1, 4, 6, 13 (very large biological variance but average value 5)

Or like this:

4, 4, 5, 6, 6 (very small biological variance with average value 5)

When you are looking, for example, for biological markers, it matters if the expression of that protein fluctuates a lot in the underlying biological population or if it is very tight. That is why we need to estimate the biological variation based on our randomly sampled observations to draw statistically sound conclusions about the underlying biological population.

biological variance

Figure: When pooling samples we are no longer able to see if the biological variance is high or low

But back to our real life scenario, where pooling was simply necessary.

If we have pooled our samples, and now only have technical replicates in a given group, there is no biological variance that we can use in the statistics to calculate a p-value. We can of course use the technical variation, but in that case it is important to know, that we have created an artificially low variation and consequently the p-values will look “better” than they might be if we had also included the biological variation. Without being based on the biological variation, the p-values simply don’t have the same relevance.

By knowing about the consequences of pooling, we can be more aware of what the statistics are telling us and how to best interpret them.

Finally, just because you have pooled your samples doesn’t mean that the up- or down-regulated proteins you are seeing are not interesting with respect to the underlying biological population – it just means that you cannot say that with a statistical confidence and it makes the use of additional methods for validating your findings even more important.

We’d love to hear about your thoughts or experiences on this. Feel free to comment on this post.