Multivariate data analysis easily retrieves insight from a wealth of data

Published on: 
, ,

Pharmaceutical Technology Europe

Pharmaceutical Technology Europe, Pharmaceutical Technology Europe-02-01-2010, Volume 22, Issue 2

Statistics are often viewed as confusing and complicated, but multivariate data analysis (MVA) methods can be used to amass knowledge simply.

Traditionally, studies are performed and analysed following a univariate (one variable at a time) approach. The big advantage of this is that simple 2D plots can be used to assess cause and effect relationships, and the corresponding statistics are straightforward too. The production environment, however, is never univariate, and interactions between parameters should be expected. In the pharmaceutical arena, this situation has been well recognised — guidelines such as ICH Q8 on Pharmaceutical Development and ICH Q10 on Pharmaceutical Quality System explicatively mention the multidimensional design space in which product performance should be tested to assure quality.1,2 In this context, trend analyses of the manufacturing process performance and its products have been mentioned as an important tool for innovation and continuous improvements.3 A potential complicating factor with multidimensional data, however, is that it is not possible to visually inspect such data and so other ways are needed to represent the results.

This is where multivariate statistics can help; it facilitates the powerful analysis of multidimensional data and, simultaneously, amasses knowledge at a single glance.

Matthias Tunger/Getty Images

From dull tables to essential information

For a process running for several years, a wealth of data is usually stored in databases containing continuously measured data and routine-based quality control (QC) data, but it is extremely difficult to obtain useful information from such an intimidating amount of numbers and other data. Multivariate data analysis (MVA) is an approach that converts data into knowledge by using data exploration techniques, without narrowing down solely on allegedly unknown aspects. Representing this knowledge for human interpretation can be done visually.

Historical data can be analysed using MVA to learn from the past, which can be useful to solve current problems, avoid future ones or to make a validation study of a similar production process or compound quicker and cheaper. Analysing historical data can also avoid, or shorten, new studies, which are often expensive. When visualised properly, extended sets of data, such as dull and perhaps confusing tables, can be changed into spatial representations that clearly depict essential information that is not visible a-priori. The methods are widely applicable and can be used, for instance, for measuring the quality and authenticity of samples, or for monitoring a production process.

MVA for a pharmaceutical quality system

MVA helps to transfer data into knowledge, which is very useful in a pharmaceutical quality system, as mentioned in ICH Q10. In this scenario, the analysis of historical data can play an important role: these data are already available and very often contain useful information that can be used for innovation and quality improvement. As this has not yet been commonly recognised, the following case study will demonstrate its advantage.

Case study: exploiting historical data for a pharmaceutical formulation


An obvious goal in pharmaceutical formulation is to maximise the yield of production. With time, however, the maximum yield of production processes is usually not maintained because of the dropout of processing units or unknown deviations from target operation. A better understanding of the production process can help define well-considered optimising steps to reduce dropout, and maintain a high level of quality and yield.

Figure 1: Schematic overview of the studied tablet production process.

For this case study, data on the tablet production process were stored in a database for several years. The process consisted of a high-shear wet granulation process, followed by drying, screening, tablet production and film coating (Figure 1). Principal Component Analysis (PCA) was used as an MVA tool to analyse these data. The principles of PCA, including useful references, are described in the sidebar.

Sidebar: Principal Component Analysis

If data are highly correlated, only a few principal components (PCs; linear combinations of the original variables) are needed to reproduce the original data sufficiently. In this example, the first two PCs (PC1 and PC2) describe 38% of the data that were originally described by 25 variables. PC1 explained 25% of the total variance in the data set, and 13% of the variance was explained by PC2. The PCA results are visualised in a biplot, in which both scores and loadings are plotted. Figure 2 presents these results for three selected variables: thickness, yield and water content. The dots reflect the scores and the red triangles reflect the loadings. The first step is to look at the scores.

The higher the similarity between samples, the smaller the spatial separation between the scores in Figure 2. In the direction of PC1, three separated groups can be distinguished. These groups can be identified as the three dose strength groups in the data set (LD = low dosage; MD = middle dosage; HD = high dosage). Variation within samples of the same dose is much lower compared with the variation between samples of different doses. Also in the direction of PC2, some grouping is seen that can be attributed to the effect of wet kneading time during granulation. Although there is a difference between doses (as is seen in the first PC), the effect of kneading time is similar for each dose — higher scores are found for long kneading time compared with those for short kneading times.

Figure 2: Biplot of PCA scores and loadings based on all available historical data. LD=low dosage; MD=middle dosage; HD=high dosage. S=short kneading time; L=long kneading time.

The biplot can help interpret the trends seen. Variables that point in the same direction show a high-positive correlation, whereas variables that are in the opposite direction reveal a high-negative correlation. Variables that are plotted perpendicular (orthogonal) to each other are uncorrelated. The higher the loading, the more the variable contributes to the PC. For instance, a high loading is given to the variable yield on PC1; the loading for yield points out in the direction of the LD samples, which means that LD samples have higher values for yield than HD samples. Therefore, yield is negatively correlated with dose.

In the opposite direction, high loadings are found for thickness: the higher the dosage, the thicker the tablet. Therefore, thickness is positively correlated with dose. As a result, yield and thickness are negatively correlated.

Water content is located right in the middle, meaning that water content has no contribution to the separation between doses nor to the grouping of the kneading time. It also reveals no correlation with yield or thickness. Therefore, alterations in water content do not contribute to yield.

So the visualised PCA results provide information on structures in the data and on correlations between variables such as yield, thickness and water content, as well as sample properties such as dosage or kneading time. This can help formulate new strategies to improve the production process. However, there may be a risk of jumping to conclusions too quickly.

Figure 2 describes the differences between doses and kneading time, but it is not focused on correlations within each dose group. There might be another correlation structure that explains the variation within each dosage group. It could be possible to investigate this within dosage correlation by looking at the higher order PCs, but in this case it is easier and more clarifying to perform a separate PCA on each of dose group [Figure 3(a), (b) and (c)]. Focusing again on the example variables yield, thickness and water content, it can be seen that correlation exists differently between the three dosages.

Figure 3: Biplot of PCA scores and loadings based on available historical data per dose group (S=short kneading time, L=long kneading time).

For the lowest dose [LD; Figure 3(a)], the variables yield, water content and thickness point in the same direction, indicating that there is a very high correlation between the three. However, the loadings of these variables are orthogonal to PC1, meaning the variables have no contribution to the separation between short and long kneading time in PC1.

A different correlation structure is revealed for the mid-dose group [MD; Figure 3(b)]. Water content is orthogonal to yield and thickness, meaning there is no correlation between water content and the other two parameters. Yield has a positive loading on PC1, which is the direction of difference in kneading time. Therefore, for MD, it can be seen that long kneading times correspond to a high yield, which was not seen for LD.

Finally, a different correlation is seen for the highest dosage [HD; Figure 3(c)]. As with LD, yield and thickness point out in the same direction, which means there is a positive correlation between the two parameters. Conversely, water content is partly negatively correlated to yield and thickness as it is points out in the opposite direction. The angle is between 90° and 180°, so it is not completely orthogonal to yield and thickness. The relation of thickness to the separation in kneading time is opposite to its relation for MD: a longer kneading time correlates with thicker tablets than for a short kneading time.

Straightforward, understandable and cheap

The above case study demonstrates the advantage of using historical data to better understand the correlation between variables in the process of pharmaceutical tablet production using PCA as an analysis tool. Based on only a few plots, which were generated in close cooperation between the statistician and the technological expert, insight was generated into the behaviour of the production process for the different doses and leads were identified as to how to increase the yield of production.

By enabling the visualisation of complex relationships, MVA allows processes to be better understood. As a consequence, new development strategies or adapted processing can be identified for product, process or quality improvement (corresponding to ICH Q10).

PCA is not the only possible tool that can be used. Regression and/or classification analyses using, for instance, Partial Least Squares (PLS) regression, are extremely useful in cases where process measurements have to be related to product quality (such as tablet dissolution). These MVA analyses can be performed easily and quickly, with only minor data requirements.

The advantage of using historical data or routinely measured QC data is that they are readily available, and often mean that new studies are not required or can take place on a smaller scale. In addition, the outcomes are straightforward and intuitively understandable. Although in practice historical data sets may be missing data entries for specific batches or parameters, a great deal of information can still be obtained if this is accounted for.


This work has been performed under the framework of the Dutch Top Institute Pharma (project D6-203).

Carina Rubingh is Biostatistician in the group of Analytical Information Sciences at the Department of Analytical Research, TNO Quality of Life (The Netherlands).

Kees van de Voort Maarschalk is Director Oral and Polymeric Product Development at Schering Plough (The Netherlands) and Professor Industrial Pharmacy at the University of Groningen (The Netherlands).

Uwe Thissen is Project Manager and Senior Scientist in the group of Analytical Information Sciences, Department of Analytical Research, TNO Quality of Life, Business Unit Quality and Safety, PO Box 360, NL-3700 AJ Zeist (The Netherlands). Tel. +31 30 694 4002


1. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use, ICH Q8 Pharmaceutical Development.

2. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use, ICH Q10 Pharmaceutical Quality System.

3. W.R. Dillon and M. Goldstein, Multivariate Analysis, Methods and Applications (John Wiley & Sons, NY, USA, 1984).

4. H. Martens and T. Naes, Multivariate Calibration (John Wiley & Sons, Chichester, UK, 1989).

5. D.L. Massart et al., Handbook of Chemometrics and Qualimetrics: Part A (Elsevier, Amsterdam, The Netherlands, 1997).

6. B.G.M. Vandeginste et al., Handbook of Chemometrics and Qualimetrics: Part B (Elsevier, Amsterdam, The Netherlands, 1998).

7. D.L. Massart, Y. Vander Heyden, LCGC Europe, 17(11), 586–591 (2004).