Making data analysis lean

March 1, 2008
Malcolm Moore

Ian Cox

Pharmaceutical Technology Europe

Pharmaceutical Technology Europe, Pharmaceutical Technology Europe-03-01-2008, Volume 20, Issue 3

Six sigma, process analytical technology (PAT) and related initiatives are driving greater use of statistical analysis methods to increase process understanding and improve manufacturing capabilities.

Six sigma, process analytical technology (PAT) and related initiatives are driving greater use of statistical analysis methods to increase process understanding and improve manufacturing capabilities. Six sigma, in particular, drives broad deployment of statistical thinking — specifically variation in product quality is a function of variation in material, process and equipment parameters — to a wider community of users. Historically, a theoretical approach to statistical thinking is promoted that is too complicated for mainstream users. Liddle analysed data from approximately 200 six sigma projects and found a low correlation between statistical complexity and value delivered.1 Typical comments from engineers deploying statistical methods include:

  • "Too much time is spent worrying about the correct application of the method rather than focusing on what the data is telling us."

  • "Our guys run a mile rather than use statistics."

  • "Our managers glaze over when stats methods and results are presented."

  • "Training overheads are huge and we forget how to use most of what we are taught."

  • "Prescribed problem solving approaches curb creativity."

A key consequence is that data analysis is not as efficient and effective as it could be, leading to higher project costs, longer project cycles and frequent push-back.

Liddle proposed a tiered approach to statistical analysis with a pragmatic toolset for mainstream users (also known as six sigma green belts) that is visual and light on numbers, while increasing knowledge about the key material, process and equipment inputs that impact product quality and how they drive variation in product quality.

A lean data analysis process is proposed and a tiered pragmatic toolset is examined via a case study. This approach is likely to have wide appeal to a broad community of engineering and quality groups.

Key points

Lean data analysis process

Figure 1 presents a lean data analysis process. We start by identifying the process inputs and outputs that need to be measured, which are then collected and managed using measurement system analysis and data management methods. Once the data are clean and free of large measurement errors, visualization methods are used to identify the hot Xs (the process inputs responsible for driving variation in product quality or associated with variation in product quality). Statistical models can be developed for more complicated problems if necessary. Our process knowledge is then revised using the visual and statistical models developed in steps 3 and 4. This increased understanding is then utilized to improve product quality. This paper will present a lean toolset for visual exploration, statistical modelling, knowledge revision and utilization.

Figure 1

Case study

A fictional case study based on simulated data is presented — a copy of which is available on request from the authors. The scenario is fairly typical of pharmaceutical manufacturing. It is not based on any particular case, but it does try to reflect the realities of analysing and improving pharmaceutical manufacturing processes.

The case study concerns a facility that has been manufacturing an established solid dose product at various concentrations for several years. Current measurement systems are based on storing finished material while offline quality assurance (QA) tests are performed to ensure the finished product meets performance specifications.

The case study focuses on investigating the process for tablet production at a single concentration. The key performance metric is 60 min mean dissolution, which must be no less than 70%. Historically, 15% of production batches fail to meet the 60 min dissolution requirement. Investigations into these failures rarely find an assignable cause.

A team was commissioned to investigate the process and dramatically improve sigma capability. They applied the lean data analysis process in Figure 1 and used a process map to frame the problem, as illustrated in Figure 2. Process mapping was used to identify the key process steps and the set of inputs that were most relevant to the problem, and easy-to-collect information about retrospectively. These inputs are identified in Figure 2.

Figure 2

Data on these inputs, along with mean dissolution, were collated for the last 2 years of production batches, which resulted in a data set consisting of 153 rows and 23 columns.

Where appropriate, measurement systems were checked to ensure that measurement variability did not mask patterns as a result of process variation. Simple visualization methods were then used to investigate the relationships between dissolution and the 22 inputs, as illustrated in Figure 3, which shows simple histograms of each variable with the failing batches identified with darker shading. A process input with darker shading at one end of the range indicates an input that is a potential cause of failures. In our case, shading the processing conditions of lots with a 60 min mean dissolution less than 70% suggests avoiding:

  • High values of granulation mix speed.

  • High values of granulation spray rate.

  • Low values of drying temperature.

  • High values of drying relative humidity (RH).

  • High values of milling screen size.

  • Low values of blending time.

Figure 3

Another useful visual exploratory tool is recursive partitioning. This method repeatedly partitions data according to a relationship between the input variables and an output variable, creating a tree of partitions. It finds the critical input variables and a set of cuts (or groupings) of each that best predict the variation in batch failures. Variations of this technique are many and include decision trees, CART, CHAID, C4.5, C5 and others.2

Figure 4 shows the resulting decision tree using recursive partitioning to explore the main drivers of batch failures. The hot Xs are confirmed as:

  • granulation mix speed

  • milling screen size

  • milling impeller speed

  • granulation chopper speed

  • blending rate

  • granulation mix time

  • blending lubricant addition

  • compression force.

Figure 4

Five of the decision tree's 10 branches propose ways to process with no associated batch failures. Starting from the right most branch and working inwards, these five branches are:

  • 72 batches were processed with a granulation mix speed of less than 254 rpm and milling impeller speed of less than 519 rpm.

  • 12 batches were processed with a granulation mix speed of less than 254 rpm, milling impeller speed of greater than or equal to 519 rpm, and blending rate of greater than or equal to 10.1.

  • 5 batches were processed with a granulation mix speed of less than 254 rpm, milling impeller speed of greater than or equal to 519 rpm, blending rate of less than 10.1, and compression force of greater than or equal to 45.5.

  • 21 batches were processed with a granulation mix speed of greater than or equal to 254 rpm, milling screen size of 5 or 6 mm, and a blending rate of less than 10.1 rpm.

  • 8 batches were processed with a granulation mix speed of greater than or equal to 254 rpm, milling screen size of 5 or 6 mm, a blending rate of greater than or equal to 10.1 rpm, and blending lubrication of less than 20.1 min.

These two visual data exploration methods collectively identify a subset of inputs — granulation mix speed; milling screen size; milling impeller speed; granulation chopper speed; blending rate; granulation mix time; blending lubricant addition; and compression force — worthy of further investigation. These methods help bring the principles of statistical thinking to the mainstream, particularly those of modelling variation in process outputs and identifying the key drivers of process variation. They have advantages compared with conventional statistical approaches because they simplify the identification of key process variables and aid communication, thereby enabling a wider community to gain insight into the potential key relationships in data. If a solution was to be selected at this stage of analysis, and in the absence of other criteria, we would choose the right most branch of the decision tree because considerably more batches have been processed under these conditions, which provides a greater weight of support as to the robustness of this solution.

The effects of this subset of input variables upon 60 min mean dissolution were investigated in more detail using multiple regression modelling — a technique that enables one or more quality characteristics, such as dissolution, to be modelled as a function of several process inputs.3,4 The results from the multiple regression modelling are summarized in Figure 5. The model explains 66% of the variation in mean dissolution and the effect tests summary ranks the hot Xs in the following order of importance:

  • granulation mix speed

  • milling screen size

  • drying relative humidity

  • blending time

  • drying temperature

  • granulation spray rate

  • granulation binder addition.

Figure 5

Further RH has a second order (quadratic) effect on 60 min mean dissolution.

The bottom part of Figure 5 illustrates the results of Monte Carlo simulations performed using the regression model as the transfer function between the key process inputs and 60 min mean dissolution.5 Predictions of the distribution of 60 min mean dissolution were made using 5000 simulated batches with each key process variable being set at a proposed optimum target value with a control tolerance about the target value. For example, in the case of granulation mix speed, the target was set to 220 rpm with a control tolerance defined by a standard deviation of 5 rpm. Assuming the 5000 values of granulation mix speed follow a normal distribution, the effective control range for granulation mix speed is 220±15 rpm (220±3x standard deviations). Alternative distributions to the normal distribution can be used to simulate the control range of the key process inputs. The optimum control ranges, assuming a normal distribution for each key process input, are illustrated graphically at the bottom of Figure 5. This enabled the investigation of the robustness of 60 min mean dissolution to the proposed variation in the settings of the key process variables. A dissolution failure free process is predicted by operating the process under the conditions indicated in Table 1 (a dissolution defect rate of 0 dpmo [defects per million opportunities] is predicted and illustrated in the bottom right corner of Figure 5).

-Table 1 Predicting a dissolution failure free process.


Visual data exploration was used to identify critical process parameters along with data mining to identify tighter control ranges of key parameters that result in more consistent product quality. These methods are capable of being deployed intelligently by a wide community of users. Multiple regression modelling and Monte Carlo simulations identified tighter control regions of key process variables that predict a near defect free process with regard to tablet dissolution.

Malcolm Moore is technical manager at SAS (UK). He is a design of experiments specialist, master black belt and six sigma analyst in the company's JMP software business unit. He works with clients to integrate statistical methods and software into R&D, quality improvement, defect reduction, cycle time reduction, and corporate six sigma consulting activities for a variety of industries, including pharmaceutical, semiconductors, finance and transactional. He has lectured in statistical methods at Newcastle and Edinburgh Universities. Previously, Malcolm has worked at ICI, BBN Software Solutions, Brooks Automation and Light Pharma. He referees papers with a statistical analysis content for the IEEE Transactions on Semiconductor Manufacturing. Malcolm received his PhD in design of non-linear experiments at London University. He is a fellow of the Royal Statistical Society in the UK.

Ian Cox is marketing manager within the JMP Division of SAS. He has worked for Digital and Motorola, was a six sigma Black Belt and worked with Motorola University. Before joining SAS 6 years ago Ian worked for BBN Software Solutions, and has consulted to many companies on data analysis, process control and experimental design. He is part of the SPIE organizing committee for process control, and referees related papers for the IEEE Transactions on Semiconductor Manufacturing. He has been a Visiting Fellow at Cranfield University and a Fellow of the Royal Statistical Society in the UK, and has a PhD in theoretical physics.


1. A. Liddle, "Lean and Pragmatic Statistical Tools", IQPC Six Sigma Conference, Amsterdam, The Netherlands (2007).

2. Chapter 34 of JMP Statistics and Graphics Guide (SAS Institute Inc., SAS Campus Drive, Cary, NC, USA, 2007).

3. N. Draper and H. Smith, Applied Regression Analysis, 2nd Edition (John Wiley and Sons Inc., New York, NY, USA, 1981).

4. D.C. Montgomery and E.A. Peck, Introduction to Linear Regression Analysis (John Wiley and Sons Inc., New York, NY, USA, 1982).

5. Chapter 15 of JMP Statistics and Graphics Guide (SAS Institute Inc., SAS Campus Drive, Cary, NC, USA, 2007).