The Challenge in Bioprocess Development: From Data to Knowledge

November 1, 2007
Elaine Martin, Gary Montague
Pharmaceutical Technology
Volume 2007 Supplement, Issue 6

The senior director of Oracle's Life Sciences Business Unit tackles some of the technical issues regarding regulatory standardization, software integration, and the trend toward visualization, among other things.

Biopharmaceutical manufacturing generates a wealth of data from multiple sensors. Consequently, the industry faces the challenge of moving from a data-rich environment to one where the data are translated to information that leads to knowledge. If this knowledge is effectively exploited, it can improve production processes, financial status, and the environmental agenda of a company.

Photo: PT Europe

Addressing the data–information–knowledge chain is not straightforward because there is a series of compounding issues. For example, with respect to the manufacturing process, the data analysis problem is not restricted to one unit operation: The interactions between different units need to be extracted and understood. This requires scientists and engineers to apply advanced statistical methods. This is not easy because the data garnered will have diverse data structures (from batch and continuous operations) and include both qualitative and quantitative measurements as well as temporal and/or end-point measurements. From these data structures, potentially valuable sources of knowledge can be extracted to assist developing data-based model process descriptors. This information can lead to increased process understanding, which allows those working on the design and operational functions to make more informed decisions.

Although many of the problems of translating data to knowledge in the biopharmaceutical industry have been overcome using commercially available tools, the introduction of process analytical technology (PAT) has introduced a whole new set of challenges (1). For example, spectroscopic instruments significantly increase the amount of data and the complexity of the knowledge-mining problem. The opening image shows a fermentation vessel with an invasive near infrared (NIR) probe as well as the standard probes. The complexity of this instrument means that the need for data analysis and its interpretation will continue to grow. Faced with the overwhelming challenge of extracting knowledge from data, a team-based strategic approach is required. The team needs to include process engineers, scientists, statisticians and, most importantly, a business champion who recognizes the potential benefits.The approach adopted, and the capability of existing technologies for knowledge extraction, depend on whether the development or manufacturing environment is being considered.


In development, at the reaction stage, understanding the predominant reactions that are occurring at various stages throughout the course of a batch is vital. Furthermore, as changes in operating policy are explored, it is necessary to determine when important reaction pathways are significant and when they are affected by reactant limitations or excessive accumulation of nutrients.

From such knowledge, operational policy changes can be made, and new avenues of operation explored. The approach that has been typically adopted uses off-line sample analysis to identify the concentrations of nutrients that process scientists perceive to be influential. However, off-line sample analysis has its limitations, particularly regarding low sampling frequency affecting the amount of available data. It is in this situation where PAT, and its related technologies, can increase the frequency of available information and provide a detailed fingerprint of the chemical composition.

Figure 1

As previously mentioned, a series of recovery operations are executed following the reaction stage. In the development stage, once it has been determined which unit operations should be used, the standard operating policy for each unit must be specified and, ideally, the performance of the chain must be considered as a whole, rather than an isolated unit. Again, extracting information from the available data is crucial to achieving informed design and operational decisions. Figure 1 shows that although improved instrumentation leads to increased process knowledge during a development process, the real benefit is not realized until appropriate statistical analyses are applied. The impact this knowledge has on process profitability is important. As Figure 2 illustrates, improved measurement promotes potentially reduced development times, and if integrated statistical analyses are used, improved operating policies can lead to yield increases and higher profits.

Figure 2

Looking ahead, by embedding the data-based process descriptors within monitoring and optimization algorithms, it will be viable to move toward processes that inherently adopt a quality by design (QbD) concept. The real challenge in using statistical tools to extract process descriptors is that in the development laboratory data are only available from a limited set of experiments. Statistical awareness in assessing the experimental results is crucial to guide the development program.


In production, there is a need to ensure consistency of processes and provide tools that give an early warning of operational and process deviations to allow timely corrective action. The knowledge extraction philosophy for production is different from development. In production, the spread of data coverage is limited, and the objective is to look for occasional deviations from predominantly "consistent" behavior. In development, the spread of data is significantly larger, and the objective is to search for robust and effective design and operational areas. Consequently, the tools required for interpreting data into knowledge demand different capabilities, and the implementation of different fundamental algorithms is often a prerequisite. The knowledge extraction "toolbox" must, therefore, possess multiple capabilities.

The methods available for data analysis can address a significant proportion of industrial problems (2). A common strategy for data analysis is compression, the goal of which is twofold: first, to describe the variation in the data from a statistical data analysis perspective more effectively and, second, to enable the graphical representation of the data in the compressed data subspace, which allows the analyst to interrogate the data. The idea here is that patterns inherent within 1000 variables, for example, cannot be easily visualized individually, but using multivariate statistical projection-based techniques, the problem can be reduced to a limited number of variation indicators. This enables the visualization of the latent patterns. A multivariate statistical method that has been widely applied is principal component analysis. This technique has been extended to take account of both batch and nonlinear behavior. Other visualization strategies to support the compression methods include parallel coordinates plots (3).

There is a further challenge related to combining data from multiple sources logged at different frequencies. This raises data pretreatment issues that must be carefully addressed if original data features are not to be masked or lost. In the production environment, where consistency deviations are sought, these compression methods enable the interrogation and understanding of changes in the original process variables to address these challenges.

In the development environment, further analysis is required to understand the behavior of the process in the operating space. Data compression is the first step in data process model construction, which results in applying methods such as partial least squares. Variants are also required to capture both the batch and nonlinear behavior. These models of process behavior can then be used in knowledge-based experimental design strategies to design the most informative experiments.

Considering the solutions required, it might first be thought that "off-the-shelf" data analysis products can provide the answers. There are, however, many hurdles to overcome between identifying problems and implementing solutions. These range from data availability and configuration (which can typically take 33% of the time of the overall job) to tool identification, where the combined technical competencies in statistics, engineering, and biochemistry are necessary. There are also problems that require new fundamental data analysis approaches and, in particular, the area of process development is one where real benefits can be derived through their implementation. More efficient use of data and its appropriate analysis can help to reduce development times. Although this is an important business opportunity, the impact of small data sets raises further statistical challenges.

Extracting knowledge across products is one way of supplementing the limited information during early-stage development, but, again, data-analysis challenges are significant. One possible solution is combining data-based analysis with other information sources, which will result in hybrid information structures becoming important.


For most applications, the tools to deliver a solution are available, but there is a vast difference between having a set of tools and knowing how to use them effectively. The overall strategy adopted, and the selection of appropriate methods and decisions required to progress from recognizing a process problem to quickly and effectively attaining a solution only comes with experience. Those who believe that the purchase of a data-analysis package is the panacea to all problems need to be wary.

Gary Montague* is professor of bioprocess control in the School of Chemical Engineering and Advanced Materials, Newcastle University, Newcastle upon Tyne, NE1 7RU, United Kingdom, tel. +44 191 2227265, fax +44 191 2225292. Elaine Martin is professor of industrial statistics in the School of Chemical Engineering and Advanced Materials at Newcastle University.

This article was reprinted with permission from Pharmaceutical Technology Europe, 19 (9), 71–75 (2007).

*To whom all correspondence should be addressed.


1. US Food and Drug Administration, Guidance for Industry PAT—A Framework for Innovative Pharmaceutical Development, Manufacturing, and Quality Assurance, (FDA, Rockville, MD, September 2004),

2. A.O. Kirdar et al., "Application of Multivariate Analysis toward Biotech Processes: Case Study of a Cell-Culture Unit Operation," Biotechnol. Prog.23 (1), 61–67 (2007).

3. A. Inselberg, "Visualization and Data Mining of High-Dimensional Data," Chemomet. Intell. Lab. Syst.60 (1), 147–159 (2002).