From data to knowledge: the challenge in bioprocess development

Published on: 

Pharmaceutical Technology Europe

In biomanufacturing, multiple sensors provide a wealth of data that could be used to enhance process understanding and assist in performance improvement. This article looks at how to move from a data-rich environment to one where the data are translated to useful information that leads to knowledge and, ultimately, process improvements.

Biopharmaceutical manufacturing generates a wealth of data from multiple sensors. Consequently, the industry faces the challenge of moving from a data-rich environment to one where the data is translated to information that leads to knowledge. If this knowledge is effectively exploited it can improve production processes, and the financial status and environmental agenda of a company.

Addressing the data/information/knowledge chain is not straightforward because there are a series of compounding issues. For example, with respect to the manufacturing process, the data analysis problem is not restricted to one unit operation: the interactions between different units need to be extracted and understood. This requires scientists and engineers to apply advanced statistical methods. This is not easy because the data garnered will have diverse data structures (from batch and continuous operations), and includes both qualitative and quantitative measurements, and temporal and/or end-point measurements. From these data structures, potentially valuable sources of knowledge can be extracted to assist developing data-based model process descriptors. This information can lead to increased process understanding, which allows those working on the design and operational functions to make more informed decisions.

Although many of the problems of translating data to knowledge in the biopharmaceutical industry have been overcome using commercially available tools, the introduction of process analytical technology (PAT) has introduced a whole new set of challenges.1

For example, spectroscopic instruments significantly increase the amount of data and the complexity of the knowledge-mining problem.1 Figure 1 shows a fermentation vessel with an invasive near infra-red (NIR) probe, as well as the standard probes. The complexity of this instrument means that the need for data analysis and its interpretation will continue to grow. Faced with this overwhelming challenge of extracting knowledge from data, a team-based strategic approach is required. The team needs to include process engineers, scientists, statisticians and, most importantly, a business champion who recognizes the potential benefits.

Figure 1

The approach adopted, and the capability of existing technologies for knowledge extraction, depend on whether the development or manufacturing environment is being considered.



In development, at the reaction stage, understanding the predominant reactions that are occurring at various stages throughout the course of a batch is vital. Furthermore, as changes in operating policy are explored, it is necessary to determine when important reaction pathways are significant and when they are affected by reactant limitations or excessive accumulation of nutrients.

From such knowledge, operational policy changes can be made and new avenues of operation explored. The approach that has been typically adopted uses off-line sample analysis to identify the concentrations of nutrients that process scientists perceive to be influential. However, off-line sample analysis has its limitations, particularly regarding low sampling frequency affecting the amount of available data. It is in this situation where PAT, and its related technologies, can increase the frequency of available information and provide a detailed fingerprint of the chemical composition.

Figure 2

As previously mentioned, a series of recovery operations are executed following the reaction stage. In the development stage, once it has been determined which unit operations should be used, the standard operating policy for each unit must be specified and, ideally, the performance of the chain must be considered as a whole, rather than an isolated unit. Again, extracting information from the available data is crucial to achieving informed design and operational decisions. Figure 2 shows that although improved instrumentation leads to increased process knowledge during a development process, the real benefit is not realized until appropriate statistical analysis procedures are applied. The impact this knowledge has on process profitability is important. As Figure 3 illustrates, improved measurement promotes potentially reduced development times, and if integrated statistical analysis procedures are used, improved operating policies can lead to yield increases and higher profits.

Figure 3

Looking ahead, by embedding the data-based process descriptors within monitoring and optimization algorithms, it will be viable to move towards processes that inherently adopt a quality by design (QbD) concept. The real challenge in using statistical tools to extract process descriptors is that in the development laboratory data is only available from a limited set of experiments. Statistical awareness in assessing the experimental results is crucial to guide the development program.


In production, there is a need to ensure consistency of processes and provide tools that give an early warning of operational and process deviations to allow timely corrective action. The knowledge extraction philosophy for production is different from development. In production, the spread of data coverage is limited and the objective is to look for occasional deviations from predominantly 'consistent' behaviour. In development, the spread of data is significantly larger, and the objective is to search for robust and effective design and operational areas. Consequently, the tools required for interpreting data into knowledge demands different capabilities and the implementation of different fundamental algorithms is often a prerequisite. The knowledge extraction 'toolbox' must, therefore, possess multiple capabilities.

The methods available for data analysis can address a significant proportion of industrial problems.2 A common strategy for data analysis is compression, the goal of which is two-fold — first, to describe the variation in the data from a statistical data analysis perspective more effectively and, second, to enable the graphical representation of the data in the compressed data subspace, which allows the analyst to interrogate the data. The driver here is that patterns inherent within 1000 variables, for example, cannot be easily visualized individually, but using multivariate statistical projection-based techniques, the problem can be reduced to a limited number of variation indicators. This enables the visualization of the latent patterns. A multivariate statistical method that has been widely applied is principal component analysis. This technique has been extended to take account of both batch and nonlinear behaviour. Other visualization strategies to support the compression methods include parallel coordinates plots.3

There is a further challenge related to combining data from multiple sources, logged at different frequencies. This raises data pretreatment issues that must be carefully addressed if original data features are not to be masked or lost. In the production environment, where consistency deviations are sought, these compression methods enable the interrogation and understanding of changes in the original process variables to address these challenges.

Key points

In the development environment, further analysis is required to understand the behaviour of the process in the operating space. Data compression is the first step in data process model construction, which results in applying methods such as partial least squares. Variants are also required to capture both the batch and nonlinear behaviour. These models of process behaviour can then be used in knowledge-based experimental design strategies to design the most informative experiments.

Considering the solutions required, it might first be thought that 'off-the-shelf' data analysis products can provide the answers. There are, however, many hurdles to overcome between identifying problems and implementing solutions. These range from data availability and configuration (which can typically take 33% of the time of the overall job) to tool identification where the combined technical competencies in statistics, engineering and biochemistry are necessary. There are also problems that require new fundamental data analysis approaches and, in particular, the area of process development is one where real benefits can be derived through their implementation. More efficient use of data and its appropriate analysis can help to reduce development times. Although this is an important business opportunity, the impact of small data sets raises further statistical challenges.

Extracting knowledge across products is one way of supplementing the limited information during early-stage development, but, again, data analysis challenges are significant. One possible solution is combining data-based analysis with other information sources, which will result in hybrid information structures becoming important.


For most applications the tools to deliver a solution are available, but there is a vast difference between having a set of tools and knowing how to use them effectively. The overall strategy adopted, and the selection of appropriate methods and decisions required to progress from recognizing a process problem to quickly and effectively attaining a solution only comes with experience. Those who believe that the purchase of a data analysis package is the panacea to all problems need to be wary.

Gary Montague is professor of bioprocess control in the School of Chemical Engineering and Advanced Materials at Newcastle University (Newcastle Upon Tyne, UK). The main thrust of his research is in the area of systems engineering. Recent work is considering the interaction between evolutionary agents in the development of pharmaceutical downstream processing models.

Elaine Martin is professor of industrial statistics in the School of Chemical Engineering and Advanced Materials at Newcastle University. Her expertise lies in the areas of multivariate statistics, design of experiments, Bayesian analysis, linear and nonlinear statistical modelling, nonparametric statistics, multivariate statistical process control, and feature extraction from complex databases.


1. Guidance for Industry PAT — A Framework for Innovative Pharmaceutical Development, Manufacturing and Quality Assurance, (September 2004).

2. A.O. Kirdar et al., Biotechnol. Prog., 23(1), 61–77 (2007).

3. A. Inselberg, Chemomet. Intell. Lab. Syst., 60(1), 147–159 (2002).