An Integrated Approach to the Data Lifecycle in BioPharma

Published on: 
Pharmaceutical Technology, Pharmaceutical Technology, August 2022, Volume 46, Issue 8
Pages: 44-46

Successful digital transformation in biopharma requires an integrated approach to the data lifecycle.

The BioPharma industry is in a state of flux. As the industry modernizes to deliver on the promise of a new generation of drug modalities, such as cell and gene therapies, it is also seeking to take advantage of recent advances in digital technologies. Many companies have digital transformation initiatives to leverage the potential of big data, cloud computing, machine learning/artificial intelligence, and the Internet of Things (IoT). The ultimate goal of these initiatives is to improve operational efficiency, reduce costs and time to market, and stay ahead of the competition. To enable this transformation, the so-called Industry 4.0 movement must be built on a solid foundation of good data governance while accommodating the industry’s stringent regulatory requirements.

In the race to all things digital, data governance, data integrity, and regulatory compliance may not be at the top of the C-Suite’s mind. Yet, digital initiatives often provide an opportunity to modernize these legacy processes, replacing labor intensive and time-consuming manual methods with integrated systems that can reduce effort and improve overall data quality.

Recent years have seen the rise of “483s,” FDA regulatory warning letters, with data integrity violations accounting for most of the notices. In 2019, almost half (47%) of all warning letters issued by FDA concerned data integrity. By the end of 2021, that number had increased to 65% (1). This trend has prompted forward-thinking organizations to re-evaluate their infrastructure and ways of working in efforts to maintain compliance and avoid future risk through digitization of their business and operational processes. In many cases, these efforts should be seen as complementary to programs that modernize data governance infrastructure to meet Industry 4.0 aspirations.

The legacy systems headache

The industry has seen good adoption of tools to help manage operational data and improve data integrity, such as electronic lab notebooks (ELN), laboratory information management systems (LIMS), and manufacturing execution systems (MES). Larger organizations are also seeing a resurge in investments in building centralized data repositories, such as data lakes, to help drive their digitization initiatives. A core business objective of many of these data lakes is to break down data silos to create centralized repository of data for end users that is easily accessible, coherent, and complete. However, automating the integration of such varied systems, whilst ensuring data integrity and regulatory compliance, remains a significant industry challenge (2).

Many biopharmaceutical processes lack agreed-upon standards for data representation and transmission. While standards are emerging, most have yet to be widely adopted by hardware and software vendors due to the lack of industry consensus or the immaturity of the standards. The problem is compounded as many hardware and software vendors fail to provide adequate programmatic interfaces to enable automated system-to-system integration, creating silos of critical data.

A less appreciated and more nuanced data integrity issue is data contextualization. Even if an operator can extract data from a specific system (e.g., chromatography), the data may be of little use without combining it with data stored in other systems, such as the experimental conditions under which the sample was generated, or how the sample was stored or subsequently processed. Maintaining this information context or “chain of custody” is not only necessary for interpreting experimental observations (e.g., mechanism of action), but also critical when attempting to gain the higher level of business intelligence needed to optimize a drug candidate’s attributes or a process for drug manufacture.

Human operators are routinely required to manually transcribe and combine data from multiple systems, often using intermediary tools, such as spreadsheets. This human-centric process is time consuming and error prone, and the potential for errors escalates with each additional transcription required in data transfer. Additionally, manual data transcription workflows require extensive additional quality checks to ensure data integrity and meet regulatory requirements, such as 21 Code of Federal Regulations (CFR) Part 11 or good laboratory/manufacturing practices (GxP) (3).

Moreover, generating and accessing high-quality contextualized data is a major bottleneck to implementing machine learning and other advanced analytic techniques. Highly educated data scientists can spend an inordinate amount of time on a project simply searching for, combining, and cleaning data to generate datasets for training and validating models.

Relieving the pain

Improving data governance practices and systems integration is a critical pre-requisite to implementation of the automation and analytics aspirations of Industry 4.0.

FDA uses the acronym ALCOA to describe its expectations of data integrity to help industry technicians stay compliant with 21 CFR Part 11. Per ALCOA, data must be attributable, legible, contemporaneous (i.e., recorded in real-time when generated with a date and time stamp), original, and accurate. The concepts were expanded to ALCOA+, which incorporates additional features and specifies that data must be complete, consistent, enduring, and available (4).

The principles of ALCOA+ and requirements of 21 CFR Part 11 for maintaining data integrity are well established within the biopharmaceutical industry. More recently, similar concepts are being advocated to address broader data integration and system automation challenges by a movement known as the F.A.I.R Principles for scientific data management and stewardship (5). Established in 2014 by participants of a seminar called the Lorentz Workshop “Jointly Designing Data Fairport”, F.A.I.R ensures data are findable, accessible, interoperable, and reusable. More of a design principle than a standard, it recommends or relies on operators to subscribe to one system that enables crosstalk between two or more different systems in a machine-readable format. The output generates data that are comprehensible, reusable, and contextualizable. (6).

F.A.I.R. and ALCOA+ complement each other to help tie the data together. F.A.I.R. focuses on infrastructure, namely metadata, to increase the reliability of electronic data capture, while ALCOA+ addresses data integrity challenges to improve the trustworthiness of the data output in the process.

Emerging informatics trends

The current state of biopharmaceutical informatics is heavily dictated by larger organizations which leverage considerable purchasing power. A high proportion of these organizations have made substantial investments in a trove of disparate systems stitched together with in-house integration code, bespoke data lakes/warehouses, and a variety of analytics tools. They are understandably hesitant to undertake the cost and risk associated with wholesale change, especially if the systems in question are validated to a GxP standard. Progress is often incremental, and this can stymie innovation by incentivizing incumbent vendors to maintain the status quo and creating a barrier to entry for newer, more disruptive technologies. There are, however, several emerging technologies to help automate the capture, integration, and contextualization of data from these legacy systems that can both improve data integrity and drive broader Industry 4.0 initiatives, such as implementation of IoT, robotics, and the creation of digital twins for predictive modeling.

Systems integration remains a perennial problem. Some organizations have the necessary IT/software development skills in-house to integrate their digital topography by working directly with individual vendors to create custom solutions. This approach is not ideal for many companies, however, because it is resource-intensive, time consuming, and creates custom software code that must be maintained in perpetuity, creating a resource overhead debt. Its viability is also dependent on support from hardware/software vendors for system integration, which varies considerably from vendor to vendor and should be a decisive factor when procuring a new system or platform.


When procuring new informatic systems or instruments intended to be integrated with your digital topography, key questions that should be asked include:

  • How good are the vendors’ APIs and documentation?
  • Do they offer services or have certified partners that support integration?
  • Can they provide references to examples of successful integration projects?

Increasingly, procurement teams are placing an emphasis on integration support, putting pressure on hardware/software vendors to ascribe to F.A.I.R. principles when designing their products. For many small to mid-sized organizations, building an integrated ecosystem of hardware and software to drive automation is beyond their technical or budgetary reach. To address this need, a number of companies have developed softwares designed to simplify the integration of popular laboratory hardware and software platforms, by creating libraries of connectors to common laboratory equipment and key informatics applications, and providing the mechanism to automate the exchange of data between systems (7).

Cleaning, contextualizing, and aligning the data pulled from different systems presents a challenge as daunting as accessing the source data itself. Recreating the information “chain of custody” typically requires combining partial data sets from different systems, harmonizing terms and identifiers, and ensuring data are aligned correctly. This is one of the most tedious and error-prone steps, creating significant data integrity risks, particularly when done manually with a human operator.

Automating this process may be accomplished often in conjunction with some form of centrally-managed metadata/ontology library to automatically annotate data or create knowledge graphs that map terms from different systems and join together the data. The availability of ontology management, semantic enrichment, and knowledge graph tools continues to expand apace, led by open-source initiatives as well as a number of commercial vendors that specialize in this market.

Systems integration and automated data contextualization focus on integration of hardware, their control systems, and other operational tools, such as ELN, LIMS, or MES. However, the dynamic of wet-lab work does not always provide access to software interfaces for capturing all observations. Some data are still manually documented in hand-written notes, which can be critical to properly qualify an experiment or analytical run. This information may or may not become part of the electronic information chain of custody; it can be lost, erroneously transcribed, or intentionally omitted from the record.

To address this issue, some companies have introduced scientifically intelligent digital voice assistants to provide a more efficient, hands-free user experience for lab workflow and data capture (8). For example, the assistant can prompt the user through a complex protocol with voice instruction, import data at critical steps directly from lab equipment, and capture key observations and ancillary notes by transcribing the operator’s voice dictation. These tools can operate both independently as well as integrate directly with other systems such as ELN.

Looking ahead

As small-molecule drug development gives way to newer drug modalities, the suitability of legacy informatics and hardware systems comes into question. Combined with a growing purchasing power of mid-market biotechnology companies, which are often greenfield sites with little or no legacy infrastructure, the opportunity for a more innovative and disruptive change to the traditional informatics landscape will emerge.

Many of these newer technologies seek to embrace a holistic integrated approach to BioPharma Lifecycle Management, by combining elements of out-of-the box workflow execution, preconfigured system and hardware integrations, contextualized data stores built on F.A.I.R principles, and integrated analytics to drive business intelligence through an integrated digital platform. Their adoption will be driven on whether they can be implemented quickly, deliver business benefits immediately, reduce total cost of ownership, and provide a scalable foundation of data lifecycle management to accelerate Industry 4.0 initiatives.

Technology is finally maturing to address the challenge of integration and automation in the lab and manufacturing plant, whilst ensuring data integrity and regulatory compliance. Regardless of how organizations reevaluate and modernize their current processes, the solution starts with a willingness to embrace new-world approaches to these old-world problems.


1. J. Eglovitch, “Experts Say FDA Enforcement Focus Unchanged, Use Of Alternative Tools To Grow,”, June 1, 2022.

2. S. Ktori, “A Digital Journey,”, Sept. 10, 2021.

3. US CFR Title 21, Part 11 (Government Printing Office, Washington DC) 1-9.

4. H. Alosert et al., Biotechnology Journal 17, 2100609 (2022).

5. IDBS, “The FAIR Principles: A Quick Introduction,”, accessed July 10, 2022.

6. Labfolder, “Setting the Standard: FAIR and ALCOA+ in Research During the Pandemic,”, accessed July 10, 2022.

7. D. Levy, “Machine Learning Adoption And Implementation In /The Lab,”, Oct. 14, 2021.

8. P. de Matos, “Lab of the Future Post-COVID-19: Bringing User Experience to the Forefront,”, March 9, 2021.

About the author

Scott Weiss is vice-president of Business Development and Open Innovation at IDBS.

Article details

Pharmaceutical Technology
Vol. 46, No. 8
August 2022
Pages: 44–46


When referring to this article, please cite it as S. Weiss, “An Integrated Approach to the Data Lifecycle in BioPharma,” Pharmaceutical Technology 46 (8) 2022.