OR WAIT 15 SECS
During the last few decades, advances in molecular biology have allowed the increasingly rapid sequencing of large portions of genomes. The plethora of information, resulting from programmes such as the Human Genome Project, has necessitated the careful storage, organization and indexing of sequence information. This, in turn, has led to the development of numerous sequence databases such as GenBank and EMBL. This article examines how an integrated approach to bioinformatics could help researchers align their work, share data and, ultimately, significantly increase productivity.
Bioinformatics is "a mechanism for the acquisition, processing, structured analysis and storage of biological data." If this statement is accepted as a satisfactory definition, then the creation of a single solution for bioinformatics, even for a modest-sized life science institution, appears to be a formidable task - particularly when the sheer volume of data and its exponential growth, the complexity of biological data sets and the multitude of different data types are considered. Slowly, we (that is, the global life sciences community along with the vendor industry) are recognizing that we cannot begin to solve the whole challenge of bioinformatics in one fell swoop. If we are to agree that the role of bioinformatics is to 'enable the integration of biological data with knowledge resources from other domains,' there is a school of thought gathering support that we must look to an integrated approach, in which we borrow, adapt and reuse existing solutions wherever possible. Furthermore, with the emergence of a new computing architecture known as web services, there is now a technological platform that could potentially deliver the integration and accessibility of biological data.
There are a number of key businessb -- scientific drivers behind the requirement for an integrated approach to bioinformatics:
The growth of GenBank (www.ncbi.nlm.nih.gov/Genbank), in terms of DNA sequence data alone, parallels the global growth of Internet nodes during the last 20 years. At an early stage, the Internet community recognized that the development of standards for data exchange, node-naming conventions and related low-level protocols were required to enable a graceful handling of an explosion in the number of nodes. From the beginning, the Internet was designed to be robust, scalable and accessible. Even so, there is now considerable consternation regarding the need to deal with the constraints imposed by the 12-digit IP (Internet protocol) addressing scheme.
The bottom line is that it is very difficult to plan for data volume changes of several orders of magnitude. Yet, the technology underlying the development must at least be scalable.
Each different field of bioscience is producing its own data, in isolation of the others. However, a sea change is anticipated. Despite a traditional reputation for insularity, researchers recognize that each discipline does not exist in isolation of the others; rather, there is interaction and feedback between and among each of the levels. Bioinformatics must rise to the challenge of providing a unified view of the data behind the interactions. New tools are required that transcend primary data storage and analysis.
Table I: Bioinformatics databases.
Organism-specific researchers generate and use data locally, annotate the data as required and use it to answer very specific experimental questions. However, these data are often shared with the global community, which requires access to large data sets to address questions that may be of little interest to the original producers of the data.
The tools and data formats used at these two levels may be very different. At the lowest level, a LIMS (laboratory information management system) allows information management and data processing of interest to individual laboratories. At the global level, however, there are agreements, standards and protocols that allow data to be shared between researchers. An ideal integrated bioinformatics solution should allow these different levels to translate and exchange data efficiently and seamlessly.
The modern researcher is faced with a multiplicity of data types, relating to, for example, mass spectrometry (MS), 2-D gels, 3-D molecular structures, microarrays and DNA sequences. An integrated solution is required to allow comparison and interpretation of this data. To take a proteomics example, a researcher may have 2-D gel data that is associated with MS data and peptide sequence data for a protein - but may have no single picture of the whole pipeline of knowledge. Instead, the researcher is only able to view the data types in isolation.
Lincoln Stein of the Cold Spring Harbor Laboratory (New York, USA) delivered a keynote presentation at the O'Reilly Bioinformatics Technology Conference (Tucson, Arizona, USA) in which he compared the current status of bioinformatics with that of Italy in the Middle Ages.1
The Italian city states were a disparate group with different dialects, cultures, legal and political systems, weights and measures, taxation and currencies. Even though Italy had brilliant thinkers and scientists, its technological and industrial development lagged behind because of the difficulty in overcoming these differences. Lincoln argued that today's bioinformatics data providers are also suffering from too many differences, which are hindering the advancement of science. "We see a lot of fragmentation in the landscape of data providers," he said, "and each of these data sites has its own view of the world."
Figure 1: The role of LIMS in the management of bioinformatic data.
Bioinformatics databases such as NCBI, EnsEMBL, FlyBase, SGD, WormBase and UCSC are all providing relevant data (Table I), but unfortunately they are using a wide range of different systems and formats. Lincoln stressed that there is a clear need for a more integrated approach to bioinformatics.
We don't collect data for it's own sake; if there is to be any value, we must derive knowledge from it. Thus, an integrated approach is required to derive knowledge from the myriad of biological data sources. LIMS is increasingly seen as the tool that can offer both a central repository for bioinformatic data whilst offering integration with laboratory robotics and instruments, management of assay information, reagents and protocols. It can also offer a route for exchanging and integrating other applications and databases. Figure 1 illustrates the pivotal role of a LIMS in the management of bioinformatic data.
Bioinformatics code of conduct. At the O'Reilly Bioinformatics Technology Conference, Lincoln Stein and others recently proposed a code of conduct for biological data providers, with the intention of allowing easy integration of bioinformatics resources. It provides a good platform from which suppliers and customers can make progress in their work, together with a means of developing relationships for the good of the bioinformatics community. The six tenets can be summarized as follows:
Technologies for integrated bioinformatics. Database federations and data warehouses have traditionally been used to integrate disparate data sources. However, the advent of web services is offering new possibilities.
A database federation can have a global (federation) schema that provides users with a uniform view of the federation and thus insulates them from the component databases, or local views that provides users with multiple views of the federation.
A data warehouse represents the materialization of a global schema, that is, the warehouse database, defined by the global schema, is loaded periodically with data from the component databases. It organizes disparate databases into a data warehouse with or without a common schema. Some of the more established examples are GUS (Genomic Unified Schema), a data warehouse that attempts to predict protein function based on protein domains, and EnsEMBL, a collaborative project between EMBL, EBI (European Bioinformatics Institute) and the Sanger Center to automatically track sequenced fragments of the human genome and assemble them into longer stretches.
Web services are intended to enable the exchange of data between heterogeneous systems in the form of XML (eXtensible Markup Language) messages. The two key qualities of XML are that it is human readable and platform neutral. Web services architecture represents an attempt to allow remote access of data and application logic in a loosely combined fashion. Previous attempts at achieving this (such as DCOM and Java/RMI) required tight integration between the client and server, using platform- and implementation-specific binary data formats.
An advantage of web services is that they are not "owned" by any one company or organization. Programs written in any language, using any component model and running on any operating system can all access web services.
As mergers, data sharing and communal resources become more accepted in the biotech and pharmaceutical industry, data compatibility and system integration become difficult and often expensive considerations. Thus, web services can offer significant benefits. Exposing the functionality of a LIMS, electronic record keeping system or other scientific information system using web services allows scientists to share data more effectively. By using common schemas and transforming the information, data held in the different systems can be searched, queried and displayed via XML documents, using common interfaces.
A proteomics researcher could, for example, perform a keyword search for related samples, spectra, annotated sequences and 2-D gel images of a particular protein across multiple systems within an organization. This search could be performed from a page in the corporate Intranet or portal. Some portals offer organizations and users the ability to create a site that is personalized for individual interests; as such, a laboratory portal could be used to bring together sources, databases and functionality pertinent to a laboratory user. This might include sample registration and tracking, access to analytical results, spectral and chemical information, and possibly documents and procedures. Having a central access point for this type of information helps to eliminate barriers between departments and functions, improves internal collaboration and communication, and ultimately delivers operational efficiency by making better use of internal resources and knowledge.
Microsoft clearly rates web services as extremely important. Approximately 80% of its 2000 research and development budget was allocated to the .NET framework and web services. Now that Visual Studio.NET has been released, web services architecture is set to explode.
There are a number of good examples of the use of web services in bioinformatics, such as EBI's Bibliographic Query Service (BQS) which provides web service access to life science publications; thereby fulfilling a similar role to PubMed only using a richer interface for querying and retrieving publications. A further example is EBI's XEMBL that offers access to the EMBL nucleotide database for the first time as a web service.
As the industry struggles to deal with increasingly greater volumes of complex and often interrelated data, the need for integration in the management of bioinformatic data has never been as stark as it is today.
To maximize the usefulness and reusability of a data source, a code of conduct for data providers has been formulated; the principles of which are now gaining widespread support. From a technological standpoint, web services appear to offer the architecture to support the integration and accessibility of data that researchers and bioinformaticians have been waiting for. The elements now appear to be in place for truly integrated bioinformatics to become a reality.
1. L.D. Stein, "Building a Nation from a Land of City States," Cold Spring Harbor Laboratory, keynote speech at the O'Reilly Open Bioinformatics Conference (Cold Spring Harbor, New York, USA, 2002).