(C) 2011 Anne E. Thessen. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
For reference, use of the paginated PDF or printed version of this article is recommended.
We review technical and sociological issues facing the Life Sciences as they transform into more data-centric disciplines - the “Big New Biology”. Three major challenges are: 1) lack of comprehensive standards; 2) lack of incentives for individual scientists to share data; 3) lack of appropriate infrastructure and support. Technological advances with standards, bandwidth, distributed computing, exemplar successes, and a strong presence in the emerging world of Linked Open Data are sufficient to conclude that technical issues will be overcome in the foreseeable future. While motivated to have a shared open infrastructure and data pool, and pressured by funding agencies in move in this direction, the sociological issues determine progress. Major sociological issues include our lack of understanding of the heterogeneous data cultures within Life Sciences, and the impediments to progress include a lack of incentives to build appropriate infrastructures into projects and institutions or to encourage scientists to make data openly available.
life science, informatics, data issues, standards, incentives, escience
The urgent need to understand complex, global phenomena, the data deluge arising from new technologies, and improved data management are driving an agenda to extend the Life Sciences with more data-driven discovery dimensions (
Data-driven discovery refers to hypothesis-testing and the discovery of scientific insights through the novel management and analysis of pre-existing data. It relies on access to and reuse of data which will most likely have been generated to address other scientific problems. While still hypothesis-based, data-driven discovery contrasts with the more familiar process of scientific inquiry based on collecting new data - whether by experimentation or by making new observations. It introduces opportunities to address questions that demand a “scale” of data that cannot be acquired within a single project. It is cost-effective (
The emergence of a data-centric Big New Biology is not guaranteed. Current practices in much of the discipline are parochial, with data being generated by individuals or small teams, being called upon to develop insights that are communicated in a narrative style in scientific publications. These small sciences rarely have a formal data culture, data are rarely collected with reuse in mind, they may be discarded, although more recently some journals and some sub-disciplines retain publication-related subsets of data (
This document reviews technical and sociological issues for biologists in the light of this futuristic vision for the Life Sciences. Many elements, such as data trust and data types have technological and sociological components and in such cases we have combined them for clarity.
What is meant by dataThe term “data” is not used consistently. For some it is limited to raw data, for others the term widens to include any kind of information or process that leads to insights. We prefer to limit the term to neutral, objective, raw data that are largely independent of context, analysis or observer. As data become constrained, filtered and selected, they acquire or are assigned a meaning in the context of what they apply to. This is part of the process that transforms data into information (
The context in which biological data are acquired or generated is important to understanding how data can be appropriately reused. A context may be formed if observers select or interpret their records, because of the limitations of tools or instruments used, or because data are gathered in an unnatural setting such as an experiment or “in silico”. Individuals and technologies are selective and capture a limited subset of all available data. Data are affected by choice of instrument and analytical processes. Some context can be represented through the addition of appropriate metadata to data. We categorize the following broad types of data reflecting the context of their origins.
A. Observational data relate to an object or event actually or potentially witnessed by an agent. An agent may be a person, team, project, initiative; and they may call upon tools and instruments. Scientists need to take responsibility to add metadata to the observational data, ideally identifying the agent, date, location, and contexts such as experimental conditions if relevant or the equipment used. Within the Life Sciences, metadata should include taxon names, the basis for identification and/or pointers to reference (voucher) material.
1. Descriptive data are non-experimental data collected through observations of nature. Ideally, descriptive data can be reduced to values about a specified aspect of a taxon, system, or process. Each value will be unique, having been made at one place, at one time, by one agent. Observations may be confirmed but not replicated such that it is important to preserve these data. Preservation often does not occur as data of this type are discarded after completion of the research narrative - the publication. The OBOE project offers a formal framework for descriptive data (
Descriptive data can be collected by instruments or by individuals. Data collected by individuals may not represent the world completely or accurately. Mistakes can be made, such as misidentification of taxa (
2. Experimental data are obtained when a scientist changes or constrains the conditions under which the expression of a phenomenon occurs. Experiments can be conducted across a broad range of scales - from electrophysiological investigations of sub millisecond processes within cells (
B. Processed data are obtained through a reworking, recombination, or analysis of raw data. There are two primary types.
1. Computed data result from a reworking of data to make them more meaningful or to normalize them. In ecology, productivity or the extent of the ecosystem are rarely measured directly. Rather they are computed using information or data from other sources to generate measurements of the amount of carbon or mass that is generated per unit area per unit time. While computed data may be held in the same regard as raw data, choices or errors in formulae or algorithms may diminish or invalidate the data created. The raw data that were used and information on how computed data were derived (provenance) are important for reproducibility. The metadata should provide this information. As computed data will grow as the virtual data pool expands, it will be helpful for sub-disciplines to develop appropriate protocols and advertize best practices.
2. Simulation data are generated by combining mathematical or computational models with raw data. Often models seek to make predictions of processes, such as the future distribution of cane toads in Australia under various climatic projections. The proximity of predictions to subsequent observations is used to test the concepts on which the model is based and to improve the model and our associated understanding of biology. Metadata differ dramatically from other data types in that date of the run, initial conditions of the model, resolution of the model output, time step, etc. are important. Rerunning the model may require preservation of initial conditions, model software, and even the operating system (
As the study of human social behavior, sociology includes the study of the behavior and practices of scientists. If we are to promote a shift to a Big New Biology, we need to understand current data cultures to determine which elements favor a transformation, and which will hinder it.
1. Data culturesThe phrase “data culture” refers to the explicit and implicit data practices and expectations that determine the destiny of data. It relates to the social conventions of acquisition, curation, preservation, sharing, and reuse of data. If the goal is to make data digital, standardized and openly accessible in a reusable format, then current data cultures provide starting points to determine the changes that will be needed before that vision can be realized. While a comprehensive survey has yet to be undertaken, it is clear that there is no single data culture for the Life Sciences (
The preparation of data for reuse in a shared pool often involves a series of steps or stages that relate to the capture, digitization, structure, storage, curation, discoverability, access, and mobility of data. The situation with molecular data achieved by the International Nucleotide Sequence Database Collaboration comprising the DNA Data Bank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and the NCBI GenBank in the USA is exemplary (http://www.insdc.org/). Molecular data tend to be born digital, and are submitted in standard formats to centralized repositories in which they are freely available for reuse in a standard form. A rich diversity of tools, services and applications has evolved to analyze and visualize the data.
Yet, set in the context of Rogers adoption curve (
Rogers adoption curve describes the acceptance of a new technology. Life Sciences is still in the Early Adopters phase for accepting principles of data readiness.
The term “agent” refers to individuals, groups or organizations - each influencing data cultures.
Scientists. As major producers and consumers of Life Sciences data, scientists are important participants in Big New Biology. Within the US there are almost 100, 000 biologists (excluding agriculture and health sciences) working outside of academia (
As personal computers and Internet access have become integral components of biological research (
Scientists, especially those associated with small science, will need to be more engaged in mobilization of data than at present (
Publishers.Publishers of scientific journals are increasingly involved in data management (
Funding agencies.Funding agencies worldwide have been called upon to finance informatics research and to promote tools and digital libraries that will underpin the shift towards a Big New Biology paradigm (
List of funding agencies and characteristics of their data policies.
Legend: A. Funding Agency, B. Country, C. Policy, D. Data Management Plan, E. Deposit, F. Standards Compliant, G. Attribution, H. Local Archive, I. Open Source, J. QA/QC, K. Confidentiality, L. PR/Licensing, M. Metadata Deposit, N. Provides Data for Free, O. Free Access to Publications, P. Notes.
A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Gordon and Betty Moore Foundation | US | http://moore.org/docs/GBMF_Data%20Sharing%20Philosophy%20and%20Plan.pdf | × | × | × | ||||||||||
Genome Canada | Canada | http://www.genomecanada.ca/medias/PDF/EN/DataReleaseand ResourceSharingPolicy.pdf | × | × | × | × | × | × | × | Data must be made available no later than the publication date or the date the patent has been filed (which ever comes first) at the end of the project | |||||
National Institutes of Health | US | http://grants.nih.gov/grants/policy/data_sharing/ | × | × | Applies to projects requesting > $500, 000, data must be released no later than the acceptance of publication of the main findings from the final data set | ||||||||||
Biotechnology and Biological Sciences Research Council | UK | http://www.bbsrc.ac.uk/publications/policy/data_sharing_policy.html | × | × | × | × | × | data release no later than publication or within 3 years of generation, Researchers are expected to ensure data availability for 10 years after completion of project | |||||||
Natural Environment Research Council | UK | http://www.nerc.ac.uk/research/sites/data/policy.asp | × | × | × | × | × | Data must be made available within 2 years from the end of data collection | |||||||
Wellcome Trust | UK | http://www.welcome.ac.uk/About-us/Policy/Policy-and-position-statements/WTX035043.htm | × | × | |||||||||||
Department of Energy | US | http://genomicsgtl.energy.gov/datasharing | × | × | × | × | × | × | × | × | Requires deposit of 1) protocols 2) raw data 3) other relevant materials no later than 3 months after publication | ||||
Chinese Academy of Sciences | China | http://english.cas.cn/ | Requires deposit or no further funding | ||||||||||||
Australian Research Council | Australia | http://www.arc.gov.au/default.htm | No policy | ||||||||||||
National Science Foundation | US | × | |||||||||||||
Austrian Science Fund | Austria | http://www.fwf.ac.at/en/public_relations/oai/index.html | × | × | Data must be available no more than 2 years after end of project | ||||||||||
NASA | US | http://science.nasa.gov/earth-science/earth-science-data/data-information-policy/ | × | × | Data can be embargoed for 2 years | ||||||||||
NOAA | US | http://www.ncdc.noaa.gov/oa/about/open-access-climate-data-policy.pdf | × | × | |||||||||||
Council for Scientific and Industrial Research | India | http://rdpp.csir.res.in/csir_acsir/Home.aspx | Plan being developed in 2010 | ||||||||||||
North Pacific Research Board | US | http://www.nprb.org/projects/metadata.html | × | × | Data must be transferred to NPRB by the end of the project | ||||||||||
Japan Science and Technology Agency | Japan | http://www.jst.go.jp/EN/index.html | None | ||||||||||||
National Research Foundation | South Africa | http://www.nrf.ac.za/ | None |
Governments.The realization of a Big New Biology will require significant investment in and reorganization of technical and human infrastructure, the creation of new agencies, new policies and implementation frameworks, as well as national and transnational coordination. The scale of these developments will require governmental and intergovernmental participation. Issues that require high-level attention are illustrated by the OECD report that established GBIF (
Several countries have established governmental digital data environments inclusive of the data.gov environments (http://www.data.gov/, http://data.australia.gov.au/, data.gov.uk), or more specialist agencies such as Conabio in Mexico (http://www.conabio.gob.mx/), ABRS, ERIN and ALA in Australia (http://www.environment.gov.au/biodiversity/abrs/, http://www.environment.gov.au/erin/, http://www.ala.org.au/), ITIS in US (http://www.itis.gov/) or the European Environment Agency (http://www.eea.europa.eu/data-and-maps).
In respect to the economics at this level, OECD, when establishing GBIF, compared the cost of the molecular informatics infrastructure (millions of dollars) against the benefits to pharmaceutical, health and agricultural businesses worth billions of dollars (
Universities.With in excess of 20, 000 universities (and institutions modeled on Universities) worldwide (Webometrics Ranking of World Universities; http://www.webometrics.info/methodology.html), employing an estimated 5–10 million academics and associated researchers, universities form the largest research and development initiative. Collectively, Universities are a significant source of new data and given their international communal character, will be important as consumers of the data pool. The support, infrastructure and services that Universities provide will be a major determinant of the flow and fate of data. Some environments, such as the SURF foundation (http://www.surffoundation.nl/en/actueel/Pages/Researchersenhancetheirpublications.aspx) seek to unite research institutes through the application of new technologies. SURF serves the Dutch context and currently emphasizes 5 disciplines; Life Sciences are not included.
Universities may or may not regard themselves as owners (having IP rights) of data and so may regulate access to data generated in-house or as part of collaborative projects. Universities may or may not have policies that require the retention of research data for a limited period usually in the range of 3 to 7 years. The University of Melbourne policy is based on guidelines from the National Health and Medical Research Council/Australian Vice Chancellors’ Committee and specifies that “Data must be recorded in a durable and appropriately referenced form” for a minimum of 5 years (http://www.unimelb.edu.au/records/research.html). The Chinese University of Hong Kong encourages researchers to deposit their data in the University Service Center upon completion of their research (http://www.usc.cuhk.edu.hk/Eng/SharingPolicy.aspx). US universities are bound to comply with the requirements of OMB Circular A-110 (Uniform Administrative Requirements for grants and agreements with Institutions of Higher Education, Hospitals, and Other Non-Profit Organizations – http://www.whitehouse.gov/omb/circulars_a110). This specifies that financial records, supporting documents, statistics, and all other records produced in connection with a financial award, including laboratory data and primary data are to be retained by the institution for a specified period. OMB A-110 also states “The Federal awarding agency(ies) reserve a royalty-free, nonexclusive and irrevocable right to reproduce, publish, or otherwise use the work for Federal purposes, and to authorize others to do so.” Many universities have data policies that target administrative data and administrative agenda rather than on promoting the use of data for academic purposes (e.g. “(This) University must retain research data in sufficient detail and for an adequate period of time to enable appropriate responses to questions about accuracy, authenticity, primacy and compliance with laws and regulations governing the conduct of the research” – http://ora.ra.cwru.edu/University_Policy_On_Custody_Of_Research_Data.pdf). As their policies improve, Universities will need to play a significant role in educating staff and students as to the value of data. They will be the focus of reshaping the skill base on which the Big New Biology will rely (
Museums and herbaria.Museums and herbaria play special roles within the Life Sciences. Along with libraries, they have a mandate for the long-term preservation of materials. Those materials include several billion specimens of plants, animals and fossils collected by biologists over 3 centuries (
Citizen scientists.Citizen scientists are non-professionals who participate in scientific activities. The appealing richness of nature, its accessibility, and our reliance on natural resources ensures that biology attracts an especially high participation by the citizenry (
Repositories. A repository provides services for management and dissemination of data inclusive of, ideally, making data discoverable, providing access, protecting the integrity of the data, ensuring long term preservation and migrating to new technologies (
Examples of repositories for Life Sciences data.
Repositories range in functionality from basic data stores to collaborative databases that incorporate analysis functions (WRAM, Wireless Remote Animal Monitoring, www-wram.slu.se). Some repositories host heterogeneous data sets (such as oceanographic databases – http://woce.nodc.noaa.gov/wdiu/, http://www.nodc.noaa.gov/, http://www.ices.dk/ocean/), but those that provide normalization, standardization, atomization and quality control services (see below) will facilitate the reuse of data and will play a stronger role in data-intensive science. That many older repositories are difficult to access or are not maintained (
The second array of challenges that need to be addressed as we move towards Big New Biology are technical issues that affect the distribution, preservation, accessibility and reuse of data.
Making data accessibleThe effective reuse of data requires that an array of conditions (Fig. 2) is optimized.
A Big New Biology can only emerge with a framework that optimizes reuse. Ideally, data should be in forms that can flow from source into a common pool and can flow back out to consumers, be subject to quality control, or be enhanced through analysis to rejoin the pool as processed data.
Data need to be retained. Relatively few data acquired historically have been retained in an accessible form by scientists, projects or institutions (
Data need to be digital. Digitization is a prerequisite for data mobility. Considerable amounts of relevant data are not yet in a digital format (
Data need to be structured. Digital data may be unstructured (e.g. in the form of free text or an image) or they may be structured into categories that are represented consecutively or periodically through the use of a template, spreadsheet or database. The simple structure of a spreadsheet allows records to be represented as rows. Data occur within the cells formed by the intersection of rows and columns defined by metadata (headers). A source may mix both structured and unstructured data such as when fields include free-form text, images, or atomic data. Unstructured data, such as the legacy data to be found in an estimated 500 million pages of text, can be improved through annotation with metadata provided by curators or through tools such as natural language processing tools.
Data should be normalized. Normalization brings information contained within different structures to the same format (or structure). Normalization may be as simple as consistently using one type of unit. Placing data within a template is a common first step to normalization. Normalization is a prerequisite for aggregating data. When data are structured and normalized, they can be mobilized in simple formats (tab delimited or comma delimited text files) or can be transformed into other structures to meet agreed upon standards. DiGIR is an early example of a data transformation tool (http://digir.sourceforge.net/). More contemporary tools, such as TAPIR or IPT from GBIF (http://ipt.gbif.org/) can output data in an array of normalized forms.
Data should be standardized. Standardization indicates compliance with a widely accepted mode of normalizing. Standards provide terms that define data and relationships among categories of data. Two basic types of standards that are indispensable for management of biological data are metadata and ontologies. Organizations such as TDWG develop new standards, and catalogs of standards and ontologies are available on the web (http://otter.oerc.ox.ac.uk/biosharing/?q=standards, http://wg.sti2.org/semtech-onto/index.php/The_Ontology_Yellow_Pages).
Metadata are terms that define data in ways that may serve different purposes, such as helping people to find data of relevance (that is they aid the discovery of data -
By articulating what metadata should be applied and how they should be formatted, standards introduce the consistency that is needed for interoperability and machine reasoning. For example, a marine bacterial RNA sequence collected from the environment ideally might be accompanied by metadata on location (latitude, longitude, depth), environmental parameters, collection metadata (collection event, date of collection, sampling device), and an identifier for the bacterium. Without such metadata, the scope of possible queries is much reduced. Examples of minimum reporting requirements have been established by the MIBBI project (
Examples of standards and their location.
Anontology is a formal statement of relationships among concepts represented by metadata terms. Ontologies enable discovery of and reasoning on data through those relationships. Ontologies may use formal descriptive languages to define the relationships. Ontologies are regarded as having great promise (
Ontologies are part of “Knowledge Organization Systems”. Those relating to biodiversity have been discussed by Morris (
Many ontological structures are available for use in Life Sciences (Table 3). Some, such as the observational (http://marinemetadata.org/references/oboeontology, http://www.nceas.ucsb.edu/ecoinfo, https://sonet.ecoinformatics.org/) and taxonomic ontologies (below), have broad applicability - the first within the field of ecoinformatics and the second to biodiversity informatics. Users can adopt existing structures or create their own using an ontology editor such as Protégé (http://protege.stanford.edu/) or OBOEdit (http://oboedit.org/). The search engines, Swoogle (http://swoogle.umbc.edu/) and Sindice (http://sindice.com/), search over 10, 000 ontologies and can return a list of those that contain a term of interest. Services such as these help users to determine if an existing ontology will meet his/her needs. Often, a user may need to use parts of existing ontologies or merge several ontologies into a single new one. Defining relationships between terms in different ontologies can be accomplished through the use of automated alignment tools such as SAMBO and KitAMO (
The system of latinized binomial names (such as Homo sapiens) introduced for species in the mid-18th century by Linnaeus is an extensive system of potential metadata for data management in the Life Sciences. They have been used to annotate virtually every statement about any of our current catalog of 2.2 million living and extinct forms of life (
Data will need to be atomized. Atomization refers to the reduction of data to minimal semantic units and stands in contrast to complex data such as images or bodies of text. In atomized forms, data may exist as numerical values of variables (e.g. “length of tail: 5.3 cm”), binary statements (e.g. “chloroplasts: absent”), or as the association with metadata terms from agreed upon vocabularies (e.g. “part of lodicules of lower floret of pedicellate spikelet of tassel”; Zea mays ontology ID ZEA:0015118, http://bioportal.bioontology.org/visualize/3294). Atomized data on the same subject can be brought together if the data are classified in a standard way. Atomization is necessary for machine-based analysis of data from one or more datasets. Many older data centers capture data as files (or packages of files) and the responsibility for extraction of data atoms falls to the user. This can be time consuming suggesting that, in the future, atomization needs to occur at or near the source of raw data, becoming part of the responsibilities of the author of the data, the software in which data are logged, or data centers that can provide services to transform data sets.
Data need to be published. Projects participating in a Big New Biology will increasingly make data visible and accessible (i.e. published). Scientists may publish data by displaying them in unstructured or structured formats on local, project, or institutional web sites; or they may seek to place data in central repositories. In science generally, over three-quarters of the published data are in local repositories (
Publication of atomized data is essential for large scale data reuse. Data must be able to move from one computer to another in an intelligent way. As illustrated by the Global Biodiversity Information Facility (http://www.gbif.org/informatics/standards-and-tools/using-data/web-services/), scientific initiatives can add RSS feeds, web services, and APIs (Application Programming Interfaces) to their web sites to broadcast new data or to respond to requests for data. An API facilitates interaction between computers in the same way that a user interface facilitates interactions between humans and computers. Without such services, data may need to be screen scraped from the web site, a process that is usually costly (because the solution for each site will differ) and, at worst, may require manual re-entry of data. A service-oriented approach is scalable but incurs overhead. They are probably best served through community repositories that can call on appropriate domain-specific knowledge.
Data must be archived. It is preferable that data, once published, are persistent (
Data will ideally be free and open. Open Access, the principle of providing unconstrained access to information on the web, improves the uptake, usage, application and impact of research output (
Data can be trusted. Once data are accessed, consumers may reveal errors and/or omissions. Biological data can be very dirty, especially if they were acquired without expectation that they would be shared later. Any data cleaning procedures should be documented to aid the consumer in assessing whether the source is “suitable for their purpose” (
Data must be attributed. Scientists gain credit in part through attribution. The permanent association of identifiers with open data offers a means of linking attribution to the data and of tracking reuse (
Data can be manipulated. A value of having large amounts of appropriately annotated data available on the web is that users can explore, in addition to search for, data. Data exploration may result from a desire to test a hypothesis. It is therefore desirable to have tools that draw data together, analyze or visualize them. Exploratory systems include: Humboldt (
Visualizations have the capacity to reveal patterns, discontinuities and exceptions that can inform us as to underlying biological processes, appropriateness of data sets, or consistency of experimental protocols. Visualizations can be used to display results with analyses of large data sets. Through visualizations we may help address the challenge stated by
Data need to be registered and discoverable.Registries index data resources to alert potential users to their availability. Search engines, the normal indexers of web-accessible materials, are not good at revealing database contents - only about half of the open data in repositories are indexed by search engines (
The “semantic web” has many definitions, but here we think of it as a technical framework that promotes automated sharing and reuse of data across disciplines (
Berners-Lee has promoted four guidelines for linked data (
1. The use of a standard system of Uniform Resource Identifiers (URIs) as “names” for things
2. The use of HTTP URIs so that the names can be looked up on the internet and the data accessed
3. When a URI is looked up, it should return useful information using standards (RDF, SPARQL)
4. Links to other URIs so that users can discover more things.
A URI is a type of persistent identifier made up of a string of characters that unambiguously (at least in an ideal world, see
RDF is a language that defines relationships between things. Relationships in RDF are usually made in three parts (often called triples), Entity:Attribute:Value. A machine-readable form in RDF may be a statement that “American robin:has_color:red”. Each term is ideally defined stringently by controlled vocabularies and ontologies, and each part represented within the triple as a URI. The “Value” can be a URI or a literal - the actual value. An advantage of RDF is that it allows datasets to be merged, for example TaxonConcept and Wikipedia (http://www.slideshare.net/pjdwi/biodiversity-informatics-on-the-semantic-web). A goal of the Linking Open Data project is to promote a data commons by registering sets in RDF. As of March 2011, the project had grown to 28 billion triples and 395 million RDF links (
Transformation of data from printed narrative or spreadsheet to semantic-web formats is a significant challenge. Based on existing ontologies, there is enough information to create 1014 triples in biomedicine alone (
Life Sciences stand to benefit greatly from the advantages of linked data (
Semanticization enables nanopublication, a form of publication that extends traditional narrative publication (
A Big New Biology holds much promise as a means to address some large proximate scientific challenges. Macroscopic tools will enable discovery of hidden features and better descriptions of relationships within the complexity of the biosphere. Yet, to date, progress towards the vision varies enormously from the successes with high-throughput biology to virtual stasis in some small science biology. Considerable effort is needed to catalog current practices, and to define the sociological transformations that will be required to improve the likelihood of success. If the transformation is to be purposeful, then it will need general oversight, discipline-specific reviews, and a description of the actual and desirable components of the Knowledge Organizational System for Biology and their relationships. Some obvious challenges relate to standards and associated ontologies, incentivizing participation, and assembling an appropriate infrastructure and skill base.
Standards and Ontologies.Data standards bring order to the virtual data pool on which a Big New Biology will rely. While complex and finely grained metadata are needed for analyses and for the world of Linked Open Data, the first challenge is to improve the discoverability of data. This process has traditionally been supported by word-of-mouth at conferences or in publications. With standards, registries can enable users to find data sets containing information about taxa, parameters, times, processes, or places of interest. If metadata are absent or incomplete, then the data sets cannot be discovered or reused and cannot contribute to Big New Biology.
Automated data discovery, aggregation and analysis require more comprehensive standards than those currently available for many of the Life Sciences. Instead of a comprehensive system of standards, there is a piecemeal system of metadata, vocabularies, thesauri, ontologies, and data transfer schemas that overlap, compete, and have gaps. Greatest progress is being made outside the Life Sciences (such as georeferencing), or in high-investment areas where data are born digital (such as in genomics,
Two organizational frameworks for Life Sciences data are as yet under-exploited. The first is the system of georeferencing that is in use in rich applications in earth sciences, cartography, and so on. Information on occurrences of species is compiled in central databases such as GBIF and OBIS, has been and is being collected in vast quantities by a myriad of citizen scientists. Its potential is well illustrated by some large-scale applications such as the impressive charting of bird migrations (
Incentives.Despite widespread calls for scientists to make data more widely available, this has yet to happen for many sub-disciplines (
In surveys, (
Infrastructure.In addition to challenges to incentivize scientists in the direction of data-sharing, the infrastructure for a Big New Biology is incomplete. Funding agencies, like the National Science Foundation in the US, require projects to have plans for data management - a requirement that presumes data persistence. The infrastructure needed to guarantee persistence will require an investment well beyond the usual 3–5 year funding cycle into multi-decadal periods and coordination that has international dimensions. The infrastructure must include tools to capture data, policies, data standards, data identifiers, registration of discovery-level metadata, and APIs to share data (Fig. 3). There is as yet no index of data-sharing services (for some initial steps see datacatalogs.org and DataCite http://www.datacite.org/repolist) nor a framework in which such elements could be integrated. There is little assessment of which elements of data plans will lead to persistence of data or their reuse. In the absence of these elements, principle investigators are left to make their own policies, use their own systems, and to finance the processes. As long as the response is piecemeal, there can be no assurances of interoperability, efficiency or persistence. At this time, research scientists need to be supported by data managers and data archivists. Institutional libraries and museums are well placed to shift their agendas to include data management and the preservation of digital artifacts and so may fill this gap, providing institutional, regional or discipline-based services. It is hoped that the ongoing NSF Data Net projects can contribute significantly to the infrastructure.
A new technical challenge is the lack of bandwidth to distribute data from modern data-intense technologies. The problem is illustrated by high throughput molecular biology with tera and petabyte scale data sets (
Technical infrastructure needed for Big New Biology to fully emerge (based on
There is growing pressure from scientists, funding agencies and governments to use new information technologies to effectively manage the increasingly vast amounts of data emerging from new technologies, to integrate these with smaller data sets, and to enhance the communal nature of science. If successful, biology will be enriched with data-intensive dimensions better suited to address large scale and trans-discipline problems. The transition requires many technical advances and cultural changes. Progress on the technical front to date clearly demonstrates that technical issues can be resolved. The process of sociological adaptation is less convincing. Some sub-disciplines (molecular domains) have embraced data-intensive dimensions, some (environmental ecology) are in transition, and others (such as taxonomy) are just beginning. A much better understanding of the existing cultures is needed before we can promote solutions that will realign the traditions of each community with the common goal of shared data use. Training environments such as Universities need to create a new cadre of scientists trained in computer sciences and biology. Other pressing challenges to data integration relate to the development of comprehensive and agreed metadata and ontologies, and to the semanticization of data so that the discipline can take advantage of the Linked Open Data cloud. The long tail of small data sets presents a special challenge - that of bringing heterogeneous data sets together. At this time, the common denominators that are likely to be effective are georeferencing, citations, and names. All require further investment. None of the elements of the transition will come quickly or cheaply, but these transformations are needed if we are to make the Life Sciences less parochial and more capable of responding to major research challenges.
The authors would like to thank Dmitry Mozzherin, David Shorthouse, Nathan Wilson, Jane Maeinschein, Peter DeVries, Holly Miller, Vince Smith, Daniel Mietchen and members of the Data Conservancy Life Sciences Advisory Group (Mark Schildhauer, Bryan Heidorn, Steve Kelling, Dawn Field, Norman Morrison and Paula Mabee) for valuable comments. This work is supported by NSF award 0830976 The Data Conservancy (A digital research and curation virtual organization).
The topics raised here were explored during a workshop held in Woods Hole, Massachusetts attended by computer, information and biological scientists, and representatives of academia, the private sector and government. A longer “white paper” produced for the National Science Foundation Data Conservancy project is available (