Thursday, December 12, 2013

Guest post: response to "Putting GenBank Data on the Map"

DES Tahiti 09 biggerThe following is a guest blog post by David Schindel and colleagues and is a response to the paper by Antonio Marques et al. in Sciencedoi:10.1126/science.341.6152.1341-a.

Marques, Maronna and Collins (1) rightly call on the biodiversity research community to include latitude/longitude data in database and published records of natural history specimens. However, they have overlooked an important signal that the community is moving in the right direction. The Consortium for the Barcode of Life (CBOL) developed a data standard for DNA barcoding (2) that was approved and implemented in 2005 by the International Nucleotide Sequence Database Collaboration (INSDC; GenBank, ENA and DDBJ) and revised in 2009. . All data records that meet the requirements of the data standard include the reserved keyword 'BARCODE'. The required elements include: (a) information about the voucher specimen from which the DNA barcode sequence was derived (e.g., species name, unique identifier in a specimen repository, country/ocean of origin); (b) a sequence from an approved gene region with minimum length and quality; and (c) primer sequences and the forward and reverse trace files. Participants in the workshops that developed the data standard decided to include latitude and longitude as strongly recommended elements but not as strict requirements for two reasons. First, many voucher specimens from which BARCODE records are generated may have been collected before GPS devices were available. Second, barcoding projects such as the Barcode of Wildlife Project (4) are concentrating on rare and endangered species. Publishing the GPS coordinates of collecting localities would facilitate illegal collecting and trafficking that could contribute to biodiversity loss.

The BARCODE data standard is promoting precisely the trend toward georeferencing called for by Marques, Marrona and Collins. Table 1 shows that there are currently 346,994 BARCODE records in INSDC (3). Of these BARCODE records, 83% include latitude/longitude data. Despite not being a required element in the data standard, this level of georeferencing is much higher than for all cytochrome c oxidase I gene (COI), the BARCODE region, 16S rRNA, and cytochrome b (cytb), another mitochondrial region that was used used for species identification prior to the growth of barcoding. Data are also presented on the numbers and percentages of data records that include information on the voucher specimen from which the nucleotide sequence was obtained. In an increasing number of cases, these voucher specimen identifiers in INSDC are hyperlinked to the online specimen data records in museums, herbaria and other biorepositories. Table 2 provides these same data for the time interval used in the Marques et al. letter (1). These tables indicate the clear effect that the BARCODE data standard is having on the community’s willingness to provide more complete data documentation.

Table 1. Summary of metadata for GPS coordinates and voucher specimens associated with all data records.
Categories of data recordsTotal number of GenBank recordsWith Latitude/LongitudeWith Voucher or Culture Collection Specimen IDs
BARCODE347,349286,975 (83%)347,077 (~100%)
All COI751,955365,949 (49%)531,428 (71%)
All 16S4,876,284461,030 (9%)138,921 (3%)
All cytb239,7967,776 (3%)84,784 (35%)

Table 2.
Summary of metadata for GPS coordinates and voucher specimens associated with data records submitted between 1 July 2011 and 15 June 2013.
Total number of GenBank recordsWith Latitude/LongitudeWith Voucher or Culture Collection Specimen IDs
BARCODE160,615132,192 (82%)160,615 (100%)
All COI302,507166,967 (55%)231,462 (77%)
All 16S1,535,364232,567 (15%)49,150 (3%)
All cytb74,6312,920 (4%)24,386 (33%)


The DNA barcoding community's data standard is demonstrating two positive trends: better documentation of specimens in natural history collections, and new connectivity between databases of species occurrences and DNA sequences. We believe that these trends will become standard practices in the coming years as more researchers, funders, publishers and reviewers acknowledge the value of, and begin to enforce compliance with the BARCODE data standard and related minimum information standards for marker genes (5).

DAVID E. SCHINDEL1, MICHAEL TRIZNA1, SCOTT E. MILLER1, ROBERT HANNER2, PAUL D. N. HEBERT2, SCOTT FEDERHEN3, ILENE MIZRACHI3
  1. National Museum of Natural History, Smithsonian Institution Smithsonian Institution, Washington, DC 20013–7012, USA.
  2. University of Guelph, Ontario, Canada
  3. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA

References

  1. Marques, A. C., Maronna, M. M., & Collins, A. G. (2013). Putting GenBank Data on the Map. Science, 341(6152), 1341–1341. doi:10.1126/science.341.6152.1341-a
  2. Consortium for the Barcode of Life, http://www.barcodeoflife.org/sites/default/files/DWG_data_standards-Final.pdf (2009)
  3. Data in Tables 1 and 2 were drawn from GenBank (http://www.ncbi.nlm.nih.gov/genbank/) [data as of 1 October 2013]
  4. Barcode of Wildlife Project, http://www.barcodeofwildlife.org (2013)
  5. Yilmaz, P., Kottmann, R., Field, D., Knight, R., Cole, J. R., Amaral-Zettler, L., Gilbert, J. A., et al. (2011). Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nature Biotechnology, 29(5), 415–420. doi:10.1038/nbt.1823