dbSNP

The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI). Although the name of the database implies a collection of one class of polymorphisms only (i.e., single nucleotide polymorphisms (SNPs)), it in fact contains a range of molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs), (3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms (MNPs), (5) heterozygous sequences, and (6) named variants. The dbSNP accepts apparently neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. It was created in September 1998 to supplement GenBank, NCBI’s collection of publicly available nucleic acid and protein sequences. As of build 131 (available February 2010), dbSNP had amassed over 184 million submissions representing more than 64 million distinct variants for 55 organisms, including Homo sapiens, Mus musculus, Oryza sativa, and many other species. NCBI will phase out support for all non-human organisms in dbSNP and dbVar during 2017. dbSNP is an online resource implemented to aid biology researchers. Its goal is to act as a single database that contains all identified genetic variation, which can be used to investigate a wide variety of genetically based natural phenomenon. Specifically, access to the molecular variation cataloged within dbSNP aids basic research such as physical mapping, population genetics, investigations into evolutionary relationships, as well as being able to quickly and easily quantify the amount of variation at a given site of interest. In addition, dbSNP guides applied research in pharmacogenomics and the association of genetic variation with phenotypic traits. According to the NCBI website, “The long-term investment in such novel and exciting research promises not only to advance human biology but to revolutionise the practice of modern medicine.” dbSNP accepts submissions for any organism from a wide variety of sources including individual research laboratories, collaborative polymorphism discovery efforts, large scale genome sequencing centers, other SNP databases (e.g. the SNP consortium, HapMap, etc.), and private businesses. Every submitted variation receives a submitted SNP ID number (“ss#”). This accession number is a stable and unique identifier for that submission. Unique submitted SNP records also receive a reference SNP ID number (“rs#”; 'refSNP cluster'). However, more than one record of a variation will likely be submitted to dbSNP, especially for clinically relevant variations. To accommodate this, dbSNP routinely assembles identical submitted SNP records into a single reference SNP record, which is also a unique and stable identifier (see below). To submit variations to dbSNP, one must first acquire a submitter handle, which identifies the laboratory responsible for the submission. Next, the author is required to complete a submission file containing the relevant information and data. Submitted records must contain the ten essential pieces of information listed in the following table. Other information required for submissions includes contact information, publication information (title, journal, authors, year), molecule type (genomic DNA, cDNA, mitochondrial DNA, chloroplast DNA), and organism. A sample submission sheet can be found at: (https://www.ncbi.nlm.nih.gov/SNP/get_html.cgi?whichHtml=how_to_submit#SECTION_TYPES) New information obtained by dbSNP becomes available to the public periodically in a series of “builds” (i.e. revisions and releases of data). There is no schedule for releasing new builds; instead, builds are usually released when a new genome build becomes available, assuming that the genome has some cataloged variation associated with it. This occurs approximately every 1–2 months. Genome sequences often contain errors so reference SNPs (“refSNP”) from previous builds, as well as new submitted SNPs, are re-mapped to the newly available genome sequence through multiple cycles of BLAST and MegaBLAST. Multiple submitted SNPs, if mapping to the same location, are clustered into one refSNP cluster and are assigned a reference SNP ID number. However, if two refSNP cluster records are found to map to the same location (i.e. are identical), then dbSNP will also merge those records. In this case, the smallest refSNP number ID (i.e. the earliest record) would now represent both records, and the larger refSNP number IDs would become obsolete. These obsolete refSNP number IDs and are not used again for new records. When a merger of two refSNP records occurs, the change is tracked, and the former refSNP number IDs can still be used as a search query. This process of merging identical records reduces redundancy within dbSNP. There are two exceptions to the above merging criteria. First, variation of different classes (e.g. a SNP and a DIP) are not merged. Secondly, clinically important refSNPs that have been cited in the literature are termed “precious”; a merger that would eliminate such a refSNP is never performed, since it could later cause confusion.

Parent Topic

Child Topic

No Parent Topic