Sunday, March 18, 2012

Genome Coordinate Cheat Sheet

The following table is a quick cheat sheet identifying the genomic coordinate conventions used at different sites, file formats, etc.  I welcome corrections and additions in the comments. I'll add them to the companion sheet as time allows.

A more detailed description of genome coordinate systems can be found in my earlier post. Note that in this table, I consider zero-based, half-open coordinate conventions to be equivalent to space-counted, zero-start and so do not distinguish them in the table.

Name Resource Type Chromosome 0 vs 1 Space vs Base Notes
UCSC Genome Browser Genome Browser chr1, chr2, .. chrX, chrY, chrM 1 Base Note that when zooming in on the genome browser, the positioning of the tick marks appears to use a space-counted, 0-start convention. As discussed here, this is not the intention. To get the 1-based position, the base coordinate corresponds to the tick mark to its immediate right.
NCBI Map Viewer Genome Browser 1, 2, .. X, Y, MT 1 Base
Ensembl Location Viewer Genome Browser 1, 2, .. X, Y, MT 1 Base
UCSC BLAT Web Tool chr1, chr2, .. chrX, chrY, chrM 1 Base
NCBI BLAST Web Tool chr1, chr2, .. chrX, chrY, chrMT 1 Base
UCSC Table Browser Web Tool chr1, chr2, .. chrX, chrY, chrM 0 Space Output formats are all 0-space formats. However, when specifying the region field is 1-base format.
BED Format File Format chr1, chr2, .. chrX, chrY, chrM 0 Space
WIG Format File Format chr1, chr2, .. chrX, chrY, chrM 1 Base UCSC's Wiggle Track Format
Galaxy Interval Format File Format chr1, chr2, .. chrX, chrY, chrM 0 Space
GFF/GTF/GFF3 Format File Format Depends on context 1 Base Chromosome names depend on resource using the file format
VCF Format File Format Depends on context 1 Base Chromosome just needs to refer to an identifier in a reference genome or can be a contig id. Position 0 and N+1 (where N = chrom length) are used to refer to telomeres.
UCSC Annotation Files Data File chr1, chr2, .. chrX, chrY, chrM 0 Space
Special thanks to John DiCarlo's insights and suggestions!

Genome Coordinate Conventions

Denoting a contiguous region in a reference genome seems straightforward enough. However, I’ve seen many a bioinformatician get tripped up by the different conventions and formats used on web sites, file formats and genomic tools. Here are some things to keep in mind.

Chromosome

Taking the human genome as our example, chromosome names may or may not use a prefix so chromosome 12 may be “chr12”, “ch12” or simply “12”. Also, the mitochondrial chromosome may be “MT” or simply “M” and the sex chromosomes might be denoted as “X” and “Y” or “23” and “24”.

Strand

Typical conventions for denoting the plus vs minus strand orientation of a sequence include using “1” and “-1” or “+” and “-“.  It is worth noting that for most of the genomic resources I use, genomic positions are almost always given relative to the plus strand even if the feature is said to be on the minus strand.  For example, BLAT-ing a minus strand sequence against the human genome at UCSC will return a search result with "-" strand but the start and stop positions will be given relative to the plus strand.  Minus strand positions would start counting from 0 at the opposite end of the chromosome (q-arm).

Position

Several conventions exist which differ in (1) what number they start with when numbering bases, (2) whether they number the bases themselves or the spaces between bases and (3) whether the interval is considered "open" or "closed".  The notion of "closed" and "open" intervals is a mathematical concept which in this context implies either that the start and stop of the interval should be included (a closed interval) or they should not be included (an open interval).  Sometimes you will see square brackets used to denote closed intervals (e.g. [12,20]) and parentheses used to denote open intervals (e.g. (12,20)).

Below are examples of three of the more common conventions. In each case I show the coordinates of an ATG subsequence (in red) and a cut site (marked by a red triangle).

Base-counted, one-start (a.k.a. one-based, fully-closed)

ATG location: 7-9 or [7,9]
Cut site: 11^12 or (11,12)
Interval length = stop - start + 1
Notes:
  • This is by far the most common convention used by the major genome browsers, tools like BLAST and BLAT, etc.
  • Using a base-counted system is problematic for describing features that occur between bases such as insertions or enzyme cut sites.  To deal with this, some conventions replace the "-" usually used to separate the start and stop position with an alternative notation such as "^".  Alternatively, one could use the parentheses notation. 
  • Base-counted systems can short-hand a one base interval with just the start (or stop) location.  This is useful for denoting the location of SNPs, for example.


Space-counted, zero-start

ATG location: 6-9
Cut site: 11-11
Interval length = stop - start


Notes:
  • Less common, this convention is attractive because of its natural way of denoting features between bases such as insertions, etc.  Length calculations are also simple.  


Zero-based, half-open

ATG location: 6-9 or [6,9)
Cut site: 11-11 or [11,11)
Interval length = stop - start


Notes:
  • In this convention, the start base is included in the interval but not the stop base.
  • This convention is used in data formats (especially at UCSC) such as BED.
  • Although conceptually different, space-counted, zero-start and zero-based, half-open give the same start and stop coordinates for intervals.
  • Python programmers will find this convention familiar as Python indexes arrays in the same manner.


Conclusions

As you can see, the differences between conventions are subtle, which makes it very easy to make undetected errors. In my experience, programmers prefer to use one of the zero-start conventions since most computer languages use zero-based indexing where as biologists are more fond of the one-start conventions. This commonly leads to a juggling of conventions where software using a zero-start convention is switched to one-start for displaying results to the user. The UCSC Genome Browser is a good example of a site with straddling conventions between the user interface and the underlying data tables and it must cause some confusion because they've devoted a FAQ entry to the issue.

My personal favorite convention is the space-counted, zero-start convention. All intervals require a start and stop location (even for single base features like SNPs), which makes up for in consistency what it might lack in conciseness. Denoting a location between two bases, such as for an insertion or enzyme cut site, is conceptually clear and does not require syntatic trickery like the base-counted methods. My understanding is that one of the many differences between the public and Celera human genome sequencing efforts was the public effort used base-counted, one-start while Celera used space-counted, zero-start although I don't have a reference to confirm this.

I haven't touched on the coordinate conventions used in the human variant world which give positions relative to cDNA. The HGVS variant nomenclature is a good example of this.  Among the interesting features of this convention are starting counting (at 1) with the A of the ATG initiation codon, using negative coordinates and skipping the 0 base altogether. Good times.

Until our evil bioinformatics overlords descend upon us and force us to standardize to one system, I've started a cheat sheet to help me remember who's using what convention. Please have a look, correct me where I'm wrong and point out what I'm missing.

References

  • A great blog post from the Bergman lab dealing with genome coordinate conventions with an emphasis on transposable elements annotation.
  • UCSC's description of the zero-based, half-open convention.