Sunday, March 18, 2012

Genome Coordinate Cheat Sheet

The following table is a quick cheat sheet identifying the genomic coordinate conventions used at different sites, file formats, etc.  I welcome corrections and additions in the comments. I'll add them to the companion sheet as time allows.

A more detailed description of genome coordinate systems can be found in my earlier post. Note that in this table, I consider zero-based, half-open coordinate conventions to be equivalent to space-counted, zero-start and so do not distinguish them in the table.

Name Resource Type Chromosome 0 vs 1 Space vs Base Notes
UCSC Genome Browser Genome Browser chr1, chr2, .. chrX, chrY, chrM 1 Base Note that when zooming in on the genome browser, the positioning of the tick marks appears to use a space-counted, 0-start convention. As discussed here, this is not the intention. To get the 1-based position, the base coordinate corresponds to the tick mark to its immediate right.
NCBI Map Viewer Genome Browser 1, 2, .. X, Y, MT 1 Base
Ensembl Location Viewer Genome Browser 1, 2, .. X, Y, MT 1 Base
UCSC BLAT Web Tool chr1, chr2, .. chrX, chrY, chrM 1 Base
NCBI BLAST Web Tool chr1, chr2, .. chrX, chrY, chrMT 1 Base
UCSC Table Browser Web Tool chr1, chr2, .. chrX, chrY, chrM 0 Space Output formats are all 0-space formats. However, when specifying the region field is 1-base format.
BED Format File Format chr1, chr2, .. chrX, chrY, chrM 0 Space
WIG Format File Format chr1, chr2, .. chrX, chrY, chrM 1 Base UCSC's Wiggle Track Format
Galaxy Interval Format File Format chr1, chr2, .. chrX, chrY, chrM 0 Space
GFF/GTF/GFF3 Format File Format Depends on context 1 Base Chromosome names depend on resource using the file format
VCF Format File Format Depends on context 1 Base Chromosome just needs to refer to an identifier in a reference genome or can be a contig id. Position 0 and N+1 (where N = chrom length) are used to refer to telomeres.
UCSC Annotation Files Data File chr1, chr2, .. chrX, chrY, chrM 0 Space
Special thanks to John DiCarlo's insights and suggestions!

5 comments:

caseybergman said...

It's important to clarify that UCSC uses 0-based coordinates for the underlying data in their databases (e.g. downloads, mySQL, table browser) as well as data you submit via custom tracks. It is only the genome browser on the web that uses (visually) 1-based coordinates.

mr.AA said...

Hi Casey,

Thanks for the feedback! Definitely worth calling this out as UCSC juggles between the two systems for BLAT results, the browser, BED files, underlying data, etc. I've tried to mention each of these cases here and I added one for the annotation files because of your suggestion.

Regarding your last point, I'm not sure I agree. My take is that UCSC, NCBI and Ensembl are all using a 1-based visualization. For example, this zoom in at Ensembl appears to use a 1-based, base-numbered system. Indeed, the number of bases shown corresponds to the stop - start + 1 you'd expect for this system.

Also, kudos again on your great post on genome coordinate systems!

Max said...

To make things not too easy, UCSC wiggle files are 1-based:
http://genome.ucsc.edu/goldenPath/help/wiggle.html

mr.AA said...

Nice catch, Max. I've added it to the table.

Anonymous said...

Good morning

I think that your blog is very nice! The content is quite useful
Keep up with The outstanding posts.

regards,