Sunday, December 12, 2010

Excel Hacks: Calculating GC%

As a quick hack, you can calculate the GC% of a DNA sequence in Excel without breaking out the macros. Here’s how you do it assuming your sequence is in Excel cell A1.

= (1 - LEN(SUBSTITUTE(SUBSTITUTE(A1,"G",""),"C",""))/LEN(A1)) * 100

Although the formula is a bit daunting, the principle is straightforward. First, use the SUBSTITUTE command to get a version of your sequence with all G’s and C’s removed. Determine the length of this new sequence using the LEN command and compare this to the length of the original sequence. This gives you the fraction of sequence that is NOT a G or a C. To get the GC%, subtract this fraction from 1 and multiply by 100% to go from a fraction to a percent.

This will probably satisfy most folks, but it’s not perfect. For an accurate calculation, we need to take into account ambiguous IUPAC base codes. For example, a "N" should be treated as 0.5 of a G-C base. A "D" should count as 0.33. Although it’s possible to extend the hack above to handle these cases, it might be a good time to break out Perl or Python.

No comments: