Friday, March 4, 2011

Excel Hacks: Reverse Complement a Nucleotide Sequence

Reverse complementing a nucleotide sequence in Excel turns out to be really ugly - in fact, this may be one case where it's better to do some cut and pasting into a web tool or finally go pick up that Perl book that's been sitting on your shelf. However, if you’re desperate and absolutely have to do it in Excel, here's how you can reverse complement short (~30 base) nucleotide sequences. 



Complementing


Let's break reverse complementing into its two steps - reversing the DNA sequence and then taking its complement. We'll start with complementing:

= SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( A1, "A", 1), "C", 2), "G", 3),"T", 4), 1, "T"), 2, "G"), 3, "C"), 4, "A")

As in previous posts, we assume the sequence to be complemented is in cell A1.

This rather lengthy formula is performing a two-step substitution. First, nucleotides are converted to numbers, e.g. “A” to “1”, “C” to “2”, etc. Next, the numbers are then replaced with the complementary nucleotide, so “1” goes to “T”, “2” to “G” and so on. Doing the substitution in two steps allows us to distinguish which bases have been complemented from those that have not. The formula works correctly for nucleotides “A”, “C”, “G”, “T” and “N”. Other degenerate IUPAC codes could be handled, but we’ll leave this as an exercise for the reader.

Reversing



Surprisingly, reversing the DNA sequence turns out to be even trickier. Most programming languages have a function to reverse text, but not Excel. Breaking out a macro in Excel might be worthwhile since writing a little Visual Basic to reverse text isn't too bad. Unfortunately, this is not possible for folks with Macs using Office 2008 since macro functionality was removed.

For a general method to reverse our sequence, our options are limited and ugly. Perhaps one of the less awful methods is repeated use of the MID function. Assuming that our complemented nucleotide sequence is in cell B1, we would do something like this:

= MID(B1,30,1) & MID(B1,29,1) & MID(B1,28,1) & MID(B1,27,1) & MID(B1,26,1) & MID(B1,25,1) & MID(B1,24,1) & MID(B1,23,1) & MID(B1,22,1) & MID(B1,21,1) & MID(B1,20,1) & MID(B1,19,1) & MID(B1,18,1) & MID(B1,17,1) & MID(B1,16,1) & MID(B1,15,1) & MID(B1,14,1) & MID(B1,13,1) & MID(B1,12,1) & MID(B1,11,1) & MID(B1,10,1) & MID(B1,9,1) & MID(B1,8,1) & MID(B1,7,1) & MID(B1,6,1) & MID(B1,5,1) & MID(B1,4,1) & MID(B1,3,1) & MID(B1,2,1) & MID(B1,1,1)

The formula constructs the reverse sequence base-by-base up to 30 bases. Make certain that your sequence to reverse is 30 or fewer nucleotides, otherwise your sequence will be truncated without warning. You could extend the function to handle larger sequences, but clearly this method is only practical for short sequences.