• Current Events & Politics
    Welcome Guest
    Please read before posting:
    Forum Guidelines Bluelight Rules
  • Current Events & Politics Moderators: deficiT | tryptakid | Foreigner

Math/Probability Problem

StarOceanHouse

Bluelight Crew
Joined
Jul 16, 2003
Messages
7,521
Location
Southern California
Hey guys, I need help with this problem. I wanna see if I did it right. Here it goes.

1. (15 points) The overall composition of the M. tuberculosis H37Rv genome is A = T = 31.5%, C = G = 68.5%. Suppose you have a random sequence containing 4,411,000 nucleotides, but with these proportions of nucleotides. What would be the expected number of times the sequence CTAG would occur in the whole sequence?
Discrepancies between observed and expected tetranucleotide counts highlight features that may have interesting biochemical explanations, such as unusual flexibility or mismatch repair. CTAG is of interest for the latter reason, as its occurrence is rare in prokaryotic genomes, possibly causing kinks under conditions of supercoiling. In this context, it may serve a specific structural purpose, as a binding site for repressor proteins.
This is a probability problem, which can be solved by applying a method known as Whittle’s equation. There are also programs available, such as codonW which perform correspondence analysis of codon usage. For this problem, however, we will attempt to make a valid estimate as follows. Notice that this sequence is palindromic, thus if a nucleotide occurs on one strand, it occurs on the other. So, the cumulative probability of CTAG is the product of these (0.315)

So I figured the probability of CTAG occuring in any order is 1/4! which is 1/24 then I multiply it by 31.5/100 * 31.5/100 * 68.5/100 * 68.5/100 = 0.001939952 = probabilty of CTAG occurring in that exact order. Multiplying that by the number of nucleotides in the whole sequence gives me 8557.
 
So I figured the probability of CTAG occuring in any order is 1/4! which is 1/24 then I multiply it by 31.5/100 * 31.5/100 * 68.5/100 * 68.5/100 = 0.001939952 = probabilty of CTAG occurring in that exact order. Multiplying that by the number of nucleotides in the whole sequence gives me 8557.

Well, I found a paper on Whittle's Equation, applied to nucleotide sequences:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.9725

...but apparently it involves solving some matrices, which I have no idea how to do. Interestingly, the paper contains specific reference to the CTAG sequence (on page 2). Your best bet is to download and figure out codonW, which apparently is a free program:
http://codonw.sourceforge.net

Good luck!
 
You start out by computing the probability that any one 4 nucleotide sequence is CTAG which is: (as you said) 31.5/100 * 31.5/100 * 68.5/100 * 68.5/100 = 0.001939952

If you want to know the expected # in a long sequence you use the property of linearity of expectations so that the expected sum of appearances is equal to the sum of the expected appearances in each location independently.

So you just take .001939952 and multiply by the # of 4 nucleotide sequences which is K - 3 where K is the total length of the nucleotide sequence.
 
I think the question is poorly worded. I seems the intention was to count the number of CTAGs read from either side of the DNA molecule. If so, it should have been made clear that we are talking about a sequence of base pairs. Anyway, the above is correct, or off by a factor of two.
 
Top