NO LATE ASSIGNMENTS WILL BE ACCEPTED,
TO ALLOW US TO DISTRIBUTE SOLUTIONS BEFORE THE MIDTERM EXAM.
the appropriate object.
Transcription factors (TF) are proteins that recognize and bind to DNA, and in doing so regulate the expression of a nearby gene. Each type of TF has an affinity for a particular short DNA pattern. For example, the E-‐box TF likes to bind the sequence CAGCTG, whereas the Runx TF likes to bind CCACA. However, every transcription factor has a certain flexibility in the DNA sequences it can bind to. For example, the E-‐box TF will sometimes bind CAGGTG, or CAGCAG. The Runx TF may also bind CCAGA or ACACA. Biologists can identify DNA sequences bound by a given transcription factor using experiments such as chromatin immunoprecipitation followed by sequence (ChIP-‐seq). The outcome of such an experiment is a set of DNA sequences from the genome that are bound by the TF. For example, a ChIP-‐seq experiment on transcription factor E2F1 may produce the following set of 10 sequences:
ACGATG ACAATG ACGATC ACGATC TCGATC TCGAGC TAGATC TAAATC AAAATC ACGATA
These DNA sequences are probably only a small subset of all the sites bound by E2F1 in the genome. We would like to be able to learn a model of what type of pattern E2F1 likes to bind, in order to predict, in the human genome, sites that may have been missed by the experiment. In order to do so, bioinformaticians have developed a model called Position Weight Matrix (PWM). Here, you will learn how this model works, and you will have to implement it in Python.
To build a PWM, one first needs to build a Position Frequency Matrix (PFM). A PFM is matrix (table) of 4 rows and L columns, where L is the length of the binding site sequences. In the E2F1 example, L=6. The entry in row i and column j of the PFM, which we denote PFM[i][j], is the frequency of nucleotide i at position j of the binding sites. For E2F1, based on the 10 sites listed above, the PFM is
0 1 2 3 4 5
0 (A) 6 | 3 | 3 | 10 | 0 | 1 |
PFM(E2F1) = 1 (C) 0 | 7 | 0 | 0 | 0 | 7 |
2 (G) 0 | 0 | 7 | 0 | 1 | 2 |
3 (T) 4 | 0 | 0 | 0 | 9 | 0 |
The PFM can be converted to a PWM using the following formula:
��� � � = log!” − log!” 0.25
The variable p, called the pseudocount is generally set to a small value such as 0.1. For our E2F1 example, the PWM is:
0 1 2 3 4 5
A | 0.37 | 0.07 | 0.07 | 0.59 | -1.41 | -0.37 |
C | -1.41 | 0.43 | -1.41 | -1.41 | -1.41 | 0.43 |
G | -1.41 | -1.41 | 0.43 | -1.41 | -0.37 | -0.09 |
T | 0.19 | -1.41 | -1.41 | -1.41 | 0.54 | -1.41 |
You will notice that large positive values in the PWM (e.g. 0.59) correspond to positions where the TF has a strong preference for a particular nucleotide (e.g. an A at position 3).
We can now use a PWM to calculate the score of a match between a candidate sequence of L nucleotides and the PWM. For example, the score of candidate sequence TCGATC is obtained by summing up the appropriate values from the PWM:
Score(TCGATG) = PWM[T][0] + PWM[C][1] + PWM[G][2] + PWM[A][3] + PWM[T][4] + PWM[G][5]
= 0.19 + 0.43 + 0.43 + 0.59 + 0.54 + (-‐0.09) = 2.11
whereas for sequence ACATAG, we get
Score(ACATAG) = PWM[A][0] + PWM[C][1] + PWM[A][2] + PWM[T][3] + PWM[A][4] + PWM[G][5]
= 0.37 + 0.43 + 0.07 + (-‐1.41) + (-‐1.41) + (-‐0.09) = -‐2.03
The larger the score, the higher the affinity of the transcription factor for the sequence.
Finally, we can use a PWM to scan a longer sequence to identify potential binding sites for the TF. This is done by calculating the PWM score for every possible portion of L nucleotides of the sequence. For example:
Sequence = ATCGATGAGACTGA Score(0) = Score(ATCGAT) = -‐3.87… Score(1) = Score(TCGATG) = 1.65… Score(2) = Score(CGATGA) = -‐4.16… Score(3) = Score(GATGAG) = -‐4.16… Score(4) = Score(ATGAGA) = -‐0.01… Score(5) = Score(TGAGAC) = -‐2.55…
…
Positions whose score exceeds a user-‐chosen threshold are reported as a putative binding sites for the TF. For example, if we choose a threshold of -‐0.2, then positions 1 and 4 would be reported as putative sites.