Language c++ only can use (If else statment, for (while) loop, array, function,vector)
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <cctype>
#include <algorithm>
The first file ecoli.fa is a FASTA file which contains the DNA sequence data. Here is an excerpt from the file:
>Chromosome dna_rm:chromosome chromosome:ASM584v2:Chromosome:1:4641652:1 REF AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC ACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGT AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG CTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGT ACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTG GCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAA CGTATTTTTGCCGAACTTTTGACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCG CAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATT AGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAA ATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGTTACTGTTATC GATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT GAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCA GGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGAC TACTCTGCTGCGGTGCTGGCTGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGAC GTTGACGGGGTCTATACCTGCGACCCGCGTCAGGTGCCCGATGCGAGGTTGTTGAAGTCG ATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCGCTAAAGTTCTTCACCCCCGC
The second file in the project folder is a CSV file named codon_table.csv which contains the codon list. Here is an excerpt from the file:
Codon
<td>AA.Abv</td> <
td>AA.Code</td>
<td>AA.Name</td></tr>
<tr><td>UUU</td>
<td>Phe</td>
<td>F</td>
<td>Phenylalanine</td></tr>
<tr><td>UUC</td>
<td>Phe</td>
<td>F</td>
<td>Phenylalanine</td></tr>
<tr><td>UUA</td>
<td>Leu</td>
<td>L</td>
<td>Leucine</td></tr>
<tr><td>UUG</td>
<td>Leu</td>
<td>L</td>
<td>Leucine</td></tr>
<tr><td>CUU</td>
<td>Leu</td>
<td>L</td>
<td>Leucine</td></tr>
In the above table, AA.Abv represents the abbreviation of the aminoacid, AA.Code represents the code for the aminoacid and AA.Name represents the actual name of the aminoacid. There are 64 codons in the file. One aminoacid can be represented with multiple codons, they all create the same aminoacid. For example, both UUU and UUC codons are translated as phenylalanine.
Write a function transcribe(dna_string) that creates the mRNA string from the DNA string. Each base in dna_string must be matched to its corresponding mRNA base. There might be strange characters in the DNA string other than A, T, C, G. They should be ignored: A U, T A, G C, C G matchings are the only valid ones.
string transcribe(string dna_string){
//this function must take the DNA string and construct a new mRNA string
//then return the mRNA string
}
Write a function translate which accepts the mRNA string as a parameter and creates a string vector of proteins. Each item in the vector is a string that consist of the aminoacid codes of the protein. The function must return the protein vector as a result.
Each protein’s aminoacid sequence starts with M (Methionine) which is the starting aminoacid and ends with a Stop aminoacid. So the function should:
look for mRNA sequences that starts with AUG codon;
detect the end (UAG, UGA, or UAA codon);
in between, identify the corresponding aminoacids for the codons to construct the protein;
save the protein string in the vector(Use push_back function).
vector<string> translate(string mrna_string) {
//create a protein vector and return it. }
Use the following print and main function to connect the processes and print the resulting protein vector
void print_protein_list(vector<string> list) {
for(string line : list)
{
cout << line << endl;
}
cout << list.size() << ” proteins listed” << endl; }
int main() {
string dnastring = readFastaFile(“ecoli.fa”);
readCsvFile(“codon_table.csv”);
string mrnastring = transcribe(dnastring);
vector<string> protein_list = translate(mrnastring);
print_protein_list(protein_list);
return 0; }
The first few lines of the output should look like this:
0 -> MDGTHLILKStop
1 -> MKLVISVSRVCLFLMSHVLStop
2 -> MVVVVVMVSIATPDCACPLCLFSGVDCHARKKKLVSIAPLLVRSQLQAAMStop
3 -> MGYSRYGLHKNGLKTALSGGGSAPRATALTFESSStop
4 -> MTIARPAFStop
5 -> MELRWQLStop
6 -> MRRRHDRRTNARLTTLStop
7 -> MDAGRSPRATLQQLQLQDGPSLPRKDEAAISRSGGVVMGVAGQGLGNGLIFMAFRSSWSMRVTTVGTTSAStop 8 -> MRStop
9 -> MSStop
10 -> MDLDFLPNDLGDRHCLADRStop
11 -> MSSLRQVMASPIASQTLTAATATATRRPRStop
12 -> MHRRHNGStop
….