Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
“Glue” programming and scripting are important skills in bioinformatics, allowing the automation of complex and repetitive tasks when analysing large datasets. This assignment is worth 15% of the total mark for the course and requires you to write a script to automate a “pipeline” of bioinformatics tasks. Two options for the assignment are presented. Option 1 involves writing a bash shell script while option 2 involves a scripting language such as Perl or Python. Students who are doing or have done COMP2041 Software Construction and are familiar with Perl should be doing Option 2. If you have or are taking COMP2041 and you choose Option 1 your mark will be capped at 65%.
General considerations:
? Your script should run at the command line in a CSE account, but can assume that the binftools set-up command (source
~binftools/setup-env.sh) has already been run. You can make use of any programs installed in binftools (type ls ~binftools/bin for a list) as well as remote databases and software that you can access using their RESTful API.
Your script should not crash or hang when given incorrect user input (eg, no arguments, or an incorrect argument) and instead should quit gracefully with a suitable error message.
Your script should be properly formatted and commented. Some marks will be given for code that can be understood easily.
You should try to include an additional feature to your script in addition to what is in the spec. It could be an additional analysis step, or an improvement to the suggested user interface. Indicate what that additional feature is and how to activate it in the comments of your script. This additional feature is worth 5% of the mark for the assignment (out of 100%). For example, for option 2, an extra feature could be to have your script also output a graphical representation of your annotations.
That aside the script will be marked on its ability to perform the required task correctly, and its compliance with the specification.
Your script should be your own work, however it is OK to use, as part of it, open source code that has been published as long as its source is clearly acknowledged in the comments. This is especially for option 2.
You can make use of open source libraries such as BioPerl for some tasks as long as this use is acknowledged in the comments, and the script does not require any additional special installation steps such as installing extra libraries. Any extra code should be either already installed on the CSE computers or binftools, or included in the script itself.
Your submission should consist of only one file, either .sh or .pl or .py, and should be directly executable.
Submission will be through a link on the course website.
Submission deadline will be advised on the website.
Please ask your questions on the Assignment forum on Moodle. Email questions will be redirected to the forum.
Finally: TEST, TEST and then TEST some more. Use a variety of inputs, testing locations etc, and make sure your program has no unintended consequences. One year a student submitted a script that did what was asked but then erased everything in the current directory (except for the output file it had created). This was not a good outcome.
Option 1: Bash shell script Write a bash shell script that retrieves all the reviewed mammalian protein sequences from UniprotKB that correspond to a specific gene name (specified as argument 1 when running the script) then creates a multiple sequence alignment out of these sequences then builds a phylogenetic tree from this multiple sequence alignment.
Your submission should consist of one file named uniTree.sh
The script should be invoked using the command ./uniTree.sh GENE where GENE is a gene code in UniprotKB, found in the Gene name [GN] field. Example gene names include INS and EGF. The output of the script should be a phylogenetic tree file in Newick format (nested parentheses) suitable for input into a tree visualisation program such as Dendroscope. The file should be named GENE.tree (where GENE is the gene name specified at the command line). The leaves of the tree should be labelled with only the species name as much as the program allows you to (in some cases you may get multiple proteins from the same species in which case you may need to add a number to the name in the tree). An example output for EGF is on Moodle. However, note that because the databases are constantly updated, the example output may not be exactly as your program will generate it.
The script should perform the following suggested steps:
o Retrieve from UniprotKB all the sequences with the specified gene name in their GN field that also are of Mammalian origin (field OC=
Mammalia [40674]) and have been reviewed for inclusion into SwissProt (field Reviewed set to Yes). The sequences should be retrieved in one file in fasta format. You probably will need to make use of the Uniprot RESTful API for this, together with the UNIX command wget.
o Edit the header of the fasta sequences so that they contain only the species name. If there are multiple sequences from the same species (for example a protein with multiple isoforms) you can add a number to the species name – eg Mus musculus1 and Mus musculus2.
o Run a multiple sequence alignment program (eg clustalw) to align the sequences and create a multiple sequence alignment
o Run another program to build a phylogenetic tree out of the multiple sequence alignment. Clustalw with the –tree option is probably the easiest option for this as phylip programs are quite
tricky to script and require writing a “batch file” (but do not simply use the clustalw guide tree created during multiple sequence alignment (.dnd file) as the tree as it is not a true phylogenetic tree).