This tutorial is specific to the files placed on the server. For practice, you can upload your own genome and genes of interest and use the same protocol by changing the filenames to your reference and your gene of interest data set.
We will start with a simple question - is my gene of interest in this particular genome?
Then, progress to efficient methods for naming all the genes in the genome based on highly conserved regions.
Finally, we will talk about customizing our search databases to find genes of interest and explore them using web-based tools.
If we have time, we will discuss the annotation of noncoding regions.
http://www.exoticsguide.org/bugula_neritina https://en.wikipedia.org/wiki/Bryozoa
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4109969/
MyD88 Toll Interleukin Receptor Pathways (TIR, TLR) etc., many more
Let’s create a file to test for the presence of MyD88
1.) Navigate to http://uniprot.org
2.) Search for non-vertebrate, non-arthropod MyD88 proteins.
A.) gene:myd88 NOT taxonomy:"Vertebrata [7742]" NOT taxonomy:"Arthropoda [6656]"
B.) Select all the sequences (should be 18 remaining) and choose "align".
C.) Kick out outliers (I removed A0A2B4S3X5, A0A2B4SJ75, A0A2B4RT06, Q4W1E7_SUBDO)
D.) Return to main screen and choose "download" and format Fasta (canonical)
E.) Copy and paste the text directly into the terminal or make a file on your desktop using notepad.
Note: don’t use Microsoft word for text copy/paste for command line, it adds hidden characters which will mess up the code.
3.) Make a text file that holds the resulting fasta sequences for use in a few minutes. For this demo, call it MyD88.fasta
What is the Uniprot database?
https://www.uniprot.org/downloads
What is MyD88?
Log into the server and follow these steps.
nano MyD88.fasta
1.) Copy and paste the sequences from the text file we just created. If you are using a traditional terminal (putty or iterm2) this is a right click. If you are using the old web shell in Jetstream, right click and choose paste. If you are using the new web shell in Jetstream, use control+alt+shift and follow the directions.
2.) Control-X to save file
If you get stuck, here is a way to download it directly:
cd ~
mkdir annotation
cd ~/annotation
wget https://de.cyverse.org/dl/d/E9A541A2-907C-459F-8133-4F559CE0F3F1/MyD88.fasta
Using the key provided in the slack, you can add an authorization to your log-in to allow for a remote connection to view files. We will demonstrate if time allows.
Lisa can demonstrate if we have time and if we cannot get the data into the system quickly using one of the other methods.
A direct link between remote computers avoids the time and effort of downloading files to your personal computer first. This is a good option for genomes.
cd ~/annotation
wget https://de.cyverse.org/dl/d/A47FAD90-1837-4868-8896-61231F14F779/genome_canu_filtered.fasta
Protein to Protein searches are more efficient in Blast, so let’s convert our genome into a set of proteins instead of nucleotides.
We will use Transdecoder to do this.
What are some reasons why converting our raw scaffold into proteins will help us find our proteins?
1.) Load the program Transdecoder into our instance
conda
conda install transdecoder
After installation, check to see if you get instructions for the program by running:
TransDecoder.LongOrfs
If not, retry the install.
2.) Run the program on our reference genome.
TransDecoder.LongOrfs -t ~/annotation/genome_canu_filtered.fasta
Note: you can replace the demonstration reference with your genome for practice later on. Using a variable called “reference” allows us to write code once and reuse it by just re-setting what we mean by “reference”. Using variables in your script can help you be more efficient as you can try multiple searches using the same script just by changing the reference, or feeding a list of reference genomes into your bash script.
This can take up to an hour, so we need to stop the program prematurely. It won’t hurt anything, so just hit Control-C after a few minutes.
Now we need to make the peptide list a searchable blast database.
First, install the blast tools set from NCBI:
sudo apt install ncbi-blast+
Note: using sudo is usually reserved for machine owners. In this course, we can install programs because we make our own machines. This will not work if you are on a privately managed server, such as a university HPC. Please consult with your system administrators on best practices for installing programs.
Check the install.
makeblastdb
1.) Copy the protein version of the genome into the annotation folder.
cp ~/annotation/$REFERENCE.transdecoder_dir.__checkpoints_longorfs/longest_orfs.pep ~/annotation
Note: your directory name may be slightly different, use ls to see whusing at the directory name is before the copy
If you could not get Transdecoder to run, here is a version longest_orfs.pep you can upload into your annotation directory.
cd ~/annotation
wget https://de.cyverse.org/dl/d/24F8DAC4-0DA7-4A45-945C-E5E36DEFAEA7/Bugula.pep
2.) Make a blast database of the genome for searching.
makeblastdb -in longest_orfs.pep -dbtype prot -title Bugula.pep -out Bugula.pep
3.) Let’s blast our sequences to see if we get a result.
blastp -query ~/annotation/MyD88.fasta -db Bugula.pep -outfmt 6 -evalue 1e-5 -out blastp.MyD88.Bugula.pep.outfmt6
Note: outfmt6 is a very versatile tool that can be used to output your gene hits as a table, read more about it here:
https://www.ncbi.nlm.nih.gov/books/NBK279682/
This should be a rather short blast operation, because we only have a few sequences.
Did you get any results at all?
conda install hmmer
Check the install of hmmer by typing the following and seeing if the instructions are provided.
hmmscan
conda install muscle
Check the install by typing the following and checking for the instructions.
muscle
First we have to align the sequences (as we did when we were selecting the perfect set for our analysis). It is important to eyeball them at some point to make sure that they are not composed of non-overlapping, too long or too short sequences. We did our original scan on uniprot. This can also be done on NCBI and through the online version of muscle. It is better to be more conservative in the number of sequences rather than trying to include everything. Some databases contain mislabeled sequences that can confound your search for proteins.
muscle -in MyD88.fasta -out MyD88.msa
Take a look inside the file to see if the alignments make sense. Kick out any sequences that are making the sequences less cohesive.
less -S MyD88.msa
Then, we are going to use hmmbuild to analyze the occurrence of each peptide in relation to the position in the aligned proteins.
hmmbuild MyD88.hmm MyD88.msa
The final step to create a mathematical matrix that shows the probability of each transition based on the data provided.
hmmpress MyD88.hmm
ls
The MyD88.htm is now a searchable hmm profile that we can use with our genome to find proteins which are probably related.
hmmscan --domtblout Bugula.MyD88.domtblout MyD88.hmm Bugula.pep
Look at the results in the Bugula.MyD88.domtblout
Are there any potential matches to our hmm model?
It can take a lot of time to do multiple sequence alignments for individual genes, so we are going to switch to using a reference hmm profile to annotate all the orf’s in our genome.
We are going to stick with proteins, so please be aware that many items in our genome will not be annotated.
This involves downloading some freely available databases curated by the scientific community.
Why is a curated database a good first pass for searching for gene names?
We are going to download two databases and put them in our computers for searching.
Pfam is described here: https://pfam.xfam.org/ Swissprot is described here: https://www.uniprot.org/help/uniprotkb_sections
cd ~
mkdir databases
cd ~/databases
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gunzip uniprot_sprot.fasta.gz
Now we are ready to search all of our genome peptides using hmmer agains the Pfam database that has already been pressed into a hmmer searchable table.
cd ~/annotation
hmmscan --domtblout Bugula.All.domtblout ~/databases/Pfam-A.hmm Bugula.pep
This can be interrupted and the file can be read without hurting anything. So let the program run for a little while then hit Control-C.
We can do the same using blast against the swissprot fasta. Basically, any fasta can be searched against another fasta.
blastp -query Bugula.pep -db ~/database/uniprot_sprot.fasta -max_target_seqs 1 -outfmt 6 -evalue 1e-5 > Bugula.swissprot.blastp.outfmt6
Also let this run for a few minutes and terminate early with Control-C
Let’s take a look at our data and discuss a few things.
1.) How much does the reference database influence our ability to annotate the Bugula genome?
2.) Which approach was more sensitive - local alignments with BLAST or hmm profile search with HMMER?
3.) What strategies might you use with these two tools to improve your annotations?
Thank you!