Bmeg logo            

Bacillus megaterium Genome Sequencing Workshop

Hands-on Computer Exercise

            NIU logo

Many of the links on this page require a password. To get one, e-mail Rick Johns (rjohns@niu.edu).

link to my PowerPoint presentation from this morning.


Step 0. Saving Your Results

A basic principle in all scientific investigations is that you need to write down what you have done and what the results were. The easiest way to do this is to open a Word document and record what you are doing as you go through the steps on this page. This is going to be your lab notebook for the day. To do this, go to the "Novell-delivered Application" at the bottom of the desktop, and scroll down until you find "WordXP". Start a new file and be sure to save it from time to time.

 

Step 1. Finding Long Open Reading Frames

Go to the ORF page using the button below. Scan all 6 reading frames for long regions with no stop codons. Be sure to scan the reverse frames from right to left! You are looking for the longest ORFs that don't overlap more than 50 bp, so don't pick smaller ones that are inside larger ones. The minimum allowable size is 100 bp, but most will be much longer than that.

As a hint, this image contains 4 long ORFs, each at least 1000 bp long.

How many ORFs did you find? Which strand are the on? How long are they?

Get the coordinates of the downstream stop codon and the farthest upstream start codon for each long ORF. Also, write down what the base seqeunce is for both the start codon and the stop codon.

ORF page To the Open Readng Frame page!

 

Step 2. Retrieving the Gene Sequence

You now want to get the DNA sequence for each of the potential gene you have found in open reading frames in the previous step. You want the sequence from the first base of the start codon to the last base of the stop codon. Enter the lower (left-hand) coordinate in the first box and the higher (right-hand) coordinate in the second box. If the gene is on the reverse strand, you will have to reverse-complement it (Step 2A).

Paste your sequences into your lab notebook file!

Left hand coordinate:        Right hand coordinate

 

Step 2A. Reverse-Complement Sequence

If your gene was on the reverse strand, you need to turn it around so the the gene's beginning (5' end) is at the beginning of your sequence. Also, you need to complement the bases: convert A to T, G to C, etc. Paste your sequence in the window below to do this.

 

Step 3. Translate into Amino Acids

Each codon (group of 3 bases) needs to be translated into a single amino acid, using the genetic code. The code is degenerate: several different codons code for the same amino acid (in most cases). The amino acids are given in the one-letter code system (which is shown in the genetic code table).

This program requires that:

Nucleotide sequence to translate:

 

Step 4. BLAST Search

Below is a link to Uniprot , which contains an up-to-date set of all known protein sequences. This service is used by a lot of people, so please don't abuse it. Click on the BLAST tab, then paste your sequence into the box and hit the BLAST button. It takes a bit (usually less than a minute) to get the results, so be patient

The results appear with the best hits on top. We want to pay attention to the best hit, and also to the top 5 hits. Questions to answer for each of these hits:

Uniprot link Uniprot link

Scoring

  1. What is the e-value? Is it better than (less than) 1e-20?
  2. What is the length of the hit? Is it within 15% of the query protein's length?
  3. What is the Identity percentage? Is it at least 35%?
  4. Click on the "Local alignment" graphic and examine the alignment, looking for gaps. Are all the gaps small (less than 20 amino acids)? Note that very good hits may not have any gaps.

If the answers to all of the above questions are yes for the best hit, we can be confident that our gene is homologus to it. If most of the answers are no, this gene is probably not a homologue. In between is a gray area, which means that we would pu tthe word "putative" in front of the gene name.

Gene Names

  1. Do the names of the top hits all match? Note that slight name variations are common.
  2. Do the names contain "weasel words" like putative or hypothetical?

If the top hits are all high scoring and their names match, we can confidently assign this name to the gene.

Closest Species

  1. What organism does each of the top hits come from?
  2. Are all of these species from the Bacillus genus, or from a closely related genus that has "bacillus" as part of its name?
  3. If the answer to the above question is "no", click on the Acession number and look at the Taxonomic lineage. What is it?
  4. B. megaterium's lineage is: Bacteria > Firmicutes > Bacillales > Bacillaceae > Bacillus. How far back do the lineages of the top hits match this?

Bacteria is the domain, Firmicutes is the phylum, Bacillales is the order, Bacillaceae is the family, and Bacillus is the genus. Any gene whose top hits aren't at least from the Firmicutes is almost certainly an example of horizontal gene transfer. Any gene from the Bacillaceae family is almost certainly an example of vertical gene transfer. In between is a gray area.

Start site (if you have time)

  1. Look again at the "Local alignment" for each of the top hits. Where does the matching region start in the query sequence?
  2. Are the starting positions of the hits consistent with teh start site you chose back in step 1? Or, would a start site further downstream better explain the data?

Usually, the protein sequence is much better conserved across species lines than the sequence just upstream. However, the beginning of a protein is often the least well conserved area, so choosing start sites on the basis of sequence conservation works best with very similar sequences.