Spring 2012 Vertebrate Genome Biology Lab

= BLAST = Basic Local Alignment Search Tool

Sequence File Formats

 * Most common for nucleotides: FASTA / Multi-FASTA
 * ">" followed by any unicode text, entire line read as sequence title
 * Carriage return followed by continuous 5’- 3’ nucleotide sequence or protein sequence using 1-letter codes

Example: >E. coli Globin-coupled chemotaxis sensory transducer (TM domain) ATGGACCTGATCACAAATGCGATTTAGAGACCTGATCACAAATGCGATGACCTGATCACAAATGCGATGAC CTGATCACAAATGCGATGTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATCTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATTAA

NCBI BLAST

 * Run by the National Center for Biotechnology Information
 * BLAST uses a heuristic algorithm based on the Smith-Waterman algorithm
 * Algorithm searches database for a small string within the query (default 11 for nucleotide searches), then when it detects a match, searches for shared nucleotides at each end of the seed to extend the match
 * Gaps are taken into account, then the matches are presented in order of statistical significance

http://www.ncbi.nlm.nih.gov/BLAST/

Nucleotide-nucleotide BLAST (BLASTN)

 * Basic nucleotide sequence searches

Protein-protein BLAST (BLASTP)

 * Similar technology used to search amino acid sequences

Position-Specific Iterative BLAST (PSI-BLAST)

 * A more advanced protein blast useful for analyzing relationships between divergently evolved proteins

MegaBLAST

 * Used for BLASTing several sequences at once to cut down on processing load and server reporting time

Max/Total Score
Calculated from the number of matches and gaps. Higher relative to your query length is better.

E Value: E = kmn^(e-lambda*S)
Translation: E value gives you the number of entries required in the database for a match to happen by random chance.

e.g., E=e^-6 means that one match would be expected for every 1,000,000 entries in the database.

Smaller E values are better

Values larger than E=e-5 are likely to be due to chance

Query Coverage
The percent of the query sequence matched by the database entry

Max Ident
The percent identity, i.e., the percent that the genes match up within the limits of the full match (e.g., deletions or additions reduce this value).

Word Size
BLAST is a heuristic that works by finding word-matches between the query and database sequences. One may think of this process as finding "hot-spots" that BLAST can then use to initiate extensions that might eventually lead to full-blown alignments. For nucleotide-nucleotide searches (i.e., "blastn") an exact match of the entire word is required before an extension is initiated, so that one normally regulates the sensitivity and speed of the search by increasing or decreasing the word-size. For other BLAST searches non-exact word matches are taken into account based upon the similarity between words. The amount of similarity can be varied so one normally uses just the word-sizes 2 and 3 for these searches.

More statistics
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Some BLAST Help

 * http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml
 * http://www.ncbi.nlm.nih.gov/books/NBK1734/

= Sequence Aligning Software =
 * ClustalX - Software
 * ClustalW - Web
 * DNAStar

These are functionally similar but difference in interface, tools, and speed of algorithms

= UCSC Genome Browser and BLAT = http://genome.ucsc.edu/

= DNAStar =

Video Tutorials
http://www.dnastar.com/t-support-videos.aspx