DNA Sequence Alignment Project

Description of Project

For this project, we will be studying and comparing techniques implemented to align protein sequences and perform similarity comparisons. For this study we will be concentrating on DNA sequences, since they present the same issues as generic protein sequences, but have data sets which are more readily available. We plan to focus our studies on database applications, though a large portion of the study will be focused on the alignment of any two DNA sequences independent of a database. As part of our studies we hope to analyze the following:

Global alignment vs local alignment
Gapped alignment vs ungapped alignment
Techniques for improving comparison time by storing precomputations
Metrics used:

The ideas behind some of the more common metrics
The effect a metric may have on the search results

A general survey of the DNA sequence alignment databases currently available on the Web for researchers to use

This is a very ambitious effort for which whole graduate courses are devoted to. It is our intent to narrow the focus as we continue to study the topic. Otherwise we will not be able to finish the project before the end of the quarter, and the material we gather will take too long to present to the class.

Description of Datasets

Currently there are several databases actively on-line which provide DNA string comparisons against their datasets. To the best extent possible we will be using the datasets that these data bases have. In addition there have been a few attempts at making small benchmarking datasets which we may also use. Where neither of these datasets will achieve our goal, we will create our own datasets which may include a combination of specifically tailored DNA sequences and randomly generated sequences.

Measure of Success of the Project

A successful project will be indicated by a paper which presents the problem we studied and details the analysis of at least 3 of the areas above (or new areas that have been discovered as we continue to study the topic). In addition to the paper a successful project should include a presentation to the class.

An excellent project would demonstrate that we have studied the topic and developed an understanding of the material. Our write-up and presentation may be used to demonstrate this and should include calculated and/or empirical analysis of areas studied.

A satisfactory project would still include the study of three areas, though it would include a cursory and purely qualitative comparison.

Timetable

At this point depending on what we find in the readings, we have two possible timetables.

Week 1 Cursory review of the reading materials
Week 2 Narrow focus of the project
Week 3 Detailed analysis of first topic
Week 4 Detailed analysis of second topic
Week 5 Detailed analysis of third topic
Week 6 Write-up
Week 7 Presentation

or possibly

Week 1 Cursory review of the reading materials
Week 2 Narrow focus of the project and distribute responsibilities
Week 3 Individual detailed analysis of assigned topic
Week 4 Continued individual analysis of topics
Week 5 Integration of topics studied
Week 6 Write-up
Week 7 Presentation

The following is a list of papers and other material that we are currently reviewing:

Torbjorn Rognes and Erling Seeberg
SALSA: Improved Protein Database Searching by a New Algorithm for Assembly of Sequence Fragments into Gapped Alignments Bioinformatics, Vol. 14 no. 10, Pages 839-845, 1998

Jean-Stephen Varre, Jean-Paul Delahaye, and Eric Rivals
Transformation Distances: A Family of Dissimilarity Measures Based on Movements of Segments
Bioinformatics, Vol. 15 no. 3, Pages 194-202, 1999

Burkhard Morgenstern, Kornelie Frech, Andreas Dress, and Thomas Werner
DIALIGN: Finding Local Similarities by Multiple Sequence Alignment
Bioinformatics, Vol. 14 no. 3, Pages 290-294, 1998

C. Miller, J. Gurd, and A. Brass
A RAPID Algorithm for Sequence Database Comparisons: Application to the Identification of Vector Contamination in EMBL Databases
Bioinformatics, Vol. 15 no. 2, Pages 111-121, 1999

Udo Tonges, Soren W. Perrey, Jens Stoye, and Andreas W.M. Dress
A General Method for Fast Multiple Sequence Alignment
Gene 172, 33-44, 1996

Paul Hardy and Michael S. Waterman
The Sequence Alignment Software Library at USC
Version 2.0, March 1997
http://www-hto.usc.edu/seqaln

K. Quandt
Software for the Analysis of DNA Sequence Elements of Transcription
January 1997
http://www.gsf.de/biodv/review/

Thomas Werner
The Biology and Bioinformatics of Regulatory Regions in Genomes
ftp://ariane.gsf.de/pub/ISMB99/tutorial_handout.pdf

Dr. David Mount and Dr. Samuel Ward
MCB 416/516 BIOINFORMATICS AND GENOMIC ANALYSIS
Class Notes, Spring 1998
http://www.blc.arizona.edu/courses/wardbioinf/

L. Allison, D.Powell & T.I.Dix,
Compression and Approximate Matching,
The Computer Journal, Volume 42, Issue 1, Pages 1-10, 1999

L. Allison,C.S. Wallace and C.N.Yee
Inductive Inference over Macro-Molecules
TechnicalReport 90/148, Department of Computer Science, Monash University, Austrailia 3168

L. Allison
Information-Theoretic Sequence Alignment
Technical Report 98/14, School of Computer Science and Software Engineering, Monash University,Australia 3168

Dennis A. Benson, Mark S. Boguski, David J. Lipman, James Ostell and B.F. Francis Ouellette
GenBank
Nucleic Acids Research, Vol 26, Issue 1.

C.Harger, M. Skupski, J.Bingham, A. Farmer, S. Hoisie, P. Hraber, D. Kiphart, L. Krakowski, M. McLeod, J. Schwertfeger, G. Seluja, A. Siepel, G. Singh, D. Stamper, P.Steadman, N. Thayer, R. Thompson, P. Wargo, M. Waugh, J.J. Zhuang and P.A. Schad
The Genome Sequence DataBase (GSDB): Improving Data Quality and Data Access
Nucleic Acids Research, Vol 26, Issue 1.

Gene Myers
A Fast Bit-Vector Algoithm for Approximate String Matching Based on Dynamic Programming
Ninth Combinatorial Pattern Matching Conference
accepted for publication

Eugene W. Myers, Paulo Oliva and Katia Guimaraes
Reporting Exact and Approximate Regular Expression Matches
Ninth Combinatorial Pattern Matching Conference
accepted for publication.

Eric L. Anson and Eugene W. Myers
ReAligner: A Program for Refinging DNA Sequence Multi-Alignments
Journal of Computational Biology 4, 3 (1997), 369-383.

Gerhard Mehldau and Gene Myers
A System for Pattern Matching Applications on Biosequences
CABIOS 9, 3 (1993), 299-314

David Loewenstern, Haym Hirsh, Peter Yianilos and Michiel Noordewier
DNA Sequence Classification Using Compression-Based Induction
DIMACS Technical Report 95-04, Department of Computer Science,
Rutgers University, New Brunswick, NJ 08903, USA

Burkhard Morgenstern, Angreas Dress and Thomas Werner
Multiple DNA and Protein Sequence Alignment based on Segment-to-Segment Comparison
Proc. Natl. Acad.Sci. USA, Vol. 93, pp. 12098-12103, October 1996
Applied Mathematics

Felix Wu and Guy Blelloch
Algorithms in the Real World
Lecture #20 (Pattern Matching 2)

Steven E. Brenner, Cyrus Chothia and Tim J. Hubbard
Assessing Sequence Comparison Methods with Reliable Structurally Identified Distant Evolutionary Relationships
Proc. Natl. Acad. Sci. USA Vol. 95, pp. 6073-6078, May 1998
Biochemistry

Tatiana A. Tatusova and Thomas L. Madden
BLAST 2 Sequences, a new Tool for Comparing Protein and Nucleotide Sequences
FEMS Microbiology Letters 174 (1999) 247-250

R.Irie,S. Hiraoka,N. Kasahara and K. Nagai
DNA Sequence Comparison Considering Both Amino Acid and Nucleotide Insertions/Deletions Because of Evolution and Experimental Error
Journal of Biotechnology 69 (1999) 19-26

John C. Wooton
Evaluating the Effectiveness of Sequence Analysis Algorithms Using Measure of Relevant Information
Computers Chem. Vol. 21, No. 4, pp. 191-202, 1997

Thomas D. Schneider and David N. Mastronarde
Fast Multiple Alignment of Ungapped DNA Sequences using Information Theory and a Relaxation Method
Discrete Applied Mathematics 71 (1996) 259-268

Ron Shamir
Pairwise Alignment
Algorithms for Molecular Biology, Lecture 2, 1998

Ron Shamir
Introduction - Sequence Alignment Heuristics
Algorithms for Molecular Biology, Lecture 3, 1998

Ron Shamir
Approximatin Algorithms for Multiple Sequence Alignment
Algorithms for Molecular Biology, Lecture 5, 1999