Combinatorial algorithms for DNA sequence assembly

John D Kececioglu, E. W. Myers

Research output: Contribution to journalArticle

128 Citations (Scopus)

Abstract

The trend toward very large DNA sequencing projects, such as those being undertaken as part of the Human Genome Program, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four-phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates, and list a series of alternate solutions in the event that several appear equally good. Moreover, it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed nonrepetitive sequences of length 50,000 sampled at error rates of as high as 10%.

Original languageEnglish (US)
Pages (from-to)7-51
Number of pages45
JournalAlgorithmica
Volume13
Issue number1-2
DOIs
StatePublished - Feb 1995
Externally publishedYes

Fingerprint

Combinatorial Algorithms
DNA sequences
DNA Sequence
Sequencing
Superstring
Error Rate
Fragment
DNA Sequencing
Multiple Sequence Alignment
Alternate
Reverse
Genome
Complement
NP-complete problem
Heuristics
Computational complexity
DNA
Series
Formulation
Genes

Keywords

  • Approximation algorithms
  • Branch- and-bound algorithms
  • Computational biology
  • Fragment assembly
  • Sequence reconstruction

ASJC Scopus subject areas

  • Applied Mathematics
  • Safety, Risk, Reliability and Quality
  • Software
  • Computer Graphics and Computer-Aided Design
  • Computer Science Applications
  • Computer Science(all)

Cite this

Combinatorial algorithms for DNA sequence assembly. / Kececioglu, John D; Myers, E. W.

In: Algorithmica, Vol. 13, No. 1-2, 02.1995, p. 7-51.

Research output: Contribution to journalArticle

Kececioglu, John D ; Myers, E. W. / Combinatorial algorithms for DNA sequence assembly. In: Algorithmica. 1995 ; Vol. 13, No. 1-2. pp. 7-51.
@article{f953028eb3474d94a8239e3ef9857a35,
title = "Combinatorial algorithms for DNA sequence assembly",
abstract = "The trend toward very large DNA sequencing projects, such as those being undertaken as part of the Human Genome Program, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four-phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates, and list a series of alternate solutions in the event that several appear equally good. Moreover, it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed nonrepetitive sequences of length 50,000 sampled at error rates of as high as 10{\%}.",
keywords = "Approximation algorithms, Branch- and-bound algorithms, Computational biology, Fragment assembly, Sequence reconstruction",
author = "Kececioglu, {John D} and Myers, {E. W.}",
year = "1995",
month = "2",
doi = "10.1007/BF01188580",
language = "English (US)",
volume = "13",
pages = "7--51",
journal = "Algorithmica",
issn = "0178-4617",
publisher = "Springer New York",
number = "1-2",

}

TY - JOUR

T1 - Combinatorial algorithms for DNA sequence assembly

AU - Kececioglu, John D

AU - Myers, E. W.

PY - 1995/2

Y1 - 1995/2

N2 - The trend toward very large DNA sequencing projects, such as those being undertaken as part of the Human Genome Program, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four-phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates, and list a series of alternate solutions in the event that several appear equally good. Moreover, it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed nonrepetitive sequences of length 50,000 sampled at error rates of as high as 10%.

AB - The trend toward very large DNA sequencing projects, such as those being undertaken as part of the Human Genome Program, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four-phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates, and list a series of alternate solutions in the event that several appear equally good. Moreover, it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed nonrepetitive sequences of length 50,000 sampled at error rates of as high as 10%.

KW - Approximation algorithms

KW - Branch- and-bound algorithms

KW - Computational biology

KW - Fragment assembly

KW - Sequence reconstruction

UR - http://www.scopus.com/inward/record.url?scp=0029197192&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0029197192&partnerID=8YFLogxK

U2 - 10.1007/BF01188580

DO - 10.1007/BF01188580

M3 - Article

VL - 13

SP - 7

EP - 51

JO - Algorithmica

JF - Algorithmica

SN - 0178-4617

IS - 1-2

ER -