Separating repeats in DNA sequence assembly

Research output: Chapter in Book/Report/Conference proceedingConference contribution

19 Citations (Scopus)

Abstract

One of the key open problems in large-scale DNA sequence assembly is the correct reconstruction of sequences that contain repeats. A long repeat can confound a sequence assembler into falsely overlaying fragments that sample its copies, effectively compressing out the repeat in the reconstructed sequence. We call the task of correcting this compression by separating the overlaid fragments into the distinct copies they sample, the repeat separation problem. We present a rigorous formulation of repeat separation in the general setting without prior knowledge of consensus sequences of repeats or their number of copies. Our formulation decomposes the task into a series of four subproblems, and we design probabilistic tests or combinatorial algorithms that solve each subproblem. The core subproblem separates repeats using the so-called k-median problem in combinatorial optimization, which we solve using integer linear-programming. Experiments with an implementation show we can separate fragments that are over laid at 10 times the coverage with very few mistakes in a few seconds of computation, even when the sequencing error rate and the error rate between copies are identical. To our knowledge this is the first rigorous and fully general approach to separating repeats that directly addresses the problem.

Original languageEnglish (US)
Title of host publicationProceedings of the Annual International Conference on Computational Molecular Biology, RECOMB
EditorsT. Lengauer, D. Sankoff, S. Istrail, P. Pevzner, M. Waterman
Pages176-183
Number of pages8
StatePublished - 2001
Event5th Annual Internatinal Conference on Computational Biology - Montreal, Que., Canada
Duration: May 22 2001May 26 2001

Other

Other5th Annual Internatinal Conference on Computational Biology
CountryCanada
CityMontreal, Que.
Period5/22/015/26/01

Fingerprint

Linear Programming
DNA sequences
Consensus Sequence
Combinatorial optimization
Linear programming
Experiments

Keywords

  • Computational biology
  • Disambiguating repeats
  • k-median problem
  • Shotgun sequencing

ASJC Scopus subject areas

  • Biochemistry, Genetics and Molecular Biology(all)
  • Computer Science(all)

Cite this

Kececioglu, J. D., & Yu, J. (2001). Separating repeats in DNA sequence assembly. In T. Lengauer, D. Sankoff, S. Istrail, P. Pevzner, & M. Waterman (Eds.), Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB (pp. 176-183)

Separating repeats in DNA sequence assembly. / Kececioglu, John D; Yu, J.

Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB. ed. / T. Lengauer; D. Sankoff; S. Istrail; P. Pevzner; M. Waterman. 2001. p. 176-183.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kececioglu, JD & Yu, J 2001, Separating repeats in DNA sequence assembly. in T Lengauer, D Sankoff, S Istrail, P Pevzner & M Waterman (eds), Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB. pp. 176-183, 5th Annual Internatinal Conference on Computational Biology, Montreal, Que., Canada, 5/22/01.
Kececioglu JD, Yu J. Separating repeats in DNA sequence assembly. In Lengauer T, Sankoff D, Istrail S, Pevzner P, Waterman M, editors, Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB. 2001. p. 176-183
Kececioglu, John D ; Yu, J. / Separating repeats in DNA sequence assembly. Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB. editor / T. Lengauer ; D. Sankoff ; S. Istrail ; P. Pevzner ; M. Waterman. 2001. pp. 176-183
@inproceedings{0d0f3981d7fc40479259ea885f38cfae,
title = "Separating repeats in DNA sequence assembly",
abstract = "One of the key open problems in large-scale DNA sequence assembly is the correct reconstruction of sequences that contain repeats. A long repeat can confound a sequence assembler into falsely overlaying fragments that sample its copies, effectively compressing out the repeat in the reconstructed sequence. We call the task of correcting this compression by separating the overlaid fragments into the distinct copies they sample, the repeat separation problem. We present a rigorous formulation of repeat separation in the general setting without prior knowledge of consensus sequences of repeats or their number of copies. Our formulation decomposes the task into a series of four subproblems, and we design probabilistic tests or combinatorial algorithms that solve each subproblem. The core subproblem separates repeats using the so-called k-median problem in combinatorial optimization, which we solve using integer linear-programming. Experiments with an implementation show we can separate fragments that are over laid at 10 times the coverage with very few mistakes in a few seconds of computation, even when the sequencing error rate and the error rate between copies are identical. To our knowledge this is the first rigorous and fully general approach to separating repeats that directly addresses the problem.",
keywords = "Computational biology, Disambiguating repeats, k-median problem, Shotgun sequencing",
author = "Kececioglu, {John D} and J. Yu",
year = "2001",
language = "English (US)",
pages = "176--183",
editor = "T. Lengauer and D. Sankoff and S. Istrail and P. Pevzner and M. Waterman",
booktitle = "Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB",

}

TY - GEN

T1 - Separating repeats in DNA sequence assembly

AU - Kececioglu, John D

AU - Yu, J.

PY - 2001

Y1 - 2001

N2 - One of the key open problems in large-scale DNA sequence assembly is the correct reconstruction of sequences that contain repeats. A long repeat can confound a sequence assembler into falsely overlaying fragments that sample its copies, effectively compressing out the repeat in the reconstructed sequence. We call the task of correcting this compression by separating the overlaid fragments into the distinct copies they sample, the repeat separation problem. We present a rigorous formulation of repeat separation in the general setting without prior knowledge of consensus sequences of repeats or their number of copies. Our formulation decomposes the task into a series of four subproblems, and we design probabilistic tests or combinatorial algorithms that solve each subproblem. The core subproblem separates repeats using the so-called k-median problem in combinatorial optimization, which we solve using integer linear-programming. Experiments with an implementation show we can separate fragments that are over laid at 10 times the coverage with very few mistakes in a few seconds of computation, even when the sequencing error rate and the error rate between copies are identical. To our knowledge this is the first rigorous and fully general approach to separating repeats that directly addresses the problem.

AB - One of the key open problems in large-scale DNA sequence assembly is the correct reconstruction of sequences that contain repeats. A long repeat can confound a sequence assembler into falsely overlaying fragments that sample its copies, effectively compressing out the repeat in the reconstructed sequence. We call the task of correcting this compression by separating the overlaid fragments into the distinct copies they sample, the repeat separation problem. We present a rigorous formulation of repeat separation in the general setting without prior knowledge of consensus sequences of repeats or their number of copies. Our formulation decomposes the task into a series of four subproblems, and we design probabilistic tests or combinatorial algorithms that solve each subproblem. The core subproblem separates repeats using the so-called k-median problem in combinatorial optimization, which we solve using integer linear-programming. Experiments with an implementation show we can separate fragments that are over laid at 10 times the coverage with very few mistakes in a few seconds of computation, even when the sequencing error rate and the error rate between copies are identical. To our knowledge this is the first rigorous and fully general approach to separating repeats that directly addresses the problem.

KW - Computational biology

KW - Disambiguating repeats

KW - k-median problem

KW - Shotgun sequencing

UR - http://www.scopus.com/inward/record.url?scp=0034819969&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0034819969&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:0034819969

SP - 176

EP - 183

BT - Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB

A2 - Lengauer, T.

A2 - Sankoff, D.

A2 - Istrail, S.

A2 - Pevzner, P.

A2 - Waterman, M.

ER -