Aligning protein sequences with predicted secondary structure

John D Kececioglu, Eagu Kim, Travis Wheeler

Research output: Contribution to journalArticle

9 Citations (Scopus)

Abstract

Accurately aligning distant protein sequences is notoriously difficult. Since the amino acid sequence alone often does not provide enough information to obtain accurate alignments under the standard alignment scoring functions, a recent approach to improving alignment accuracy is to use additional information such as secondary structure. We make several advances in alignment of protein sequences annotated with predicted secondary structure: (1) more accurate models for scoring alignments, (2) efficient algorithms for optimal alignment under these models, and (3) improved learning criteria for setting model parameters through inverse alignment, as well as (4) in-depth experiments evaluating model variants on benchmark alignments. More specifically, the new models use secondary structure predictions and their confidences to modify the scoring of both substitutions and gaps. All models have efficient algorithms for optimal pairwise alignment that run in near-quadratic time. These models have many parameters, which are rigorously learned using inverse alignment under a new criterion that carefully balances score error and recovery error. We then evaluate these models by studying how accurately an optimal alignment under the model recovers benchmark reference alignments that are based on the known three-dimensional structures of the proteins. The experiments show that these new models provide a significant boost in accuracy over the standard model for distant sequences. The improvement for pairwise alignment is as much as 15% for sequences with less than 25% identity, while for multiple alignment the improvement is more than 20% for difficult benchmarks whose accuracy under standard tools is at most 40%.

Original languageEnglish (US)
Pages (from-to)561-580
Number of pages20
JournalJournal of Computational Biology
Volume17
Issue number3
DOIs
StatePublished - Mar 1 2010

Fingerprint

Benchmarking
Protein Sequence
Secondary Structure
Alignment
Proteins
Sequence Alignment
Amino Acid Sequence
Learning
Scoring
Model
Benchmark
Pairwise
Efficient Algorithms
Error Recovery
Structure Prediction
Information use

Keywords

  • Affine gap penalties
  • Inverse parametric alignment
  • Protein secondary structure
  • Sequence alignment
  • Substitution score matrices

ASJC Scopus subject areas

  • Molecular Biology
  • Genetics
  • Computational Mathematics
  • Modeling and Simulation
  • Computational Theory and Mathematics

Cite this

Aligning protein sequences with predicted secondary structure. / Kececioglu, John D; Kim, Eagu; Wheeler, Travis.

In: Journal of Computational Biology, Vol. 17, No. 3, 01.03.2010, p. 561-580.

Research output: Contribution to journalArticle

Kececioglu, John D ; Kim, Eagu ; Wheeler, Travis. / Aligning protein sequences with predicted secondary structure. In: Journal of Computational Biology. 2010 ; Vol. 17, No. 3. pp. 561-580.
@article{604ebef5fc6b4d3581a708b03c60949e,
title = "Aligning protein sequences with predicted secondary structure",
abstract = "Accurately aligning distant protein sequences is notoriously difficult. Since the amino acid sequence alone often does not provide enough information to obtain accurate alignments under the standard alignment scoring functions, a recent approach to improving alignment accuracy is to use additional information such as secondary structure. We make several advances in alignment of protein sequences annotated with predicted secondary structure: (1) more accurate models for scoring alignments, (2) efficient algorithms for optimal alignment under these models, and (3) improved learning criteria for setting model parameters through inverse alignment, as well as (4) in-depth experiments evaluating model variants on benchmark alignments. More specifically, the new models use secondary structure predictions and their confidences to modify the scoring of both substitutions and gaps. All models have efficient algorithms for optimal pairwise alignment that run in near-quadratic time. These models have many parameters, which are rigorously learned using inverse alignment under a new criterion that carefully balances score error and recovery error. We then evaluate these models by studying how accurately an optimal alignment under the model recovers benchmark reference alignments that are based on the known three-dimensional structures of the proteins. The experiments show that these new models provide a significant boost in accuracy over the standard model for distant sequences. The improvement for pairwise alignment is as much as 15{\%} for sequences with less than 25{\%} identity, while for multiple alignment the improvement is more than 20{\%} for difficult benchmarks whose accuracy under standard tools is at most 40{\%}.",
keywords = "Affine gap penalties, Inverse parametric alignment, Protein secondary structure, Sequence alignment, Substitution score matrices",
author = "Kececioglu, {John D} and Eagu Kim and Travis Wheeler",
year = "2010",
month = "3",
day = "1",
doi = "10.1089/cmb.2009.0222",
language = "English (US)",
volume = "17",
pages = "561--580",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "3",

}

TY - JOUR

T1 - Aligning protein sequences with predicted secondary structure

AU - Kececioglu, John D

AU - Kim, Eagu

AU - Wheeler, Travis

PY - 2010/3/1

Y1 - 2010/3/1

N2 - Accurately aligning distant protein sequences is notoriously difficult. Since the amino acid sequence alone often does not provide enough information to obtain accurate alignments under the standard alignment scoring functions, a recent approach to improving alignment accuracy is to use additional information such as secondary structure. We make several advances in alignment of protein sequences annotated with predicted secondary structure: (1) more accurate models for scoring alignments, (2) efficient algorithms for optimal alignment under these models, and (3) improved learning criteria for setting model parameters through inverse alignment, as well as (4) in-depth experiments evaluating model variants on benchmark alignments. More specifically, the new models use secondary structure predictions and their confidences to modify the scoring of both substitutions and gaps. All models have efficient algorithms for optimal pairwise alignment that run in near-quadratic time. These models have many parameters, which are rigorously learned using inverse alignment under a new criterion that carefully balances score error and recovery error. We then evaluate these models by studying how accurately an optimal alignment under the model recovers benchmark reference alignments that are based on the known three-dimensional structures of the proteins. The experiments show that these new models provide a significant boost in accuracy over the standard model for distant sequences. The improvement for pairwise alignment is as much as 15% for sequences with less than 25% identity, while for multiple alignment the improvement is more than 20% for difficult benchmarks whose accuracy under standard tools is at most 40%.

AB - Accurately aligning distant protein sequences is notoriously difficult. Since the amino acid sequence alone often does not provide enough information to obtain accurate alignments under the standard alignment scoring functions, a recent approach to improving alignment accuracy is to use additional information such as secondary structure. We make several advances in alignment of protein sequences annotated with predicted secondary structure: (1) more accurate models for scoring alignments, (2) efficient algorithms for optimal alignment under these models, and (3) improved learning criteria for setting model parameters through inverse alignment, as well as (4) in-depth experiments evaluating model variants on benchmark alignments. More specifically, the new models use secondary structure predictions and their confidences to modify the scoring of both substitutions and gaps. All models have efficient algorithms for optimal pairwise alignment that run in near-quadratic time. These models have many parameters, which are rigorously learned using inverse alignment under a new criterion that carefully balances score error and recovery error. We then evaluate these models by studying how accurately an optimal alignment under the model recovers benchmark reference alignments that are based on the known three-dimensional structures of the proteins. The experiments show that these new models provide a significant boost in accuracy over the standard model for distant sequences. The improvement for pairwise alignment is as much as 15% for sequences with less than 25% identity, while for multiple alignment the improvement is more than 20% for difficult benchmarks whose accuracy under standard tools is at most 40%.

KW - Affine gap penalties

KW - Inverse parametric alignment

KW - Protein secondary structure

KW - Sequence alignment

KW - Substitution score matrices

UR - http://www.scopus.com/inward/record.url?scp=77950839478&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77950839478&partnerID=8YFLogxK

U2 - 10.1089/cmb.2009.0222

DO - 10.1089/cmb.2009.0222

M3 - Article

C2 - 20377464

AN - SCOPUS:77950839478

VL - 17

SP - 561

EP - 580

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 3

ER -