Core column prediction for protein multiple sequence alignments

Dan DeBlasio, John D Kececioglu

Research output: Contribution to journalArticle

Abstract

Background: In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference alignment are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the known three-dimensional structures of the proteins. Typically the accuracy of a protein multiple sequence alignment that has been computed for a benchmark is only measured with respect to the core columns of the reference alignment. When computing an alignment in practice, however, a reference alignment is not known, so the coreness of its columns can only be predicted. Results: We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment's accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner's scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other column-confidence estimators from the literature, and affords a substantial boost in alignment accuracy.

Original languageEnglish (US)
Article number11
JournalAlgorithms for Molecular Biology
Volume12
Issue number1
DOIs
StatePublished - Apr 19 2017

Fingerprint

Multiple Sequence Alignment
Sequence Alignment
Alignment
Proteins
Protein
Prediction
Benchmarking
Predictors
Linear Programming
Nearest Neighbor
Gold
Large-scale Linear Programming
Benchmark
Regression Function
Distance Function
Scoring
Superposition
Confidence
Substitution
Machine Learning

Keywords

  • Accuracy estimation
  • Alignment accuracy
  • Core blocks
  • Machine learning
  • Multiple sequence alignment
  • Parameter advising
  • Regression

ASJC Scopus subject areas

  • Structural Biology
  • Molecular Biology
  • Computational Theory and Mathematics
  • Applied Mathematics

Cite this

Core column prediction for protein multiple sequence alignments. / DeBlasio, Dan; Kececioglu, John D.

In: Algorithms for Molecular Biology, Vol. 12, No. 1, 11, 19.04.2017.

Research output: Contribution to journalArticle

@article{102fa8002eca4939afb5b936fb46b83e,
title = "Core column prediction for protein multiple sequence alignments",
abstract = "Background: In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference alignment are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the known three-dimensional structures of the proteins. Typically the accuracy of a protein multiple sequence alignment that has been computed for a benchmark is only measured with respect to the core columns of the reference alignment. When computing an alignment in practice, however, a reference alignment is not known, so the coreness of its columns can only be predicted. Results: We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment's accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner's scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other column-confidence estimators from the literature, and affords a substantial boost in alignment accuracy.",
keywords = "Accuracy estimation, Alignment accuracy, Core blocks, Machine learning, Multiple sequence alignment, Parameter advising, Regression",
author = "Dan DeBlasio and Kececioglu, {John D}",
year = "2017",
month = "4",
day = "19",
doi = "10.1186/s13015-017-0102-3",
language = "English (US)",
volume = "12",
journal = "Algorithms for Molecular Biology",
issn = "1748-7188",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Core column prediction for protein multiple sequence alignments

AU - DeBlasio, Dan

AU - Kececioglu, John D

PY - 2017/4/19

Y1 - 2017/4/19

N2 - Background: In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference alignment are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the known three-dimensional structures of the proteins. Typically the accuracy of a protein multiple sequence alignment that has been computed for a benchmark is only measured with respect to the core columns of the reference alignment. When computing an alignment in practice, however, a reference alignment is not known, so the coreness of its columns can only be predicted. Results: We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment's accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner's scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other column-confidence estimators from the literature, and affords a substantial boost in alignment accuracy.

AB - Background: In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference alignment are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the known three-dimensional structures of the proteins. Typically the accuracy of a protein multiple sequence alignment that has been computed for a benchmark is only measured with respect to the core columns of the reference alignment. When computing an alignment in practice, however, a reference alignment is not known, so the coreness of its columns can only be predicted. Results: We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment's accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner's scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other column-confidence estimators from the literature, and affords a substantial boost in alignment accuracy.

KW - Accuracy estimation

KW - Alignment accuracy

KW - Core blocks

KW - Machine learning

KW - Multiple sequence alignment

KW - Parameter advising

KW - Regression

UR - http://www.scopus.com/inward/record.url?scp=85018500015&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85018500015&partnerID=8YFLogxK

U2 - 10.1186/s13015-017-0102-3

DO - 10.1186/s13015-017-0102-3

M3 - Article

AN - SCOPUS:85018500015

VL - 12

JO - Algorithms for Molecular Biology

JF - Algorithms for Molecular Biology

SN - 1748-7188

IS - 1

M1 - 11

ER -