Predicting core columns of protein multiple sequence alignments for improved parameter advising

Dan Deblasio, John D Kececioglu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the folded three-dimensional structures of the proteins. When computing a protein multiple sequence alignment in practice, a reference alignment is not known, so its coreness can only be predicted. We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment’s accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner’s scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other columnconfidence estimators from the literature, and affords a substantial boost in alignment accuracy.

Original languageEnglish (US)
Title of host publicationAlgorithms in Bioinformatics - 16th International Workshop, WABI 2016, Proceedings
PublisherSpringer Verlag
Pages77-89
Number of pages13
Volume9838
ISBN (Print)9783319436807
DOIs
StatePublished - 2016
Event16th International Workshop on Algorithms in Bioinformatics, WABI 2016 - Aarhus, Denmark
Duration: Aug 22 2016Aug 24 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9838
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other16th International Workshop on Algorithms in Bioinformatics, WABI 2016
CountryDenmark
CityAarhus
Period8/22/168/24/16

Fingerprint

Multiple Sequence Alignment
Alignment
Proteins
Protein
Predictors
Nearest Neighbor
Large-scale Linear Programming
Regression Function
Distance Function
Scoring
Gold
Superposition
Substitution
Machine Learning
Transform
Benchmark
Estimator
Linear programming
Predict
Three-dimensional

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Deblasio, D., & Kececioglu, J. D. (2016). Predicting core columns of protein multiple sequence alignments for improved parameter advising. In Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Proceedings (Vol. 9838, pp. 77-89). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9838). Springer Verlag. https://doi.org/10.1007/978-3-319-43681-4_7

Predicting core columns of protein multiple sequence alignments for improved parameter advising. / Deblasio, Dan; Kececioglu, John D.

Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Proceedings. Vol. 9838 Springer Verlag, 2016. p. 77-89 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9838).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Deblasio, D & Kececioglu, JD 2016, Predicting core columns of protein multiple sequence alignments for improved parameter advising. in Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Proceedings. vol. 9838, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9838, Springer Verlag, pp. 77-89, 16th International Workshop on Algorithms in Bioinformatics, WABI 2016, Aarhus, Denmark, 8/22/16. https://doi.org/10.1007/978-3-319-43681-4_7
Deblasio D, Kececioglu JD. Predicting core columns of protein multiple sequence alignments for improved parameter advising. In Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Proceedings. Vol. 9838. Springer Verlag. 2016. p. 77-89. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-43681-4_7
Deblasio, Dan ; Kececioglu, John D. / Predicting core columns of protein multiple sequence alignments for improved parameter advising. Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Proceedings. Vol. 9838 Springer Verlag, 2016. pp. 77-89 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{8615dd7929ac4246be202f256ab92f95,
title = "Predicting core columns of protein multiple sequence alignments for improved parameter advising",
abstract = "In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the folded three-dimensional structures of the proteins. When computing a protein multiple sequence alignment in practice, a reference alignment is not known, so its coreness can only be predicted. We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment’s accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner’s scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other columnconfidence estimators from the literature, and affords a substantial boost in alignment accuracy.",
author = "Dan Deblasio and Kececioglu, {John D}",
year = "2016",
doi = "10.1007/978-3-319-43681-4_7",
language = "English (US)",
isbn = "9783319436807",
volume = "9838",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "77--89",
booktitle = "Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Proceedings",
address = "Germany",

}

TY - GEN

T1 - Predicting core columns of protein multiple sequence alignments for improved parameter advising

AU - Deblasio, Dan

AU - Kececioglu, John D

PY - 2016

Y1 - 2016

N2 - In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the folded three-dimensional structures of the proteins. When computing a protein multiple sequence alignment in practice, a reference alignment is not known, so its coreness can only be predicted. We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment’s accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner’s scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other columnconfidence estimators from the literature, and affords a substantial boost in alignment accuracy.

AB - In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the folded three-dimensional structures of the proteins. When computing a protein multiple sequence alignment in practice, a reference alignment is not known, so its coreness can only be predicted. We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment’s accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner’s scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other columnconfidence estimators from the literature, and affords a substantial boost in alignment accuracy.

UR - http://www.scopus.com/inward/record.url?scp=84984999015&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84984999015&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-43681-4_7

DO - 10.1007/978-3-319-43681-4_7

M3 - Conference contribution

AN - SCOPUS:84984999015

SN - 9783319436807

VL - 9838

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 77

EP - 89

BT - Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Proceedings

PB - Springer Verlag

ER -