Allele identification in assembled genomic sequence datasets

Katrina M Dlugosch, Aurélie Bonin

Research output: Chapter in Book/Report/Conference proceedingChapter

1 Citation (Scopus)

Abstract

Allelic variation within species provides fundamental insights into the evolution and ecology of organisms, and information about this variation is becoming increasingly available in sequence datasets of multiple and/or outbred individuals. Unfortunately, identifying true allelic variants poses a number of challenges, given the presence of both sequencing errors and alleles from other closely related loci. We outline the key considerations involved in this process, including assessing the accuracy of allele resolution in sequence assembly, clustering of alleles within and among individuals, and identifying clusters that are most likely to correspond to true allelic variants of a single locus. Our focus is particularly on the case where alleles must be identified without a fully resolved reference genome, and where sequence depth information cannot be used to infer the putative number of loci sharing a sequence, such as in transcriptome or post-assembly datasets. Throughout, we provide information about publicly available tools to aid allele identification in such cases.

Original languageEnglish (US)
Title of host publicationMethods in Molecular Biology
PublisherHumana Press Inc.
Pages197-211
Number of pages15
Volume888
ISBN (Print)9781617798696
DOIs
StatePublished - 2012

Publication series

NameMethods in Molecular Biology
Volume888
ISSN (Print)10643745

Fingerprint

Alleles
Ecology
Transcriptome
Cluster Analysis
Datasets
Genome

Keywords

  • AllelePipe
  • Allelic variation
  • Gene duplication
  • Granularity
  • Maximum likelihood clustering
  • Next-generation sequencing
  • Paralogs
  • Single-linkage clustering
  • Transcriptome data

ASJC Scopus subject areas

  • Molecular Biology
  • Genetics
  • Medicine(all)

Cite this

Dlugosch, K. M., & Bonin, A. (2012). Allele identification in assembled genomic sequence datasets. In Methods in Molecular Biology (Vol. 888, pp. 197-211). (Methods in Molecular Biology; Vol. 888). Humana Press Inc.. https://doi.org/10.1007/978-1-61779-870-2_12

Allele identification in assembled genomic sequence datasets. / Dlugosch, Katrina M; Bonin, Aurélie.

Methods in Molecular Biology. Vol. 888 Humana Press Inc., 2012. p. 197-211 (Methods in Molecular Biology; Vol. 888).

Research output: Chapter in Book/Report/Conference proceedingChapter

Dlugosch, KM & Bonin, A 2012, Allele identification in assembled genomic sequence datasets. in Methods in Molecular Biology. vol. 888, Methods in Molecular Biology, vol. 888, Humana Press Inc., pp. 197-211. https://doi.org/10.1007/978-1-61779-870-2_12
Dlugosch KM, Bonin A. Allele identification in assembled genomic sequence datasets. In Methods in Molecular Biology. Vol. 888. Humana Press Inc. 2012. p. 197-211. (Methods in Molecular Biology). https://doi.org/10.1007/978-1-61779-870-2_12
Dlugosch, Katrina M ; Bonin, Aurélie. / Allele identification in assembled genomic sequence datasets. Methods in Molecular Biology. Vol. 888 Humana Press Inc., 2012. pp. 197-211 (Methods in Molecular Biology).
@inbook{c2b2f10a3c844dd1b59389394788bc8c,
title = "Allele identification in assembled genomic sequence datasets",
abstract = "Allelic variation within species provides fundamental insights into the evolution and ecology of organisms, and information about this variation is becoming increasingly available in sequence datasets of multiple and/or outbred individuals. Unfortunately, identifying true allelic variants poses a number of challenges, given the presence of both sequencing errors and alleles from other closely related loci. We outline the key considerations involved in this process, including assessing the accuracy of allele resolution in sequence assembly, clustering of alleles within and among individuals, and identifying clusters that are most likely to correspond to true allelic variants of a single locus. Our focus is particularly on the case where alleles must be identified without a fully resolved reference genome, and where sequence depth information cannot be used to infer the putative number of loci sharing a sequence, such as in transcriptome or post-assembly datasets. Throughout, we provide information about publicly available tools to aid allele identification in such cases.",
keywords = "AllelePipe, Allelic variation, Gene duplication, Granularity, Maximum likelihood clustering, Next-generation sequencing, Paralogs, Single-linkage clustering, Transcriptome data",
author = "Dlugosch, {Katrina M} and Aur{\'e}lie Bonin",
year = "2012",
doi = "10.1007/978-1-61779-870-2_12",
language = "English (US)",
isbn = "9781617798696",
volume = "888",
series = "Methods in Molecular Biology",
publisher = "Humana Press Inc.",
pages = "197--211",
booktitle = "Methods in Molecular Biology",

}

TY - CHAP

T1 - Allele identification in assembled genomic sequence datasets

AU - Dlugosch, Katrina M

AU - Bonin, Aurélie

PY - 2012

Y1 - 2012

N2 - Allelic variation within species provides fundamental insights into the evolution and ecology of organisms, and information about this variation is becoming increasingly available in sequence datasets of multiple and/or outbred individuals. Unfortunately, identifying true allelic variants poses a number of challenges, given the presence of both sequencing errors and alleles from other closely related loci. We outline the key considerations involved in this process, including assessing the accuracy of allele resolution in sequence assembly, clustering of alleles within and among individuals, and identifying clusters that are most likely to correspond to true allelic variants of a single locus. Our focus is particularly on the case where alleles must be identified without a fully resolved reference genome, and where sequence depth information cannot be used to infer the putative number of loci sharing a sequence, such as in transcriptome or post-assembly datasets. Throughout, we provide information about publicly available tools to aid allele identification in such cases.

AB - Allelic variation within species provides fundamental insights into the evolution and ecology of organisms, and information about this variation is becoming increasingly available in sequence datasets of multiple and/or outbred individuals. Unfortunately, identifying true allelic variants poses a number of challenges, given the presence of both sequencing errors and alleles from other closely related loci. We outline the key considerations involved in this process, including assessing the accuracy of allele resolution in sequence assembly, clustering of alleles within and among individuals, and identifying clusters that are most likely to correspond to true allelic variants of a single locus. Our focus is particularly on the case where alleles must be identified without a fully resolved reference genome, and where sequence depth information cannot be used to infer the putative number of loci sharing a sequence, such as in transcriptome or post-assembly datasets. Throughout, we provide information about publicly available tools to aid allele identification in such cases.

KW - AllelePipe

KW - Allelic variation

KW - Gene duplication

KW - Granularity

KW - Maximum likelihood clustering

KW - Next-generation sequencing

KW - Paralogs

KW - Single-linkage clustering

KW - Transcriptome data

UR - http://www.scopus.com/inward/record.url?scp=84866749735&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84866749735&partnerID=8YFLogxK

U2 - 10.1007/978-1-61779-870-2_12

DO - 10.1007/978-1-61779-870-2_12

M3 - Chapter

C2 - 22665283

AN - SCOPUS:84866749735

SN - 9781617798696

VL - 888

T3 - Methods in Molecular Biology

SP - 197

EP - 211

BT - Methods in Molecular Biology

PB - Humana Press Inc.

ER -