Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Bayesian semi-supervised classification of bacterial samples using MLST databases

Cheng, Lu ORCID: https://orcid.org/0000-0002-6391-2360, Connor, Thomas Richard ORCID: https://orcid.org/0000-0003-2394-6504, Aanensen, David M., Spratt, Brian G. and Corander, Jukka 2011. Bayesian semi-supervised classification of bacterial samples using MLST databases. BioMed Central Bioinformatics 12 (1) , 302. 10.1186/1471-2105-12-302

[thumbnail of Cheng 2011.pdf]
Preview
PDF - Published Version
Download (603kB) | Preview

Abstract

Background Worldwide effort on sampling and characterization of molecular variation within a large number of human and animal pathogens has lead to the emergence of multi-locus sequence typing (MLST) databases as an important tool for studying the epidemiology and evolution of pathogens. Many of these databases are currently harboring several thousands of multi-locus DNA sequence types (STs) enriched with metadata over traits such as serotype, antibiotic resistance, host organism etc of the isolates. Curators of the databases have thus the possibility of dividing the pathogen populations into subsets representing different evolutionary lineages, geographically associated groups, or other subpopulations, which are defined in terms of molecular similarities and dissimilarities residing within a database. When combined with the existing metadata, such subsets may provide invaluable information for assessing the position of a new set of isolates in relation to the whole pathogen population. Results To enable users of MLST schemes to query the databases with sets of new bacterial isolates and to automatically analyze their relation to existing curated sequences, we introduce here a Bayesian model-based method for semi-supervised classification of MLST data. Our method can use an MLST database as a training set and assign simultaneously any set of query sequences into the earlier discovered lineages/populations, while also allowing some or all of these sequences to form previously undiscovered genetically distinct groups. This tool provides probabilistic quantification of the classification uncertainty and is highly efficient computationally, thus enabling rapid analyses of large databases and sets of query sequences. The latter feature is a necessary prerequisite for an automated access through the MLST web interface. We demonstrate the versatility of our approach by anayzing both real and synthesized data from MLST databases. The introduced method for semi-supervised classification of sets of query STs is freely available for Windows, Mac OS X and Linux operative systems in BAPS 5.4 software which is downloadable at http://web.abo.fi/fak/mnf/mate/jc/software/baps.html webcite. The query functionality is also directly available for the Staphylococcus aureus database at http://www.mlst.net webcite and shortly will be available for other species databases hosted at this web portal. Conclusions We have introduced a model-based tool for automated semi-supervised classification of new pathogen samples that can be integrated into the web interface of the MLST databases. In particular, when combined with the existing metadata, the semi-supervised labeling may provide invaluable information for assessing the position of a new set of query strains in relation to the particular pathogen population represented by the curated database. Such information will be useful both for clinical and basic research purposes.

Item Type: Article
Date Type: Publication
Status: Published
Schools: Biosciences
Systems Immunity Research Institute (SIURI)
Subjects: Q Science > QH Natural history > QH301 Biology
Q Science > QR Microbiology > QR180 Immunology
Q Science > QR Microbiology > QR355 Virology
Additional Information: Pdf uploaded in accordance with publisher's policy at http://www.sherpa.ac.uk/romeo/issn/1471-2105/ (accessed 25/02/2014)
Publisher: BioMed Central
ISSN: 1471-2105
Date of First Compliant Deposit: 30 March 2016
Last Modified: 06 May 2023 21:56
URI: https://orca.cardiff.ac.uk/id/eprint/41541

Citation Data

Cited 19 times in Scopus. View in Scopus. Powered By Scopus® Data

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics