Henneges, Carsten

Feature Selection and Data Mining for Proteomics and Metabolomics

Ph.D. thesis, University of Tuebingen, Verlag Dr. Hut,Sternstraße 18, München, Tübingen, Germany, 2011

Abstract

The past decades saw a rapid improvement in the technique of biological experiments. While the beginning was coined with long and complex experiments carried out by laboratory staff, automatic high-throughput methods emerged. Especially proteomics profited from advances in mass spectrometry to identify fragments from digested proteins. Mass spectrometry entered the novel field of metabolomics. Metabolomics investigates small molecules from metabolism for their role as disease markers. The goal here is to develop monitoring and screening techniques based on easily obtained body fluids. However, further methods, as IR spectroscopy, are waiting for their advent into metabolomics research. Connected to each area of research is the topic of data mining. Essentially, data mining can be subdivided into the tasks of feature selection and feature construction. Feature selection aims to select relevant features out of a larger pool. A selected combination may aid visualisation and thus understanding of a dataset as well as improve the prediction performance of learning algorithms. To this end, three general approaches exist: wrapper, filter and embedded methods. While wrappers employ an arbitrary learning algorithm for assessing the value of a feature combination, filters rely on statistical criteria. Most recently, embedded methods attracted research interest, wherein feature selection is integrated into a learning algorithm. Feature construction algorithms on the other hand reconstruct the subsignals of an additive superposition. The key approach thereby is matrix factorisation by constraints. Frequent constraints used for this purpose are statistical independence as well as non-negativity and sparsity, leading to problem specific algorithms. This book supports life science researchers with adapted data mining methods from both feature selection and feature construction for proteomics and metabolomics. We describe biomarker identification for breast cancer prediction using a SVM-based wrapper and develop faster wrapper algorithms using surrogate-based optimisation. Applying filters for ranking-specific feature selection, we also design a cost-efficient prediction system for proteotypic peptides. As an application, embedded methods are used to infer energetical interaction patterns in protein 3D structures. Finally, we develop a novel factorisation method for feature construction to decompose IR spectra within a metabolomics context.

Downloads and Links

[pdf]

BibTeX

@phdthesis{Henneges2011,
  author = {Henneges, Carsten},
  title = {Feature Selection and Data Mining for Proteomics and Metabolomics},
  school = {University of Tuebingen},
  year = {2011},
  address = {T\"ubingen, Germany},
  month = mar,
  abstract = {The past decades saw a rapid improvement in the technique of biological
	experiments. While the beginning was coined with long and complex
	experiments carried out by laboratory staff, automatic high-throughput
	methods emerged. Especially proteomics profited from advances in
	mass spectrometry to identify fragments from digested proteins. Mass
	spectrometry entered the novel field of metabolomics. Metabolomics
	investigates small molecules from metabolism for their role as disease
	markers. The goal here is to develop monitoring and screening techniques
	based on easily obtained body fluids. However, further methods, as
	IR spectroscopy, are waiting for their advent into metabolomics research.
	Connected to each area of research is the topic of data mining. Essentially,
	data mining can be subdivided into the tasks of feature selection
	and feature construction. Feature selection aims to select relevant
	features out of a larger pool. A selected combination may aid visualisation
	and thus understanding of a dataset as well as improve the prediction
	performance of learning algorithms. To this end, three general approaches
	exist: wrapper, filter and embedded methods. While wrappers employ
	an arbitrary learning algorithm for assessing the value of a feature
	combination, filters rely on statistical criteria. Most recently,
	embedded methods attracted research interest, wherein feature selection
	is integrated into a learning algorithm. Feature construction algorithms
	on the other hand reconstruct the subsignals of an additive superposition.
	The key approach thereby is matrix factorisation by constraints.
	Frequent constraints used for this purpose are statistical independence
	as well as non-negativity and sparsity, leading to problem specific
	algorithms. This book supports life science researchers with adapted
	data mining methods from both feature selection and feature construction
	for proteomics and metabolomics. We describe biomarker identification
	for breast cancer prediction using a SVM-based wrapper and develop
	faster wrapper algorithms using surrogate-based optimisation. Applying
	filters for ranking-specific feature selection, we also design a
	cost-efficient prediction system for proteotypic peptides. As an
	application, embedded methods are used to infer energetical interaction
	patterns in protein 3D structures. Finally, we develop a novel factorisation
	method for feature construction to decompose IR spectra within a
	metabolomics context.},
  isbn = {978-3-8439-0122-2},
  publisher = {Verlag Dr.~Hut,Sternstra{\ss}e 18, M\"unchen},
  url = {http://www.dr.hut-verlag.de/978-3-8439-0122-2.html}
}