Wegner, Jörg K. and Fröhlich, Holger and Zell, Andreas

Feature selection for Descriptor based Classification Models. 2. Human intestinal absorption (HIA)

Journal of Chemical Information and Computer Science (JCICS) vol. 44 (2004), no. 3, pp. 931-939


Abstract

We show that the topological polar surface area (TPSA) descriptor and the radial distribution function (RDF) applied to electronic and steric atom properties, like the conjugated electrotopological state (CETS), are the most relevant features/descriptors for predicting the human intestinal absorption (HIA) out of a large set of 2934 features/descriptors. A HIA data set with 196 molecules with measured HIA values and 2934 features/descriptors were calculated using JOELib and MOE. We used an adaptive boosting algorithm to solve the binary classification problem (AdaBoost.M1) and Genetic Algorithms based on Shannon Entropy Cliques (GA-SEC) variants as hybrid feature selection algorithms. The selection of relevant features was applied with respect to the generalization ability of the classification model, avoiding a high variance for unseen molecules (overfitting).


Downloads and Links

[doi] [pdf] [pdf]


BibTeX

@article{2004_65,
  author = {Wegner, J\"org K. and Fr\"ohlich, Holger and Zell, Andreas},
  title = {{Feature selection for Descriptor based Classification Models. 2.
	Human intestinal absorption (HIA)}},
  journal = {Journal of Chemical Information and Computer Science (JCICS)},
  year = {2004},
  volume = {44},
  pages = {931--939},
  number = {3},
  month = feb,
  abstract = {We show that the topological polar surface area (TPSA) descriptor
	and the radial distribution function (RDF) applied to electronic
	and steric atom properties, like the conjugated electrotopological
	state (CETS), are the most relevant features/descriptors for predicting
	the human intestinal absorption (HIA) out of a large set of 2934
	features/descriptors. A HIA data set with 196 molecules with measured
	HIA values and 2934 features/descriptors were calculated using JOELib
	and MOE. We used an adaptive boosting algorithm to solve the binary
	classification problem (AdaBoost.M1) and Genetic Algorithms based
	on Shannon Entropy Cliques (GA-SEC) variants as hybrid feature selection
	algorithms. The selection of relevant features was applied with respect
	to the generalization ability of the classification model, avoiding
	a high variance for unseen molecules (overfitting).},
  doi = {10.1021/ci034233w},
  pdf = {http://www.cogsys.uni-tuebingen.de/publikationen/2004/Wegner2004b.pdf},
  url = {http://dx.doi.org/10.1021/ci034233w}
}