Graph Mining and Machine Learning for Chemical Compounds

A Library for the Decomposition of Chemical Compounds

Goal of this work is the development of a library for the decomposition of the topology and geometry of small organic compounds. Such representations are useful for various purposes like molecular fingerprints and similarity searches. Although a large number of commercial tools exist for this purpose they suffer from closed source and high license costs. Therefore, it is nearly impossible to compare different encodings. A publication and the release of the library at is scheduled in summer 2010. At the moment the decomposition library features

  • 16 topological and geometrical fingerprints
  • 4 different atom typing schemes and a Pharmacophore typing schemes
  • Free parameterizations such as geometrical distance-cutoff and topological search depth
  • Efficient data structures for a fast comparison of feature maps
  • Exporters such as LIBSVM format (sparse format and matrix format), WEKA ARFF format, and comma-separated format
  • A liberal license (LGPL): The library is only based on the open source project Chemistry Development Kit (CDK)

Parts of the library were already used in several studies [1,3,5].

Figure 1: Lossless all-shortest path fingerprint in a trie data structure (encoded molecule: aspirin, search depth = 2)

Molecule Kernels

Graph Kernels provide the possibility to compare chemical compounds without the predefinition of nominal or numerical features and have shown an excellent prediction performance [6]. Recently, we published a new 2D/3D convolution kernel based on atom pair environments which is suitable to predict molecular properties and to conduct similarity searches [3,7]. This algorithm was extended to integrate conformational information [4].

Figure 2.1: Optimal Assignment of geometrical local atom pair environments

Figure 2.2: Optimal assignment of topological atom pair environments


Large Scale Learning on Molecular Data

The application of large scale machines in cheminformatics becomes increasingly important because of the growing size of data sets with measured compounds. Our experiments focus on the use of linear large scale SVMs which can handle a large number of samples (up to several 100,000 samples) with a sparse encoding. Thus, the decomposition into molecular fragments seems a promising encoding to accomplish this task


[1]          Nikolas Fechner, Georg Hinselmann, Andreas Jahn, Lars Rosenbaum, and Andreas Zell. A Free-Wilson-like approach to analyze QSAR models based on graph decomposition kernels. Molecular Informatics, in press, 2010.

[2]          Nikolas Fechner, Georg Hinselmann, and Jörg Kurt Wegner. Handbook of Chemoinformatics Algorithms, chapter Molecular Descriptors. Chapman & Hall/CRC Mathematical & Computational Biology, 2010.

[3]        Georg Hinselmann, Nikolas Fechner, Andreas Jahn, Matthias Eckert, and Andreas Zell. Graph kernels for chemical compounds using topological and three-dimensional local atom pair environments. Neurocomputing, In Press, Accepted Manuscript:-, 2010.

[4]          Andreas Jahn, Georg Hinselmann, Nikolas Fechner, Carsten Henneges, and Andreas Zell. Probabilistic modeling of conformational space for 3d machine learning approaches. Molecular Informatics, 29(5):441-455, 2010.

[5]          Georg Hinselmann, Andreas Jahn, Nikolas Fechner, and Andreas Zell. Chronic rat toxicity prediction of chemical compounds using kernel machines. In Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics: 7th European Conference (EvoBio 2009), volume 5483, pages 25-36, Tübingen, Germany, April 2009. Springer.

[6]          Nikolas Fechner, Andreas Jahn, Georg Hinselmann, and Andreas Zell. Atomic local neighborhood flexibility incorporation into a structured similarity measure for qsar. Journal of Chemical Information and Modeling, 49(3):549-560, March 2009.

[7]          Andreas Jahn, Georg Hinselmann, Nikolas Fechner, and Andreas Zell. Optimal assignment methods for ligand-based virtual screening. Journal of Cheminformatics, 1(14), 2009.


Lars Rosenbaum, Tel.: (07071) 29-77174, lars.rosenbaum (at)