COFEA: High speed searching methods using the Compressed Feature
The Compressed Feature Matrix (CFM) is a feature based molecular
descriptor enabling fast adaptive similarity search, pharmacophore
development and substructure search. Within the CFM descriptor a
feature-vector contains the biochemical or physicochemical features
that occur in the described molecule. The assignment of the structural
patterns to feature types may be determined by the user. The second
part of the descriptor is a distance matrix which correlates the
comprised features. Depending on the particular purpose, the matrix
may either be generated from topological or Euclidean molecular
data, permitting both a two- and a three-dimensional encoding of
In contrast to the common distance matrix, the CFM is based on
features instead of atoms. Each kind of these features may be weighted
separately, depending on its (estimated) contribution to the biological
effect of the molecule. Therefore, the CFM allows to adapt similarity
evaluation to particular ligand sets as well as to classification
requirements. As a result, the CFM permits to focus on characteristic
small parts of molecules - which are independent of the molecular
scaffold - to serve as a basis for the calculation of similarity.
Hence, the CFM is not only suitable for common similarity evaluation
but also for techniques like lead or scaffold hopping.
Similarity search characteristics
- The CFM-based similarity search may be performed via common
molecule vs. molecule comparisons or by using a pharmacophore-model
as the target structure.
- COFEA provides two different ways of calculating pharmacophores,
each of them being capable of 2D and 3D evaluations.
- The average search speed is around 1200 compunds/second, i.e.
1,000,000 compounds in less than 15 minutes.
- The CFM-based similarity search is suitable for interactive
use even for large data bases.
While common substructure descriptors merely allow a screening
for predefined patterns, the CFM permits a real substructure/ subgraph
search, presuming that all desired elements of the query substructure
are described by the selected feature set. Compared to graph-based
searching methods, the CFM based matrix algorithm turned out to
be up to several hundred times faster. Using the CFM as a basis
for a basic substructure screening, the search speed is even accelerated
by three orders of magnitude. Thus, the CFM based substructure search
complies with the requirements of an interactive use, even for the
evaluation of several 100,000 compounds.
Substructure search characteristics
- Since the feature-set may be determined by the user, the search
results may be adapted to particular requirements, e.g. concerning
a certain biological effect.
- Using a feature-set that is suitable for common pharmaceutical
problems, the average search speed is between 30,000 and 100,000
compounds/second, i.e. 1,000,000 compounds in between 10 and 30
- COFEA permits the preclusion of compounds with unsuitable feature-composition.
- Therefore, search speed may even be decreased up to 250,000
compounds/second, i.e. 1,000,000 compounds in about 4 seconds.
Badreddin Abolmaali, Tel.: +49 7071
29-78979, abolmaali at informatik.uni-tuebingen.de
Last changes: 19.03.2018, 18:46 CET.
© 2001-2005 University of Tübingen