Pattern Recognition for Strings

Traditionally, pattern recognition has been concerned mostly with numerical data, i.e. vectors of real-valued features. Less often, syntactical representation of data has been used. A special category of data, the symbol strings, have been neglected for a long time, partially because of a perceived lack of urgency and partially because of the high computational cost involved. Only recently, motivated by research in such diverse fields like speech recognition and computational molecular biology, symbol strings acquired more interest in the pattern recognition community.

With respect to the underlying mathematical properties, current pattern recognition algorithms can be classified into three main groups: those based on a scalar product, on a distance, and on a kernel. Multi-layer perceptrons with sigmoid activation functions are the most popular example of the first kind. Nearest-neighbor classifiers are typical representatives of the second kind, whereas support vector machines rely on kernels.

No scalar product has been defined on string data. However, it is possible to define both a distance measure on them, as well as a kernel function. This implies that the two latter methods can be applied for strings, too. By further defining averages on them, further distance-based algorithms, like K-means and LVQ, become applicable. This project investigates the performance of such algorithms on symbol string data.

StringSOM of proteins
A StringSOM of seven protein families (chroma-coded)


Contact

Igor Fischer, Phone: +49 7071 29-77176, fischer@informatik.uni-tuebingen.de