AURITUS: Automatic Meta-Data Extraction from Audio Documents

While text search engines have become common tools even for the casual Internet user, searching and retrieving audio files by content is still subject to research, from extracting meta-data by statistically analyzing the audio signal, segmenting at texture changes, generating "audio fingerprints" of music pieces, developing interfaces like query-by-humming, and visualizing audio content. Three major current trends in the field of audio databases promote the need for new search and navigation techniques, especially content-based retrieval algorithms:

  1. The amount of digitally available audio documents increases at a very high speed. The same time digital storage capacity has become affordable enough to allow online access to vast audio collections.
  2. Helpful meta-data is often non-existent or flawed, because many of these 'databases' are open collections (e.g. in peer-to-peer file sharing communities) - lacking centralized administration and indexing using established dictionaries.
  3. The nature of audio documents itself suggests novel query interfaces besides mere searching in textual meta-data, e.g. query-by-humming or query-by-example.

Almost any audio application employs some form of temporal segmentation of the audio signal, i.e. seeks continuous regions in time where a specific feature is present (or absent). We focus on segmentation by chorus in music documents. Chorus detection has many obvious applications, like playing chorus thumbnails in sound file browsers or "magnetic" region selectors in audio editors etc.

Top: Self-similarity matrix. Chorus segments are clearly visible as white lines in the upper right of the repetition matrix. Middle: Repetition density function: Bottom: Human segmentations (light blue) and automated segmentation (red).

Our chorus segmentation algorithm is based on self-similarity in the frequency domain. The basic idea is that a chorus is assumed to be a repeated region within a music document. The more often a region is repeated throughout the document, the more likely it belongs to a chorus segment. Within four processing steps we generate a self-similarity matrix and extract repeated regions.

Using a real-world benchmark set comprising the German Top 100 Single Charts as published by Media Control GmbH (the leading German market research institute in this field), AURITUS, our JavaBeans based implementation produces results comparable to human listeners and makes way for a variety of useful applications as described.

Examples from the Top 100 benchmark set.

Above you can see some self-similarity matrices from our benchmark set. In these visualizations, similar segments are colored dark. Please notice the dark lines running parallel from top left to bottom right and their correspondence with the human chorus segmentations (n=3 shown here) depicted as vivid colored bars below.


Simon G. Wiest, Tel.: (07071) 29-77176, Email: wiest at