SABINE has moved to github.com/draeger-lab/SABINE.

Documentation

Contents



Introduction

The stand-alone application SABINE predicts the binding specificity of eucaryotic transcription factors (TF), based on various features extracted from their annotated protein sequences. The predicted DNA motif is reported as a position frequency matrix (PFM) which is a commonly used format to represent transcription factor binding specificities. In a nutshell, SABINE adopts support vector regression (SVR) models to identify TFs with annotated PFMs that exhibit highly similar binding specificities as the factor of your interest. The PFMs of these functionally similar factors are filtered for outliers and subsequently merged to generate a consensus PFM which is in turn transferred to the factor under study.

How to get started

SABINE is available as a stand-alone version and as an online version. Using the online version does not require any installation and provides a quick and simple way to predict DNA-binding specificities for the TFs of your choice. If you prefer to locally install your own copy of SABINE, you can get the latest stand-alone version at our download section. After installing the tool on your system, you can test whether the tool works properly by invoking SABINE with a sample command. Then you should choose the convenient user interface depending on your purpose of use of this program. SABINE is equipped with a user-friendly graphical interface which allows biological scientists to predict binding specificities of individual transcription factors. For time-consuming applications, such as organism-wide large-scale analysis of TF binding specificities, a more comprehensive command line interface is provided.

Installation

To extract the tool from the packed archive, which can be obtained from our download section, use the command:

tar -xzf sabine.tar.gz


For convenience, a shell script was implemented to simplify the installation of SABINE. Change the current working directory to the extracted SABINE directory and start the installation script with the following command:

sh install.sh


The script will install SABINE and all required third-party software packages and libraries on your system.
You can test if all software packages were installed successfully and thus ensure that SABINE works properly by running the installation validator:

sh sabine.sh --check-install


Warning Note that the tcsh shell has to be installed, as it is required by the tool PSIPRED, which is employed by SABINE to predict secondary structures.

Requirements

  • Linux system
  • Java (JDK 1.6 or later)

SABINE exclusively runs on Linux as it integrates diverse bioinformatics tools (see integrated software) which require a Linux platform. The analysis framework of SABINE is written in Java. Thus it requires that Java Virtual Machine (JDK version 1.6 or newer) is installed on your system. We recommend to use Sun Java 1.6 which has proven to perform well.

Starting the program

SABINE provides a Graphical User Interface (GUI) as well as a command-line interface. You can use the GUI for predicting the binding specificities of individual TFs. For the purpose of applying SABINE to large datasets, we recommend using the command-line interface.

The graphical user interface

You can start the GUI by executing the main script in your installation directory with the option "--gui".

sh sabine.sh --gui


The graphical user interface of SABINE


The demo mode
Use the Demo-Button to generate a showcase input for SABINE. Click the Run-Button in order to predict the binding specificity of the exemplary transcription factor using the default parameters.

User-defined queries
Follow the steps listed below, if you want to use SABINE to predict the binding specificity of an individual transcription factor:

1.Choose the organism and the superclass of the transcription factor of your interest. The corresponding superclass can be looked up at TRANSFAC.
2. Paste or type the entire amino acid sequence (in single-letter code) into the corresponding text box.
3. Specify the start and end positions of the DNA-binding domain(s) of the factor and press the Add-Button.
4. Click the Run-Button to predict the binding specificity of the entered transcription factor.

The results view
After SABINE has completely processed your input, the results will be displayed in a new frame.
If a PFM transfer was possible, the best matches, i.e., the TFs for which a PFM similarity greater than the best match threshold was predicted, are listed in a table. SABINE will display the corresponding consensus PFM in a table along with a sequence logo to provide a graphical representation of the transferred PFM. An option to save the reported results is available.
If SABINE was not able to transfer a PFM, you can try to choose a less stringent best match threshold. If your query factor exhibits sufficient domain similarity to the training factors of SABINE, this will increase the chance that a PFM transfer is possible. Note, that decreasing the best match threshold might negatively affect the quality of the transferred PFM, as it reduces the significance of the best matches.

The results view of SABINE

The command-line interface

To run SABINE in batch mode, execute the shell script sabine.sh contained in your installation directory:

sh sabine.sh <input_filename> [<OPTIONS>]

To display the usage of the script and an overview of the command line options, use the command:

sh sabine.sh --help


Short tutorial
This tutorial describes how SABINE can be applied to predict the binding specificity of the transcription factor of your interest.

First, you have to generate a properly formatted input file (see format specification or example input file).
The input file should contain the following information about the transcription factor under study:


In the next step one can apply the tool to predict the PFM-similarity of the chosen "query factor" to all "training factors" of the same TRANSFAC superclass whose PFMs are known, and which are part of the SABINE training set. The PFM predicted for your query factor results from merging the annotated PFMs of the best matches, i.e., the factors in the training set for which a high PFM-similarity to the query factor was predicted.
You can set a threshold for the PFM similarity, which determines whether a training factor is counted as a best match and thereby constitutes a candidate for a PFM transfer to the query factor. Additionally, you can set a limit for the maximum number of best match matrices that shall be merged in order to generate the predicted PFM.

To run SABINE on the example input file, using a PFM similarity threshold of 0.9 and merging at most 3 best match PFMs you can use the command:

sh sabine.sh input/test.tf -s 0.9 -m 3


SABINE will return an output file (see format specification or example output file), which contains the best matches associated with a PFM similarity score and the transferred PFM, provided that a best match was found in the training set.


Command-line options
You can set the parameters, customize the output and specify path and file names using the command-line options of SABINE which are listed in the following.

-s <best match threshold>
Lower bound for the predicted PFM similarity of a best match (see parameters for details).
 
-m <max. number of best matches>
Limit for the number of PFMs that are merged to generate a prediction (see parameters for details)
 
-o <outlier filter threshold>
Maximal tolerated deviation of a best match PFM (see parameters for details).
 
-b <base directory>
Directory to save temporary files. If the option is omitted, the files are saved to an automatically generated base directory.
 
-f <output filename>
Output file which contains predicted best matches and transferred PFM.
 
-v <verbose option>
Enables/disables the standard output of SABINE.



Parameters

The quality of the predicted PFMs heavily depends on the parameter settings of SABINE. Thus, we recommend to retain the stringent default values, which have proven to produce highly accurate results.

Best Match Threshold

This parameter defines a lower bound for the predicted PFM similarity of a best match to the query factor. By default this parameter is chosen dynamically at runtime depending on the quality of the best matches. However, if desired it can also be set to a fixed value. The more stringent the threshold value is chosen, the higher is the prediction accuracy, i.e., the expected correspondence between the transferred and the true PFM of the query factor. Choosing a less stringent cutoff increases the chance that a PFM transfer is possible, but also reduces the expected quality of the prediction.

Maximum Number of Best Matches

This parameter can be used to specify a limit for the number of PFMs which shall be merged to generate a prediction for the query factor. By default the PFMs of at most 5 best matches are merged to generate the predicted consensus PFM. If you want to avoid merging PFMs, you can set this parameter to 1. This setting will cause SABINE to directly transfer the PFM of the best match which has been predicted with highest confidence.

Outlier Filter Threshold

This parameter serves to adjust the stringency of the outlier filter, which avoids the merging of dissimilar PFMs (default: 0.5). The lower the cutoff is chosen, the more stringent is the criterion for exclusion and the higher is the homogeneity of the PFMs which are merged to generate a prediction.


SABINE file format specification

If you want to use the command-line interface of SABINE, you have to comply with the following file format guidelines. If you use the graphical interface instead, a properly formatted input file will be generated automatically.

To predict a PFM for a TF, SABINE needs information about the organism, superclass, protein sequence and DNA-binding domains of the respective factor. This information has to be formatted as specified in the SABINE input file format description.

The output of SABINE is a text file containing the predicted best matches, their predicted PFM-similarity to the query factor, and the consensus PFM which results from the merging the annotated PFMs of the best matches (see SABINE output file format specification).

Note, that SABINE can be applied exclusively to eucaryotic transcription factors (see list of supported organisms).

The superclass of the TF specified in the input file has to be denoted as a decimal classification number (see TF Classification in TRANSFAC). The following table itemizes the decimal classification numbers for the five possible superclasses:

SuperclassDecimal class. no.
Basic domain1.0.0.0.0
Zinc finger2.0.0.0.0
Helix-turn-helix3.0.0.0.0
Beta scaffold4.0.0.0.0
Other0.0.0.0.0

The input file format description specifies the input data for an individual TF. You can pack multiple TFs in one input file to sequentially process larger datasets with SABINE. In addition to the general description of the file formats, example input and output files for SABINE are provided.

SABINE input file

NA  Identifier
XX
SP  Organism (list of supported organisms)
XX
RF  Reference to UniProt (optional)
XX
CL  Classification (class acc. no. or decimal classification no. as in TRANSFAC)
XX
S1  Amino acid sequence
XX
S2  Alternative amino acid sequence (optional)
XX
FT  DNA-binding domain (domain ID   start position   end position)
XX
//
XX

View Example

SABINE output file

NA  Identifier
XX
BM  Best match (transcription factor ID   PFM similarity score)
XX
MA  A  C  G  T   rows: positions within the aligned sequences
MA               first column: position index
MA               columns: relative frequencies of A, C, G, T residues
MA               last column: consensus sequence in IUPAC code
XX
//
XX

View Example




Johannes Eichner
http://www.ra.cs.uni-tuebingen.de/software/SABINE/intro.htm
© 2008 University of Tübingen, Germany