GWAS Pathway Identifier is available as a stand-alone application. It combines GWAS and pathway data as well as known and predicted protein-interaction data to identify disease specific SNP sets.
The application provides several different analysis methods differing in the composition of SNP sets, which are evaluated with a variation of Fisher's combined statistic described by De la Cruz et al. (2009). Additionally to an html report, which presents the outcomes in neatly arranged tables, GWAS Pathway Identifier creates for each selected analysis method a csv result files and is able to generate pathway graphs representing the investigated pathway and SNP set. Beside this feature, it's also possible to change the number of permutations for the evaluation of the SNP sets to improve the accuracy of the results.

The analysis possibilities

GWAS Pathway Identifier has six different analysis methods to create SNP sets:
  • pathway analysis methods creates sets, containing SNPs in all genes in the pathway.
  • characteristic pathway analysis methods creates sets, containing SNPs in genes occurring exclusively in the pathway.
  • pathway interaction analysis methods creates sets, containing SNPs in genes in the pathway with a specific interaction class.
  • characteristic pathway interaction analysis methods creates sets, containing SNPs in genes with a specific interaction class, occurring exclusively in the pathway
  • single set method creates one single set, which contains all entered SNPs. It is recommended to start this analysis with a limited number of SNPs, for example by entering a specific SNP or gene file.
  • gene set methods creates for each gene containing one or more SNPs one set.

How to run the application

File preparation

To start the GWAS Pathway Identifier at least four files are needed:
  1. a SNP annotation file
  2. a bed file
  3. a bim file
  4. a fam file

The last three files need to have the same file base name and can be generated with the PLINK toolset. If the GWAS data is saved as a ped, map and assoc file, it is possible to generate the bed, bim and fam file with the following command:
    
      plink --file mydata --make-bed

Basic pathway analysis

After preparing the input files call GWAS Pathway identifier with the following command:

      java -jar GWASPathwayIdentifier.jar -fan <annotation file> -ff <fam file> 
                                          -fb <bed file> -fi <bim file> 
                                          -o <output folder> -s <species>

With this application call, the following analysis methods are executed: pathway, characteristic pathway, interaction pathway and characteristic pathway interaction analysis. The results are provided in an html file, named "result.html" which contains neatly arrange tables with the outcomes. Additionaly, for each analysis method a csv file is saved in the given output folder.

Program call with specific SNPs or genes

For restricting the number of used SNPs or genes, it is possible to enter a SNP or gene file, containing thoses SNPs or genes which should be investigated.

Program call with a SNP file:

      java -jar GWASPathwayIdentifier.jar -fan <annotation file> -ff <fam file> 
                                          -fb <bed file> -fi <bim file> 
                                          -o <output folder> -s <species>
                                          -fs <snp file>

Program call with a gene file:

      java -jar GWASPathwayIdentifier.jar -fan <annotation file> -ff <fam file> 
                                          -fb <bed file> -fi <bim file> 
                                          -o <output folder> -s <species>
                                          -fg <gene file>

Disease run

During a disease run, the analysis is done three times. The first run is a normal program run using all entered SNPs or genes. The second run uses exclusively the genes which are defined in the entered disease file, and in the last run the analysis is done without the entered disease genes. Finally the best results of all three runs are compared and saved in a "best comparison" table (see result files).

      java -jar GWASPathwayIdentifier.jar -fan <annotation file> -ff <fam file> 
                                          -fb <bed file> -fi <bim file> 
                                          -o <output folder> -s <species>
                                          -fd <disease gene file>

If only the run with the disease genes or without the disease genes should be processed, it is possible to attach the "-d true" option. But it is important to consider that the final comparision could not be executed in this case, because the reference results of the normal run are missing!

      java -jar GWASPathwayIdentifier.jar -fan <annotation file> -ff <fam file> 
                                          -fb <bed file> -fi <bim file> 
                                          -o <output folder> -s <species>
                                          -fd <disease gene file> -d true

Further program settings

There exist four further parameters for the personal application configuration:
  • "-g true": this command line option is used to generate pathway graphs for the results of the pathway analysis methods.
  • "-t <doulbe value>": allows to define a specific threshold which is used to create the best lists, for the pathway analysis methods. For more details the the results section.
  • "-p <int value>": is used to change the number of permutations for the evaluation of the SNP sets. The default value is 1000 permutations and to improve the accuracy of the evaluation, it is recommended to augment the number. But it is necesarry to consider that the higher the number the longer lasts the application run.
  • "-mt <int value>": defines how many processor units are used for performing the analysis. The more pocessors are selected the faster is the application.

The result files

All results of an application run are summarized in an html file, named "results.html", which consist of several linked html tables which can be sorted individually. Additionally, for each analysis method a separate csv file is generated:
During a disease run, three basic pathway analysis are perfomed using different SNPs (see diseae run). The files of the first run are named like discribed above. In the second run, the usual file names are extended with the suffix "_diseaseGenes" and in the third run with the suffix "_without_diseaseGenes". Furthermore, an additional file is set up called "bestListComparison.csv". This file provides a comparison between the three best files of the disease run.

The command-line options in detail

Required options

Input files (required)

Define the files that should be entered.

-fan <File>, --annotation-input-file[ |=]<File>
The annotation file is a tab delimited .txt file and must contain three columns. One column with the SNP identifer, one column with the gene identifier, and one colun with the SNP location. The columns need a header: "SNP" for the SNP identifier column and "Location" for the SNP location column. The gene column header needs to specify the gene identifier. Possible used gene identifieres and column headers are: "GeneSymbol", "EntrezGeneID" or "RefSeqID" Possible SNP location values are: "-1", "3UTR", "5UTR", "UTR", "coding", "flanking_3UTR", "flanking_5UTR", or "intron". It is possible to use the annotation file provided by the gene chip manufacturer or to use a custom made file. Accepts text files (*.txt).
Default value: users home directory
-fb <File>, --bed-input-file[ |=]<File>
The bed file, for format information see Purcell et al., 2007. Accepts bed files.
Default value: users home directory
-fi <File>, --bim-input-file[ |=]<File>
The bim file, for format information see Purcell et al., 2007. Accepts bim files.
Default value: users home directory
-ff <File>, --fam-input-file[ |=]<File>
The fam file, for format information see Purcell et al., 2007. Accepts fam files.
Default value: users home directory

Species

Define the species that is investigated.

-s <String>, --species[ |=]<String>
The target species for the analysis is defined by the KEGG species abbreviation. A list of all available organisms can be found here.
Default value: "hsa"

Output folder

Define the default output folder.

-o <File>, --output-folder[ |=]<File>
Path, where the result files are saved.
Default value: users home directory

Facultative application options

Input files (optional)

-fs <File>, --snp-input-file[ |=]<File>
If not all SNPs of the GWA study should be investigated, a SNP file can be entered to the application. It contains one SNP identifier per line and no header. Accepts text files (*.txt).
Default value: users home directory
-fg <File>, --gene-input-file[ |=]<File>
If not all genes of the GWA study should be investigated, a gene file can be entered to the application. The gene file contains in the first line a header, which specifies the used gene identifier, and in the following lines the genes, that should be investigated (one gene per line). Valid headers are "GeneSymbol", "EntrezGeneID" or "RefSeqID" Accepts text files (*.txt).
-fd <File>, --disease-gene-input-file[ |=]<File>
The disease gene file has the same format as the gene file. It is used for as special disease analysis. For more detail see the disease run section.

Analysis options

Defines the possble analysis options.

-p <Integer>, --permutations[ |=]<Integer>
How often the permutation should run.
Arguments must fit into the range {[1000,10000]}.
Default value: 1000
-mt <Integer>, --maxthreads[ |=]<Integer>
Number of processors to use during the application run.
Default value: "available processors-1"
-t <Double>, --threshold[ |=]<Double>
Threshold, which defines the cutoff for the best list creation (for details see the following description.
Arguments must fit into the range {[0.0,1.0]}.
Default value: 0.05
-d <Boolean>, --just-do-disease-run[ |=]<Boolean>
If true, just a disease run is performed with the entered disease file. For more details see the disease run section.
All possible values for type <Boolean> are: true and false.
Default value: false

Analysis methods

Defines the possble analysis methods that should be performed.

-as <Boolean>, --analyse-one-input-set[ |=]<Boolean>
If true, one set is build and analysed based on the entered SNPs.
All possible values for type <Boolean> are: true and false.
Default value: false
-ag <Boolean>, --analyse-gene-units[ |=]<Boolean>
If true, all genes are analysed separately based on the entered SNPs.
All possible values for type <Boolean> are: true and false.
Default value: false
-ap <Boolean>, --analyse-pathway-units[ |=]<Boolean>
If true, pathways are analysed depending on all SNPs occuring in the pathway.
All possible values for type <Boolean> are: true and false.
Default value: true
-ac <Boolean>, --analyse-characteristic-pathway-units[ |=]<Boolean>
If true, pathways are analysed depending on all SNPs occuring exclusively in the pathway.
All possible values for type <Boolean> are: true and false.
Default value: true
-ai <Boolean>, --analyse-interaction-pathway-units[ |=]<Boolean>
If true, pathways are analysed depending on all SNPs occuring in the pathway and the interaction information of the gene of the SNP.
All possible values for type <Boolean> are: true and false.
Default value: true
-aci <Boolean>, --analyse-characteristic-interaction-pathway-units[ |=]<Boolean>
If true, pathways are analysed depending on all SNPs occuring in the pathway and the interaction information of the gene of the SNP.
All possible values for type <Boolean> are: true and false.
Default value: true

Visualization options

Define if additional graphs should be created.

-g <Boolean>, --create-graphs[ |=]<Boolean>
If true, for each pathway unit a graph is created which is linked to the result file.
All possible values for type <Boolean> are: true and false.
Default value: true