Dräger, Andreas

Automatische und vergleichende Analyse bakterieller Genome mit Schwerpunkt auf Ralstonia/Cupriavidus-Arten sowie verwandten Proteobakerien

Diplomarbeit, Martin-Luther-Universität Halle-Wittenberg, von-Seckendorff-Platz 1, 06120 Halle (Saale), 2005


Abstract

Knowledge generation by comparative genomics become more and more meaningful due to the increased availability of genomic sequence data. This thesis compares the genomes of β-proteobacteria, a taxonomic group of bacteria showing a high degree of diversity. Sequenced species of this group inhabit various ecological niches. Therefore, they need special features to surviving toxic heavy metal concentrations, to act as pathogens of plants or animals or to degrade organic substances, which are normally difficult to break down. These characteristic features are mediated by special proteins, leading to the question, if characteristic proteins can be detected automatically by comparing the sequences of related proteins of different species. By comparison of highly conserved proteins with essential life functions in a taxonomic context a measurement should be developed to normalize later on comparison of these specialized proteins. The taxonomic context of different bacterial species can be derived from comparisons of the highly conserved genes for the 16 S rRNA, which is involved in the protein synthesis of bacteria. To perform those comparisons different algorithms have been released. For global alignments the Needleman-Wunsch algorithm or a special Hidden Markov Model can be used. Local alignments can be done by the Smith-Waterman algorithm or its heuristic approximation BLAST. Global alignments compare whole sequences, whereas local alignments find longest conserved regions in two sequences. Proteins contain functional regions (domains) surrounded by less important regions, so that these can be compared by local alignments. The complete 16 S rRNA sequence is essential for its functionality, so that for taxonomic analysis global alignments are needed. To evaluate the quality of alignments different substitutions matrices like BLOSUM, PAM, NUC contain scores for the substitution of a symbol with another one. The sum of these scores is the score of the alignment. Several online database servers like NCBI, RDP, EMBL, JGI, SwissProt, Tigr and many others provide genomic and proteomic sequence data of different species. These databases grow at an exponential rate, because new techniques of sequencing proteins and genetic elements in high throughput analysis are used and data can be uploaded by individual authors, laboratories or other scientific institutions. However, to use these data to perform comparative local analysis, an efficient way of storage has to be found. Problems of data storage occur due to the different naming conventions of species, genes, proteins and other biological data. To maintain the data consistently without redundancy, a database server was installed using MySQL. As a relation scheme BioSQL was used. To integrate downloaded data into the local database different pre-processing steps (data cleaning) were necessary. A client program to compare the locally stored data was implemented in Java, using the open source library BioJava. The resulting database application also provides interactive visualizations of the data in the database such as the taxonomic tree and genetic annotations. It contains a graphical user interface to interact with BioSQL without knowledge of databases. In addition, different file formats of biological sequence data can be converted into each other. With a special dialog global and local sequence alignments can be performed interactively. The BioJava library had to be extended to provide the full functionality of that program. This application was used on a selection of essential proteins (involved in DNA transcription and translation) of 15 β-proteobacteria and 7 γ-proteobacteria and non essential proteins (involved in heavy metal resistance) and the 16 S rRNA genes. All comparisons are relative to the well investigated species Escherichia coli K12. Plotting the protein similarity against the taxonomic neighborhood shows an almost linear increase for essential proteins with taxonomic nearness. The other proteins show higher variability. This leads to the conclusion that proteins with special features are less conserved than highly essential proteins, so that these can be detected from the taxonomic context by normalization with essential proteins.


Downloads and Links

[pdf]


BibTeX

@mastersthesis{Draeger2005,
  author = {Dr\"ager, Andreas},
  title = {{Automatische und vergleichende Analyse bakterieller Genome mit Schwerpunkt
	auf \emph{Ralstonia}/\emph{Cupriavidus}-Arten sowie verwandten Proteobakerien}},
  school = {Martin-Luther-Universit\"at Halle-Wittenberg},
  year = {2005},
  type = {Diplomarbeit},
  address = {von-Seckendorff-Platz 1, 06120 Halle (Saale)},
  month = dec,
  abstract = {Knowledge generation by comparative genomics become more and more
	meaningful due to the increased availability of genomic sequence
	data. This thesis compares the genomes of $\beta$-proteobacteria,
	a taxonomic group of bacteria showing a high degree of diversity.
	Sequenced species of this group inhabit various ecological niches.
	Therefore, they need special features to surviving toxic heavy metal
	concentrations, to act as pathogens of plants or animals or to degrade
	organic substances, which are normally difficult to break down. These
	characteristic features are mediated by special proteins, leading
	to the question, if characteristic proteins can be detected automatically
	by comparing the sequences of related proteins of different species.
	By comparison of highly conserved proteins with essential life functions
	in a taxonomic context a measurement should be developed to normalize
	later on comparison of these specialized proteins. The taxonomic
	context of different bacterial species can be derived from comparisons
	of the highly conserved genes for the 16 S rRNA, which is involved
	in the protein synthesis of bacteria. To perform those comparisons
	different algorithms have been released. For global alignments the
	Needleman-Wunsch algorithm or a special Hidden Markov Model can be
	used. Local alignments can be done by the Smith-Waterman algorithm
	or its heuristic approximation BLAST. Global alignments compare whole
	sequences, whereas local alignments find longest conserved regions
	in two sequences. Proteins contain functional regions (domains) surrounded
	by less important regions, so that these can be compared by local
	alignments. The complete 16 S rRNA sequence is essential for its
	functionality, so that for taxonomic analysis global alignments are
	needed. To evaluate the quality of alignments different substitutions
	matrices like BLOSUM, PAM, NUC contain scores for the substitution
	of a symbol with another one. The sum of these scores is the score
	of the alignment. Several online database servers like NCBI, RDP,
	EMBL, JGI, SwissProt, Tigr and many others provide genomic and proteomic
	sequence data of different species. These databases grow at an exponential
	rate, because new techniques of sequencing proteins and genetic elements
	in high throughput analysis are used and data can be uploaded by
	individual authors, laboratories or other scientific institutions.
	However, to use these data to perform comparative local analysis,
	an efficient way of storage has to be found. Problems of data storage
	occur due to the different naming conventions of species, genes,
	proteins and other biological data. To maintain the data consistently
	without redundancy, a database server was installed using MySQL.
	As a relation scheme BioSQL was used. To integrate downloaded data
	into the local database different pre-processing steps (data cleaning)
	were necessary. A client program to compare the locally stored data
	was implemented in Java, using the open source library BioJava. The
	resulting database application also provides interactive visualizations
	of the data in the database such as the taxonomic tree and genetic
	annotations. It contains a graphical user interface to interact with
	BioSQL without knowledge of databases. In addition, different file
	formats of biological sequence data can be converted into each other.
	With a special dialog global and local sequence alignments can be
	performed interactively. The BioJava library had to be extended to
	provide the full functionality of that program. This application
	was used on a selection of essential proteins (involved in DNA transcription
	and translation) of 15 $\beta$-proteobacteria and 7 $\gamma$-proteobacteria
	and non essential proteins (involved in heavy metal resistance) and
	the 16 S rRNA genes. All comparisons are relative to the well investigated
	species \emph{Escherichia coli} K12. Plotting the protein similarity
	against the taxonomic neighborhood shows an almost linear increase
	for essential proteins with taxonomic nearness. The other proteins
	show higher variability. This leads to the conclusion that proteins
	with special features are less conserved than highly essential proteins,
	so that these can be detected from the taxonomic context by normalization
	with essential proteins.},
  pdf = {https://nirvana.informatik.uni-halle.de/~molitor/Diplomarbeiten/2005/draeger_andreas.pdf},
}