Gene splice site prediction using Artificial Neural Networks
MetadataShow full item record
- Master's theses (TN-IDE) 
Gene prediction is the process of finding the location of genes and other meaningful subsequences in DNA sequences. This process is time consuming and expensive when done by biochemical methods and genetics. An other approach to the process of predicting genes is ab initio gene prediction. With this approach genes are predicted by analysing the sequence of nucleotides in the DNA using a statistical method. The process can then be carried out in a computer system, which is faster and less expensive. A computational tool to predict genes in DNA sequence is therefore of high importance and great value to the biologists. The DNA molecule contains subsequences that codes protein chains. The proteins form the functionality of the organism. The subsequences that code to these proteins are called genes. In eukaryotic cells, the gene sequences consist of exons and introns. The exon part is coding to proteins while the intron part is rejected in the splicing process. The transition between an exon to an intron sequence is a splice site. The proposed system in this thesis, will try to predict the splice sites in the genes. The gene sequences used in this thesis are from the model plant Arabidopsis thaliana. Artificial neural network is a mathematical method known from artificial intelligence and pattern recognition. It is a method that can be used as a general function approximator. In a training phase, the neural network is presented input patterns and corresponding desired outputs, and these are used to adjust weights inside the neural network. Later, the neural network can approximate an output from an arbitrary input pattern. Artificial neural networks have shown to be usable in many application, and it has also been used in gene prediction. The proposed system connects an artificial neural network to a gene sequence, and the system tries to predict the splice sites in the sequence. A window of 60 nucleotides slides over the gene, and the neural network will evaluate the pattern in this window to predict if there are any splices sites in this pattern. The output of this evaluation is a numeric value, and the values are accumulated for each nucleotide as the window slides over the gene. The accumulated score is used as an indicator of where the splice sites are located. To find the exact location of the splice site, a second order polynomial function is fitted through successive data points in the splice site indicator. The top location of this parabola is used as the predicted location of the splice site. The system has been developed and some experiments have been performed. The system has been trained on a data set of 15551 genes, and a performance benchmark of the neural network is done at a distinct set of 5000 genes. The best neural network achieves a sensitivity of 0.851, a speci- ficity of 0.844, and a correlation coeficient of 0.568. These are reasonable measurements considering that no prior knowledge about special splice site signals were given to the system.
Master's thesis in Information technology