The use of principal component analysis for predicting genomic breeding values
MetadataVis full innførsel
- Master's theses (IHA) 
During the last few years the idea of predicting quantitative traits and diseases based on genotypic information has raised a major interest in animal and plant breeding as well as in human genetics. However, there are still important questions and problems that need to be addressed. Some of these problems are statistical. Statistical problems mainly concern multicollinearity basic derived from the huge amount of available data. In addition, the number of variables that needs to be estimated (p) is much larger than the number of observations (n) disabling least squares methodology. Principal component analysis (PCA) is a multivariate statistical method often used to deal with these problems. The objective of this study was to investigate the use of PCA for predicting genomic breeding values. Data of 1,609 first lactation Holstein heifers were analysed including test-day milk, fat and protein yields. Animals originated from 4 countries, Ireland, United Kingdom, the Netherlands and Sweden and were genotyped within the RobustMilk project with the Illumina BovineSNP50 Beadchip. After editing, 37,069 SNPs remained. Two different models were compared for genomic predictions i) Principal component regression (PCR) was used to directly estimate genomic breeding values. Selection of principal components (PCs) was based either on their eigenvalues or the regression sum of square (SS) contribution, ii) a best linear unbiased prediction model with genomic relationship matrix (GBLUP) was developed to compare accuracies to those obtained by PCR models. In a third case, PCs extracted from the G-matrix were added in the GBLUP model as fixed effects to investigate the impact of population structure when predicting genomic breeding values. The dataset was split in four training (reference populations) and testing parts for validation. Each testing subset included all animals from only one country. Predictive ability was calculated as Pearson correlation between the predicted genomic values and the phenotypes. PCR where PCs selection was based on their eigenvalues resulted in considerably high accuracies and outperformed both PCR (SS) and GBLUP models. Accuracies varied between populations and traits. Interestingly, highest accuracies were obtained for the only genetically distinguished population (GBR), according to PCA, in the dataset with only the first or the first two PCs for protein and milk yield, respectively. In GBLUP models an increase of the accuracies (~40% on average) was observed in all cases when PCs were added in the model. Simplicity of PCR method, fast computation, reduction of data dimension (>96%) as well as the ability of both predicting breeding values and identifying groups in the data are the main benefits of PCR. The above elements together with at least as accurate predictions as GBLUP, obtained with real data, marks PCR as an attractive tool for animal breeding. However, the variation on the number of PCs needed to achieve highest accuracies could be a drawback of the method. According to our results, where the highest accuracies obtained for the only group of animals genetically separated from the rest, we hypothesize that PCR could be tested for across breed genomic predictions.