DNA Markers and Computer Science Methodology Can be Used to Trace Individual Ancestry
1 Oct, 2007 12:57 pm
Genetic markers (variations in the DNA among individuals) can be used to infer population structure, a task that remains a central challenge in many areas of genetics such as the study of population history and the search of susceptibility genes for common disorders. In such settings, it is often desirable to reduce the number of markers needed for structure identification. Based on the properties of a powerful dimensionality reduction technique (Principal Components Analysis), an international team of collaborators has developed a novel algorithm that does not depend on any prior assumptions and can be used to identify a small set of structure informative markers.
As the cost of technology for SNP determination drops, it is now feasible to study data from thousands of individuals for hundreds of thousands of SNPs. This tremendous amount of generated data renders imperative the need for powerful computational methods. In the paper by Dr. Paschou and colleagues, appearing in the September issue of the PLoS Genetics journal, a fast and novel algorithm for the selection of genetic markers that carry information about individual ancestry is described. The method can be applied on data from diverse population samples, without knowing in advance the origin of each individual. These selected genetic markers (i.e. SNPs) can then be used to predict the membership of an individual to a particular population with almost 100% accuracy.
The described algorithm is based on Principal Components Analysis (PCA), which is why the selected SNPs are called PCA-correlated SNPs. PCA is a classical dimensionality reduction tool, meaning that it can be used to extract the fundamental structure (meaningful dimensions) of very large datasets. It was first used in population genetics by Cavalli-Sforza almost 30 years ago and it has recently started regaining favor for the analysis of human population structure (see for example references below). However, up until now it had not been used for the selection of SNPs that can capture structure in a population and differentiate individuals from different backgrounds. All other methods in the literature that are used to identify ancestry informative markers either rely on a specific model or are frequency based and demand prior knowledge of the origin of individuals (such measures are for example Fst, informativeness for assignment and ä).
The proposed method is evaluated on a previously described dataset (more than 10,000 SNPs for 11 populations, made available for analysis by Dr. Mark Shriver), and it is demonstrated that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. Indeed, for most geographic regions, less than 50 SNPs suffice in order to differentiate among populations in the studied dataset. The methods are also validated on the HapMap populations (Yoruba from Africa, Europeans, Chinese and Japanese – see http://www.hapmap.org for a description of the HapMap project) and achieve perfect differentiation to three different continents using only 14 PCA-correlated SNPs (selected out of about 10,000 SNPs). The Chinese and Japanese populations can be easily differentiated with less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. The authors also managed to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations.
The study of admixed populations (populations of complex ancestry, resulting from multiple ancestral populations) is a case that requires special attention. Ancestral populations may be unknown, or may no longer exist. In this study the authors demonstrate that this new algorithm can also be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals and more than 7,000 SNPs), it is shown that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. These SNPs are subsequently validated for structure identification in an independent Puerto Rican dataset.
The algorithm that is introduced runs in seconds and can be easily applied on large genomewide datasets facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations. An international team of collaborators was involved in this study, including Petros Drineas at the Rensselaer Polytechnic Institute, Elad Ziv, Esteban G. Burchard, and Shweta Choudhry at the University of California, San Francisco, William Rodriguez-Cintron at the University of Puerto Rico School of Medicine in San Juan, Michael W. Mahoney from Yahoo! Research in California and Peristera Paschou at Democritus University in Thrace.
1. Paschou P, Ziv E, Burchard EG, Choudry S, Rodriguez-Cintron W, Mahoney MW, Drineas P (2007). PCA-correlated SNPs for structure identification in worldwide human populations. PLoS Genet 3, e160
2. Liu N, Zhao H (2006). A non-parametric approach to population structure inference using multilocus genotypes . Hum Genomics 2:353–364.
3. Price A, Patterson N, Plenge R, Weinblatt M, Shadick N, Reich D (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38:904– 909.
4. Patterson N, Price A, Reich D (2006). Population Structure and Eigenanalysis. PLoS Genet 2:e190.
5. Shriver M, Mei R, Parra E, Sonpar V, Halder I, et al. (2005) Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation. Hum Genomics 2:81–89.