DNA Markers and Computer Science Methodology Can be Used to Trace Individual Ancestry

Humans are almost 99% identical at the DNA level. However, variations do occur making each individual unique and contributing to the vast amounts of diversity encountered in the human species. For instance, Single Nucleotide Polymorphisms (the so-called SNPs) are an abundant source of variation and it is estimated that approximately 10,000,000 such polymorphic sites exist in the human genome. Such variations can be used to infer individual ancestry and population structure. This becomes a central challenge in a variety of different research scenarios and science applications such as the search of susceptibility genes for common disorders, forensics, conservation studies, and population genetics. In particular in medical genetics, it has been shown that unrecognized population substructure is a serious problem that may lead to false positive or false negative results, hampering the ongoing efforts to identify causal genes for disorders, such as diabetes and heart disease. In such settings it is also often desirable to identify a minimal set of SNPs that could be used to assign individuals to populations and uncover hidden population substructure.

As the cost of technology for SNP determination drops, it is now feasible to study data from thousands of individuals for hundreds of thousands of SNPs. This tremendous amount of generated data renders imperative the need for powerful computational methods. In the paper by Dr. Paschou and colleagues, appearing in the September issue of the PLoS Genetics journal, a fast and novel algorithm for the selection of genetic markers that carry information about individual ancestry is described. The method can be applied on data from diverse population samples, without knowing in advance the origin of each individual. These selected genetic markers (i.e. SNPs) can then be used to predict the membership of an individual to a particular population with almost 100% accuracy.

The described algorithm is based on Principal Components Analysis (PCA), which is why the selected SNPs are called PCA-correlated SNPs. PCA is a classical dimensionality reduction tool, meaning that it can be used to extract the fundamental structure (meaningful dimensions) of very large datasets. It was first used in population genetics by Cavalli-Sforza almost 30 years ago and it has recently started regaining favor for the analysis of human population structure (see for example references below). However, up until now it had not been used for the selection of SNPs that can capture structure in a population and differentiate individuals from different backgrounds. All other methods in the literature that are used to identify ancestry informative markers either rely on a specific model or are frequency based and demand prior knowledge of the origin of individuals (such measures are for example Fst, informativeness for assignment and ä).

The proposed method is evaluated on a previously described dataset (more than 10,000 SNPs for 11 populations, made available for analysis by Dr. Mark Shriver), and it is demonstrated that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. Indeed, for most geographic regions, less than 50 SNPs suffice in order to differentiate among populations in the studied dataset. The methods are also validated on the HapMap populations (Yoruba from Africa, Europeans, Chinese and Japanese – see http://www.hapmap.org for a description of the HapMap project) and achieve perfect differentiation to three different continents using only 14 PCA-correlated SNPs (selected out of about 10,000 SNPs). The Chinese and Japanese populations can be easily differentiated with less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. The authors also managed to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations.

The study of admixed populations (populations of complex ancestry, resulting from multiple ancestral populations) is a case that requires special attention. Ancestral populations may be unknown, or may no longer exist. In this study the authors demonstrate that this new algorithm can also be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals and more than 7,000 SNPs), it is shown that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. These SNPs are subsequently validated for structure identification in an independent Puerto Rican dataset.

The algorithm that is introduced runs in seconds and can be easily applied on large genomewide datasets facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations. An international team of collaborators was involved in this study, including Petros Drineas at the Rensselaer Polytechnic Institute, Elad Ziv, Esteban G. Burchard, and Shweta Choudhry at the University of California, San Francisco, William Rodriguez-Cintron at the University of Puerto Rico School of Medicine in San Juan, Michael W. Mahoney from Yahoo! Research in California and Peristera Paschou at Democritus University in Thrace.

References
1. Paschou P, Ziv E, Burchard EG, Choudry S, Rodriguez-Cintron W, Mahoney MW, Drineas P (2007). PCA-correlated SNPs for structure identification in worldwide human populations. PLoS Genet 3, e160

2. Liu N, Zhao H (2006). A non-parametric approach to population structure inference using multilocus genotypes . Hum Genomics 2:353–364.

3. Price A, Patterson N, Plenge R, Weinblatt M, Shadick N, Reich D (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38:904– 909.

4. Patterson N, Price A, Reich D (2006). Population Structure and Eigenanalysis. PLoS Genet 2:e190.

5. Shriver M, Mei R, Parra E, Sonpar V, Halder I, et al. (2005) Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation. Hum Genomics 2:81–89.

Key words :

More contributions

08/12/08
Made-to-Measure Medicines
18/06/08
Neste Moves Forward with Green Diesel
24/09/07
Junk DNA: Let me Say it One More Time
21/09/07
A Bio-Based Society?
14/09/07
A Public Scientist in the Biotechnology Debate?

2 comment(s)

[1]

Comment by Dr Xiaokun Zeng

26 Sep, 2007 09:41 pm

Review status: ACCEPTED

Alert Moderator

[2]

Comment by Dr Xiaokun Zeng

26 Sep, 2007 10:08 pm

Review status: ACCEPTED
N/A

Alert Moderator

Biotechnology

By Peristera Paschou

biotechnology,

computer science

dna markers

ancestry

biotechnology

DNA Markers and Computer Science Methodology Can be Used to Trace Individual Ancestry

By Peristera Paschou

biotechnology,

computer science

dna markers

ancestry

biotechnology

More contributions

Made-to-Measure Medicines

Neste Moves Forward with Green Diesel

Junk DNA: Let me Say it One More Time

A Bio-Based Society?

A Public Scientist in the Biotechnology Debate?