Evaluation of the SNP tagging approach in an independent population sample—array-based SNP discovery in Sami
Abstract Significant efforts have been made to determine the correlation structure of common SNPs in the human genome. One method has been to identify the sets of tag- SNPs that capture most of the genetic variation. Here, we evaluate the transferability of tagSNPs between populations using a population sample of Sami, the indigenous people of Scandinavia. Array-based SNP discovery in a 4.4 Mb region of 28 phased copies of chromosome 21 uncovered 5,132 segregating sites, 3,188 of which had a minimum minor allele frequency (mMAF) of 0.1. Due to the popula- tion structure and consequently high LD, the number of tagSNPs needed to capture all SNP variation in Sami is much lower than that for the HapMap populations. Tag- SNPs identified from the HapMap data perform only slightly better in the Sami than choosing tagSNPs at ran- dom from the same set of common SNPs. Surprisingly, tag- SNPs defined from the HapMap data did not perform better than selecting the same number of SNPs at random from all SNPs discovered in Sami. Nearly half (46%) of the Sami SNPs with a mMAF of 0.1 are not present in the HapMap dataset. Among sites overlapping between Sami and Hap- Map populations, 18% are not tagged by the European American (CEU) HapMap tagSNPs, while 43% of the SNPs that are unique to Sami are not tagged by the CEU tagSNPs. These results point to serious limitations in the transferability of common tagSNPs to capture random sequence variation, even between closely related popula- tions, such as CEU and Sami.
Introduction
Our ability to identify the genetic determinants of human traits depends on the complexity of the trait as well as fac- tors such as the extent of linkage disequilibrium (LD) and haplotype structure of the population. The LD pattern and haplotype block architecture varies substantially across the genome due to recombination, mutation rate and selection, and between populations due to random genetic drift and demographic history. Studies have indicated that the human genome can be partitioned into haplotype blocks character- ized by lower diversity (Altshuler et al. 2005; Hinds et al. 2005). The use of linked markers (tagSNPs) or SNPs that capture the haplotype information (htSNPs) has been sug- gested in order to reduce the need for genotyping (Dawson et al. 2002; Johnson et al. 2001) and sets of SNPs capturing the majority of the genetic variation in the human genome have been identified (Altshuler et al. 2005; Hinds et al. 2005). It has also been argued that the LD pattern is similar between human populations, or only differs by a scaling factor (De La Vega et al. 2005; Gibson et al. 2005), and therefore that a single set of tagSNPs can be used irrespec- tive of the population under study. Some authors have rea- soned that tagSNPs defined from large diverse human populations will also be useful for capturing the genetic variation of small populations from the same region (Bonnen et al. 2006; Gonzalez-Neira et al. 2006; Montpetit et al. 2006). These conclusions have been challenged (Evans and Cardon 2005) and it has been indicated that tag- SNPs are only partly transferable between different Euro- pean populations (Mueller et al. 2005).
In order to define the structure of genetic variation in the human genome, a dataset comprising SNP variation among 270 individuals from four populations has been used to identify the optimal tagSNP set for each population (Altsh- uler et al. 2005). These publicly available data are based on population samples of Yoruba (Nigeria, YRI, n = 90, including 30 parent–offspring trios), European Americans (Utah residents with ancestry from northern and western Europe, CEU, n = 90, including 30 parent–offspring trios from the Centre d’Etude du Polymorphisme Humain), Han Chinese (CHB, n = 45, Beijing, China) and Japanese (JPT, n = 44 Tokyo, Japan) (http://www.hapmap.org/). These data include approximately 3 million of the estimated 11 million common SNPs (Kruglyak and Nickerson 2001) and rare sequence variants, constituting the majority of SNPs in the human genome, are severely underrepresented in this dataset. The ability of tagSNPs to capture genetic variation has been mainly evaluated by re-typing SNPs in other populations and by simulation (Bonnen et al. 2006; Gonzalez-Neira et al. 2006; Montpetit et al. 2006; Mueller et al. 2005; Tenesa and Dunlop 2006). By resequencing the HapMap population samples, de Bakker (et al. 2005) conclude that tagSNPs chosen from an incomplete source generally perform well, but the power of detecting rare alle- les is reduced. Other studies have also found a high trans- ferability of tagSNPs but have noted that transferability is dependent on the relationship between populations (Conrad et al. 2006; de Bakker et al. 2006) as well as the level of LD in the population (Conrad et al. 2006). Similarly, Tantoso et al. (2006) found a lower detection rate when applying the HapMap tagSNPs to the data obtained from 199 rese- quenced genes from a population with average levels of LD.
Sami are the indigenous people of the northern parts of Sweden, Finland, Norway and the Kola Peninsula of Rus- sia. Sami were originally hunters (mainly of reindeer and moose) but over time they domesticated the reindeer and became reindeer herders. Today, the Sami population is estimated to be less than 100,000 individuals (Hassler et al. 2004). They speak a language belonging to the Finno-Ugric branch of the Uralic language family, with Finns, Kare- lians, and Estonians as their closest linguistic neighbors. Sami are the extreme genetic outliers in Europe (Cavalli- Sforza and Piazza 1993) and their origin has for a long time been the focus of speculation. Archaeological findings on the west coast of Sweden have been dated to more than 10,200 years before present (Nordqvist 2000). However, whether or not those findings are from a pre-Sami culture is not known. By 1950s, the first studies of genetic markers in the Sami population (Allison et al. 1952, 1956; Beckman et al. 1959) were performed in order to investigate their ori- gin. During the following decades, a large number of genetic markers, such as blood groups, enzyme groups, and serum proteins (Beckman 1996), were investigated. Some anthropometric studies suggest that there has been a Mon- golian influence on the Sami population. However, this relationship was not generally supported by genetic studies, even though a few of the markers did show a similarity in allele frequencies between Sami and Asian populations (Beckman et al. 1988; 1993; Evseeva et al. 2002; Fan et al. 1993; Larsen et al. 2001; Sikstrom and Nylander 1990). More extensive studies on mitochondrial DNA have noted that most Sami mtDNA types originated from Eastern European populations (Tambets et al. 2004), but also with some limited input from Asian populations (Ingman and Gyllensten 2007; Tambets et al. 2004).
To date, only a limited amount of sequence data is available for evaluating the performance of the tagSNP approach when applied to populations of a different genetic origin. Therefore, using SNPs discovery in a 4.4 Mb region on chromosome 21, we tested whether tagSNPs that have been defined in large populations can be transferred to the Sami population. Previous studies have indicated increased levels of LD in Sami both between microsatellite markers and between SNPs on the X chromosome (Johansson et al. 2005; Kaessmann et al. 2002; Laan and Paabo 1997). Here, we also investigate the haplotype structure and LD pattern between SNPs in the region on chromosome 21.
Materials and methods
Population data
We studied SNP variation in individuals from two Sami vil- lages in Västerbotten County, Sweden. The individuals par- ticipating in this study were already part of a health-based study, which has been approved by the ethics committee of Umeå University, Sweden. From the two villages, only 28 individuals were considered unrelated for at least three gen- erations, of which 22 were willing to participate also in this study. Publicly available genotype data, derived from sam- ples of Yoruba, European Americans, Han Chinese and Japanese, are part of the HapMap phase II data (public release #21, 2006-07-20; http://www.hapmap.org/) were used for comparison.
Array-based SNP discovery of a 4.4 Mb region of chromosome 21
Independent copies of chromosome 21 were isolated from peripheral blood lymphocytes (PBLs) of 22 unrelated Sami individuals. The two copies of chromosome 21 were physically separated from each individual using a rodent- human somatic cell hybrid technique (Douglas et al. 2001) and 560 cell-lines were made. To identify cell lines con- taining a single copy of chromosome 21, each cell line was typed for 7 microsatellite markers (D21S1437, D21S1435, D21S1446, D21S1432, D21S1270, D21S156 and D21S144). A total of 105 cell lines were positive for all 7 markers (contained at least one copy of chromosome 21) and screening for cell lines containing a single copy of chromosome 21 (cell line homozygous for all 7 microsatellite markers) revealed 28 different chromosomes. A 4.4 Mb region was analyzed in each of the 28 chromosomes using an array-based technique for SNP discovery, which has already been comprehensively described by Patil et al. (2001). After masking the sequence for repeats, the sequence was assayed for variation with high-density oligo- nucleotide arrays. Oligonucleotides were synthesized on wafers and were used to scan the chromosomes for DNA sequence variation. Each unique copy of chromosome 21 was amplified from the cell lines by using long-range poly- merase chain reaction (LR-PCR). Minimally overlapping LR-PCR products were produced, spanning the region of interest on chromosome 21. SNPs were detected as altered hybridization by using a pattern recognition algorithm. In order to archive a low rate of false-positive SNPs, a strin- gent threshold for SNP detection on wafers is used.
Approximately 65% (Patil et al. 2001) of all bases present chi square statistics. For the Sami data, LD was also calcu- lated separately for frequency-matched SNPs using the same distance bins. Frequency-matched SNPs were identified as pairs of SNPs were the difference in MAF did not exceed 0.1. Haplotype blocks were defined as a solid spine of LD and identified using Haploview software (Barrett et al. 2005) for the same number of chromosomes (n = 28) and using SNPs with mMAF of 0.1 in each population (SNP set B; Table 1c). Multiple-SNP haplotype sharing was estimated for adjacent markers in the Sami chromosomes. For each observed haplo- type consisting of 1–100 adjacent markers, the number of chromosomes sharing the same haplotype was counted. Tag- SNPs were identified using the “Tagger” tagSNP selection algorithm (de Bakker et al. 2005) implemented in Haploview software (Barrett et al. 2005) for SNPs with mMAF of 0.1 in each population (SNP set A; Table 1b). The transferability to Sami of the different sets of tagSNPs identified in the HapMap populations was also tested using the “Tagger” algorithm in Haploview. Randomly ascertained SNPs, for either the Sami SNPs or the HapMap SNPs, were chosen (without replace- ment) in 10,000 replicates and tested using MATLAB scripts for the ability to capture SNP variation in Sami.
Correlation structure, haplotype blocks, and tagSNP selection
The data used for comparing to the Sami genotype data included all available (about 7,000) SNPs from each of the four populations in the HapMap phase II data (public release #21, 2006-07-20). To study the linkage disequilibrium (LD) and evaluate the detection rate of tagSNPs in the Sami data, we calculated the squared correlation coefficient (r2) between pairs of SNPs in Sami (SNP set A; Table 1b). For HapMap populations LD data (SNP set A; Table 1b), downloaded form the HapMap home page, were used. The fraction of marker pairs with r2 > 0.8 were calculated for pairs of SNPs with a distance between each other falling into the same distance bin (0–1,000 bp, 1,000–2,000 bp, 2,000–3,000 kb, etc.). The significance of the increased LD in Sami was calculated by com- paring the fraction of markers in LD within each distance bin in Sami to the fraction of markers in LD when combining the Sami and CEU observations. P-values were assessed using
Results
Array-based SNP discovery in a 4.4 Mb region of chromosome 21
To study the usefulness of tagSNPs for genetic studies of isolated populations, we analyzed the SNP distribution in Swedish Sami from the Västerbotten County in Northern Sweden. We isolated individual copies of chromosome 21 by creating human-hamster cell hybrids and identified clones containing a single human chromosome 21 copy (Douglas et al. 2001). A region of 4.4 Mb was analyzed using array-based SNP discovery technology, which detects about 65% of the SNPs (Patil et al. 2001), in each of the 28 chromosomes. A total of 5,132 variable sites were found (on average 1 per 857 bp) of which 3,188 sites (1 per 1,380 bp) had a minimum minor allele fre- quency (mMAF) of 0.1 (Table 1a) and remaining 1,944 sites had a MAF bellow 0.1. Of the 5,132 SNPs, 27% (1,390) were not found in any database and of those, 62% had a MAF bellow 0.1 (Table 1a). Genotyping complete- ness of at least 75% was reached for 4,496 of the 5,132 variable sites and, due to the small sample size, we have only used SNPs with a mMAF of 0.1 in the following analyses.
Shared haplotypes and haplotype block structure
By knowing the linkage phase of the Sami haplotypes, the degree of multiple-SNP haplotype sharing can be mea- sured. Haplotype sharing, using the Sami SNP Set A (Table 1b), decreased when including more SNPs (Fig. 1a) as well as when using longer haplotypes, independently of the number of markers included (Fig. 1b). Even for haplo- types including about 10 adjacent SNPs, the average degree of sharing is still 0.29 for each haplotype. Similarly, the average degree of sharing is 0.29 for haplotypes between 10 and 20 kb in length. To study the haplotype block structure we used SNP Set B (Table 1c), which includes 2,492, 2,527, 2,428, 2,424 and 2,454 SNPs in Sami, African (YRI), European American (CEU), Japanese (JPT) and Han Chinese (CHB) population samples, respectively. The hap- lotype blocks are considerably longer in the Sami when compared to a sample of the same size (28 randomly drawn chromosomes) from each of the four populations included in the HapMap data (Fig. 2a). Some blocks even extend over 450 kb in Sami. The number of haplotype blocks in the Sami (n = 81) is about half that in a similar size sample from the HapMap populations, with 165, 192, 160 and 147 blocks in CEU, YRI, JPT and CHB, respectively. The smaller number of blocks in Sami is likely to reflect the more extended LD and homogenous genetic architecture of Sami.
LD structure of SNPs in Sami and HapMap populations
The LD decay in both the Sami and the HapMap data was studied using the squared correlation coefficient (r2) between pairs of SNPs (SNP Set A, Table 1b) at different distances from each other (distance bins). The LD in Sami, expressed as fraction of markers with r2 > 0.8, is clearly higher than in the HapMap populations, and a further increase in LD is seen for frequency matched SNPs (Fig. 2b). The LD between pairwise markers in Sami is significantly increased (P < 0.01) for all SNP separated by less than 81 Mb, compared to a mixture of the CEU and Sami LD values. The relationship between the detection rate and r2 threshold was studied in the Sami and the Hap- Map populations. For all populations the detection rate depends on the correlation coefficient used as the detection threshold (Fig. 2c). The detection rate in the Sami data only decreases from 0.98 to 0.92 when r2 increases from 0.5 to 1.0. In contrast, for the African data (YRI) the detection rate decreases from 0.75 to 0.38 when r2 increases from 0.5 to 1.0. This result is consistent with the higher LD in Sami. Detection rate and transferability of tagSNPs SNP Set A (Table 1b) was used to study the power of the tag- SNP approach. TagSNPs were identified using the “Tagger” tagSNP selection algorithm (de Bakker et al. 2005) imple- mented in the Haploview software (Barrett et al. 2005), either by searching for the minimal number of tagSNPs needed to capture all SNP variation, or by finding the differ- ent sets of tagSNPs (n = 100, 200, 300, ...) that capture most of the variation in each population. The number of tagSNPs needed to capture all variation within populations is 433, 673, 724 and 1,495 for Sami, CEU, CHB + JPT and YRI, respectively (Fig. 3a). The smaller number of tagSNPs needed in Sami is consistent with the higher LD. We then studied the detection rate in the Sami when using tagSNPs defined from different populations. Not surprisingly, choos- ing tagSNPs from all sites in Sami gives the highest power to detect SNP variation. To capture 90% of the variation in Sami we need about 205 tagSNPs, as compared to about 1,100 randomly chosen SNPs (Fig. 3b). This demonstrates the importance of knowing the correlation between SNPs in a population in order to reduce the amount of genotyping while still being able to capture most of the genetic variation. To study the transferability of tagSNPs between populations, we tested the detection rate in the Sami of tagSNPs defined in the different HapMap populations. The ability of tagSNPs identified from either of the HapMap populations to detect the variation in Sami is lower than for SNPs chosen at ran- dom from the Sami data, and show similar detection rates as a random selection of sites from the same population data (Fig. 3b). SNPs (2,492) overlap between the Sami data and those typed in CEU, representing about half the total number of SNPs found in Sami. The overlap for SNPs used to study the tagSNP performance (i.e. with a mMAF = 0.1 and geno- typing performance of 75%) is 1,482 SNPs. Consequently, when choosing tagSNPs among the 3,351 SNPs reported in the European American population with mMAF = 0.1, approximately 44.2% (1,482/3,351) of SNPs overlap with SNPs in Sami while the remaining 55.8% are not informative in the Sami dataset. This shows that information on the actual SNP distribution in a population may dramatically increase the ability to identify informative tagSNPs. The transferabil- ity of tagSNPs defined from common SNPs appears to be related to the genetic similarity between the populations. The mean pairwise FST (Wright 1950) for all overlapping sites shows that Sami are more similar to CEU (average FST = 0.0451) than to either YRI (average FST = 0.186) or CHB + JPT (average FST = 0.1435) population samples. Also, the CEU tagSNPs show a somewhat better detection rate in the Sami than those defined from the other popula- tions (Fig. 3b). Nevertheless, tagSNPs optimized from the European American HapMap population (CEU) do not show a much better performance in the Sami than a random selec- tion of sites, as might have been expected from the high genetic similarity between the two populations. The low transferability of tagSNPs is not affected by also including SNPs with lower MAF than 0.1 (Supplementary material; Figure 1). Features of “untagged” SNPs in Sami The high number of Sami SNPs that are not tagged by the HapMap tagSNPs is surprising. Although the HapMap SNPs represent common rather than random variants, there is a correlation between SNP density in the Sami and SNPs typed in the HapMap CEU population (Fig. 4). Having a correlation between the SNP densities in the populations indicates that gaps in the SNP distribution are not responsible for the poor transferability of tagSNPs. In contrast, no corre- lation is seen between the density of untagged and tagged SNPs in Sami (P = 0.15), although a correlation is found between the density of tagSNPs and the density of SNPs typed in CEU (r2 = 0.36, P = 10¡32). Among the 2,732 Sami SNPs (mMAF = 0.1), 1,942 (71%) were tagged by a set of 600 tagSNPs from the CEU population, while 790 (29%) were untagged (Fig. 5). The untagged SNPs have a some- what lower average heterozygosity (H = 0.33 0.11) than the tagged SNPs (H = 0.38 0.11) (P < 10¡25), and untag- ged sites also have a lower average MAF (MAF = 0.23 0.12) as compared to tagged sites (MAF = 0.29 0.13) (P < 10¡30). Among the sites used to study the transferability between populations, 1,482 were present and had a mMAF of 0.1 in both Sami and CEU. Of these 1,482 sites, 82% were tagged in Sami by the 600 CEU tag- SNPs. In contrast, only 57% of the Sami sites that do not overlap with the HapMap dataset were tagged by the 600 CEU tagSNPs. Allowing for rare variants in both the Sami and CEU datasets resulted in a similar result. By choosing 1,067 CEU tagSNPs from all SNPs, independent on MAF, 343 (15%) of the overlapping Sami SNPs were not tagged compared to 702 (33%) of the non-overlapping sites. Of these untagged sites, 44% had a MAF below 0.1. The differen- tiation between populations is slightly higher for untagged sites (FST = 0.021 § 0.026) compared to tagged sites (FST = 0.017 0.23) (P = 0.02), supporting the benefits of using a closely related population for tagSNP selection. Discussion We have studied the tagSNPs defined in the HapMap popu- lations for their performance when applied to the SNP pattern detected by SNP discovery in an independent popu- lation sample. While most other empirical studies have used either data from the same population (sometimes even the same individuals) or the same set of common SNPs to evaluate the ability of tagSNPs to capture the genetic varia- tion, we have studied a different population using an unbi- ased method to assess SNP variation. We analyzed a region of 4.4 Mb on chromosome 21 using isolated individual chromosomes. Analysis of individual chromosomes allows for unambiguous phase determination of sequence variants. The chromosome region studied was randomly ascertained and appears to be representative of the genome as a whole with regard to GC content as well as SNP and gene density (data not shown). In addition, previous studies of the Hap- Map data have not indicated that this region exhibits any deviating genomic characteristics. The SNP density in the region is somewhat lower than that expected for a large and expanding population, consistent with information indicating that Sami have had a rather limited and constant population size. Both LD between microsatellite markers and X chro- mosomal SNPs have been shown to be unusually high in Sami (Johansson et al. 2005; Kaessmann et al. 2002; Laan and Paabo 1997). The present study of chromosome 21 shows that the SNP haplotype blocks are considerably longer and the extent of LD in Sami is higher as compared to European American, African and two Asian population samples (Fig. 2a,b). The difference in LD between Euro- pean Americans and Sami is of the same magnitude as that between the African and European Americans. Interest- ingly, the relative difference in microsatellite LD between Sami and the general Swedish population (Johansson et al. 2005) is of the same magnitude as the LD between SNP markers between these populations (Fig. 2b). Thus, while the physical distance over which significant LD occurs differs between microsatellite and SNP markers by a factor of about 80, both markers show an approximately 2-fold higher LD in Sami relative to the general Swedish popula- tion. The higher LD in Sami results in less than half the number of haplotype blocks compared to a sample of the same number of chromosomes from the European Ameri- can population, and a need to study fewer tagSNPs in order to achieve the same level of coverage. The performance of tagSNPs depends heavily on the set of SNPs used. We have used 0.1 as the MAF cutoff due to the relatively small sample sizes in both our and the Hap- Map dataset. According to previous calculations (Kruglyak and Nickerson 2001), the detection rate of SNPs in a sam- ple size of 24 chromosomes is estimated to be: 0.76, 0.95 and 0.99 for a MAF of 0.01, 0.05 and 0.1, respectively. Consequently, the loss in SNP information due to small sample size is more pronounced for rare alleles. By using a MAF cutoff of 0.1 in our analysis, the small sample size will not have a large effect on the SNP delectability. However, including alleles with a MAF of below 0.1 has no dramatic effect on the transferability of tagSNPs, compared to simply using allele frequencies with mMAF of 0.1 (Supplementary Figure 1). The detection rate is clearly dependent on the mMAF and somewhat on differentiation between popula- tions. TagSNPs chosen from the European American popu- lation data did not perform well in the Sami, and even choosing SNPs at random from the Sami SNP data resulted in a higher detection rate. The tagSNPs selected in the European American population performed slightly better than the tagSNPs selected in the African and Asian popula- tions. This is not surprising since our Sami samples were found to be genetically more similar to European popula- tions than to either the African or Asian populations. The markers in Sami that are not tagged by the tagSNP from the European American population have a lower heterozygos- ity as compared to tagged sites. This may reflect the bias towards SNPs with a high MAF in the HapMap genotyping panel, making tagSNPs chosen from this population more efficient to capture SNPs with a high MAF. LD is known to be frequency-dependent (Eberle et al. 2006) and clearly increases for SNPs that are frequency-matched, as is also seen for our Sami data (Fig. 2b). The reduced transferabil- ity can therefore be partly explained by the bias in favor of common variants in the HapMap data as compared to our Sami data (Supplementary material; Figure 2). The fre- quency dependence is also demonstrated by using subsets of SNPs with different mMAF for selecting tagSNPs and then applying them to other subsets of SNPs with different allele frequencies. The detection rate clearly decreases when tagSNPs are chosen from SNPs with a high mMAF in order to detect variation among SNPs with a lower mMAF (Supplementary material; Figure 3). The similar perfor- mance in Sami of randomly chosen tagSNPs and tagSNPs optimized for the European American dataset indicates a difference in the basic correlation structure of SNPs in the Sami and European American dataset. Earlier studies have indicated a relatively high transferability of tagSNPs between populations and the poor transferability seen in our study might be due to both differentiation between the Sami and HapMap populations, and the fact that we evaluated the tagSNPs in a set of randomly ascertained SNPs rather than using the same set of sites from which they were defined. The first explanation is less likely considering the highly correlated allele frequencies between Sami and CEU popu- lation (Supplementary material; Figure 4). However, the latter is clearly supported by the increased performance of the CEU tagSNPs within the subset of Sami SNPs that overlaps with the HapMap panel compared to the Sami SNPs that do not overlap (lower average MAF). This suggests that the detection rate will be overestimated when the evaluation is based on a set of predefined sites that are being retyped rather than using techniques for de novo detection of SNPs, such as resequencing or array-based SNP discovery. Despite the reduced transferability of the HapMap tag- SNPs, there are distinct advantages in studying isolated populations like the Sami. Considering that the 4.4 Mb region studied on chromosome 21 is about 0.15% of the human genome, a 500 k SNP chip would comprise 750 SNPs from this region. Extrapolating from our data (Fig. 3b), 750 randomly ascertained SNPs from set of SNPs with a mMAF of 0.1 in the HapMap European American population would still capture over 70% of all common variation in Sami. It is remarkable that the same number of SNPs chosen at random from all sites with a mMAF of 0.1 in Sami performs better than the tagSNPs selected in the HapMap populations. This result is most likely due to the small overlap between our SNPs and the HapMap SNPs. Among all SNPs with a mMAF of 0.1 found in Sami, only 54% are typed and have a mMAF of 0.1 in the HapMap European American population. Similarly, only 44% of the SNPs in the HapMap data are found in Sami. Choosing SNPs randomly from the HapMap SNPs will consequently result in only 44% being informative markers in Sami. Considering that the SNP discovery technique used in Sami has an estimated discovery rate of 65% (Patil et al. 2001), the fraction of CEU SNPs expected to be found in Sami could be approximated to be 68% (43.8%/0.65). The higher overlap assumes that the performance in Sami of tagSNPs chosen from the CEU dataset is somewhat underestimated and that high throughput SNP genotyping, either of ran- domly ascertained SNPs or selected tagSNPs, will provide sufficient power in Sami for mapping genetic traits. In addition to the higher LD, isolated populations offer several other advantages for mapping complex traits. The demographic history of many isolated populations is likely to result in the segregation of fewer phenotype-associated alleles and thus a reduced complexity of the genetic factors affecting a multifactorial trait. Isolated populations may also show less variability in environmental factors and pro- vide a possibility to compare diet and other lifestyle factors between different groups with the same genetic back- ground. For instance, Sami leading a traditional subsistence lifestyle are exposed to a relatively homogeneous diet and lifestyle, quite different from the westernized lifestyle of Sami with other occupations. Such lifestyle differences may be associated with differences in disease prevalence (Hass- ler et al. 2001; Wiklund et al. 1990). Studies of well-chosen isolated populations thus represent important assets in the quest to decipher the genetic factors underlying complex human traits. Even in the light of the progress made by the HapMap project, much of the genetic variation in the human genome remains undetected. For our chromosome 21 region, about 27% of all segregating sites (or 17% of sites with a mMAF of 0.1) identified in our survey have not been reported in any database, even though chromosome 21 has already been analyzed by array-based SNP discov- ery in at least 21 individuals from three different popula- tions (Patil et al. 2001). The lack of complete information for all SNPs in a population leads to a reduced power to detect genetic variation of functional relevance. This can be most clearly seen in that the tagSNPs selected on the basis of the HapMap data have a detection rate in the Sami that is lower than choosing SNPs at random from the Sami data. This emphasizes the need for genomic rese- quencing at a larger scale and techniques are now becom- ing available that will make it economically feasible to perform complete sequencing of mammalian genomes at an unprecedented rate (Bennett et al. 2005; Margulies et al. 2005).