search this blog


Monday, October 6, 2014

The power of imputation

The latest version of the Affymetrix Human Origins genotyping dataset, published last month along with Lazaridis et al. 2014, is an awesome resource for population genetics (see here). However, it lacks Polish samples, which is a major drawback as far as this blogger is concerned.

Hopefully this oversight is corrected soon. In the meantime, I decided to include 15 Poles from the Eurogenes Project dataset in my copy of the Human Origins. But in order to do that I first had to impute around 460K genotypes for each of these people.

Imputing so many markers might sound pretty crazy, but it's actually very doable, especially for genetically homogeneous groups with relatively low haplotype diversity, like the Polish population. I used BEAGLE 3.3.2 for the job, mostly because I'm familiar with it, but also because it's quick and accurate.

My reference panel included 1090 individuals, most of them shared by Eurogenes and Human Origins, and just over 1 million markers. Only around 130K of the markers were shared by the two datasets, but well over 50% of the 1 million genotypes were observed in each of the Poles. This meant that I was imputing sporadically missing data, which is certainly a more sensible strategy than attempting to fill in long stretches of empty calls.

Everything seems to have worked out just fine, and the proof is in the pudding. Below are two Principal Component Analyses (PCA) featuring the Poles alongside 50 samples from the HGDP. The first PCA is based on observed genotypes, while the second on markers that were imputed into the Polish genomes. PCA are very sensitive to artifacts like genotyping errors, but as you can see, there's very little difference between these results. Also, keep in mind that the SNPs used in the Human Origins were specifically chosen for population genetics, while those in the Eurogenes dataset come from chips mostly designed for commercial ancestry and medical work.

Also, here's a PCA based on more than 300K SNPs, both observed and imputed in the Poles, featuring all of the West Eurasian samples from the filtered version of Human Origins, as well as the 15 Polish individuals. Note that the Poles cluster more or less between the Czechs and groups from the East Baltic region, and overlap most strongly with Belarusians, which makes sense.


Brian L. Browning, Sharon R. Browning, A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals, AJHG, Volume 84, Issue 2, p210–223, 13 February 2009, DOI:

Lazaridis et al., Ancient human genomes suggest three ancestral populations for present-day Europeans, Nature, 513, 409–413 (18 September 2014), doi:10.1038/nature13673

Sunday, September 21, 2014

Corded Ware people: more versatile and healthier than Neolithic farmers

Over at West Hunter Greg Cochran argues that late Neolithic farmers in Northern Europe experienced nothing short of genocide at the hands of Corded Ware Culture (CWC) pastoralists, who pushed deep into the continent from somewhere east of present-day Germany around 4,800 years ago.

I think he's exaggerating. My view is that farming populations throughout much of Neolithic Europe began to crash well ahead of any invasions, perhaps as a result of climate change, overpopulation, environmental degradation and bad health. This, I'd say, created a vacuum that attracted groups from the peripheries of the Neolithic world, like the CWC nomads.

If so, it's likely that many of the surviving farmers were killed or marginalized in the process, although, as often happens in such cases, their women might have been incorporated on a large scale into the new post-Neolithic societies. This is perhaps why the most common Neolithic Y-chromosome haplogroup, G2a, is now so scarce in Europe, while a wide variety of mitochondrial (mtDNA) lineages frequently found among Neolithic skeletons are still carried by many Europeans today.

Nevertheless, I'm not aware of any evidence of a wholesale slaughter, or even any wars, going on in Europe during the early CWC period.

This new paper at Anthropologie seems to back up my case. The Corded Ware people were simply more versatile and healthier than the Neolithic farmers. No wonder then, that they eventually came out on top.

This study focuses on the changes in the human skeleton that are associated with the transition to agricultural subsistence. Two populations from the territory of contemporary Poland that differ in terms of their subsistence strategies are compared. An agricultural subsistence strategy is represented by a Lengyel Culture population from Oslonki (5690-4950 BP), whilst the Corded Ware populations from Zerniki Gorne and Zlota (c. 4160-3900 BP) represent mixed, agricultural-breeding-pastoral economies supplemented with hunting and gathering. The Corded Ware sample consisted of 62 individuals in total, and the Lengyel sample comprised 68 individuals. Health status was examined through skeletal stress indicators, cribra orbitalia, enamel hypoplasia and Harris lines. The analysis of enamel hypoplasia showed the effect of different adaptive strategies on buffering adverse nutritional factors and diseases. The prevalence and severity of the condition proved significantly higher in the Lengyel sample than in the Corded Ware population (64.7% vs. 43.5%, respectively). It is suggested that agricultural subsistence, associated with a less diversified diet, sedentism, exposure to pathogens, spread of infections and increased population density, caused more frequent and severe stress episodes than the mixed economy of the Corded Ware people. The inverse relationship between enamel hypoplasia and the mean age at death found in the agricultural population clearly shows an effect of adverse living conditions on the biological development of the individuals studied.


Krenz-Niedbala M, A biocultural perspective on the transition to agriculture in Central Europe, Anthropologie, 2014/Volume 52/Issue 2/pp. 115-132, ISSN 0323-1119

See also...

Best of 2008: Corded Ware DNA from Germany

Corded Ware Culture linked to the spread of ANE across Europe

Thursday, September 4, 2014

Ancient North Eurasian (ANE) admixture across Europe & Asia

This is an update of a supervised ADMIXTURE analysis that I ran earlier this year looking at ANE levels throughout Asia, the results of which I posted at my other blog (see here). Anyone wanna make a map?

ANE admixture across Europe & Asia spreadsheet

My claim is that these estimates are more accurate than those we've seen recently in scientific literature. Obviously I'm referring here to Lazaridis et al. 2013/14 (see here). That's not to say that people like Iosif Lazaridis, Nick Patterson and David Reich don't know what they're doing. Clearly they do, but at the fine-scale there's usually room for improvement no matter who you are.

For instance, in their paper in table S14.9 they list the Basques (in fact, French Basques) as 11.4% ANE, which sounds reasonable, although perhaps a little too high considering they admit that this population can be modeled as 0% ANE. On the other hand, they estimate the "North Spanish" to be 16.3% ANE.

Now, this reference set is actually from the 1000 Genomes project, where it's listed as Spaniards from Pais Vasco (ie. Basque Country). Essentially, what this means is that these are Basques from Spain. So why would Basques from France carry only 11.4% ANE, and Basques from Spain a whopping 16.3%? Not only that, but according to Lazaridis et al., these "North Spanish" also can be modeled as 0% ANE.

Obviously, something's not quite right there. Indeed, in my spreadsheet, the very same French Basques are listed as 7.4% ANE, while the Pais Vasco Spaniards as just over 8%. Call me crazy, and many do, but I think these results actually make good sense.

By the way, I made ten synthetic samples from the ANE allele frequencies from this test, and remarkably, in all of the analyses I've ran so far they behaved very much like MA-1 or Mal'ta boy, the main ANE proxy. Below, for example, is a Principal Component Analysis (PCA) of West Eurasia featuring these individuals. The result is very similar to those I obtained with Mal'ta boy (see here and here).

The synthetic ANE samples are available here. Feel free to play around with them, and if you do, please let me know what you discover.

As some regular visitors already know, I'm currently designing a new test for GEDmatch that will include various ancient components like ANE. Unfortunately, it might be a while before it's ready, simply because I want it to be as accurate as possible.

See also...

Eurogenes ANE K7

Corded Ware Culture linked to the spread of ANE across Europe

Wednesday, August 13, 2014

Male height in Europe

I'd say this open access paper at Science Direct is the most detailed work on European stature ever. The conclusion is that male height in Europe is mostly determined by nutrition and genetics, which isn't really earth shattering. But the authors also point out that Y-chromosome haplogroup I-M170 shows a strong correlation with the highest average stature on the continent, and speculate that the link between the two might be Upper Paleolithic hunter-gatherer ancestry:

The average height of 45 national samples used in our study was 178.3 cm (median 178.5 cm). The average of 42 European countries was 178.3 cm (median 178.4 cm). When weighted by population size, the average height of a young European male can be estimated at 177.6 cm. The geographical comparison of European samples (Fig. 1) shows that above average stature (178+ cm) is typical for Northern/Central Europe and the Western Balkans (the area of the Dinaric Alps). This agrees with observations of 20th century anthropologists (Coon, 1939; Lundman 1977). At present, the tallest nation in Europe (and also in the world) are the Dutch (average male height 183.8 cm), followed by Montenegrins (183.2 cm) and possibly Bosnians (182.5 cm) (Table 1). In contrast with these high values, the shortest men in Europe can be found in Turkey (173.6 cm), Portugal (173.9 cm), Cyprus (174.6 cm) and in economically underdeveloped nations of the Balkans and former Soviet Union (mainly Albania, Moldova, and the Caucasian republics).


The trend of increasing height has already stopped in Norway, Denmark, the Netherlands, Slovakia and Germany. In Norway, military statistics date its cessation to late 1980s.


In contrast, the fastest pace of the height increase (≥1 cm/decade) can be observed in Ireland, Portugal, Spain, Latvia, Belarus, Poland, Bosnia and Herzegovina, Croatia, Greece, Turkey and at least in the southern parts of Italy.


Although the documented differences in male stature in European nations can largely be explained by nutrition and other exogenous factors, it is remarkable that the picture in Fig. 1 strikingly resembles the distribution of Y haplogroup I-M170 (Fig. 10a). Apart from a regional anomaly in Sardinia (sub-branch I2a1a-M26), this male genetic lineage has two frequency peaks, from which one is located in Scandinavia and northern Germany (I1-M253 and I2a2-M436), and the second one in the Dinaric Alps in Bosnia and Herzegovina (I2a1b-M423)16. In other words, these are exactly the regions that are characterized by unusual tallness. The correlation between the frequency of I-M170 and male height in 43 European countries (including USA) is indeed highly statistically significant (r = 0.65; p < 0.001) (Fig. 11a, Table 4). Furthermore, frequencies of Paleolithic Y haplogroups in Northeastern Europe are improbably low, being distorted by the genetic drift of N1c-M46, a paternal marker of Ugrofinian hunter-gatherers. After the exclusion of N1c-M46 from the genetic profile of the Baltic states and Finland, the r-value would further slightly rise to 0.67 (p < 0.001). These relationships strongly suggest that extraordinary predispositions for tallness were already present in the Upper Paleolithic groups that had once brought this lineage from the Near East to Europe.


Grasgruber et al., The role of nutrition and genetics as key determinants of the positive height trend, Economics & Human Biology, available online 7 August 2014, DOI: 10.1016/j.ehb.2014.07.002

Tuesday, July 29, 2014

Analysis of Upper Paleolithic Siberian forager Afontova Gora-2

Apparently, this 15,000 year-old genome from Central Siberia is heavily contaminated with modern DNA (see section SI 5.2.3. in Raghavan et al. 2013). However, apart from MA-1, it's the only Ancient North Eurasian (ANE) sample available right now, so I thought I'd take a closer look at it.

The shared drift statistics using f3(Mbuti;AG-2,Test) do suggest contamination from a present-day Eastern European source, with, for instance, Ukrainians from Lviv showing an unexpectedly strong signal (third on the list below just behind Pima Indians). This makes sense since AG-2 was probably mainly handled by Slavic-speaking Soviet archaeologists and museum staff.

Shared drift with AG-2 (spreadsheet)

Indeed, in the Eurogenes K15 test, the Baltic component is the most important for AG-2, and this component is modal among Balto-Slavic populations. However, AG-2 fails to register any Mediterranean-specific admixture. At the very least, this is interesting, because all present-day Europeans show this influence. In fact, out of the four K15 components typical of the Near East, only the West Asian component appears for AG-2. This component actually peaks in the Caucasus, where today ANE reaches its highest levels in West Eurasia.

Eurogenes K15 results for AG-2

North_Sea 11.3
Atlantic 0.01
Baltic 22.83
Eastern_Euro 20.53
West_Med 0
West_Asian 4.63
East_Med 0
Red_Sea 0
South_Asian 13.9
Southeast_Asian 0
Siberian 5.97
Amerindian 16.07
Oceanian 4.77
Northeast_African 0
Sub-Saharan 0

4 Ancestors Oracle results based on the K15 ancestry proportions suggest that AG-2 might simply be a more westerly ANE sample than MA-1, perhaps with some European forager ancestry. Below are a few examples of the best population approximations; note the strong showing by StoraFörvar11, a Mesolithic genome from near Gotland, Sweden. The full list can be seen here.

1 Brahmin_UP+North_Amerindian+StoraFörvar11+StoraFörvar11 @ 8.364493
2 Burusho+North_Amerindian+StoraFörvar11+StoraFörvar11 @ 8.411899
3 MA-1+MA-1+StoraFörvar11+Tatar @ 8.427561
4 Kshatriya+North_Amerindian+StoraFörvar11+StoraFörvar11 @ 8.437549
5 Gujarati+North_Amerindian+StoraFörvar11+StoraFörvar11 @ 8.45127

However, I was only able to use around 13K SNPs that overlapped with my dataset for all of the tests here. So perhaps these markers were much less affected by contamination than the rest? In any case, here are three Principal Component Analyses (PCA) to finish things off. Again, AG-2 basically looks like the genome of a late ANE survivor with a solid contribution from indigenous European foragers. Hopefully this can be confirmed or debunked in the near future with a much higher quality sequence of its genome.

Update 20/08/2014: In the above analysis I used variants from the 1stextraction AG-2 bam file. To try and get more markers I have now also processed the apparently lower quality supernatant bam. Merging the two files has given me just over 30K SNPs to play with, and I think the extra markers have made a positive difference. Below are the updated results, which I'd say appear more accurate because they're much more similar to those of MA-1 (see here and here).

Revised Eurogenes K15 results for AG-2

North_Sea 12.63
Atlantic 0
Baltic 12.77
Eastern_Euro 30.26
West_Med 0
West_Asian 1.13
East_Med 0
Red_Sea 0
South_Asian 18.44
Southeast_Asian 0
Siberian 3.84
Amerindian 17.34
Oceanian 3.6
Northeast_African 0
Sub-Saharan 0

Revised 4 Ancestors Oracle results for AG-2
Revised shared drift with AG-2 (spreadsheet)

PCA based on the new set of markers look almost identical to the PCA above, so I won't bother posting them. By the way, I updated the Eurogenes ancient genomes datasheet with the revised AG-2 K15 results (see here).

See also...

Analysis of Mesolithic Swedish forager StoraFörvar11

Wednesday, March 26, 2014

The story of R1a: the academics flounder on

There's been a lot of horseshit published over the years about Y-chromosome haplogroup R1a, which just happens to be my haplogroup. That includes academic papers in journals like PLoS ONE and Nature. My advice is, take all of that stuff with a very large pinch of salt and just look here for updates.

Indeed, a new paper on the phylogeography of R1a appeared at the Nature website today: Underhill et al. 2014. It's actually a much better effort than anything else on the topic at academic level thus far, but certainly not without issues.

For instance, the authors failed to include two well known and very important R1a subclades in their analysis: the Northwest European-specific R1a-CTS4385 and the East and Central European-specific R1a-Z280. As a result, the former is lumped with R1a-M417* and the latter with R1a-Z282*. In fact, Z280 is shown to be above Z282 in the topology of R1a-M420 (see Figure 1 here), which is plain wrong. These are major oversights and mean that this study is not a very useful resource as far as the phylogeography of European R1a is concerned.

But the paper does show a couple of interesting things. For instance, the maps below offer the best illustration to date of the dichotomy between the European-specific R1a-Z282 and Asian-specific R1a-Z93.

However, these are very closely related subclades, sharing the Z645 mutation (unfortunately not mentioned in the paper), and both reaching high frequencies among Indo-European speakers. It's therefore plausible that groups carrying these markers expanded to the west and east from a zone between their current hotspots, possibly the Volga-Ural region, rather recently.

Indeed, these migrations had to have happened after 4800-6800 YBP, which is the age of R1a-M417 reported by Underhill et al., and backed up by estimates from genetic genealogists using, among other things, complete R1a sequences (see here). In other words, the rapid expansions of R1a-Z282 and R1a-Z93 appear to have taken place from more or less the same region during the generally accepted early Indo-European timeframe, making them excellent candidates for paternal markers of the early Indo-European dispersals.

At the same time, the paucity of R1a-Z93 and derived lineages in Europe, including Eastern Europe, suggests that historic migrations originating in East and Central Asia, like those of the early Turks, had a negligible effect on the paternal ancestry of modern Europeans. This shows very clearly on the PCA in Figure 4 (see here).


Underhill et al., The phylogenetic and geographic structure of Y-chromosome haplogroup R1a, European Journal of Human Genetics, advance online publication, 26 March 2014; doi:10.1038/ejhg.2014.50

See also...

R1a-Z93 from Bronze Age Mongolia

Afghan Hindu Kush: a genetic sink

Saturday, March 15, 2014

PCA of ancient European mtDNA

The recent Wilde et al. paper on the ancient DNA of Eastern European steppe nomads included mitochondrial DNA (mtDNA) data for just over 60 of the studied individuals. Below is a Principal Component Analysis (PCA) featuring these samples, marked collectively as KGU, alongside the dataset from last year's Brandt et al. study on the genetic origins of Central Europeans.

Note that KGU falls closest to the Bernburg (BEC) and Unetice (UC) samples from Neolithic and Bronze Age eastern Germany, respectively. This is probably because all of these groups have similar levels of mtDNA haplogroups U5a and H. Moreover, UC is thought to be an Indo-European archaeological culture with origins in Eastern Europe. On the other hand, Brandt et al. hypothesized that BEC might have been of Scandinavian origin.

The Central European metapopulation (CEM) is composed of present-day individuals from Austria, Germany, Poland and the Czech Republic. Its position on the PCA plot suggests to me that modern Central Europeans are largely derived of Kurgan nomads, Bell Beakers from Iberia (BBC), and remnants of Neolithic farmers from the Near East, at least in terms of maternal ancestry.

In other words, I'd say the result correlates well with the findings of Brandt et al., who posited that long-range migrations from eastern and western Europe into the heart of the continent, particularly during the late Neolithic, played an important role in the formation of the modern Central European mtDNA gene pool.

Citations and credits...

Thanks to Eurogenes Project member PL16 for the PCA

Wilde et al., Direct evidence for positive selection of skin, hair, and eye pigmentation in Europeans during the last 5,000 y, PNAS, Published online before print on March 10, 2014, DO:I10.1073/pnas.1316513111

Guido Brandt, Wolfgang Haak et al., Ancient DNA Reveals Key Stages in the Formation of Central European Mitochondrial Genetic Diversity, Science 11 October 2013: Vol. 342 no. 6155 pp. 257-261 DOI: 10.1126/science.1241844

See also...

Extreme positive selection for light skin, hair and eyes on the Pontic-Caspian steppe...or not