Genomic transformation and social organization during the Copper Age–Bronze Age transition in southern Iberia-3
Library preparation
We generated double-indexed double-stranded (79) UDG (“ds UDG-half”) (39) libraries using 25 μl of DNA extract and following established protocols (80). Libraries produced per individual are reported in table S1.1.
Shotgun screening and in-solution enrichment (1240k, MT-capture, and Y-capture)
Libraries were sequenced in-house on an Illumina HiSeq 4000 platform to an average depth of 5 million reads and after demultiplexing processed through EAGER (43). After an initial quality filter based on the presence of aDNA damage and endogenous DNA higher than 0.1%, libraries were subsequently enriched using in-solution capture probes synthetized by Agilent Technologies for ~1240k SNPs along the nuclear genome and independently for the complete mitogenome (40) following (41) and mappable regions of the Y chromosome (42), both in-house protocols. The captured libraries were sequenced for 20 to 40 million reads on average using either a single end (1 × 75-bp reads) or paired end configuration (2 × 50-bp reads). Results are reported in tables S1.2 and S1.3.
aDNA data processing
Read processing and DNA damage
After demultiplexing based on a specific pair of indexes, raw sequencing data were processed with EAGER (1.92.59) (43). First, the adaptor sequences were clipped with Adapter removal (v2.3.1) (81) and then mapped against the Human Reference Genome hs37d5 with BWA (v0.7.12) (82) aln and samse commands (-l 16500, -n0.01, -q30). We removed duplicates (reads with same start and end and same orientation) with DeDup (v0.12.2) (43). Last, we determined the deamination rate pattern in our UDG-half libraries using mapDamage (v.2.0.6) (83). All our libraries showed expected deamination patterns for UDG-half treatment, and according to this, we trimmed 2 bp on each terminal side before genotyping. A summary of quality statistics is given in table S1.2.
Genetic sex determination
We determined genetic sex at BAM file level by calculating the coverage on the X and Y chromosomes and the autosomes using a bed file of the regions captured by the 1240k SNP array. We normalized the X and Y reads by the autosomal coverage resulting in the X ratio and likewise for the Y ratio (table S1.1 and fig. S1). For uncontaminated libraries, we expect an X ratio of 1 and a Y ratio of 0 for females and an X and Y ratio of 0.5 for males (46). Individuals that fall in an intermediate position indicate the presence of DNA contamination from the opposite sex.
Contamination estimation
As first instance, we inspected the scatterplot of X/autosomes ratio versus Y/autosomes ratio from the sex determination to look for outliers that would indicate contamination of the opposite sex. We then estimated exogenous human contamination at nuclear and mitochondrial level. We used method 2 in ANGSD, which estimates heterozygosity at polymorphic sites on the X chromosome in males (44), applying a contamination threshold of 3% in individuals with at least 100 X-SNP positions covered twice. In cases where the number of SNPs was lower and for all other libraries from female individuals, we quantified the heterozygosity on the individual mitochondrial reads using ContamMix 1.0.10 (45) by comparing the consensus mitogenome sequence reconstructed from the MT-capture data with a comparative mitochondrial dataset of 311 global mitogenomes. Only one individual had a rate of contamination of 7.8%, while all others were below 1% (table S1.1). We visually inspected the reads with Geneious (v9.1.8) (84), looking specifically for background contamination at all variable sites but found no consistent pattern of contamination that would affect the haplotype calls.
Genotyping
We used our trimmed BAM files to call genotypes with the tool pilleupCaller (https://github.com/stschiff/sequenceTools), which randomly chooses one allele at every SNP position generating pseudo-haploid genotypes. We generated genotype files using the respective SNP lists of the 1240k panel (22) and the Affymetrix HO panel with the intersecting SNPs (~600,000 SNPs) (23, 54). Both of them were restricted to only autosomal SNPs for population genetics downstream analysis. Numbers of SNPs covered at least once in both HO and 1240k are given in table S1.1.
Mitochondrial haplogroup assignment
Following targeted enrichment, we could reconstruct complete mitochondrial genomes of 144 individuals, with an average coverage of 1.3 to 964.6 X (tables S1.3 and S1.4). To process mitochondrial DNA data, we extracted reads from mito-capture data and complemented with reads from 1240k capture and shotgun sequencing when necessary using SAMtools v1.3.1 (85) and mapping to the mitochondrial reference, exclusively. Then, we remapped reads to the revised Cambridge Reference Sequence (rCRS) using Geneious (v9.1.8) (84). To determine haplogroups from the consensus sequences reliably, we applied a threshold of at least 3000 reads per individual (table S1.4). Upon visual inspection of the assembly, we called and exported the consensus sequences in FASTA format to be used in HaploGrep2 v2.1.1 [available via https://haplogrep.uibk.ac.at/ for an automated mitochondrial haplogroup assignment based on PhyloTree (mtDNA tree build 17, available via www.phylotree.org/)] (86). Using Geneious (v9.1.8), we also visually double-checked whether specific private SNPs were covered and shared in pairs of individuals with an attested degree of relatedness (table S1.4).
Y-haplogroup assignment
By using an in-house protocol for Y capture (42), we were able to assign Y-chromosome haplogroups in male individuals. We genotyped the Y-chromosome reads using a Y-SNP list from the ISOGG (International Society of Genetic Genealogy) dataset included in the 1240k and Y-capture probes and by using an in-house script (A.B.R.) (table S1.1). This procedure allowed us to check in a semi-automatic way which positions were covered to assign an ancestral or derived haplogroup or to make corrections to calls in cases where, for instance, a more derived haplogroup was called because of residual ancient damage (C to T or G to A mismatches) in terminal read positions at diagnostic SNPs.
Kinship estimation
We calculated the PWMR (48, 87) in all pairs of individuals from our pseudo-haploid dataset to double-check for potential duplicate individuals and to determine first- and second-degree relatives. We split our individuals into two groups according to their genetic ancestry profiles (CA and BA individuals), assuming that the median PWMR of the two groups provides a background for unrelated individuals (77). We considered one-half of the unrelated value as baseline for identical samples/twins and the midpoint between the two for first-degree relatives (77). After running PWMR, we found no duplicates and removed one of each pair of first-degree relatives for downstream population genetic analyses (in all cases, the individuals with lower SNP coverage) (table S1.5 and text S5).
For the kinship study of La Almoloya, we compared and combined kinship estimates from several methods. We used READ (49) to determine first- and second-degree relatedness among individuals based on the proportion of nonmatching alleles (P0) in nonoverlapping windows of 1 mega base pairs (Mbps) and calculated standard errors. The normalized value of P0 is expected to be between 0 and 0.625 for identical twins, between 0.625 and 0.8125 for first-degree related individuals, between 0.8125 and 0.90625 for second-degree related individuals, and greater than 0.90625 for all other degrees of relatedness (which are reported as “unrelated”) (table S1.6 and text S5).
We also used the method LcMLkin (47), which uses genotype likelihoods to estimate the three k-coefficients (k0, k1, or k2) defined by (88), which define the probability that two individuals have zero, one, or two alleles identical by descendent (IBD) at a random site in the genome. As the method has to deal with missing data, it uses fractions of sites (haplotypes) that are identical by state (IBS). The length of these analyzed fractions can be modified with the thinning parameter --thin. We performed LcMLkin, exploring different thinning parameters (100,000, 50,000, 10,000, and 5000) to observe the behavior of first-degree estimates and the potential to distinguish between possible parent-offspring or sibling relationships (table S1.7 and text S5). To inform this approach contextually, we combined the genetic data with the archaeological information following this order of relevance: (i) We defined the anthropological age at death to determine sibling relatedness among subadults (before reaching reproductive age) or to inform on the directionality of parent-offspring relatives, as infants cannot be parents. (ii) When two unrelated adult male and female both are first degree related to (an)other individual(s), we identified these as parent-offspring relationship and used these trios to identify potential cases of half siblings. (iii) We used the stratigraphic information to confirm the directionality (Who died first?) to establish the type of first-degree relatedness, e.g., assuming that mothers cannot have passed away long before neonates. (iv) We used uniparentally inherited haplogroups to corroborate the established pedigrees.
Phenotypic and functional variants
We explored a list of 56 SNPs associated with known phenotypic traits. Forty-one of these are linked to physical appearance (skin, eye, or hair color), and we used the HIrisPlex method to calculate the probabilities of phenotype prediction (fig. S8 and table S2.19) (89). The other 15 SNPs are associated with the capacity to digest lactose in adulthood (rs4988235 in LCT), alcohol dehydrogenase (rs1229984 in ADH1B), adaptation to metabolize fatty acids (rs174546 in FADS1), predisposition to celiac disease (rs272872 in SLC22A4 and rs653178 in ATXN2/SH2B3), or resistance against infectious diseases [rs4833103 in the gene cluster TLR1-TLR6-TLR10 and rs2269424 in the major histocompatibility complex (MHC)]. We calculated the genotype likelihood to be homozygous or heterozygous for the ancestral (no effect) or derived (effect) allele based on the number of reads from BAM files after removing duplicates. We restricted the analysis to individuals with more than 400,000 SNPs covered in the 1240k panel, and results are summarized in fig. S9 and table S2.20.
Population genetic analysis
Dataset
We merged our final dataset with previously published datasets of ancient and modern individuals reported by the Reich Lab (https://reich.hms.harvard.edu/datasets; please see table S2.1 for a detailed list of references) as well as genotypes from recently published studies (table S2.1) (50–53).
Grouping and labeling individuals
A detailed description of new labels used in population genetics analysis is described in text S6 and table S2.1. We only included new individuals with more than 40,000 of the 1240k SNPs covered in the downstream population genetics analysis and lowered this threshold to >30,000 SNPs only for individuals forming part of a consistent cluster of individuals. In addition, we excluded groups that have consistently fewer than 60,000 SNPs in qpAdm modeling (table S2.1).
Principal components analysis
We computed PCA using smartpca software from the EIGENSOFT package (v6.0.1) (90) with the lsqproject and SHRINKMODE option YES and an extended list of modern populations from Eurasia and North Africa (23) and the Caucasus (91). The newly typed ancient individuals were projected onto PC1 and PC2.
F-statistics
F-statistics were computed with ADMIXTOOLS (https://github.com/DReichLab). For F4-statistics, we used the qpDstat and with the activated f4-mode. F3-statistics were calculated using qp3Pop. We used the 1240k panel in all our f-statistics to increase the number of SNPs covered by the ancient individuals. Standard errors were calculated with the default block jacknife.
Admixture modeling
We used ADMIXTURE (60) to define the main genetic cluster profiles of CA and BA Iberian individuals. We pruned our data for linkage disequilibrium in PLINK (92) with parameters --indep-pairwise 200 25 0.4 and –maf 0.01, which retained 243,848 autosomal SNPs for the HO dataset. ADMIXTURE was run with default fivefold cross-validation (--cv = 5), varying the number of ancestral populations between K = 2 and K = 18 with different random seeds. Each K was replicated five times.
We used the qpWave and qpAdm programs from the ADMIXTOOLS v3.0 package (https://github.com/DReichLab), with the “allsnps: YES” option to minimally reduce the number of SNPs used and subsequently increase the power to reject models, to model the ancestry in our new reported individuals. With qpWave, we estimated the minimum number of independent contributing ancestral sources of gene flow needed to explain a target population. With qpAdm, we quantified the proportion of genetic ancestry contributed by each source. The ancestry proportions in the target population are inferred on the basis of how the target population is differentially related to a set of reference/outgroups via the source populations.
For all the models applied here, we have used a set of 12 outgroups (Mbuti.DG, Ethiopia_4500BP.SG, Ust_Ishim.DG, Russia_MA1_HG.SG, Italy_Villabruna, Belgium_GoyetQ116_1, Han.DG, Onge.DG, Papuan.DG, AHG, CHG, and Morocco_Iberomaurusian). When we modeled ancestries with proximal sources, we added additional populations or rotated them to the outgroup set. The corresponding rationale is explained in each section of the Supplementary Materials.
Admixture dating
We used DATES v.753 to estimate the date of admixture among different previously fitted ancestral sources (https://github.com/priyamoorjani/DATES). The method works on modeling genotypes of the admixed test population as a linear combination of two population sources calculating the ancestry covariance coefficient. The results are given in generations (28 years, one generation; table S2.9) (93).
Sex bias
To test for potential sex bias during periods of admixture, we ran the distal (model 4; described in text S8) and proximal qpAdm models (models 2 and 4b; described in text S8) for the autosomes and in the X chromosome separately using the parameters allsnps: YES and chrom23: YES. If there was a sex bias related to any of the modeled sources of ancestry, we expect to observe increasing (female bias) or decreasing (male bias) amounts of this ancestry on the X chromosome compared to the autosomes, as the paternal contribution on X chromosome is about one-third when compared to the autosomes (the father only contributes one X chromosome to the female offspring). If the ancestry proportion is significantly higher in the autosomes, it implies that the contribution was male driven. Significance was calculated with z scores following previous studies (76, 77) (table S2.22).
Runs of homozygosity
We used hapROH to calculate the portion of the genome under runs of homozygosity (73), which can be applied to low-coverage genotype data. We included all CA and BA individuals from Iberia, which had more than 400,000 SNPs covered in the 1240k panel (table S2.21).
Testing for skewness in f3-statistics
To test for a significant change in the skewness of the distributions of f3-statistics of the form f3(female, female; Mbuti), f3(male, female; Mbuti), and f3(male, male; Mbuti), we measured the sample skewness for each distribution, denoted bff, bmf, and bmm, respectively. We then looked at the values for the pairwise skewness differences of the form δi, j = (bi − bj), where i, j ∈ (mm, mf, ff). We estimated the standard deviation of the δi, j, denoted σˆi,j by taking 1000 bootstrap samples. We then tested the null hypothesis that the δi, j = 0 by calculating a z score of the form Zi,j=δi,j/σˆi,j and then calculated a P value using a standard normal distribution. Our P values were ad hoc adjusted via the Bonferroni method to account for multiple hypothesis testing (table S2.23).
Testing for patrilocality via mean PWMR
To test the overall mean relatedness for each individual, we considered the PWMR for each pair of individuals, denoted pij (for individuals i and j). We then calculated the mean PWMR for each individual such that p¯i=∑i≠jpij. For the p¯i being attributed to groups A and B (male and female), we tested for a significant difference in sample location using a Wilcoxon rank sum test as implemented in the stats library for the R statistical software package (table S2.24).
Acknowledgments
We thank all members of the Archaeogenetics Department of the Max Planck Institute for the Science of Human History, especially the Population Genetics and PALEORIDER groups, for continued support and discussions. We thank G. Brandt, A. Wissgott, S. Clayton, and K. Prüfer for sequencing and handling of the raw data; A. Ghalichi for help with ADMIXTURE; and S. O’Reilly for the design of Fig. 1. We are indebted to the Museo de Lorca, Museo de Alcoy, Museo de Villena, Museo de Sevilla, and Museo de Alicante and all archaeologists and students involved in the excavations.
Funding: This work was supported by the Max Planck Society (V.V.-M. and W.H.); European Research Council (ERC) grant 771234—PALEoRIDER (W.H.); Spanish Ministry of Economy, Industry and Competitiveness project HAR2017-85962-P (C.O., C.R.-H., M.I.F., E.C.B., C.V.-F., V.L., R.M., and R.R.); AGAUR 2017SGR1044 (C.O., C.R.-H., M.I.F., E.C.B., C.V.-F., V.L., R.M., and R.R.); ICREA Academia program (R.R.); John Templeton Foundation grant 61220 (D.R.); and Paul Allen Family Foundation (D.R.). D.R. is an Investigator of the Howard Hughes Medical Institute.
Author contributions: V.V.-M., R.R., and W.H. conceived the study. C.O., C.R.-H., R.M., M.I.F., E.C.B., C.V.-F., K.W.A., D.C.S.-G., G.G.A., M.S.H.P., M.P.d.M.I., A.R., V.B., J.P., A.M., J.L., J.S., A.P.M., J.K., L.G.S., V.L., and R.R. provided archaeological material and advised on the archaeological background and interpretation. F.A., M.H., W.H., C.F., V.V.-M., and C.O. performed the laboratory work. V.V.-M., W.H., A.B.R., and A.C. performed data analysis, with J.K. providing feedback in the analysis and W.H. providing guidance and methodologies. V.V.-M., R.R., W.H., and A.B.R. wrote the manuscript with input from all coauthors.
Competing interests: The authors declare that they have no competing interests.
Data and materials availability: All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. Genomic data (BAM format) are available at the European Nucleotide Archive under accession number PRJEB46907.
Supplementary Materials
This PDF file includes:
Supplementary Text
Figs. S1 to S9
Legends for tables S1 and S2
References
DOWNLOAD (1.85 MB)
Other Supplementary Material for this manuscript includes the following:
Tables S1 and S2
DOWNLOAD (1.26 MB)
View/request a protocol for this paper from Bio-protocol.
REFERENCES AND NOTES
https://www.science.org/doi/10.1126/sciadv.abi7038