globalCassava_NoFlab.vcf.gz: Cassava HapMapII excluding M. flabellifolia (NoFlab) samples. 30X WGS data for millions of SNPs was imputed with Beagle. Data are originally from Punna et al. 2017. Nature Genetics. globalCassava_NoFlab_prune50by10pt3 PLINK Binary (.bed, .bim, .fam): The dataset (globalCassava_NoFlab.vcf.gz) was LD prunned using plink1.9 --indep-pairwise. Window size 50 SNPs, step size 10, LD r-square threshold 0.3. globalCassava_hapCutCorrectedPhased.vcf.gz: HAPCUT phased HapMapII dataset. MeanGlazDist_1000SNPwindows_globalCassava_NoFlab_prune50by10pt3.rds: Derived from the LD prunned HapMapII dataset (globalCassava_NoFlab_prune50by10pt3). R dataset. Contains a data.frame. One row per CLONE per 1000 SNP window. MeanGlazDist = mean Hamming distance for a 1000 SNP window between the *M. glaziovii* reference panel and each cultivated-cassava sample. MglazioviiReferencePanels_Top10perWindow_MeanGlazDist.rds: Derived from MeanGlazDist_1000SNPwindows_globalCassava_NoFlab_prune50by10pt3.rds.10 rows per 1000 SNP window, indicating the 10 CLONEs with the greatest MeanGlazDist. These clones were considered as non-introgressed cassava and used on a window-specific basis and compared to 8 M. glaziovii samples, in order to discover introgression diagnostic markers. IDMdosages.rds: R dataset (rds) containing a nested data.frame. Each row gives the population name (Group), a vector of sample names (list-type column headed 'SampleList') and a dosage table (list-type column headed 'Dosage'). Each element of 'Dosage' is a data.frame with allelic-dosages at introgression diagnostic markers (IDMs). The first 6 columns are sample meta-data (IID = clone name). Each remaining column contains integers 0, 1 or 2 counting the number of M. glaziovii diagnostic alleles at each IDM. LDwithIDMs.rds: R dataset (rds) containing a nested data.frame. For each SNP, gives the MaxLD and the TotalLD with IDM statistics as described in the manuscript. Dosages_tagIDMplusNonIDM_RefPanel.raw: Dosage table in the same format as those found for IDMs in IDMdosages.rds. N individuals (rows) x P snps (columns). Tag-IDM and non-IDM SNPs are included. The entire imputation reference panel 'RefPanel' as described in the Methods are included in this file. The first 6 columns are sample meta-data (IID = clone name). Each remaining column contains integers 0, 1 or 2 counting the major allele, indicated in the column heading. PCAs_AllPhenotypedSamples_ALLvsIDMvsNonIDM.rds: R dataset containing a data.frame. PCAs were conducted using prcomp() in R. Columns are: - Category = indicates whether all (ALL), IDM-only (IDM) or nonIDM-only (nonIDM) SNPs were used in PCA. - SNPset = list-column containing vectors of SNP IDs used for the corresponding PCA. - Loadings = list-column containing data.frames of loadings (eignenvector coefficients) for the first 50 PCs. - Scores = list-column containing data.frames of PC scores for each sample on the first 50 PCs. - PVE = list-column containing data.frames with the percent variance explained (PVE) of the first 50 PCs. meanDoseGlaz_250Kwindows_AllPops.rds: R dataset containing a data.frame. The mean M. glaziovii allele dosage at IDMs in 250Kb windows across the genome was calculated. Columns are: - Group = population (e.g. Genetic Gain = GG). - Chr = chromosome. - doseGlazInWindows = list-column of data.frames with mean M.glaziovii dosages for each clone in each window. Also gives the number of IDM SNPs in each window (Nidm) and the beginning (Start) and ending (Stop) base-pair. meanDoseGlaz_250Kwindows_HapMapII_HAPMIX.rds: Same as for 'meanDoseGlaz_250Kwindows_AllPops.rds' but for the HapMapII population only and based on HAPMIX. meanDoseGlaz_250Kwindows_HapMapII_IDMs.rds: Same as for 'meanDoseGlaz_250Kwindows_AllPops.rds' but for the HapMapII population only and based on IDMs. IntrogressionMap_[Group]_[dpi]dpi.png: Introgression Plots with sufficient detail to distinguish individual clones. For each, the mean M. glaziovii allele dosage at IDMs in 250Kb windows across the genome is depicted. Physical position on each chromosome is in megabases (Mb) along the x-axis. Colors range from orange (0 M.g. alleles), to green (1 M.g. allele), to dark blue (2 M.g. alleles). Each row (y-axis) is an individual cassava clone with its name on the left. The file names indicate the population [Group] and resolution [dpi] of the image. GRMs_AllNextGenSamples.rds: Kinship matrices for 2742 clones made with various partitions of SNPs included as a data.frame. Columns are: - Category = whether ALL, IDM, nonIDM or one of 5 random partitions of SNPs of Nidm vs. Nnonidm. - Chr = list-type column with lists of SNPs used to construct the kinship. - GRM = list-column of square-symmetric kinship or genomic relationship matrices (GRMs). FieldTrialData.rds: Each row of this data.frame contains data about a specific trait in a specific trial. Information on institute, trait, location, year-of-harvest and a trial-name are given. In addition, numbers of observations (Nobs), clones (Nclone) and replications (Nrep) plus ratio of Nobs/Nclone are given. The column 'data' is list-type, each element containing the actual plot-level data for each trait-trial. Columns of 'data' include: - CLONE, VCFCLONE, DOSAGECLONE = variants of sample IDs. - LOC.YEAR.TRIAL.REP = gives a unique ID for replications where relevant. - TRIAL.NAME = ... - PLOT_ID and PLOT_NAME = unique IDs for the plot, where available. - NOHAV = number of plants harvested, if any. - Value = the phenotypic record. CuratedFieldTrialData.rds: Each row of this data.frame contains a field-plot record for a specific trait. This dataset was derived from FieldTrialData.rds originally. The remaining records come from trials, which passed filters that are described in the manuscript. BLUPsForGWASandGP.rds: Each row of this data.frame contains meta-information, model outputs and BLUPs from one of 19 trait-institute datasets. Linear-mixed models were fit in R used the function lme4::lmer(). Columns include: - data = list-type column of the raw data for a train-institute dataset. - H2 = Broad sense heritability (Vg / (Vg + Ve)) based on variance components estimated by lmer(). - Vg = Genetic variance component estimate. - BLUPs = list-column containing BLUPs for CLONE, PEVs, reliabilities (REL), deregressed-BLUPs (drgBLUP) and weights (WT) as described in the text. - Model = The lme4-style formula for the linear mixed-model that was fit. SNPwiseGWAS_GCTAresults.rds: GWAS results from running GCTA. Columns inlclude: - All outputs from GCTA. - Trait, Institute. - Bonferonni threshold (SNPbonf), significant at SNPbonf level or not (Sig). - Status as an introgrssion-diagnostic marker or not (IDM). - ExpectedP for QQ-plotting. GWASon250KbWindowMeans.rds: GWAS results from running sommer on the mean-M. glaziovii dosage for each 250Kb window. Columns inlclude: - Trait, Institute. - Segment, Chr, StartPos (in bp), EndPos (also bp), MidPoint to identify the position of the segment. - Bonferonni threshold (SEGbonf), significant at SNPbonf level or not (Sig). - p-value (p) and effect-estimate (Effect_reml). - ExpectedP for QQ-plotting. CrossValidationResults.rds: Raw cross-validation results. Includes lists of training-test samples for each fold/rep, seed used for set.seed() when sampling train-test partitions (for reproducibility), raw sommer() output and accuracy estimates. info_delMut_22495.txt: List of deleterious mutations from a previous publication Punna et al. 2017. Nature Genetics. See text for full citation on this table. c[1-18]_GGplusC1_DeleteriousSites.raw: Dosage table in the same format as those found for IDMs in IDMdosages.rds. N individuals (rows) x P snps (columns). Prefixes e.g. c1_ indicate the chromosome. The first 6 columns are sample meta-data (IID = clone name). The number of deleterious alleles (0, 1 or 2) at each deleterious mutation. Only ~9.7K mutations were genotypeable based on the available the IMPUTE2 dataset, which is described in the text. The LG, GG and C1 populations were included in this dataset. GeneticLoadPerIndividual.rds: For each group, a summary of the per-individual genetic load. The column 'GeneticLoad' is list-type, containing a data.frame for each group. Each 'GeneticLoad' summary includes the following: - FID = sample ID. - TotalLoad = number of deleterious alleles. - nNotNA = number of loci that were genotyped (not missing). - RelativeLoad = TotalLoad/(2*nNotNA). - NHom and Nhet = number of homozygous and heterozygous deleterious genotypes. - RelHom = nHom/nNotNA. - RelHet = nHet/nNotNA.The suffix 1 or 4 is added to the column header to indicate the summary is over the introgression regions (chr. 1 from 25Mb+, chr. 4 from 5-25Mb). Reference_panel_and_progenies folder: Include GBS vcf file and relevaant dosage data for both Reference panel and genomic selection progenies - Reference_panel folder Includes all cassava accessions used as reference population for imputation of genomic selection progenies - Reference_panel_imputed folder Reference panel after imputation using Beagle 4.0 (see Material and Method section) - IITA_GSprogenies_imputed folder IITA cassava progenies GBS data imputed using the Reference panel using Beagle 4.0 (see Material and Method section) Corresponding population names and url in cassavabase.org: Reference_panel_June16: IITA_GSprogenies_imputed_June16: