--- title: "Impute NaCRRI DCas20_5360" site: workflowr::wflow_site date: "2020-November-11" output: workflowr::wflow_html editor_options: chunk_output_type: inline --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, eval = FALSE) ``` DArTseqLD (DCas20-5419). Contains pre-breeding materials of combined African and L. American descent. 1. 2019 East Africa Reference panel: 56250 SNP, 19136 clones included W. Africa landraces 2. Latin America Ref (4101 cclones, 65886 SNP) Suggest comparing the results using PCA, prediction, correlation of kinship matrices, etc. # Copy data Copy the imputation reference panel from 2019 to the `data/` folder. ```{bash,eval = FALSE} # mkdir /workdir/mw489/ cp -r /home/jj332_cas/marnin/IITA_EA_2020 /workdir/mw489/ cp -r /home/jj332_cas/CassavaGenotypeData/CassavaGeneticMap /workdir/mw489/IITA_EA_2020/data/ cp /home/jj332_cas/CassavaGenotypeData/nextgenImputation2019/ImputationEastAfrica_StageII_90919/chr*_ImputationReferencePanel_StageVI_91119.vcf.gz /workdir/mw489/IITA_EA_2020/data/ ``` # Impute Impute with [Beagle V5.0](https://faculty.washington.edu/browning/beagle/b5_0.html). Use the "imputation reference panel" dataset from 2019, e.g. `chr1_ImputationReferencePanel_StageVI_91119.vcf.gz` as reference. Used 1 large memory Cornell CBSU machine (e.g. [cbsulm16; 112 cores, 512 GB RAM](https://biohpc.cornell.edu/lab/hardware.aspx)), running 1 chromosome at a time. R functions are stored in the `code/` sub-directory. Functions sourced from e.g. **imputationFunctions.R** are wrappers around e.g. Beagle, and other command line programs. ```{bash, eval=F} cd /workdir/mw489/IITA_EA_2020/ ``` ```{r,eval = FALSE} targetVCFpath<-here::here("data/OrderAppendix_1_DCas20-5261/") # location of the targetVCF refVCFpath<-here::here("data/") mapPath<-here::here("data/CassavaGeneticMap/") outPath<-here::here("output/") outSuffix<-"DCas20_5261" ``` ```{r,eval = FALSE} source(here::here("code","imputationFunctions.R")) purrr::map(1:18,~runBeagle5(targetVCF=paste0(targetVCFpath,"chr",.,"_DCas20_5261.vcf.gz"), refVCF=paste0(refVCFpath,"chr",.,"_ImputationReferencePanel_StageVI_91119.vcf.gz"), mapFile=paste0(mapPath,"chr",.,"_cassava_cM_pred.v6_91019.map"), outName=paste0(outPath,"chr",.,"_DCas20_5261_EA_REFimputed"), nthreads=112)) ``` Clean up Beagle log files after run. Move to sub-directory `output/BeagleLogs/`. ```{bash,eval = FALSE} cd /workdir/mw489/IITA_EA_2020/output/; mkdir BeagleLogs; cp *_DCas20_5261_EA_REFimputed.log BeagleLogs/ cp -r BeagleLogs /home/jj332_cas/marnin/IITA_EA_2020/output/ cp *_DCas20_5261_EA_REFimputed* /home/jj332_cas/marnin/IITA_EA_2020/output/ ``` # Post-impute filter For now, the function will just do a fixed filter: AR2>0.75 (DR2>0.75 as of Beagle5.0), P_HWE>1e-20, MAF>0.005 [0.5%]. It can easily be modified in the future to include parameters to vary the filter specifications. Input parameters ```{r,eval = FALSE} #' @inPath path to input VCF-to-be-filtered, can be left null if path included in @inName . Must end in "/" #' @inName name of input VCF file EXCLUDING file extension. Assumes .vcf.gz #' @outPath path where filtered VCF and related are to be stored.Can be left null if path included in @outName . Must end in "/". #' @outName name desired for output EXCLUDING extension. Output will be .vcf.gz ``` Loop to filter all 18 VCF files in parallel ```{r,eval = FALSE} inPath<-here::here("output/") outPath<-here::here("output/") source(here::here("code","imputationFunctions.R")) require(furrr); options(mc.cores=ncores); plan(multiprocess) future_map(1:18,~postImputeFilter(inPath=inPath, inName=paste0("chr",.,"_DCas20_5261_EA_REFimputed"), outPath=outPath, outName=paste0("chr",.,"_DCas20_5261_EA_REFimputedAndFiltered"))) ``` Check what's left ```{r,eval = FALSE} purrr::map(1:18,~system(paste0("zcat ",here::here("output/"),"chr",.,"_DCas20_5261_EA_REFimputedAndFiltered.vcf.gz | wc -l"))) # 5506 # 2181 # 2402 # 2352 # 2325 # 2134 # 1023 # 2094 # 2147 # 1587 # 1743 # 2118 # 1441 # 3211 # 2409 # 1709 # 1622 # 1687 ``` ```{bash, eval=F} cd /workdir/mw489/IITA_EA_2020/output/; cp -r *_DCas20_5261_EA_REFimputed* /home/jj332_cas/marnin/IITA_EA_2020/output/ ``` # Formats for downstream analysis The function below will (1) convert the input VCF to plink1.9 binary format and (2) convert the plink binary to a dosage (0,1,2) matrix with special attention to which allele gets counted in the file. **NOTICE:** I was worried about `plink1.9` changing allele codes between files. There is some risk the counted allele could switch between e.g. the reference panel and the progeny files because of allele freq. (see plink documentation). To avoid this, went to extra trouble: write a file suffixed `*.alleleToCount` listing SNP ID (column 1) and the ALT allele from the VCF (column 2). Pass the file to `plink1.9` using the `--recode-allele` flag to ensure all output dosages count the ALT allele consistent with the VCFs. The reason to use `plink1.9` is that `Beagle5` imputed files don't have a **DS** (dosage) field that can be directly extracted. Instead, phased genotypes e.g. `0|1` need to be converted to dosages (e.g. `0|1 --> 1`, `1|1 --> 2`). An alternative might be to extract the haplotypes using `vcftools` and manually (in R) computed the dosages; that would give most control but is slow. ```{bash, eval=F} cd /home/jj332_cas/marnin/IITA_EA_2020/; ``` ```{r, eval=F} library(tidyverse); library(magrittr); source(here::here("code","imputationFunctions.R")) require(furrr); options(mc.cores=18); plan(multiprocess) pathOut<-here::here("output/") # Imputation reference panel future_map(1:18,~convertVCFtoDosage(pathIn="/home/jj332_cas/CassavaGenotypeData/nextgenImputation2019/ImputationEastAfrica_StageII_90919/", pathOut=pathOut, vcfName = paste0("chr",.,"_ImputationReferencePanel_StageVI_91119"))) # DCas20_5261 future_map(1:18,~convertVCFtoDosage(pathIn=here::here("output/"),pathOut=pathOut, vcfName = paste0("chr",.,"_DCas20_5261_EA_REFimputedAndFiltered"))) # Genome-wide dosage (for use in R) for each dataset # Imputation reference panels createGenomewideDosage(pathIn = here::here("output/"), chroms=1:18, "_ImputationReferencePanel_StageVI_91119") # DCas20_5261 createGenomewideDosage(pathIn = here::here("output/"), chroms=1:18, "_DCas20_5261_EA_REFimputedAndFiltered") ```