Cassava (Manihot esculenta Crantz) is grown throughout tropical Africa, Asia and the Americas for its starchy storage roots, and feeds an estimated 750 million people each day. Farmers choose it for its high productivity and its ability to withstand a variety of environmental conditions (including significant water stress) in which other crops fail. However, it has low protein content, and is susceptible to a range of biotic stresses. Despite these problems, the crop production potential for cassava is enormous, and its capacity to grow in a variety of environmental conditions makes it a plant of the future for emerging tropical nations. Cassava is also an excellent energy source - its roots contain 20-40% starch that costs 15-30% less to produce per hectare than starch from corn, making it an attractive and strategic source of renewable energy.

The cassava genome project has built upon a pilot initiated through the DOE-JGI Community Sequencing Program (CSP) by a 14-member consortium led by Claude Fauquet, Joe Tohme and Pablo Rabinowicz. This pilot project produced a little under 1x coverage from over 700,000 Sanger shotgun reads using plasmid and fosmid libraries, and it provided insights into the overall characteristics of the cassava genome, and a valuable source of Sanger paired-end sequences to be used later. Next a draft 454-based assembly was generated (v4.1) in a project led by Steve Rounsley, Dan Rokhsar, Chinnappa Kodira, and Tim Harkins, This began in Spring 2009 when 454 Life Sciences, a Roche company partnered with DOE-JGI to provide the resources for a whole genome shotgun sequencing of cassava using the 454 GS FLX Titanium platform.

The current assembly (v6.0) is a brand new Illumina-based assembly from the same genotype, AM560-2. The scaffolds have been anchored onto chromosomes using a high resolution genetic map. This assembly was supported by a grant to Steve Rounsley and Dan Rokhsar from the Bill & Melinda Gates Foundation (OPPGD1493).

Genome

The main genome assembly is approximately 582.25 Mb arranged on 18 chromosomes plus 2,001 scaffolds that have not yet been anchored on chromosomes

Approximately 495.48 Mb arranged in 40,044 contigs (~ 14.9 % gap)
Scaffold N50 (L50) = 10 ( 28.12 Mb)
Contig N50 (L50) = 5,394 ( 26.69 kb)
317 scaffolds are > 50kb in size, representing approximately 96.2% of the genome
The assembly is 35.9 % GC

Loci

33,033 total loci containing protein-coding transcripts

Alternative Transcripts

8,348 total alternatively spliced transcripts

Sequence data were generated from a partially inbred (third generation self, or S3, of MCOL1505) line called AM560-2 which was generated at CIAT (International Center for Tropical Agriculture) in Cali, Colombia. The previous assembly (v4.1) was based on the same accession.

Short fragment libraries were generated by Jessica Lyons at UC Berkeley and sequenced to 125X depth; 6 kbp insert mate-pair libraries were constructed and sequenced by Jane Grimwood and Jeremy Schmutz at the HudsonAlpha Institute. For fosmid end sequencing and long-range (Hi-C) sequencing, Becky Bart at the Donald Danforth Plant Science Center grew and collected etiolated leaves from which DNA was extracted into low melting agarose plugs by Julia Vrebalov at the Boyce Thompson Institute. Fosmid libraries with 20 kb inserts were constucted and sequenced by Lucigen Corp. and long-range sequence information was generated by Dovetail Genomics Inc. (Putnam et al., 2015 arXiv.org).

Sequence information from a wide range of insert sizes was quality controlled and assembled de novo with Platanus (Kajitani and Itoh, 2014) by Jessen Bredeson at UC Berkeley and long-range information was integrated by Dovetail Genomics Inc. The resulting scaffolds were ordered and oriented by mapping the markers onto a genetic map containing 22,403 markers that had been generated previously (International Cassava Genetic Map Consortium, 2014). The vast majority of the assembly (518.51 Mbp of 582.28 Mbp (89.0%)) could be anchored to 18 chromosomes.

Although cassava has an estimated genome size of ~772 Mbp (Awoleye & Novak, 1994), this assembly spans only 582.28 Mbp. Despite the discrepancy, we believe that the assembly represents nearly all of the genic regions of the genome (see below), and that the missing portion is repetitive sequence that could not so far be assembled. On-going genome sequencing on the PacBio platform is attempting to assemble more of the repetitive sequence in the genome to increase the completeness of the assembly.

We were able to map 91.8% of the 85,665 EST sequences available in the NCBI nucleotide database, showing near-complete coverage of protein-coding genes in the assembly.

In order to find and mask repeats in the genome sequence, RepeatModeler v1.0.8 (Smit, 2008-2015) was run on the v6.0 assembly and generated 1,275 repetitive sequences (median length 576bp and totaling 1,331,368 bp). Most of these sequences were unknown, but ~28% could be classified LTR; ~6% DNA and ~3% LINE/SINE elements. Nonetheless, the software is unaware that plants contain large families of genes with similar sequence and these often are included in the putative repeat library. In order to remove these gene-like sequences from the repeat library, PFAM and PANTHER annotations were predicted on the repeat sequences and those with annotations known to correspond to genes were removed from the repeat library. A further 155 sequences were manually inspected as they were annotated with predicted protein domains that are often associated with protein coding genes. Four of these sequences were removed from the repeat library, leaving 151. A further 1,101 sequences were similar to repeat sequences and were added to make 1,252 sequences. Lastly, 176 Manihot repeats from repbase (downloaded Jan 15th, 2015) were added to make the final Manihot repeat library. These 1,428 sequences were used to mask the genome with RepeatMasker. This masked 49.3% of the genome.

To produce the current "Cassava V6.1" gene set, we used the homology-based gene prediction programs FgenesH and GenomeScan, along with the PASA program to integrate expression information from cassava ESTs and RNA-Seq. The gene set shown on the browser was produced by Simon Prochnik at JGI.

Transcript data from three sources were integrated. First, RNA-Seq root and shoot tissues from Albert and Namikonga varieties, with and without challenge by CBSV 1x50 (1,055,722,008 initial reads, 895,271,180 reads after quality trimming) and 2x100 (340,899,946 initial reads; 282,586,400 reads after quality trimming) reads were aligned to the genome and assembled with in-house software Pertran (Shu et al., manuscript in preparation). This yielded 51,588 and 62,488 transcript assemblies from PE and SE reads respectively. These were aligned to the genome with PASA (90 % identity and 60% coverage cutoffs) to make 69,624 aligned assemblies. In addition, ESTs from previous 454 sequencing (Prochnik et al. 2011) (1,187,328 from root and 299,509 from leaf) were assembled with Pertran (Shu et al., manuscript in preparation) and added to 80,459 ESTs from GenBank and aligned to the genome with PASA (95% identity, 60% coverage) to generate 27,470 aligned assembles.

Loci were determined from the coordinates of transcript assembly alignments and/or EXONERATE alignments of proteins from Ricinus communis (TIGR release 0.1), Arabidopsis thaliana columbia (TAIR10) and Populus trichocarpa (v3) proteins to soft-repeatmasked cassava v6.0 genome using RepeatMasker (Smit, 1996-2012) with up to 2kb extension in both directions unless extending into another locus on the same strand. Gene models were predicted by homology-based predictors, FGENESH+ (Salamov, 2000), FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan (Yeh, 2001).

The highest scoring predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain C-score and protein coverage. C-score is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on C-score, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its C-score is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its C-score must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed. Genes with apparently truncated ORFs may be prediction errors or pseudogenes.

Principal Collaborator: Steve Rounsley (Dow Agrosciences) (email: rounsley AT email DOT arizona DOT edu)
JGI Contact: Simon Prochnik (email: seprochnik AT lbl DOT gov) 22523606