Posada, D., Crandall, K. A. Membrebe, J. V., Suchard, M. A., Rambaut, A., Baele, G. & Lemey, P. Bayesian inference of evolutionary histories under time-dependent substitution rates. Our results indicate the presence of a single lineage circulating in bats with properties that allowed it to infect human cells, as previously described for bat sarbecoviruses related to the first SARS-CoV lineage29,30,31. Emergence of SARS-CoV-2 through recombination and strong purifying selection. PubMed Accurate estimation of ages for deeper nodes would require adequate accommodation of time-dependent rate variation. This long divergence period suggests there are unsampled virus lineages circulating in horseshoe bats that have zoonotic potential due to the ancestral position of the human-adapted contact residues in the SARS-CoV-2 RBD. Epidemiology, genetic recombination, and pathogenesis of coronaviruses. Menachery, V. D. et al. SARS-like WIV1-CoV poised for human emergence. Individual sequences such as RpShaanxi2011, Guangxi GX2013 and two sequences from Zhejiang Province (CoVZXC21/CoVZC45), as previously shown22,25, have strong phylogenetic recombination signals because they fall on different evolutionary lineages (with bootstrap support >80%) depending on what region of the genome is being examined. ac, Root-to-tip (RtT) divergence as a function of sampling time for the three coronavirus evolutionary histories unfolding over different timescales (HCoV-OC43 (n=37; a) MERS (n=35; b) and SARS (n=69; c)). Temporal signal was tested using a recently developed marginal likelihood estimation procedure41 (Supplementary Table 1). Lam, H. M., Ratmann, O. 88, 70707082 (2014). Sign up for the Nature Briefing newsletter what matters in science, free to your inbox daily. This provides compelling support for the SARS-CoV-2 lineage being the consequence of a direct or nearly-direct zoonotic jump from bats, because the key ACE2-binding residues were present in viruses circulating in bats. A., Filip, I., AlQuraishi, M. & Rabadan, R. Recombination and lineage-specific mutations led to the emergence of SARS-CoV-2. 2 Lack of root-to-tip temporal signal in SARS-CoV-2. If the latter still identified non-negligible recombination signal, we removed additional genomes that were identified as major contributors to the remaining signal. At present, we analyzed the diversity of SARS-CoV-2 viral genomes in India to know the evolutionary patterns of viruses in the country through their pangolin lineage and GISAID-Clade. PubMed Central Lin, X. et al. Genetics 172, 26652681 (2006). All authors contributed to analyses and interpretations. Now, the two researchers used genomic sequencing to compare the DNA of the new coronavirus in humans with that in animals and found a 99% match with pangolins. Its genome is closest to that of severe acute respiratory syndrome-related coronaviruses from horseshoe bats, and its receptor-binding domain is closest to that of pangolin viruses. Wu, F. et al. Evol. Pangolin-CoV is 91.02% and 90.55% identical to SARS-CoV-2 and BatCoV RaTG13, respectively, at the whole-genome level. Lancet 383, 541548 (2013). Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia. Wang, H., Pipes, L. & Nielsen, R. Synonymous mutations and the molecular evolution of SARS-Cov-2 origins. One study suggests that over a century ago, one lineage of coronavirus circulating in bats gave rise to SARS-CoV-2, RaTG13 and a Pangolin coronavirus known as Pangolin-2019, Live Science . 1. 11,12,13,22,28)a signal that suggests recombinationthe divergence patterns in the Sprotein do not show evidence of recombination between the lineage leading to SARS-CoV-2 and known sarbecoviruses. Root-to-tip divergence as a function of sampling time for non-recombinant regions NRR1 and NRR2 and recombination-masked alignment set NRA3. BEAST inferences made use of the BEAGLE v.3 library68 for efficient likelihood computations. The ongoing pandemic spread of a new human coronavirus, SARS-CoV-2, which is associated with severe pneumonia/disease (COVID-19), has resulted in the generation of tens of thousands of virus . These datasets were subjected to the same recombination masking approach as NRA3 and were characterized by a strong temporal signal (Fig. 2a. Google Scholar. Indeed, the rates reported by these studies are in line with the short-term SARS rates that we estimate (Fig. Scientists trying to trace the ancestry of SARS-CoV-2, the virus responsible for COVID-19, have found the pangolin is unlikely to be the source of the virus responsible for the current pandemic. Viruses 11, 979 (2019). However, formal testing using marginal likelihood estimation41 does provide some evidence of a temporal signal, albeit with limited log Bayes factor support of 3 (NRR1), 10 (NRR2) and 3 (NRA3); see Supplementary Table 1. 4, vey016 (2018). Because the SARS-CoV-2 S protein has been implicated in past recombination events or possibly convergent evolution12, we specifically investigated several subregions of the Sproteinthe N-terminal domain of S1, the C-terminal domain of S1, the variable-loop region of the C-terminal domain, and S2. Patino-Galindo, J. We showed that severe acute respiratory syndrome coronavirus 2 is probably a novel recombinant virus. Curr. 2). A counting renaissance: combining stochastic mapping and empirical Bayes to quickly detect amino acid sites under positive selection. Cell 181, 223227 (2020). Nat. If stopping an outbreak in its early stages is not possibleas was the case for the COVID-19 epidemic in Hubeiidentification of origins and point sources is nevertheless important for containment purposes in other provinces and prevention of future outbreaks. A., Lytras, S., Singer, J. This new approach classifies the newly sequenced genome against all the diverse lineages present instead of a representative select sequences. A phylogenetic treeusing RAxML v8.2.8 (ref. Nature 538, 193200 (2016). 3). When the first genome sequence of SARS-CoV-2, Wuhan-Hu-1, was released on 10January 2020 (GMT) on Virological.org by a consortium led by Zhang6, it enabled immediate analyses of its ancestry. Mol. Nature 583, 282285 (2020). Google Scholar. SARS-CoV-2 and RaTG13 are the most closely related (their most recent common ancestor nodes denoted by green circles), except in the 222-nt variable-loop region of the C-terminal domain (bar graphs at bottom). Sci. volume5,pages 14081417 (2020)Cite this article. DRAGEN COVID Lineage App This app aligns reads to a SARS-CoV-2 reference genome and reports coverage of targeted regions. Unlike other viruses that have emerged in the past two decades, coronaviruses are highly recombinogenic14,15,16. Yres, D. L. et al. The estimated divergence times for the pangolin virus most closely related to the SARS-CoV-2/RaTG13 lineage range from 1851 (17301958) to 1877 (17461986), indicating that these pangolin lineages were acquired from bat viruses divergent to those that gave rise to SARS-CoV-2. Virus Evol. GitHub - cov-lineages/pangolin: Software package for assigning SARS-CoV-2 genome sequences to global lineages. Schierup, M. H. & Hein, J. Recombination and the molecular clock. 04:20. 2). This underscores the need for a global network of real-time human disease surveillance systems, such as that which identified the unusual cluster of pneumonia in Wuhan in December 2019, with the capacity to rapidly deploy genomic tools and functional studies for pathogen identification and characterization. PubMedGoogle Scholar. Li, Q. et al. Despite the high frequency of recombination among bat viruses, the block-like nature of the recombination patterns across the genome permits retrieval of a clean subalignment for phylogenetic analysis. Our third approach involved identifying breakpoints and masking minor recombinant regions (with gaps, which are treated as unobserved characters in probabilistic phylogenetic approaches). G066215N, G0D5117N and G0B9317N)) and by the European Unions Horizon 2020 project MOOD (no. This is evidence for numerous recombination events occurring in the evolutionary history of the sarbecoviruses22,33; specifying all past events in their correct temporal order34 is challenging and not shown here. RegionB is 5,525nt long. The divergence time estimates for SARS-CoV-2 and SARS-CoV from their respective most closely related bat lineages are reasonably consistent among the three approaches we use to eliminate the effects of recombination in the alignment. SARS-CoV-2 is an appropriate name for the new coronavirus. Grey tips correspond to bat viruses, green to pangolin, blue to SARS-CoV and red to SARS-CoV-2. However, the coronavirus isolated from pangolin is similar at 99% in a specific region of the S protein, which corresponds to the 74 amino acids involved in the ACE (Angiotensin Converting Enzyme . A new coronavirus associated with human respiratory disease in China. Because there is no single accepted method of inferring breakpoints and identifying clean subregions with high certainty, we implemented several approaches to identifying three classic statistical signals of recombination: mosaicism, phylogenetic incongruence and excessive homoplasy51. 3). 4). PLoS ONE 5, e10434 (2010). As informative rate priors for the analysis of the sarbecovirus datasets, we used two different normal prior distributions: one with a mean of 0.00078 and s.d. 4), that region and shorter BFRs were not included in combined putative non-recombinant regions. It performs: K-mer based detection Map/align, variant calling Consensus sequence generation Lineage/clade analysis using Pangolin and NextClade Access the DRAGEN COVID Lineage App on BaseSpace Sequence Hub The shaded region corresponds to the Sprotein. Sequencing from Malayan pangolins collected during anti-smuggling operations in southern China detected coronavirus lineages related to SARS-CoV-2. 94, e0012720 (2020). Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. Lam, T. T. et al. Lancet 395, 565574 (2020). M.F.B. 5. & Li, X. Crossspecies transmission of the newly identified coronavirus 2019nCoV. SARS-CoV-2 and RaTG13 are also exceptions because they were sampled from Hubei and Yunnan, respectively. The time-calibrated phylogeny represents a maximum clade credibility tree inferred for NRR1. A hypothesis of snakes as intermediate hosts of SARS-CoV-2 was posited during the early epidemic phase54, but we found no evidence of this55,56; see Extended Data Fig. Early detection via genomics was not possible during Southeast Asias initial outbreaks of avian influenza H5N1 (1997 and 20032004) or the first SARS outbreak (20022003). A distinct name is needed for the new coronavirus. Nucleotide positions for phylogenetic inference are 147695, 9621,686 (first tree), 3,6259,150 (second tree, also BFR B), 9,26111,795 (third tree, also BFR C), 12,44319,638 (fourth tree) and 23,63124,633, 24,79525,847, 27,70228,843 and 29,57430,650 (fifth tree). Hu, B. et al. Divergence dates between SARS-CoV-2 and the bat sarbecovirus reservoir were estimated as 1948 (95% highest posterior density (HPD): 18791999), 1969 (95% HPD: 19302000) and 1982 (95% HPD: 19482009), indicating that the lineage giving rise to SARS-CoV-2 has been circulating unnoticed in bats for decades. In addition, sequences NC_014470 (Bulgaria 2008), CoVZXC21, CoVZC45 and DQ412042 (Hubei-Yichang) needed to be removed to maintain a clean non-recombinant signal in A. Rambaut, A., Lam, T. T., Carvalho, L. M. & Pybus, O. G. Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen). The authors declare no competing interests. The virus then. Zhang, Y.-Z. Using the most conservative approach (NRR1), the divergence time estimate for SARS-CoV-2 and RaTG13 is 1969 (95% HPD: 19302000), while that between SARS-CoV and its most closely related bat sequence is 1962 (95% HPD: 19321988); see Fig. The canine viral genome was excluded from the Bayesian phylogenetic analyses because temporal signal analyses (see below) indicated that it was an outlier. Webster, R. G., Bean, W. J., Gorman, O. T., Chambers, T. M. & Kawaoka, Y. Evolution and ecology of influenza A viruses. 382, 11991207 (2020). 21, 15081514 (2015). To avoid artefacts due to recombination, we focused on NRR1 and NRR2 and the recombination-masked alignment NRA3 to infer time-measured evolutionary histories. The histogram allows for the identification of non-recombining regions (NRRs) by revealing regions with no breakpoints. Discovery and genetic analysis of novel coronaviruses in least horseshoe bats in southwestern China. We thank all authors who have kindly deposited and shared genome data on GISAID. PLoS Pathog. J. Virol. Graham, R. L. & Baric, R. S. Recombination, reservoirs, and the modular spike: mechanisms of coronavirus cross-species transmission. Syst. Posterior distributions were approximated through Markov chain Monte Carlo sampling, which were run sufficiently long to ensure effective sampling sizes >100. Lemey, P., Minin, V. N., Bielejec, F., Pond, S. L. K. & Suchard, M. A. S. China corresponds to Guangxi, Yunnan, Guizhou and Guangdong provinces. Note that six of these sequences fall under the terms of use of the GISAID platform. 1a-c ), has the third-highest number of confirmed COVID-19 cases in the state of So. Two other bat viruses (CoVZXC21 and CoVZC45) from Zhejiang Province fall on this lineage as recombinants of the RaTG13/SARS-CoV-2 lineage and the clade of Hong Kong bat viruses sampled between 2005 and 2007 (Fig. 6, 8391 (2015). from the European Research Council under the European Unions Horizon 2020 research and innovation programme (grant agreement no. Without better sampling, however, it is impossible to estimate whether or how many of these additional lineages exist. We find that the sarbecovirusesthe viral subgenus containing SARS-CoV and SARS-CoV-2undergo frequent recombination and exhibit spatially structured genetic diversity on a regional scale in China. Identification of diverse alphacoronaviruses and genomic characterization of a novel severe acute respiratory syndrome-like coronavirus from bats in China. Bioinformatics 30, 13121313 (2014). Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. Conservatively, we combined the three BFRs >2kb identified above into non-recombining region1 (NRR1). Mol. To estimate non-synonymous over synonymous rate ratios for the concatenated coding genes, we used the empirical Bayes Renaissance countingprocedure67. performed Srecombination analysis. Genetic lineages of SARS-CoV-2 have been emerging and circulating around the world since the beginning of the COVID-19 pandemic. While it is possible that pangolins, or another hitherto undiscovered species, may have acted as an intermediate host facilitating transmission to humans, current evidence is consistent with the virus having evolved in bats resulting in bat sarbecoviruses that can replicate in the upper respiratory tract of both humans and pangolins25,32. Mol. The variable-loop region in SARS-CoV-2 shows closer identity to the 2019 pangolin coronavirus sequence than to the RaTG13 bat virus, supported by phylogenetic inference (Fig. We focused on these three non-recombining regions/alignments for divergence time estimation; this avoids inappropriate modelling of evolutionary processes with recombination on strictly bifurcating trees, which can result in different artefacts such as homoplasies that inflate branch lengths and lead to apparently longer evolutionary divergence times. & Holmes, E. C. Recombination in evolutionary genomics. Even before the COVID-19 pandemic, pangolins have been making headlines. Of the nine breakpoints defining these ten BFRs, four showed phylogenetic incongruence (PI) signals with bootstrap support >80%, adopting previously published criteria on using a combination of mosaic and PI signals to show evidence of past recombination events19. Sorting these breakpoint-free regions (BFRs) by length results in two segments >5kb: an ORF1a subregion spanning nucleotides (nt) 3,6259,150 and the first half of ORF1b spanning nt13,29119,628 (sequence numbering given in Source Data, https://github.com/plemey/SARSCoV2origins). MERS-CoV data were subsampled to match sample sizes with SARS-CoV and HCoV-OC43. Stamatakis, A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. wrote the first draft of the manuscript, and all authors contributed to manuscript editing. Next, we (1) collected all breakpoints into a single set, (2) complemented this set to generate a set of non-breakpoints, (3) grouped non-breakpoints into contiguous BFRs and (4) sorted these regions by length. RegionC showed no PI signals within it. 5). Thank you for visiting nature.com. The fact that these estimates lie between the rates for MERS-CoV and HCoV-OC43 is consistent with the intermediate sampling time range of about 18years (Fig. is funded by The National Natural Science Foundation of China Excellent Young Scientists Fund (Hong Kong and Macau; no. obtained the genome sequences of 10 SARS-CoV-2 virus strains through nanopore sequencing of nasopharyngeal swabs in Malta and analyzed the assembled genome with pangolin software, and the results showed that these virus strains were assigned to B.1 lineage, indicating that SARS-CoV-2 was widely spread in Europe (Biazzo et al., 2021). Trends Microbiol. We considered (1) the possibility that BFRs could be combined into larger non-recombinant regions and (2) the possibility of further recombination within each BFR. The presence in pangolins of an RBD very similar to that of SARS-CoV-2 means that we can infer this was also probably in the virus that jumped to humans. All three approaches to removal of recombinant genomic segments point to a single ancestral lineage for SARS-CoV-2 and RaTG13. When the genomic data included both coding and non-coding regions we used a single GTR+ substitution model; for concatenated coding genes we partitioned the alignment by codon position and specified an independent GTR+ model for each partition with a separate gamma model to accommodate inter-site rate variation. 31922087). This statement informs us of the possibility that a virus has spilled over from a very rare and shy reptile-looking mammal . We demonstrate that the sarbecoviruses circulating in horseshoe bats have complex recombination histories as reported by others15,20,21,22,23,24,25,26. 68, 10521061 (2019). Preprint at https://doi.org/10.1101/2020.05.28.122366 (2020). Software package for assigning SARS-CoV-2 genome sequences to global lineages. Adv. Su, S. et al. the development of viral diversity. Center for Infectious Disease Dynamics, Department of Biology, Pennsylvania State University, University Park, PA, USA, Department of Microbiology, Immunology and Transplantation, KU Leuven, Rega Institute, Leuven, Belgium, Department of Biological Sciences, Xian Jiaotong-Liverpool University, Suzhou, China, State Key Laboratory of Emerging Infectious Diseases, School of Public Health, The University of Hong Kong, Hong Kong SAR, China, Department of Biology, University of Texas Arlington, Arlington, TX, USA, Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, UK, MRC-University of Glasgow Centre for Virus Research, Glasgow, UK, You can also search for this author in It is available as a command line tool and a web application. The boxplots show divergence time estimates (posterior medians) for SARS-CoV-2 (red) and the 20022003 SARS-CoV virus (blue) from their most closely related bat virus. Note that breakpoints can be shared between sequences if they are descendants of the same recombination events. Among the 68sequences in the aligned sarbecovirus sequence set, 67 show evidence of mosaicism (all DunnSidak-corrected P<4104 and 3SEQ14), indicating involvement in homologous recombination either directly with identifiable parentals or in their deeper shared evolutionary historythat is, due to shared ancestral recombination events. Bayesian evolutionary rate and divergence date estimates were shown to be consistent for these three approaches and for two different prior specifications of evolutionary rates based on HCoV-OC43 and MERS-CoV. A second breakpoint-conservative approach was conservative with respect to breakpoint identification, but this means that it is accepting of false-negative outcomes in breakpoint inference, resulting in less certainty that a putative NRR truly contains no breakpoints. Holmes, E. C., Dudas, G., Rambaut, A. All four of these breakpoints were also identified with the tree-based recombination detection method GARD35. We thank originating laboratories at South China Agricultural University (Y. Shen, L. Xiao and W. Chen; no. 36)gives a putative recombination-free alignment that we call non-recombinant alignment3 (NRA3) (see Methods). Virus Evol. CAS Preprint at https://doi.org/10.1101/2020.02.10.942748 (2020). The research leading to these results received funding (to A.R. In the meantime, to ensure continued support, we are displaying the site without styles Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the causative agent for the current coronavirus disease (COVID-19) pandemic that has affected more than 35 million people and caused . Sarbecovirus, HCoV-OC43 and SARS-CoV data were assembled from GenBank to be as complete as possible, with sampling year as an inclusion criterion. We used TreeAnnotator to summarize posterior tree distributions and annotated the estimated values to a maximum clade credibility tree, which was visualized using FigTree. 87, 62706282 (2013). Several of the recombinant sequences in these trees show that recombination events do occur across geographically divergent clades. For weather, science, and COVID-19 . To evaluate the performance procedure, we confirmed that the recombination masking resulted in (1) a markedly different outcome of the PHI test64, (2) removal of well-supported (bootstrap value >95%) incompatible splits in Neighbor-Net65 and (3) a near-complete reduction of mosaic signal as identified by 3SEQ. 5 (NRR1) are conservative in the sense that NRR1 is more likely to be non-recombinant than NRR2 or NRA3. (Yes, Pango is a tongue-in-cheek reference to pangolins, which were briefly suspected to have had a role in the coronavirus's originseveral of the team's computational tools are named after. Biol. The red and blue boxplots represent the divergence time estimates for SARS-CoV-2 (red) and the 2002-2003 SARS-CoV (blue) from their most closely related bat virus, with the light- and dark-colored versions based on the HCoV-OC43 and MERS-CoV centered priors, respectively. Humans' selfish, speciesist treatment of these animals could be the very reason why the novel coronavirus exists. & Boni, M. F. Improved algorithmic complexity for the 3SEQ recombination detection algorithm. Press, H.) 3964 (Springer, 2009). In our analyses of the sarbecovirus datasets, we incorporated the uncertainty of the sampling dates when exact dates were not available. 21, 255265 (2004). In December 2019, a cluster of pneumonia cases epidemiologically linked to an open-air live animal market in the city of Wuhan (Hubei Province), China1,2 led local health officials to issue an epidemiological alert to the Chinese Center for Disease Control and Prevention and the World Health Organizations (WHO) China Country Office.