The debate on the origin of the SARS-CoV-2, the causal agent of the COVID-19 pandemic, is a contentious one. The quality of this debate suffers from the paucity of hard data and a series of unexplained irregularities associated with a number of reports on SARS-CoV-2 and related viruses. As I previously discussed, a Nature paper by Peng Zhou, Zheng-Li Shi and colleagues on RaTG13, a betacoronavirus related to SARS-CoV2, stood out by the lack of metadata on the precise origin of this virus. RaTG13, we were expediently told, was isolated from a bat in Yunnan. What the authors failed to report is that RaTG13 was isolated from the Mojiang mine that was the site of a deadly outbreak of pneumonia-like disease that resulted in three deaths. What they also failed to report is that RaTG13 is identical to a virus they studied and published on years before their February 2020 paper.
And it’s not just the metadata. The data, the genome sequence data itself, has often failed to be up to the reporting standards set by the community ever since the launch of GenBank and similar genome sequence data repositories. This incredibly poor reporting on the deadliest pandemic in a century is what led Alina Chan and Shing Hei Zhan to state the obvious by emphasizing “the importance of requiring authors to publish their complete genome assembly pipeline and all contributing raw sequence data, particularly those supporting epidemiological investigations.” At the very least, authors should be expected to provide such data and methods once problems and errors are identified. Otherwise, as Cyril Zipfel and I proposed about 5 years ago, failure by authors to correct their mistakes should be viewed as scientific misconduct.
But more on this issue later. Let’s first revisit the Zhou, et al. RaTG13 paper and another anomaly brought forward by the energetic Yuri Deigin. One of the defining features of SARS-CoV-2 is the presence of a protease cleavage site in its Spike protein. This polybasic viral sequence (PRRAR) occurs at the S1/S2 junction within the Spike protein and can be cleaved by the human protease furin, a molecular event that was shown to enhance SARS-CoV-2 transmission. The importance of the furin cleavage site in disease epidemiology is further illustrated by the SARS-CoV-2 Delta variant where PRRAR mutated to RRRAR resulting in enhanced transmissibility of this virus variant. This PRRAR sequence has been the subject of much debate given that it is absent in related betacoronaviruses, including RaTG13, and how SARS-CoV-2 gained the furin cleavage site is unknown. PRRAR insertion may have arose through genetic recombination in a wild animal reservoir, a natural evolutionary process, or through a laboratory manipulation.
Remarkably, even though the PRRAR sequence was the most conspicuous feature of the SARS-CoV-2 genome, the Zhou et al. Nature paper failed to mention the presence of this protease cleavage site in what was at the time a new virus. Worse, they provided an alignment between SARS-CoV-2 and Spike proteins from other coronaviruses that oddly terminates just before the S1/S2 junction and the PRRAR insertion. What Yuri Deigin pointed to recently is that the alignment they published “sloppily” ends at the position Q, 6 residues before the PRRAR sequence, and appears “whited out” according to Matt Ridley, author of a recent book on COVID-19 origin. It’s also unclear how Zhou et al. generated the Spike protein alignment and which of the classic sequence alignment software they used.
Make what you want out of this, but in the digital era, it’s odd that Zhou et al. did not provide the full protein alignment. Nature asserts that they enforce transparent and stringent reporting requirements. An oddly truncated alignment of what is currently the most important protein to humankind does not fit that bill. Combine this with the fact that some leading virologists erroneously aligned the cornavirus Spike proteins in a way that masks the PRRAR insertion, and you develop a sense of sympathy for those who are claiming that there is a conspiracy to hide the origin of the pandemic coronavirus.
Now back to the issue that got Alina Chan and Shing Hei Zhan to call for better reporting of genomic sequence data and methods. This story has its own saga of bad reporting practices to say the least. In the early months of the pandemic, multiple papers independently reported that a coronavirus isolated from Malayan pangolins confiscated from wildlife smugglers in Guangdong province in March 2019 shares ~97% amino acid identity in the Spike receptor binding domain (RBD) with SARS-CoV-2. This is obviously relevant to the coronavirus origin discussion, particularly because these papers were presented as independent studies of multiple pangolin samples. However, what Chan and Zhan reported in their October 2020 preprint is that no less than four (yes, FOUR) papers used the same raw sequence data they obtained from a study published in 2019 by Liu et al. in the specialist journal Viruses. These are the four papers listed in chronological order:
Zhang, T., Wu, Q. & Zhang, Z. Probable Pangolin Origin of SARS-CoV-2 Associated with the COVID-19 Outbreak. Curr. Biol. doi:10.1016/j.cub.2020.03.022. March 2020.
Lam, T. T.-Y. et al. Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins. Nature doi:10.1038/s41586–020–2169–0. March 2020.
Xiao, K. et al. Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins. Nature doi:10.1038/s41586–020–2313-x. May 2020.
Liu, P. et al. Are pangolins the intermediate host of the 2019 novel coronavirus (SARS-CoV-2)? PLoS Pathog. 16, e1008421. May 2020.
It is worth noting that the Liu et al. data that fed the other studies consists of a metagenomic survey of virus sequences from the smuggled Malayan pangolins, and thus the sequence data is from a single occurrence of the coronavirus rather than a systematic outbreak in this animal. Therefore, this finding doesn’t inform us much on the origin of SARS-CoV-2 itslef. Several authors have questioned whether the pangolin is a natural wildlife reservoir of SARS-CoV-2 or whether it picked up the virus in captivity from another source. Once again, metadata matters when interpreting epidemiological data. And journals should know better than publish such papers without extensive metadata information.
To be fair to the journals, they did publish corrections for three of the above papers with the latest being a meeky Erratum to Xiao et al. Malayan pangolin paper posted by Nature just a few days ago. Even though retractions may have been more appropriate, we cannot accuse the Editors of playing possum (playing dead), a behavior that seems to be common for scientific journals who are called out for publishing manipulated or fraudulent papers. I’m happy to settle with Elisabeth Bik’s more considerate #BadEditorialDecision hashtag. But as Elisabeth highlighted, the community is waiting for dozens of corrections of egregious errors in the peer-reviewed literature. Again, I stand by the view that failure to fix the record is scientific misconduct.
It’s also fair to wonder whether Nature, Current Biology and others would have published the corrections without the generous work of post-publication peer reviewers like Alina Chan, Yuri Degin, Elisabeth Bik and others. Would the journals have bothered to publish the corrections without the preprint and social media publicity? Would the scientific record have been corrected? In environmental law, we have the “polluter pays principle”, essentially the concept that the party responsible for producing pollution to be responsible for paying for the damage. Perhaps, the same should apply in scientific publishing. Springer Nature, Elsevier and company should be liable for the serious damage they are causing to the scientific enterprise.
Read Part One of this series:
The triumphs of post-publication peer-review
Journals must enforce transparent reporting of metadata associated with biological samples or face the wrath of post-publication peer-reviewers.