Science thrives on precision and transparency. To be valuable and impactful, scientific studies need to be reproducible. They must provide enough detail and accurate information to allow other scientists to replicate the experiments and build upon the published work. Anything short of that and the study — no matter where it is published — will end up buried within the pile of irrelevant scientific literature.
Show me the metadata
What about work that describes biological diversity? The unabating effort to document the diversity of life also requires precision. Papers in this field include what is known as metadata or in some scientist’s jargon passport data, namely the precise descriptions of the location, date, habitat and other details associated with the biological samples under study. Papers that describe a population of samples are expected to provide as much metadata as possible to enable a precise interpretation of the data. Journal peer-reviewers and editors are expected to request this metadata when it’s missing or lacking in detail. This is the best practice that we are all expected to adhere to as professional scientists.
There are organized efforts to curate the metadata that has accumulated in the scientific literature, notably when the samples are linked to taxonomic descriptions or genetic sequences. In one effort, the Genomic Observatories MetaDatabase (GEOME) aims to archive the who, what, where, and when of biological samples associated with genome sequences. There are also efforts to establish metadata standards. A blog post in microBEnet makes the case for why it is so “critical to collect detailed information about each sample”. They list examples of poorly described samples from the microbiology literature: “Soil Sample #3, July 7th”… “Sears Tower A/C unit water sample”… “Cancer patient lower colon sample”.
In some fields, the bar for precision and transparency is even higher. And it should be. The study of biological diversity in the context of infectious diseases has important implications for identifying outbreaks, managing a response plan and developing mitigation strategies. This includes not just the infectious agents and their relatives, but also any host or intermediary species.
I know this first hand because my lab has been directly involved in the rapid response to emerging plant disease outbreaks. Back in 2016, we organized a community effort to trace the origin of the wheat blast pathogen that caused a devastating outbreak in Bangladesh. The project was challenging logistically as we relied on samples collected, sequenced and analyzed by over 30 scientists from seven countries. Nonetheless, we did our best to collect and compile the metadata as reported in Table S1 of the original research paper. We diligently shared among the collaborators a spreadsheet to collate information such as sampling region and country, date of collection, and host species. Despite our best efforts, some samples still lacked information such as the precise sampling location but at least we tried our best. Still, a few years later, I received a terse email from GEOME curators complaining about the absence of the precise geographic coordinates of the samples we published.
A probable bat origin
All this is to say that matters of metadata standards are taken very seriously by the scientific community, even for the oft overlooked plant pathogens that threaten global food security. So what about the COVID-19 pandemic that has killed millions of people across the globe, wreaked economic havoc and disrupted pretty much everyone’s life in the last 20 months? We can be forgiven to expect the highest standards of metadata reporting in publications on SARS‑CoV‑2 — the causal agent of COVID-19 — and related SARS-type coronaviruses. We can also be forgiven to expect these standards to be even more stringent for articles published in Nature, the flagship journal of the academic publishing company Springer Nature that bills its authors the hefty sum of €9,500 in Article Pages Charges (APCs).
Regrettably, the opposite has happened. Back in early 2020, the search for the closest coronavirus relative of SARS‑CoV‑2 was a pressing question in that early phase of the COVID-19 pandemic. One of the closest known relatives of SARS‑CoV‑2 is another betacoronavirus known as RaTG13 that was first reported by Peng Zhou, Zheng-Li Shi and colleagues at the Wuhan Institute of Virology in the February 2020 Nature paper “A pneumonia outbreak associated with a new coronavirus of probable bat origin.” RaTG13 has an overall genome sequence identity of ~96% to SARS-CoV-2 and, therefore, is critical to understanding the origin of the pandemic virus. Given that RaTG13 was isolated from the bat species Rhinolophus affinis, Zhou and colleagues concluded — as stated in the title of their paper — that SARS-CoV-2 probably originated from bat coronaviruses.
So far, so good. Peng Zhou and colleagues should be commended for rapidly putting together and publishing their paper, as well as sharing their genome sequences of SARS‑CoV‑2 and related coronaviruses. However, the metadata associated with the RaTG13 was far from the expected standards. The paper simply stated that “RaTG13 was obtained from R. affinis, found in Yunnan province.” As the narrator of a recent Channel 4 documentary that revisited the virus origin debate stated, Yunnan is “roughly the size of Germany.” What happened? How come the Editors of a famous (and prohibitively expensive) scientific journal didn’t ensure detailed disclosure of the metadata of a single sample that is so important for understanding the origin of SARS-CoV-2.
The lack of uproar among most virologists about the minimalistic metadata of such a crucial biological sample was also notable. Scientists are insatiably curious and inquisitive. I’m certainly not an expert in zoonotic diseases or coronaviruses, but Nature claims loud and clear to be a broad audience journal so I suppose papers like Zhou et al. are addressed to biologists like myself. When I first read about RatG13, I was left wanting. I couldn’t help but wonder when and where the infected bat was collected. For one thing, the precise location would call for more surveillance and sampling in that area to help identify even closer relatives of SARS-CoV-2 and prevent further outbreaks. The year of collection is also important for molecular clock dating analyses that could help pinpoint when the virus started infecting humans. Finally, the group of Zheng-Li Shi has previously published the collection and analysis of several coronaviruses from Yunnan. How does RaTG13 relate to these earlier virus samples? The lack of metadata in such a high profile paper was puzzling even to an outside observer like myself.
All they needed was an internet connection and a curious mind
The story could have easily ended right there as just another example of a prestigious scientific journal condoning questionable data reporting practices — just the opposite of what one would expect. But that’s without counting on a group of modern day Sherlock Holmes who somehow figured out in the months following the Zhou et al. Nature publication that RaTG13 is identical to a coronavirus described in 2016 as BtCoV4991. This virus was first described in a 2016 publication by Zheng-Li Shi’s team as one of only two betacoronaviruses collected from bat feces and swabs in the Tongguan mineshaft in Mojiang, Yunnan. Shockingly, the reason they studied this mine is because of a deadly 2012 outbreak of pneumonia-like illness that affected six mine workers causing three deaths among them. It remains unclear why the precise identity of RaTG13 and related information was omitted from the Nature paper. Why has the journal failed to ensure the inclusion of RaTG13 metadata in the paper is another mystery that the Nature Editors haven’t publicly addressed yet as far as I can tell.
I’m not qualified to narrate the history of how the link between RaTG13 and BtCoV4991 was revealed. My understanding is that several scientists scattered throughout the world converged on the same conclusion around the same time. Their story, related in part in the Channel 4 documentary, is highly inspiring to biologists all over the globe as they were equipped with nothing more than an internet connection and a curious mind. They illustrate the democratization of science brought upon by the internet and genomics revolutions that shook the life sciences. I very much recommend watching the documentary segment (starting at 10’54) featuring Rossana Segreto from the University of Innsbruck, Austria, and Monali Rahalkar from the MACS Agharkar Research Institute in Pune, India, who were key among the scientists who showed that RaTG13 is BtCoV4991 and, therefore, ultimately connected to the deadly Mojiang mineshaft.
I was particularly taken by Monali (Mona) Rahalkar’s testimony in the documentary. Working with her husband Rahul Bahulikar, they queried public genome sequence databases with the RatG13 genome to determine if a similar sequence was previously reported. As they described in their May 2020 preprint, the method they used, the openly available Basic Local Alignment Search Tool (BLAST), is a basic bioinformatic tool that virtually every biologist uses on a routine basis. Using this straightforward approach, they showed that RaTG13 is 100% identical to a 370 bb fragment of BtCoV4991 deposited in the genome sequence database GenBank in March 2016 under accession KP876546. Why this link wasn’t reported in the Zhou et al. paper remains puzzling. In a November 2020 Addendum to the Nature paper, Zheng-Li Shi and colleagues finally acknowledged the true origin of RaTG13 and at last cite their own Ge at al. 2016 paper.
It beggars belief that Zheng-Li Shi’s team somehow didn’t connect RatG13 to one of a few betacoronaviruses they previously studied and published on. As Broad Institute biologist Alina Chan eloquently stated, what’s the point of virus hunting and initiatives like the Global Virome Project when the databases and metadata evaporate once a pandemic happens. But it’s not just the authors, what about the responsibility of the journal in this affair. Journals like Nature justify their business model and excessive APCs based on the quality of the peer-review they impose on the articles they publish. Yet in this case, they failed to require the authors to comply to the most basic level of data reporting. What went wrong?
What went wrong is non-transparent pre-publication peer-review
We often hear that peer-review can only be performed by experts in the specific field of study and the argument was made that some of the scientific sleuths mentioned above aren’t virologists. It is important to point out that the BLAST search that helped connect RaTG13 to GenBank accession KP876546 could have been expertly performed by any biology undergraduate student. Therefore this particular aspect of Zhou et al. doesn’t require expert virologists to be critiqued. The review of the paper benefited from wider community scrutiny instead of reliance on journal editors and their handful of selected reviewers. The reality is that peer-review is an inherently conflicted system and the closed-door deliberations of an editor and a couple of referees aren’t exactly an example of transparency that conveys confidence in the system. Don’t get me wrong, as I have argued before, nobody is against peer-review per se. As a practicing scientist, I’m performing peer-review all the time. I’m peer-reviewing when I query a student at a poster session, ask a colleague a question after a talk, raise a point on Twitter and so on. Peer-review is central to our profession. What I and others take issue with is non-transparent pre-publication peer-review as orchestrated by journals to evaluate and filter out the literature. The scientific community and the public at large deserve more transparency about what happened with the review and editorial evaluation of the Zhou et al. paper.
The trumphs of post-publication peer-review
What Rossana, Mona and others have done is what we now define as post-publication peer-review — reviewing papers after they are published and shared with the community. Post-publication peer-review can bring much needed transparency to the closed journal system that many view as gamed and flawed. It is particularly effective when combined with preprints and the publication of reviewer reports. Anyone in the scientific community can offer their evaluation of the work. The more scrutiny, the better. Most importantly, the life of a paper can now extend beyond its publication date. This is the way forward. We have to stop relying on science publishers with a laundry list of commercial conflicts and who cannot deliver the most elementary level of data reporting even in the midst of a deadly pandemic.
We have a #sciencecrisis and the journals are complicit. Look no further than at the relentless post-publication peer-review work of science integrity consultant Elisabeth Bik who uncovered hundreds of data anomalies in the peer-reviewed literature. Her blog Science Integrity Digest is a must read for biologists of all career stages if anything to realize the scale of the data manipulation problem that we face. The failure of pre-publication peer-review is right there in that blog for everyone to witness. The triumphs of post-publication peer-review are in the work of Elisabeth, Rossana, Mona and all other scientists who uphold the ideal that good science should stand up to close scrutiny. The only impact we should be factoring in.
There is hope for reform. With the rise of bioRxiv, preprints have at last become widely accepted in biology following the lead of physics and mathematics. Peer-review practices are also evolving and there is much experimentation happening with new models of preprint peer-review. Discussions are ongoing among the bioRxiv team and affiliates about posting public reviews as the next step of open science. I was pleasantly surprised when the journal eLife posted a few months ago the reviews of our paper while it was still being evaluated. Indeed, eLife has been a leader in reforming the publishing model. As Editor in Chief Michael Eisen and colleagues recently stated, the journal is embracing a “publish, then review” model. With initiatives like this, we can be optimistic that the tide has turned and major reform of our fundamentally flawed traditional publishing model will take place. The early career scientists I work with on a daily basis are ready and have wholeheartedly embraced preprints and open review.
Wither pre-publication peer review and mandate preprints
It’s time to abandon pre-publication peer-review once for all. Science funders should do their part and embrace Plan U to mandate preprints for their grantees and enable universal free access to the scientific literature as bioRxiv co-founder Richard Sever and coauthors have proposed. The simple idea of funding bodies mandating preprint deposition would most certainly speed up the process of science and truly take us into the open science era. As Richard tweeted, the next phase of systematic reviewing of preprints would emerge organically based on community needs. It’s already happening. As discussed above, the community has already started experimenting with various models of downstream peer-review. The future is here.
There is another hugely valuable advantage of preprints. Their cost is negligeable and they are for all purposes free for the authors to post on respected platforms like bioRxiv and Zenodo. Meanwhile, commercial publishers, Springer Nature, Elsevier and others, fail badly at justifying these outrageous APCs they keep on charging their authors — the exact same scientists who produce the data, write the articles and work for them for free to review their colleagues papers. In many countries, the APC for a single article amounts to a significant fraction of a PhD studentship. In an era of limited research funds, imagine how many fellowships could be funded with all the publishing and access funds that are currently siphoned to the commercial publishers. The publishing charge for a single–yes a single–Nature article would be enough to sequence the genomes of over 100 bacterial strains. Imagine the impact such a dataset would have on microbiology students all over the world who just started their Ph.D. on bacterial projects. Once again, there is a solution, funders must endorse Plan U as a first step to stop this nonsense. If you’re inspired by this post, please lobby your funding agency to endorse preprint mandates and stop paying for article page charges at least to commercial publishers.
Meanwhile, editorial offices like Nature’s should embrace transparent reporting systems, enforce basic standards for metadata and become less opaque about the peer-review process and conflicts of interests. At €9,500 an article, the scientific community can at least have this courtesy.
Watch the Channel 4 documentary “Did Covid Leak from a Lab in China?” on the All4 channel in the UK and YouTube elsewhere in the world.