Big data in small places — 10 years later
Back in 2012, Dan MacLean and I wrote about how a “small-sized” research institute like The Sainsbury Laboratory can deal with the data deluge that biology was facing. Ten years later, the article remains current.
Adapted from MacLean, D., and Kamoun, S. 2012. Big data in small places. Nature Biotechnology, 30:33–34.
My motivation for reposting this article is a question that arose just this last Friday at the workshop “How to do Science”. A student queried whether bioinformatics should also be performed according to the GOHREP framework or whether this guide is only appropriate to wet lab research. The answer is straightforward and is further discussed in this re-posted article. Don’t approach bioinformatics as black box tricks and forget the basics of rigorous experimentation. As Dan and I wrote, we very much urge biologists to gain enough computational biology expertise to be able to test their won hypotheses and approach bioinformatics in the same way they would approach wet lab experiments. Provision of bioinformatics support at small-sized research institutions should keep this in mind and avoid top-down approaches.
As biology becomes increasingly dependent on very large datasets biologists must learn the tools and techniques to interpret this “big data”. Bioinformatics support must adapt and evolve by providing training and systems in which big data biology can thrive.
The recent meteoric rise in the rates and ease at which biological data, particularly DNA sequences, can be collected has created opportunities for new, exciting but heavily data-bound projects that would not have been conceivable just a few years ago. This proliferation has brought new challenges. Scientists now deal with volumes of data so large that high performance computer software and hardware have become integral to modern biological research. Bioinformatics capability is now more than ever a vital component of the research. Here, we share our views and experience on how small institutions can best manage bioinformatics support in an era of big data biology.
A top-down model of bioinformatics provision is unsuitable for 21st Century biology
A common model of bioinformatics provision is a top-down one; a biologist gives the data to a core informatician to work with and to return in cleaned, filtered and useable form for the bench biologist to verify and evaluate. A top-down model like this has several limitations. First, the model is not well suited to up-scaling since it is difficult and prohibitively expensive to increase team sizes in step with the demands of data-heavy projects. Second it is time-consuming given the variety of bioinformatics methods and approaches that get applied. Third, the model can lead to confusion and misunderstandings between the bioinformatician and the biologist. A top-down approach requires that the bioinformatician explains the results in context to the bench biologist, and inevitably something will get lost in translation. Perhaps worse each party may feel that they do fully understand the results but miss out on a technical or biological subtlety and a deeper conclusion from the experiment may never be reached. In our view, core bioinformaticians tend to find it hard to fully understand all of the relevant biological background of projects that they only spend a fraction of their time on and cannot interpret data in light of biological information they do not fully appreciate. Similarly bench biologists don’t immediately realize the restriction of interpretation required from bioinformatics experiments and the controls and checks that must be done.
At our organization, The Sainsbury Laboratory (TSL), we wrestled with these issues as we began work with a high-throughput DNA sequencer and with various high-throughput screening systems. In response we took the bold move of developing a bottom-up, biologist-enabling approach that effectively increases bioinformatics manpower without extra specialist staff and most crucially brings the bioinformatics techniques into the same brains in which the biological knowledge is already stored. We have used a targeted program of training and have developed easy to use analysis pipelines and tools to create an environment in which biologists can do their own bioinformatics. We find that our approach speeds up the analysis cycle and allows us to handle increasing workloads in a timely, productive manner with a modest core support team.
TSL is a privately funded laboratory with a remit to do daring research in the field of plant-pathogen interactions. Our five research groups and support teams consist of ~80 individuals with expertise in genetics, genomics, molecular and cellular biology and biochemistry. We provide bioinformatics support with just two core bioinformaticians and one systems-administrator. One bioinformatician, the first author, has a PhD in plant molecular biology and post-doctoral experience in genomics and transcriptomics and the second has a biology background and a computational biology PhD. None of these two scientists have extensive academic backgrounds in computer science and they both consider themselves biologists in the first instance. The team background has given them the ideal perspective into how best to marry biology and bioinformatics successfully.
A bottom-up model to manage, understand, and analyze big data effectively
We developed a bottom-up model based on the premise that dealing with big data can be abstracted into three main tasks: we must be able to manage, understand and analyze. “Managing” is to carry out the somewhat dry, computer-science based transfer and storage of data. “Understanding” implies a clear knowledge of the biological context and caveats of the data as well as the functioning and limitations of the methods. Finally, “analyzing” refers to the application of the various bioinformatics methods to specific biological questions and data. We therefore developed a bioinformatics support platform that distributes labor between bioinformaticians and bench scientists to optimize the delivery of the three tasks.
Manage: Bioinformaticians provide the environments for insight
Much as biologists are typically eager to carry out their own bioinformatic experiments, they rarely have much enthusiasm for carrying out proper data storage, organization and transfer. Potentially this is a massive problem since big data are driving a critical need for original solutions to managing and sharing data. Thankfully, the informatician can do much to remove the burden of worrying about the mechanics of dealing with the data and allow the biologist to get on with their science. In a laboratory with diverse and wide-ranging activities this isn’t the organizational nightmare it might seem at first. At TSL we have found a number of useful tools and tricks to facilitate data management. For instance, we promoted the use of SSHFS (SSH Filesystem), which makes it possible for large data storage devices to be mounted onto user computers with a single click and Terabytes of data readily accessed as if it were on a USB stick. This reduced the hassle and interruption of using data transfer systems that require remembering command-lines, user-names, passwords and server addresses — tasks that tend to slow down many biologists. Another example is our implementation of simple techniques such as providing local rules about data descriptions (in particular genome assemblies and annotations) and providing associated tools to make sure files will validate against these rules. We have also adopted Galaxy 1 as our favored workflow-engineering environment. Galaxy’s intuitive environment facilitates the tasks of both bioinformaticians and biologists. It enables biologists to create and share large and complex analysis pipelines without worrying about tedious details of implementation. In parallel, bioinformaticians are able to develop tools for immediate deployment in a familiar and flexible framework.
Understand and Analyze: Biologists take on bioinformatics
A false distinction is often made between biology and bioinformatics and between biologists and bioinformaticians. To effectively understand and analyze data, it is most useful to give biologists tools and skills to handle bioinformatics as independently as possible. Our philosophy is that for the majority of research projects bioinformatics is just a sub-discipline of molecular biology, albeit one in which the experiments rely on a computer not a thermal cycler or other wet-lab equipment. Once this principle is accepted, biologists must consider learning bioinformatics methods to be as critical as learning basic wet-lab methods, such as performing a polymerase chain reaction (PCR) experiment. Given proper training and demystification of what bioinformatics actually is, we have noted that the majority of biologists are perfectly capable of performing and developing complex bioinformatics methods. A critical aspect of this model is that bioinformatics and biological concepts are now being thought of by the same individual. This significantly accelerates project turnover, and reduces the likelihood of missed insights and misunderstandings.
Bioinformatics is a another tool in the biologists toolshed not a confusing black box of tricks
In our experience, many biologists initially approach bioinformatics methods as a set of black box tricks in which the basic rules of rigorous experimentation can be somehow readily ignored. This may be due to the background of numericality caused by all the E-Scores and P-values that gives an unwarranted sense of absolute accuracy to the results of bioinformatics analyses. It is has always been puzzling to us how careful bench biologists who would absolutely insist on control treatments and independent replications of their wet-lab experiments, turn into naïve experimentalists once they sit at the computer. The last author has enjoyed running the perhaps cruel exercise of handing students computer-generated random DNA sequences to annotate and analyze. Frequently, they would report back with findings of sequence motifs and other spurious annotations. This exercise serves to illustrate the importance of negative controls in even the simplest bioinformatics analyses and shatters the myth of the flawless black box.
Bioinformaticians can have the strongest effect on the acceptance and best use of bioinformatic practices by helping in the design and execution of experiments and controls. A particularly relevant example stems from our experience with the detection of single nucleotide polymorphisms (SNP) from next-generation DNA sequence data. We have completed several projects that aimed to catalogue genetic variation in microbes and plants and our analysis pipelines include next-generation sequencing alignment and SNP calling algorithms. None of the SNP identification programs give perfect results so the amount of error must be quantified. However, this serious limitation is not initially obvious to a biologist whose main focus is the end goal of generating lists of SNPs. In-depth explanations of the methods may not necessarily help as they can mire the discussion in statistical or technical details that may not be fully appreciated by the biologist. We approach the problem by encouraging the use of controls to demonstrate and estimate the error rates. This can be done, for example, by computationally introducing SNPs into a reference sequence and showing the extent to which recall of these SNPs is accurate (see Figure S2 and methods in Raffaele et al. 2). Such an exercise usually has a profound impact on biologists because they appreciate the value of controlled experimentation and informed criticism of data. It unequivocally demonstrates that bioinformatics methods have error. It frees the biologists to see the approach as just another way of estimating something and to approach bioinformatics as a set of methods that can be dissected with the familiar knife of experimentation. It also emphasizes the need to run method optimizations. Once these limitations of bioinformatics methods are acknowledged, the biologist would usually adopt standard controls to routinely estimate error rates and maximize sensitivity and specificity.
In our experience, the general concept of approaching bioinformatics with the same mindset as bench experimentation, i.e. with a critical mind and a set of robust controls, propagates rapidly within a research group. Once the concept is adopted in lab meetings and research discussions and the issues are explained by biologists to other biologists, we reach a virtuous cycle whereby the use of criticism and controls becomes reinforcing in a laboratory. Our ultimate aim is that the biologists use bioinformatics in a mature and critically aware way, not just a mechanical one.
Making the model work requires an open dialogue between bioinformatician and biologist
A key to success in the bottom-up model is to begin and sustain a productive working dialogue between the two parties by giving the biologist the vocabulary needed to work in the field and discuss issues as a peer of the bioinformatician. It is trivial to teach a biologist the mechanics of command-line program execution, but a more wide ranging effort is required to create the sort of sea change we aim for. At TSL we decided to first focus on training and to get our bioinformaticians to discuss their tricks and toys in a relaxed yet formal fashion. We began by implementing a wide range of courses aimed at the novice but covering enough ground to introduce all the vital aspects of each topic. Specifically we began by teaching biologists about driving the computer through the command-line, introduced scripting languages through Perl and Ruby courses and tied these together by showing how ad hoc but flexible pipelines could be created combining the two and how this results in saving time and effort. Illustrating these advantages is powerful in convincing biologists to invest in learning the basics of bioinformatics. We also implemented statistics training through an introduction to R, the use and utility of databases through MySQL and advanced Excel etc…
After formal training sessions, follow-up is vital. Help and resources should be available on-demand and the trainer needs to operate an open-door policy for questions. Answers to questions and discussions on request will help to prevent the learner’s enthusiasm from stalling early on. One key objective is to eliminate biologists’ frequent ‘command-line aversion’. Courses on command-line use and introductory programming with a scripting language encourages biologists to start employing high-throughput approaches when working with their big data. High level courses on algorithms and specialized method workshops help to bring the scope of the field into better focus. To reinforce our message we provide a small library of books from technical publishers like O’Reilly and Apress and our own printed and online resources too. We also provide web access to recordings of training courses and screencasts on quick how-to’s.
Wider lab-culture changes can be sustained and extended through a range of familiar exercises and resources. Journal Club meetings specifically designed to tackle discrete bioinformatics topics like ‘What is FASTQ?’ or ‘What are the tools available for Next-Generation sequence alignment?’ help enormously to reinforce awareness of what is being done in the field and what the details of execution are. Lab meetings in which the biologist presents their bioinformatics work to an interested audience provide a vital opportunity to develop critical appraisal of informatics methods.
Our model increases productivity (A). By collating system statistics on the number of jobs being run concurrently per hour and per day ‘before and ‘after’ a period in which we implemented our new systems for data management, our Galaxy instance and carried out training. We see a general rise in the jobs being run per hour after the changes and the box-plots show a significant increase in the jobs run per day (t-test, p = 0.027).
We have many biologists using many bioinformatics approaches (B). A high proportion of our staff are carrying out bioinformatics projects and using a wide range of methods. Each bar represents a method and each colour represents a user (projects using that method are indicated by the size of the block in each bar). Most biologists work on 2–4 informatics projects, applying 2–3 methods. The most widely used methods are resequencing and de-novo assembly that contribute to our genomics projects.
The bottom-up model brings greater productivity and scalability
Our approach has borne fruit. We have found that when biologists are able to handle part or all of the bioinformatics load on their projects our productivity increases (Figure 1). The turnover of jobs run on our computer cluster increased significantly after opening it up to trained biologists. The number of concurrent bioinformatics projects we are now handling is high too. In total, 25% of our researchers (20 individuals) are now actively involved in running their own bioinformatics projects, way above the 2.5% (2 researchers) that the old model permitted. The number of projects that we can handle is limited much more by the number of biologists than the capacity of the core support team. As the analysis moves from being the primary responsibility of core bioinformaticians, extra analysis capacity can now be brought in at the project level when hiring new biologists and not throttled by the size of the core support team. Thus the expertise available can scale as the number of projects requiring bioinformatics methods being carried out do.
Compute power can also be a limitation to bioinformatics, but this problem is not as acute in smaller institutions as it is in larger sequencing centers. A massive infrastructure investment is not necessary and it is possible to provide compute infrastructure that can be expanded as needed. We found that when the small core team was responsible for the majority of work with our hardware there was often lot of spare processing capacity and analyses did not run flat-out at every moment. With job-scheduling software modest compute clusters can be made to support the activities of many researchers by distributing resources evenly through time. It is possible to acquire for moderate costs a few high-powered servers and a storage device that can be built into a compute cluster easily. Well-designed clusters can be expanded by adding new servers and extra disks to storage appliances whenever projects require it .
Conclusions: A win-win proposition
It may seem that the scheme we have developed makes gains for biologists at the expense of bioinformaticians. That the bioinformaticians work is moved away from data analysis into training and systems-administration, a role that may not suit those with a keen interest in research. In fact, we have found that the main advantage of implementing the bottom-up model is that it pays back in saved time for the bioinformaticians thereby creating a new set of opportunities. Bioinformaticians working within this model will free up time to follow their own projects such as research into new methods. With a newly qualified, captive beta-testing audience and more time to exploit the growing data it becomes possible to simultaneously push forward the institutions research projects as well as the bioinformaticians reputation.
Our experience shows that bottom-up models of informatics provision are flexible and scalable where top-down models are not. As well as paying off in the current environment, such flexibility also helps to pave the way for future changes. Coming generations of biology students, taught in a data rich environment, will be more knowledgeable of techniques for handling big data and will be primed to begin such projects without needing much central assistance. As biology becomes a more data rich science it will attract more computer science trained students and researchers. Institutes and laboratories need to provide a working environment that supports a wide range of skill levels in biology and computer science to take full advantage of their main asset, people. Our approach grants this flexibility and means that we are not reliant on any single entity in our organization to make progress. Another key gain has been time and efficiency. We now get through the bioinformatics components of our projects faster. We have quicker paths to insight because bioinformatics concepts are being processed by the same brains that think about the biology. Once a critical and informed approach to bioinformatics takes hold it rapidly spreads from member to member in a virtuous cycle. An important part of service provision becomes creating environments for insights, an ecosystem of colleagues with deep experience of methods and a sustaining self-reproducing community of peers.
1 Goecks, J., Nekrutenko, A., & Taylor, J. Genome Biol. 11:R86 (2011).
2 Raffaele, S., Farrer, R.A., Cano, L.M. et al., Science 330:1540–1543 (2010).