Chapter 2 General Principles

New ideas about what makes a good data analysis are emerging. With data being so readily available in vast quantities, analyzing data using out of date methods — such as Microsoft excel — can quickly become overwhelming, not reproducible, error-prone, and difficult to assess for reliability.

Much of the progress in terms of ‘developing analyses’ has been made in the field of bio-statistics due to the high volume of genomic data that researchers deal with. One of the most concerning examples of what can go wrong with an analysis, is from the field of genomics and cancer treatments. In the ‘Duke Scandal’, researchers made mistakes in their data analysis, that were extremely difficult to track, and resulted in patients receiving the wrong cancer treatment. This is an extreme example that affected peoples lives directly. I would argue, that the work that we do at Hakai, analyzing ecological data, has much broader implications and should be treated with an equal degree of discretion.

Some important concepts in defining a good data analysis are:

  1. Reproducible Research,
  2. Open Science Collaboration

These two concepts bring together a very modern way of conducting science that are beneficial in the following ways:

  • An additional peer review process becomes possible in the development of your analyses—peers can review how you conducted your analysis
  • You can reduce your work in the long run by being able to reproduce your own analyses after you’ve long forgotten the details of how they were conducted
  • If you weave a narrative text into your R code you’ll be able to understand what you were thinking at a later time when you revisit it
  • You can meaningfully collaborate
  • You can show your peers that you have nothing to hide in your analytic methods, immediately increasing the reliablilty of your work
  • You can share your analyses in hopes that others will improve the quality of your analysis by offering their insight

2.1 Reproducible Research

A big concern with modern scientific studies is the inability to reproduce study findings. As experimental and analytic methods become increasingly complex, published methods often lack the detail required to be able to reproduce the study. As data analysts, it is our responsibility to include in our analyses sufficient notation to facilitate deep understanding of what was done.

If your study finds something very interesting, people are going to want to know how you came to your conclusion. The gold standard for verifying a study is to independently replicate it. However, before investing the huge amount of resources required to replicate a study independently, a better place to start is to reproduce the study findings using their data and analysis.

In order for your study and your analysis to be reproducible and to be viewed as trustworthy, you need to be able to provide the data, the scripted code you used to clean, summarize, model, and visualize that data, and then a reviewer has to be able to run that same code and see the same results. This level of transparency allows a reviewer to look very closely at how you conducted your analysis. This adds an additional step in the peer review process which has not previously been possible with un-scripted analyses.

A simple example of how easy it is to break the reproducibility chain comes from working with your data in excel. By simply deleting some values that looked to be outliers, without recording anywhere that you did that, or why you did that, you have effectively broken the reproducibility chain. Another person could not receive the raw data and come to the same conclusions as you did — their results would be different because of the missing data.

Some academic journals, such as the Journal of Bio-statistics, are taking proactive steps by adopting a reproducibility policy for article submissions. The journal of Nature has taken a stance on reproducibility and has published several articles in a special called Challenges in Irreproducible Research. If you google reproducible research you will find many initiatives that are attempting to solve this important problem.

Most often, the peer-reviewer or collaborator that you will work with is ‘future-you’. Scripting reproducible analyses with embedded narrative allows future-you to understand what past-you was thinking. This, in turn, saves you a lot of time, and allows you to build upon your previous work. The idea of embedding your own narrative, or adding comments to your code, introduces the idea of literate programming. Weaving together human-readable narrative text that explains what your computer code is doing and why you decided to do it, greatly increase the quality of your work.

For some comic relief on the follies of working with someone who doesn’t use reproducible research methods watch this youtube video

2.2 Open Science Collaboration

2.2.1 Peer Review

Open science collaboration establishes an additional peer-review step in the scientific process. By making your analysis public using a distributed version control system like GitHub you open up the possibility for on-going peer-review of your analysis, as well as the the opportunity for not only your immediate colleagues, but also experts in your field to contribute in a meaningful and formalized way.

Using consistent formatting for writing your code, makes it easier for everyone to interpret. I recommended the The tidyverse style guide. Have a read through the guide and try to implement in for your next script.