Chapter 6 Version Control and Collaboration

Version control is an additional level of saving your files. The old school method of version control is to have multiple versions of the same file on your computer with different dates or initials to identify the version you want to work on (eg. fishy_analysis_V9_2017_05_11_BJ_edits.R, etc…) — this is what version control using Git and GitHub is aims to simplify.

There are two methods that are recommended for version controlling your files: Google Drive and Git/GitHUB. Each has their own advantages and disadvantages and can be used under different circumstances.

6.1 Google Drive

6.1.1 Version Control

Googlesheets, Google Docs, Google Slides, has for me, replaced Excel, Word, and Powerpoint. The ability to collaborate in real time on the same file as someone else has great advantages and solved many of the issues with version controlling edits to a source file. The version controlling is taken care of for you and you can look into the version history to see exactly what and when changes were made and by whom. Essentially everyone works off the master copy (or using git jargon — commits to the master branch). The version history of any Google file (Sheet, Doc, Slides) can found on the browser version of Google Drive by opening the document or sheet and selecting file > version history. If you’d like to make a specific version more memorable before or after some significant change, you can ‘name the version’. The actual name of the file does not change, however the version will receive a name in the version history. This allows you to better keep track of important changes to a file.

Keeping data files and R scripts in Google Drive for version controlling isn’t quite as slick as it is for google products (sheets etc.). It does, however, certainly have some advantages compared to emailing colleuages different versions of files back and forth, and renaming them each time.

Everytime you save a file that is located in a Google Drive folder on your computer, it is backed up in the cloud and the version of that file is saved. Managing the version history of files other than Google Sheets, Docs, Slides etc, is a little different. Only the most recent versions of your file will be saved unless you right-click the file in the web version of Google Drive and check the box that says ‘Keep Forever’.

Google Drive is a simplified and intuitive version control system that can work well under some scenarios. Google Drive falls short, however, when you and a colleauge are working in a collaborartive development of an .R file. Maybe you are both working on the same .R file to build a complicated analysis, and your part depends on their part so you have to do it in the same script. If there is a chance you could be working on the same script at the same time, then Google Drive will fail you. If you both opened the .R file from Google Drive at the same time and one person finished making their edits after the other, the first persons edits would be erased as the second person uploads their version. This is one of the major problems that Git and GitHUB solves.

6.2 Git and GitHUB

While peer-review permits the scientific merit of your analyses to be assesed, code-review permits a colleague to assess your analyses for coding errors. Developing analyses is difficult and error prone. Using the distributed version control system GitHUB, you are able to track changes to your code over time and enable others to suggest edits for your review, leaving you with a history of exactly what was added when and by whom — this is version control. The distributed aspect of ‘distributed version control’ means that multiple people can access the version control through your repository of all the files of your analysis (and their previous versions). You can make your repository (repo) private or public. By acessing your repo, which is hosted on GitHub, others can meaningfully collaborate, conduct peer-review, and code-review. Git is the local version control system running in the background on your computer while GitHub is the remote user interface for saving, tracking, and sharing updated versions of your files that are hosted on a remote server. Git and GitHub are integral for tracking the evolution of a set of files, and the development of your analysis.

The additional ‘save’ that comes from version control using git is known as a commit. You save your files like you normally would, but every once in a while you commit your files as an official version to be remembered. A commit can be thought of as a bullet point in the to do list of your analysis, and each commit you make must be accompanied by a message. For example; ‘read in data and tidy it up’, or ‘remove observations from non-standard sampling event, and re-fit GLM’. Git tracks the commits you make in R-Studio locally on your own computer. When you are ready for a series of commits to be made public, you push your commits to your remote repository at GitHub.

If you’re interested in learning why version control is important, I encourage you to skim Chapter 1; ‘Why Git? Why GitHub’ from Jenny Bryan’s Book. I recommend referring back to this book whenever you have a question about using git and GitHub with R-Studio. You can also watch this video for an introduction to Git and GitHub.

6.2.1 Step-by-step Install Git/GitHub

  1. Setup a GitHub Account: First, you’re going to want to sign up for a GitHub account

  2. Install and set-up Git: Installing Git locally and getting Git to communicate to your remote GitHub website account, and then getting them to talk with R Studio takes a number of steps to complete…This next part can be painful to work through, but it is 100 % necessary and well worth it in the long-run. Install Git and GitHub and get them talking to R-Studio using Jenny Bryan’s Guide, follwow along wth chapter 7, 8, 10, and 13.

  3. Put this all in context: To learn how to use Git, GitHub and R-Studio I recommend watching the R-Studio Essentials tutorial; GitHub and R-Studio.

If you are working directly for Hakai, you should contact the Ecological Information Management department: eims@hakai.org, to request that you be brought on to the Hakai GitHub team so that your work is stored in the Hakai repositories rather than on your personal account.

6.2.2 Start an R-Studio project under version control

6.2.2.1 Set up a repo on GitHUB

If you are working on a specific program, there’s a good chance your program already has a central GitHUB repo set up for all your programs analyses. If you’ve been asked to conduct some data summaries or analyses you should talk to your PI about storing the code you write in your program’s repo. If your program doesn’t have a repo, you’re going to want to set one up. Make sure you talk to your PI and someone from Hakai EIMS about this.

Whether you’re setting up a repo on your own GitHub account or the Hakai one, your going to want to first create the repo, and then create your R-Studio project by following these instructions. If the repo you want to use already exists, then just go straight to the next section: ‘Start a new R-Studio project.’

  1. Go to https://github.com and login.

  2. Create a new repository on your own account, or on the Hakai account depending on the scenario.

  3. Call it something-descriptive-but-concise.

  4. Make it private to start (talk to your PI or IT about making it Public) and click yes to initialize with a README.

  5. Click the big green button “Create repository.”

  6. Copy the HTTPS clone URL to your clipboard via the green “Clone or Download” button. You’ll need this to link the repo with R-Studio.

6.2.2.2 Start a new R-Studio project

To link the gitHUB repo to your Google drive folder and a new project in R-Studio:

  1. Open R-Studio and go to File > New Project > Version Control > Git. In the “repository URL” paste the URL of the Git Hub repository you just made. Under project directory name, type in the name of whatever this analysis is, for example “soft sediment spp diversity” — this will be the name of the analysis’ directory. Under ‘Create project as a sub-directory of’ navigate to your programs Google Drive folder for analyses. Click “Create Project”.

  2. Go to Tools in the menu bar, and navigate to Project Options. Click Packrat, and then check the box ‘Use Packrat with this project’. This may take a few minutes to initialize, but is important to make your analyses reproducible in the long term.

This should download the README.md file that we created on GitHub in the previous step. Look in R Studio’s file browser pane for the README.md file."

6.2.3 GitKraken

When you first start, using the git feature directly in R-Studio is more than adequate. If however, you’d like more features, then I recommend the following software and workflows.

GitKraken is a graphical user interface that allows you to visualize your repositories branches and history. The Hakai IT team uses this software when developing tools, and it works quite well for backing up and monitoring the development of an analysis as well. Check out GitKraken to see if it’s something you’d like to use.

6.2.3.1 Git Flow

Git Flow is a workflow that can be implemented in GitKraken by going to GitKraken > Preferences > GitFlow and then initialize gitflow. The workflow is generally good whether working individually or collaboratively. Some will argue that there’s no such thing as working individually, especially when you consider that you’re constantly collaborating with past-you, and future-you.

Some terms related to version control and Git Flow:

  • Branching - This is a core concept in Git Hub and version control in general – The master branch must always remain stable and working. When you initialize git flow it creates a development branch. Essentially, you are creating an environment where you can try out new ideas without directly manipulating the stable ‘master’ version of your analysis. When you are happy with the changes you push the development branch to the master branch.

  • Commits - Commits represent significant changes to your branch. I think of them as mini-milestones that add up to your complete analysis. Perhaps you’re mini-milestone is to create a linear model of something. You can work locally and make save files on your computer while working on the linear model like you normally would. When you finish your linear model, this is a good time to make a commit to your branch. You must name each commit, and this should be descriptive. In this case ‘fitted linear model’ would be ideal.

To learn more about git flow check out this article

6.2.4 Fork and Pull

If someone from outside the Hakai Institute GitHUB account wants to review your analysis and suggest changes, make additions, or generally collaborate with you, you can use a different workflow known as the fork and pull model.

  • Fork - This is the process of copying someones repository and creating a fork in the development branch where you can work the forked copy all you want without affecting the original code. When you’re happy with the changes you made to the repo, you can make a pull request.

  • Pull Requests - I like to think of pull requests as an opportunity to review and assess external changes to the repo before you may decide to merge the fork back into the main branch. Something that helped clarify what a pull request is for me, was thinking about the difference between a push (write) request, and a pull (read) request. In the context of a shared repository, a pull request gives everyone that the repo is shared with a change to read your request to consider the sum of your mini-milestones. Pull requests should amount to the completion of a component of your project. See here for more info.

  • Merge - Once a Pull Request has been sufficiently reviewed, discussed, and tested (deployed), the new code can be merged with the master branch, typically by the team owner, or maintainer.

  • Issues - You can create an issue to raise a bug report, suggest some functionality, or discuss the objectives of a certain project.

6.3 Best Practices

6.3.1 Citing analyses

  • Citing analyses can be done as text files that get included with data packages that can be distributed with manuscript.

  • Hakai has the ability to zip scripts and data files from a gitHUB repo and assign a DOI to it with a metadata entry. This isn’t mutually exclusive to maintaining a GitHUB repo and pointing collaborators to archived versions of a repo.

6.3.2 Private v. public repos

  • We don’t really want anonymous use of our analyses, so it’s best to use private repos and add external collaborators if we want them to review or contribute to our analysis.

6.3.3 Granularity of repos

  • Best practice is to not make repos too small.

  • In cases where you want to cite an analysis, need finer control of access permissions, or when a tool requires a separate repo (eg. creating an R package) you’ll have to create a separate repo.

  • There are ways to take folders out of one repo to create an independent repo which retains commit history. Therefore, if you are uncertain about whether you need to split into a different repo, it’s best to lump folders and split later if need be.

  • Managing repos with gitHUB ‘teams’ is a good workflow.

6.3.4 Code Licensing

  • It’s a good idea to license with MIT license to permit commercial use.

  • This license should be included in every script or data package, but we will still control who has access to code and data by requiring them to request access to data packages or scripts.