You’ll recall from my blog Part 1: Data, Doctor Who and Utilities, the Sensus and Xylem Data Science team invented a methodology for making sure our data science isn’t just correct at any moment in time, but that we can recapitulate (ideally) the entire history of any given research project quantitatively.
This is achieved through the proper application of containerization, version control and dependency management. In essence we’re adapting the tools of software engineering to the discipline of data science. In this blog, I explain how this is done, using poetry instead of code.
Git: The bridge between software engineering and data science
Git, a distributed version control system, created by Linus Torvalds in 2005 to replace ClearCase, is one of the most important software tools in the entire world. The canonical history is that, after being cut off from their free access to the commercial ClearCase because Andrew Tridgell attempted to “reverse engineer” its network protocol, in violation of their license, Linus managed all of the distributed development of Linux (an open source operating system) using just two standard command line utilities: patch and diff. People would email patches and he would apply them, or break them up, or deny them. This simple workflow is still the basis of git itself, which you can think of as a blockchain of sorts built up out of patches.
Many of us on the data science team came out of software engineering, or at least have taken a bit of a segue through it, on our way to our current work. In software engineering, the disciplined usage of git is practically an entry-level job requirement. This isn’t so much the case with research-oriented professionals, where adoption is patchy. There certainly are some wizards out there, but it’s not uncommon to see researchers manage their code bases with little more than a directory with folders named “Analysis V1”, “Analysis V2”, etc.
This is partially understandable: despite superficial similarities, research isn’t software engineering and researchers aren’t really software engineers. But in the end, its a bit of a tragedy, because git offers amazing capabilities to the researcher interested in a repeatable and traceable process (if at somewhat steep an initial cost).
For the uninitiated: patch and diff
Despite literal decades of agitation, almost all code is still, in the end, and fundamentally, text. While other strategies for denoting processes constantly appear (and disappear), code as text has such longevity precisely because of the tractability of the representation. This is admirably demonstrated by the simple little command line utilities patch and diff.
Suppose someone sends you a copy of the poem “Water” by Robert Lowell:
It was a Maine lobster town—
each morning boatloads of hands
pushed off for granite
quarries on the islands,
and left dozens of bleak
white frame houses stuck
like oyster shells
on a hill of rock,
and below us, the sea lapped
the raw little match-stick
mazes of a weir,
where the fish for bait were trapped.
Remember? We sat on a slab of rock.
From this distance in time
it seems the color
of iris, rotting and turning purpler,
but it was only
the usual gray rock
turning the usual gray
when drenched by the sea.
The sea drenched the rock
at our feet all day,
and kept tearing away
flake after flake.
One night you dreamed
you were a mermaid clinging to a wharf-pile,
and trying to pull
off the barnacles with your hands.
We wished our two souls
might return like gulls
to the rock. In the end,
the water was too cold for us.
However, you suspect that it has been tampered with. The poem is somewhat longish, and you don’t want to compare the version you’ve received line by line with a good copy you’ve got. The tool diff can do the comparison for us:
capitulation vincenttoups ~/work/xylem/blog-posts $ diff -u water-received.txt water-good.txt
— water-received.txt 2019-03-14 14:00:21.000000000 -0400
+++ water-good.txt 2019-03-14 14:00:37.000000000 -0400
@@ -20,7 +20,7 @@
but it was only
the usual gray rock
-turning the usual gray
+turning the usual green
when drenched by the sea.
The sea drenched the rock
What we’ve got here is called a patch or (colloquially) a diff. It is a description of the difference between our two files. If that were the end of it, it would only be slightly useful, but the brilliance of diff is that the output also constitutes a set of instructions for changing one file into another.
The above patch says: “To make water-received.txt into water-good.txt, remove the line turning the usual gray from the former and add the line turning the usual green (at the 20th line).
Indeed, the utility patch can be made to perform such a change given the output of diff.
diff -u water-received.txt water-good.txt > patch \
This will modify water-received.txt so that it matches water-good.txt.
Here is the critical insight
A software project or a data analysis pipeline is a piece of text, sure, but it can equally be thought of as a series of patches applied to a project starting with nothing at all.
Imagine, once an hour, copying your project to another folder, working for an hour, and then calculating the patch which takes all your files from their previous state to the current state. Collect those patches in order. Starting from nothing, you could reconstruct the current state of affairs by simply applying each of those patches in turn. However, even better, you can now see how your project evolved. You can see false starts. You can see when bugs crept in and when and how you squashed them. If you forget, for some reason, why a line of code looks the way it does, you can go back in time to find the time you created it.
This is, essentially, what a git repository is: a tree of patches which tells the entire story of your development process.
Terms and conditions apply
Like any tool, the real value of git, for a data scientist or a software engineer, depends on proper usage. Naive git usage accomplishes little more than organizing occasional backups of your repository, and can, in some cases, cause more headaches than even this utility justifies. Here are the rules to which we adhere, as data scientists:
- Don’t use git for large files
- Don’t check in results
- Make small commits with commit messages
Don’t use git for large files
Git has to calculate diffs and something called hashes. These can be slow for large files. Data scientists often deal with large files, though, so what is the solution?
Use a service like s3 to store large files in the cloud. When someone checks out the repository, they run a script which fetches the appropriate files into the local workspace.
You can make git ignore certain files by listing them in a special file, .gitignore. We can ignore an entire directory like this:
Our scripts can automatically grab data and place it in that directory, and it will never be captured by git.
Don’t check in results
In software engineering, this rule translates to “don’t check in compiled artifacts” – git is meant to manage source code, not executables. The repository should contain scripts to build the results of the project. By not checking results in, we make sure that the results someone looks at always reflects the analysis as it appears in the source code.
If we checked results in, someone might look at our repository, see results, and assume they reflect the current state, although nothing ensures this is the case.
Make them run the analysis locally, so they can trust the results.
Again, use .gitignore to tag appropriate files as being artifacta-non-grata.
Make small commits with good commit messages
This is one of the hardest ones for new git users. It is also the single most important habit, and the one which generates the most long term value.
Each commit is a little nugget of history. Often, we’re interested in understanding the exact reason a line of code has changed. The best source for that information is the commit message (a little note we associate with each commit). If we routinely make big commits, the utility of that message tends to be proportionally small, since it refers to a large volume of changes.
We might also want to pinpoint exactly when, for instance, results went from positive to negative, in the history of the project. If we want to know exactly what caused the change, large commits are our enemy. If each commit just touches 3-5 lines of code, then we’ll be able to (automatically, often) narrow down our search to just a few lines.
So how do we make small commits? Novices often imagine that they are expected to stop every 20 minutes or so to create a commit that covers their most recent changes and that, to accomplish this, they must work in an unnaturally sequential and single-minded way.
But that isn’t how pros do it.
While we can use a shotgun to create our commits, by saying
git add -A :/
To just make a commit of everything, git gives us a highly granular tool to take a large number of changes and to split out just a few of them for a small commit.
Git has this notion called the “staging area.” This is a set of patches which are going to be the next commit. When we first learn git we learn to add whole directories or files to the staging area, but we can actually look through all the changes in our working copy and just select a small set of them using interactive staging.
git add -i
This gives us an interactive menu which helps us visit each change we’ve made and decide whether to include it in the current commit. Most professional git users work for a day or more without making any commits, and work on as many issues as they want, and then use interactive staging to create a coherent set of small changes. Mastering this technique is the single most important piece of git facility you can develop.
At Sensus we apply all of these pieces of advice, in conjunction with other technologies, to make sure that we always know how our data science works and to know how it worked in the past.
Those interesting in learning git ought to check out the git book.