What’s the connection between data, Doctor Who and utilities?
Reliable, reproducible and traceable data science. Intrigued? Read on…
Data science can be hard in different ways.
We often dwell upon the academic ones. We might pull our hair out wondering why a Bayesian model of infrastructure failure doesn’t converge. We might fight with data sets with many special cases or errors. We might even struggle to communicate to stakeholders why a particular data set offers no conclusive results on account of size or just bad assumptions.
But beneath all those struggles slithers a different beast: history. In general, we may all be trapped in history’s mighty jaws, but at least in the world of data science, there is hope of taming it.
Xylem Data Science (XDS)
At Xylem Data Science (XDS) we do a lot of work which is pure research, usually oriented towards answering a specific question in principle. For instance: we might want to reconstruct the mapping from transformer to meter using only hourly information about local voltage. Such a project involves a few phases, the first of which is fundamentally a research project: is this proposal even possible and what kind of performance can we expect on an idealized and then a representative data set?
Its tempting to think of the result of such an inquiry as an atemporal monolith: data goes in, passes through some modeling, a result comes out and is then characterized in various ways. Figures are generated and results are reported.
A parallel universe vs. real time
If data scientists could, like Doctor Who, retreat into a parallel universe where they had infinite time and resources to spend before delivering a result, this might be a reasonable conceptual model. But unfortunately, we work in real time, concomitantly with colleagues in marketing and product development. It’s often the case that the target of an R&D project is itself moving: initial analysis might, for instance, reveal that (in general) transformer mapping is difficult with real data, but that fixing small errors in a known transformer map is tractable.
Furthermore, data isn’t static either. You can certainly snapshot a dataset at a moment in time, but time moves on. New data might reveal problems in a line of research that no one anticipated. Changes in a software stack might reveal that old data wasn’t as clean as you thought it was.
One week a result might show significance and the next it might not. Or the results might be similar week to week but reflect a fundamentally different analysis. The way most people do research, it might be totally impossible to answer the dread question:
“Why did this graph look different 3 months ago, when we last got an update?” As data scientists, our responsibility is to the truth, before anything else. If we can’t trace our research to answer such questions, we can’t really be said to be doing our jobs.
As data scientists, our responsibility is to the truth, before anything else. If we can’t trace our research to answer such questions, we can’t really be said to be doing our jobs.
At XDS we want to do our jobs and so we’ve invented a methodology for making sure that our data science isn’t just correct at any moment in time, but that we can recapitulate (ideally) the entire history of any given research project quantitatively.
Software engineering + data science
How does it work? Its just the proper application of containerization, version control and dependency management. In essence we’re adapting the tools of software engineering to the discipline of data science. A detailed breakdown will follow in Part 2. Stay tuned!
About the Author
Vincent Toups is a PhD Neuro/Physicist turned Data Scientist/Engineer interested in traceable, reliable data science, the foundations and history of physics, programming languages and game design. He is also a dad and an assistant beekeeper.More Articles by Vincent