At Sensus and Xylem, our Data Science Team is responsible for engineering data stores to drive novel solutions and quality improvements for our products and services. The data science research results are ultimately used to better serve our customers and advance our industry.
In previous blogs, you’ve learned about the methodology of how we approach data science and you’ve read about real-world examples of analyzing utility data. But, what we haven’t really touched on yet is the data itself. Where do we get all this utility data to analyze and how do we manage it?
Welcome to the Xylem data lake.
The purpose of this blog is to:
- Provide an explanation of a data lake.
- Describe the scale of the Xylem data lake, what it consists of and how it is used.
- The technology used for our data lake.
These three topics will provide a background against which to understand future blog posts relating to the Xylem Data Lake.
Note: The term RNI is used throughout this blog post. RNI stands for Regional Network Interface. It is Sensus’ head-end enterprise software system used to relay messages to and from meters in the field.
What is a data lake?
If you google this question, you will find about 1,040,000,000 results. The definition we use comes from one of our partners, AWS:
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics – from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.”Data Lakes and Analytics on AWS
Note: If you are wondering about the difference between a data lake and a data warehouse, please refer to the AWS link above, as it does a really good job explaining the difference between the two.
Why would a company need a data lake anyway?
Well, for example, let’s say you want to analyze customer data – demographics, consumption amounts and times, billing rates, payment types, historical account information, weather data, etc. Chances are this data can live in a multitude of distributed systems. Some may even be in an Excel spreadsheet managed by Karen that lives on her hard drive, or on a legacy billing system that only Frank has login credentials for. Getting access to and pulling data from all these places can be difficult and painful. Not to mention, puts your analyses at risk for error.
The beauty of a data lake is that it is a central repository, pulling from all your systems, databases and files, but still maintains the integrity of the original raw data. This way, you always know where your data came from. And as you learned in a previous blog, this is the secret to performing data science which is reproducible, traceable and reliable.
Back to the definition… Let’s break down what’s in a data lake.
The concept can be compared to a body of water, where water flows in, filling up a reservoir and then flows out.
- Both raw structured data and raw unstructured data are fed into the data lake via interfaces.
Information in rows and columns
Easily ordered and processed with data mining tools
Relational Database Extracts, CSV files, etc.
Raw, unorganized data
Images, video and audio
Social media tools
- The reservoir of water is a dataset where you run analytics on all the data.
The raw data is stored in its native format, within the data lake reservoir, in perpetuity. Additionally, we transform frequently accessed data into an actionable form. Going forward, we will refer to frequently accessed/transformed data as “hot data” and infrequently accessed data as “cold data.”
- The outflow of the data lake (because it’s not stagnant water!) is the analyzed data used by:
A) Applications, Data Scientists, Analysts and others accessing the hot data present in the data lake.
B) Big Data Engineers transforming cold data on an ad hoc basis for Data Scientists to utilize.
- Through the use of the data lake, the applications, Data Scientists, Analysts, etc. quickly gain key business insights by:
A) Sifting through a large amount of raw data in one centralized location.
B) Getting access to a democratized set of data.
C) Gaining information through matching up disparate IT systems and external data sources.
Cool. So what’s in the Xylem data lake?
Sensus hosted RNIs are the “tributaries” which feed into the meter deployment “river” that feeds into our data lake. Using the same visual, here’s a peek into what the Xylem data lake receives, stores and shares.
- We receive:
Structured data, such as:
Uplink and downlink message files from meters to/from our hosted systems.
RNI database extracts
Sensus manufacturing data
Unstructured data, such as:
Spreadsheets containing field test information
Field data relating to malfunctioning meters
Text files from field applications and devices
- We store:
Cold data: About 27TB of compressed raw data, representing more than 8 million files.
Hot data: More than 120TB of active data sets from meter data.
- We share:
A web application for employees to look-up meter data over the last four years.
Data sets for our Data Scientists and Analysts to “crunch.”
Complex ad hoc analyses our Big Data Engineers perform for colleagues who do not have a PhD in Statistics.
How do you keep this Great Lake running?
In 2014, the Xylem data lake was first engineered and hosted on-premises using a combination of popular technologies including Linux, Windows, and the Hortonworks Data Platform. As our business (and amount of data) grew, we decided to re-architect our data lake and migrate it on to the Amazon Web Services (AWS) platform. We are very excited about the possibilities afforded to us in terms of new functionalities, tools, scalability and cost management.
We are also very excited to start contributing our lessons learned, experience and know-how on this blog. Subscribe and continue to check back to see new blogs on data science from Sensus.