The ability to understand and communicate about data is an increasingly important skill for the 21st-century citizen, for three reasons. First, data science and AI are affecting many industries globally, from healthcare and government to agriculture and finance. Second, much of the news is reported through the lenses of data and predictive models. And third, so much of our personal data is being used to define how we interact with the world.
When so much data is informing decisions across so many industries, you need to have a basic understanding of the data ecosystem in order to be part of the conversation. On top of this, the industry that you work in will more likely than not see the impact of data analytics. Even if you yourself don’t work directly with data, having this form of literacy will allow you to ask the right questions and be part of the conversation at work.
To take just one striking example, imagine if there had been a discussion around how to interpret probabilistic models in the run up to the 2016 U.S. presidential election. FiveThirtyEight, the data journalism publication, gave Clinton a 71.4% chance of winning and Trump a 28.6% chance. As Allen Downey, Professor of Computer Science at Olin College, points out, fewer people would have been shocked by the result had they been reminded that, Trump winning, according to FiveThirtyEight’s model, was a bit more likely than flipping two coins and getting two heads – hardly something that’s impossible to imagine.
What we talk about when we talk about data
The data-related concepts non-technical people need to understand fall into five buckets: (i) data generation, collection and storage, (ii) what data looks and feels like to data scientists and analysts, (iii) statistics intuition and common statistical pitfalls, (iv) model building, machine learning and AI, and (v) the ethics of data, big and small.
Sponsored by Splunk
Help your employees be more data-savvy.
The first four buckets roughly correspond to key steps in the data science hierarchy of needs, as recently proposed by Monica Rogati. Although it has not yet been formally incorporated into data science workflows, I have added data ethics as the fifth key concept because ethics needs to be part of any conversation about data. So many people’s lives, after all, are increasingly affected by the data they produce and the algorithms that use them. This article will focus the first two; I’ll leave the other three for a future article.
How data is generated, collected and stored
Every time you engage with the Internet, whether via web browser or mobile app, your activity is detected and most often stored. To get a feel for some of what your basic web browser can detect, check out Clickclickclick.click, a project that opens a window into the extent of passive data collection online. If you are more adventurous, you can install data selfie, which “collect[s] the same information you provide to Facebook, while still respecting your privacy.”
The collection of data isn’t relegated to merely the world of laptop, smartphone and tablet interactions but the far wider Internet of Things (IoT), a catch-all for traditionally dumb objects, such as radios and lights, that can be smartified by connecting them to the Internet, along with any other data-collecting devices, such as fitness trackers, Amazon Echo and self-driving cars.
All the collected data is stored in what we colloquially refer to as “the cloud” and it’s important to clarify what’s meant by this term. Firstly, data in cloud storage exists in physical space, just like on a computer or an external hard drive. The difference for the user is that the space it exists in is elsewhere, generally on server farms and data centers owned and operated by multinationals, and you usually access it over the Internet. Cloud storage providers occur in two types, public and private. Public cloud services such as Amazon, Microsoft and Google are responsible for data management and maintenance, whereas the responsibility for data in private clouds remains that of the company. Facebook, for example, has its own private cloud.
It is essential to recognize that cloud services store data in physical space, and the data may be subject to the laws of the country where the data is located. This year’s General Data Protection Regulation (GDPR) in the EU impacts user data privacy and consent around personal data. Another pressing question is security and we need to have a more public and comprehensible conversation around data security in the cloud.
The feel of data
Data scientists mostly encounter data in one of three forms: (i) tabular data (that is, data in a table, like a spreadsheet), (ii) image data or (iii) unstructured data, such as natural language text or html code, which makes up the majority of the world’s data.
Tabular data. The most common type for a data scientist to use is tabular data, which is analogous to a spreadsheet. In Robert Chang’s article on “Using Machine Learning to Predict Value of Homes On Airbnb,” he shows a sample of the data, which appears in a table in which each row is a particular property and each column a particular feature of properties, such as host city, average nightly price and 1-year revenue. (Note that data are rarely delivered directly from the user to tabular data; data engineering is an essential step to make data ready for such an analysis.)
Such data is used to train, or teach, machine learning models to predict Lifetime Values (LTV) of properties, that is, how much revenue they will bring in over the course of the relationship.
Image data. Image data is data that consists of, well, images. Many of the successes of deep learning, have occurred in the realm of image classification. The ability to diagnose disease from imaging data, such as diagnosing cancerous tissue from combined PET and CT scans, and the ability of self-driving cars to detect and classify objects in their field-of-vision are two of many use cases of image data. To work with image data, a data scientist will convert an image into a grid (or matrix) of red-green-blue pixel values or numbers and use these matrices as inputs to their predictive models.
Unstructured data. Unstructured data is, as one might guess, data that isn’t organized in either of the above manners. Part of the data scientist’s job is to structure such unstructured data so it may be analyzed. Natural language, or text, provides the clearest example. One common method of turning textual data into structured data is to represent it as word counts, so that “the cat chased the mouse” becomes “(cat,1),(chased,1),(mouse,1),(the,2)”. This is called a bag-of-words model, and allows us to compare texts, to compute distances between them, and to combine them into clusters. Bag-of-words performs surprisingly well for many practical applications, especially considering that it doesn’t distinguish “build bridges not walls” from “build walls not bridges.” Part of the game here is to turn textual data into numbers that we can feed into predictive models, and the principle is very similar between bag-of-words and more sophisticated methods. Such methods allow for sentiment analysis (“is a text positive, negative or neutral?”) and text classification (“is a given article news, entertainment or sport?”), among many others. For a recent example of text classification, check out Cloudera Fast Forward Labs’ prototype Newsie.
These are just two of the five steps to working with data, but they’re essential starting points for data literacy. When you’re dealing with data, think about how the data was collected and what kind of data it is. That will help you understand its meaning, how much to trust it, and how much work needs to be done to convert it into a useful form.