Five building blocks of a data-driven culture

Carl Anderson Contributor

Carl Anderson is the author of Creating a Data-Driven Organization. He previously headed up data, analytics and data science at Warby Parker and WeWork, and is currently a member of WeWork’s Product Research Department.

Michael Li Contributor

Tianhui Michael Li is the founder of The Data Incubator, an eight-week fellowship to help Ph.D.s and postdocs transition from academia into industry. It was acquired by Pragmatic Institute. Previously, he headed monetization data science at Foursquare and has worked at Google, Andreessen Horowitz, J.P. Morgan, and D.E. Shaw.

Single source of truth

A single source of truth is a central, controlled and “blessed” source of data from which the whole company can draw. It is the master data. When you don’t have such data and staff can pull down seemingly the same metrics from different systems, inevitably those systems will produce different numbers. Then the arguments ensue. You get into a he-said-she-said scenario, each player drawing and defending their position with their version of the “truth.” Or (and more pernicious), some teams may unknowingly use stale, low-quality or otherwise incorrect data or metrics and make bad decisions, when they could have used a better source.

When you have a single source of truth, you provide superior value to the end user: the analysts and other decision makers. They’ll spend less time hunting for data across the organization and more time using it. Additionally, the data sources are more likely to be organized, documented and joined. Thus, by providing a richer context about the entities of interest, the users are better positioned to leverage the data and find actionable insights.

Knowing where to get the data, and providing quality data, is only one ingredient.

From the data administrator’s side, a single source of truth is preferable, as well. It is easier to document, prevent name collisions across tables, run data quality checks and ensure that the underlying IDs are consistent across the tables. It also is easier to provide flattened, easier-to-work-with views of the key relations and entities that, under the hood, may have come from different sources.

For instance, at WeWork, a global provider of co-working spaces, we provide our analytics users with a core table called the “activity stream,” a single narrow table that provides web page views, office reservations, tour bookings, payments, Zendesk tickets, key card swipes and more. The table is easy for users to work with, such as slicing and dicing different segments of our members or locations, even though the underlying data comes from many heterogeneous systems. Moreover, having this centralized, relatively holistic view of the business means that we also can build more automated tools on top of those data to look for patterns in large numbers of different segments.

In large organizations, there are often historical reasons why data are siloed. For example, large organizations are more likely to acquire data systems through company acquisitions, thereby resulting in additional independent systems. Thus, a single source of truth can represent a large and complex investment. However, in the interim, the central data team or office can still make a big difference by providing official guideposts: listing what’s available, where it is and where there are multiple sources, the best place to get it. Everyone needs to know: “if you need customer orders, use system X or database table Y” and nowhere else.

Data dictionary

Knowing where to get the data, and providing quality data, is only one ingredient. Users need to know what the data fields and metrics mean. You need a data dictionary. This is an aspect that trips up many organizations. When you don’t have a clear list of metrics and their definitions, people make assumptions — ones that may differ from colleagues. Then the arguments ensue.

A business needs to generate a glossary with clear, unambiguous and agreed-upon definitions. This requires discussion with all the key stakeholders and business domain experts. First, you need buy-in to those official definitions; you don’t want teams going rogue with their secret version of a metric. Second, it is often not the core definition where people’s understandings differ but how to handle the edge cases. Thus, while everyone might have a common understanding of what an “orders placed” metric means, they may differ in how they want or expect to handle cancellations, split orders or fraud.

Those scenarios need to be laid out, discussed and resolved. A goal here is to collapse multiple similar metrics into a single common metric, or flesh out situations where you genuinely need to split one metric into two or more separate metrics to capture different perspectives.

Having clean, high-quality data, from a central source, and with clear metadata, is ineffective if staff can’t access it.

For instance, at WeWork, prospective members check out our facilities by signing up for a tour. Importantly, some people may tour different locations, or come back for a second tour to show other members of the organization before signing off on their new space. While our various dashboards had a metric called “tours,” they didn’t align across teams. The process of creating a data dictionary fleshed out two different metrics:

“Tours completed-Volume” captures the absolute number of tours taken, which our Community team, who staff such tours, monitor.
“Tours completed-People” captures the unique number of people who signed up for a tour. This can then feed into a lead conversion metric, which our sales and marketing teams track.

Specificity in well-chosen names, and unambiguous definitions with examples, are key here. It is better to err toward longer but descriptive names, such as “non_cancelled_orders,” or “Tours Created To Tours Completed Conversion %” than shorter names that users think they understand.

Broad data access

Having clean, high-quality data, from a central source, and with clear metadata, is ineffective if staff can’t access it. Data-driven organizations tend to be very inclusive and provide access wherever the data can help. This doesn’t mean handing over the keys to all the data to all the staff — the CIO would never sign off on that! Instead, it means assessing the needs of individuals, not just the analysts and key decision makers, but across the whole organization, out to the front-line of operations.

For instance, at Warby Parker, a retailer of prescription glasses and sunglasses, associates on the retail shop floor have access to a dashboard that provides details on their performance, as well as that of the store as a whole. At Sprig, a food-delivery company from San Francisco, even the chef has access to an analytics platform that they use to analyze the meals that have been ordered and understand which ingredients and flavors are popular or have not fared well, and so tailor the menu.

A large Fortune 100 financial conglomerate that hires data scientist from The Data Incubator’s fellowship is able to maintain a competitive edge in hiring compared to “sexy” Silicon Valley companies like Google, Facebook and Uber, partially through granting broad access to data for their data science team. And the access doesn’t just stop at data scientists — one of the products our alumni have worked on is building summary dashboards that automatically gives customer service reps a visualization of the interaction history of the customer on the phone.

Data-driven organizations need to foster a culture whereby individuals know what data are available.

It is those front-line staff — the customer service agent dealing with an angry customer, or a warehouse worker facing a pallet of damaged product — who can leverage data immediately to determine best next steps. If suitably empowered, they are often also in the best position to resolve a situation, determine changes to workflow or handle a customer complaint.

Data-driven organizations need to foster a culture whereby individuals know what data are available — a good data dictionary and generally seeing data being used in day-to-day decision making helps — and, further, that they feel comfortable requesting access, if they have genuine use case. Red tape should be cut so that while there is still an appropriate approval process and oversight, and systems in place such that access can easily be revoked if necessary, the staff get access without too many hoops to jump through and without too many delays.

Finally, with broader access, and more users of analytical tools, the organization will need to commit to providing training and support. At WeWork, while our data team are available through Slack, email and service desk tickets, we also provide weekly office hours to help users with our business intelligence tools, SQL queries and any other aspects about the data.

Data literacy

In a data-driven organization with broad data access, staff will frequently encounter reports, dashboards and analyses, and they may have a chance to analyze data themselves. To do so effectively, they must be sufficiently data literate.

Data literacy is often a multi-pronged effort. (For an excellent and accessible overview of this topic see this article by Brent Dykes.) At The Data Incubator, we engage clients with employees at a range of different skill levels that require a tailored approach.

One of the most exciting areas is data science training. This covers an introduction to the more advanced and computational data mining and machine learning approaches to extract insights from data, as well as create data products such as recommendation engines and other predictive models. This tends to be focused at the top of the skills pyramid for more advanced users to up their game a notch. One of the quickest data wins for many of our clients simply comes in training people who are half-way to becoming data scientists on the other half.

For example, pharmaceutical and finance clients tend to have legacy statisticians who are well-versed in the statistical aspects of data science but are weaker on the computational front. Many technology companies have an abundance of programming talent that lack in statistical rigor. Training statisticians on programming and programmers on statistics is a great “quick win” that can be extended more broadly.

Data is not there to bolster (or undermine) existing decisions, but to help inform future ones.

For those who don’t have such skills, there are plenty of opportunities to increase data literacy across the board. Enterprises have begun to view data literacy training as necessary for everyone, and we’ve seen the demand for “introductory data science for managers” courses double in the last 12 months. The lowest and simplest level is to enhance basic skills in descriptive statistics. These are the basic ways of summarizing data: mean, percentiles, range, standard deviation, etc., and highlighting when they are or are not appropriate give the shape of the underlying data.

For instance, when data are highly skewed, as in house prices or income, the median is the appropriate metric with which to summarize the data, not the mean. Just training people to make fewer assumptions, to plot and examine the data and to use appropriate summary metrics would be a big win.

Another win can come from data visualization skills. Too often, charts are full of chart junk, that is, unnecessary clutter and annotations that detract from the key point. Or, inappropriate chart types are used — such as multiple pie charts each with a large number of segments — or, a color scheme is chosen that makes it near impossible to interpret.

It is a tragedy to spend a huge amount of effort on data collection and analysis, only to fail, and lessen the data’s impact, at the finish line. Just a small amount of data visualization training goes a long way, and can greatly enhance people’s presentation skills and make insights clearer, more digestible and ultimately likely to be used.

At the next level of complexity is inferential statistics. These are the standard, objective statistical tests used to detect, for instance, whether a trend or difference in website traffic between weeks is likely real or whether it is just random variation. The purpose here is not so that a manager or customer service agent can perform these tests but, instead, making them aware of how statistics can be of use, to understand correlation versus causation and appreciate that forecasts always come with uncertainty. For the decision makers and managers, this also can arm them with the skills to push back on shoddy work or where the data don’t support the conclusions.

Decision making

Data can only make an impact if it is actually incorporated in the decision-making process. An organization can have quality, timely and relevant data and skilled analysts who generate masterful reports with carefully crafted and presented insights and recommendations. However, if that report sits unopened on a desk, or unread in an inbox, or the decision maker has already made up his mind what action he or she is going to take, regardless of what the data shows, then it is all for naught.

HiPPO, “highest paid person’s opinion,” a term coined by Avinash Kaushik, is the antithesis of data-drivenness. You all know them. They’re the expert with decades of experience. They don’t care what the data says, especially when it disagrees with their preconceived notions, and they are going to stick to their plan because they know best. And, besides, they’re the boss. As the Financial Times explains:

HiPPOs can be deadly for businesses, because they base their decisions on ill-understood metrics at best, or on pure guesswork. With no intelligent tools to derive meaning from the full spectrum of customer interactions and evaluate the how, when, where and why behind actions, the HiPPO approach can be crippling for businesses.

Too often organizations have a prevailing culture where intuition is valued or there is a lack of accountability. In one survey, just 19 percent of respondents said that decision makers are held accountable for their decisions in their organization. It is in such habitats where HiPPOs thrive.

One way to counteract the HiPPOs is to cultivate a culture of objective experimentation, such as A/B testing. In those scenarios, whether it be a change to website design or marketing messaging, you control for as much as possible, determine the success metrics and required sample sizes, change that one thing and let the experiment run. The key here is to have a clear analysis plan and set out the success metric and any predictions before the experiments run. In other words, the plan prevents HiPPOs from cherry picking results after the fact. That same is true of any pilot program.

Part of the value of broad data literacy training is to allay fears from the perceived threat of big data. Data is not there to bolster (or undermine) existing decisions, but to help inform future ones. It does not threaten the manager’s job — but ignoring it might. By demystifying how data works, data science trainings can increase manager confidence in data and increase data-driven decision making in an enterprise.

Conclusion

Through work at both employers and clients, we’ve come to learn that data-driven culture does not come overnight, but is part of a multi-step process. The first requirement is a clean, single source of data from which analyses can flow. Second, data analysts and data scientists then need to agree on the data dictionary and what the data means. Next, not just data scientists but the entire organization needs to be given broad access to this data to enable the application of collective business expertise in analyzing data. With access to data must come good training to help reinforce data literacy. Finally, all those amazing data analyses must be put into the hands of data-trusting managers to affect decision making.