Three Marks Of Real Data Science

Michael Howard Contributor

Editor’s note: Michael Howard is CEO of C9, a provider of predictive sales and marketing applications that enable companies to increase revenue, generate more precise forecasts, and mitigate pipeline risk.

As a venture investor looking to invest in data-driven companies, telling real data science apart from pseudo science is difficult, but it’s critically important. Investing in a data science application company won’t deliver long-term returns if the company uses quackery.

Real data science involves using complex algorithms to collect masses of data, analyze all of it, and convert it into real answers. There’s a lot of pseudo data science being peddled by software companies that claim they can turn data into gold. While these alchemists dangle buzzwords like “big data” and “machine learning,” they really aren’t doing any data science; they’re just querying subsets of data to deliver limited findings.

So what is real data science and how can you spot software companies that are really using it to deliver meaningful business insights? Here are three ways to tell if the software company you’re considering funding uses real data science.

Look for Algorithms, Not Queries

The first level of distinction between real and pseudo data science involves the difference between a query and an algorithm. Data science uses algorithms to collect and analyze up to thousands, millions, or billions of rows of columns, automatically discovering new relationships in data. Then, these algorithms learn and adapt, getting more and more accurate over time to spot current and future trends – otherwise known as “machine learning.”

Using real data science, a predictive analytics to solution might discover, on its own, the top reasons why sales deals go south; a BI tool might take the information the algorithm generates and create a rendered report or workbook showing action-oriented tasks. Algorithms continually adapt and change as they process new data and glean new learnings.

On the other hand, queries are simply one-time questions and they never learn from themselves. A database query, however complex, is not data science. A query asks a single question such as “sum sales by territory” but doesn’t provide any actionable insights.

Look for a Rich Data Model

The second level of distinction between real and pseudo data science concerns the notion of “model richness” used to create and understand predictive models. To understand this distinction, let’s review what a “predictive model” is. In the case of predicting whether a sales deal will close or not, for example, a predictive model needs data from which to construct a complex model and then make a prediction.

This data can come from CRM applications like Salesforce or Microsoft Dynamics, which were designed, up to a point, to capture the dynamic and complex nature of a sales process. Imagine the tortuous steps in takes to sell an American-made plane to an airline in a foreign country. It’s tough, complicated, and takes a long time. But here’s the catch. Salesforce, for example, doesn’t keep a history of a deal for more than 90 days. So in order to understand everything that went into a deal, you have to use data science to gather various data from myriad sources to create a “rich model.”

Taking the sales data scenario a step further, the fake data scientists don’t have rich data sets, so they provide only a slim view of a deal – basically, whether a deal was “won” or “lost.” In contrast, real data science would use a rich data model based on a comprehensive, relevant data set to provide accurate, actionable and highly valuable insights. Data science uses “temporal” technology to analyze every aspect of the sales process, and in so doing, can characterize a winning versus losing sales cycle.

Rich data sets are hard to come by. Only a few predictive analytics companies have access to data sets large enough to enable scientists to score them to provide not just answers, but accurate predictions about potential future outcomes.

Don’t Fall for “Signal Spin”

Many software companies try to get around having rich data sets by claiming they use more signals than anybody else. A signal is a single data point, for example from a government database on education. Some companies engage in “signal spin,” counting every column of data sources they could potentially use (but usually don’t) to increase their signal count. To distinguish the heavyweights from lightweights, you have to have a data scientist dig deeper and start reviewing confusion matrices and F1 scores.

A simpler way to tell when a company is engaging in signal spin is by asking a single question: What are the top 10 attributes considered in your algorithm? Lightweights can’t typically get past No. 3, let alone 4,000. The bottom line: Don’t let the quantity of signals mask the quality required to develop accurate and rich models. In most cases, the signals cited aren’t even used.

Telling real data science apart from pseudo science isn’t easy. But if you’re considering investing in a software company that claims to use real data science, just ask them these three crucial questions to find out if they’re real scientists or quacks. Pseudo science, like alchemy, may look good on the surface, but dig a bit deeper, and its fake claims are simply too good to be true. Don’t get stuck with lead and alchemy when the only thing that will deliver long-term financial gains is real data science.