Stop creating self-fulfilling prophecies: How to apply AI to small data problems

Over the past decade or so, the digital revolution has given us a surplus of data. This is exciting for a number of reasons, but mostly in terms of how AI will be able to further revolutionize the enterprise.

However, in the world of B2B — the industry I’m deeply involved in — we are still experiencing a shortage of data, largely because the number of transactions is vastly lower compared to B2C. So, in order for AI to deliver on its promise of revolutionizing the enterprise, it must be able to solve these small data problems as well. Thankfully, it can.

The problem is that many data scientists turn to bad practices, creating self-fulfilling prophecies, which reduces the effectiveness of AI in small data scenarios — and ultimately hinders AI’s influence in advancing the enterprise.

The trick to applying AI correctly to small data problems is in following correct data science practices and avoiding bad ones.

The term “self-fulfilling prophecy” is used in psychology, investing and elsewhere, but in the world of data science, it can simply be described as “predicting the obvious.” We see this when companies find a model that predicts what already works for them, sometimes even “by design,” and apply it to different scenarios.

For instance, a retail company determines that people who filled their cart online are more likely to purchase than people who didn’t, so they heavily market to that group. They are predicting the obvious!

Instead, they should apply models that help optimize what does not work well — converting first-time buyers who don’t already have items in their cart. By solving for the latter — or predicting the non-obvious — this retail company will be much more likely to impact sales and acquire new customers instead of just keeping the same ones.

To avoid the trap of creating self-fulfilling prophecies, here’s the process you should follow for applying AI to small data problems:

  1. Enrich your data: When you find you don’t have a ton of existing data to work off of, the first step is to enrich the data you already have. This can be done by tapping into external data to apply look-alike modeling. We see this more than ever thanks to the rise of recommendation systems used by Amazon, Netflix, Spotify and more. Even if you only have one or two purchases on Amazon, they have so much information on products in the world and the people who buy them, that they can make fairly accurate predictions on your next purchase. If you’re a B2B company that uses a “single dimension” to categorize your deals (e.g., “large companies”), follow Pandora’s example and dissect each customer by the most detailed degrees (e.g., song title, artist, singer gender, melody construction, beat, etc.). The more you know about your data, the richer it gets. You can go from low-dimensional data with trivial predictions to high-dimensional knowledge with powerful prediction and recommendation models.
  2. Model the future instead of the past: Here’s where we go back to the basics a little. There are two ways to do data science: We use the empirical method when we have no idea what we are looking for, which lets the data tell us the story. We use the classic scientific method when we claim a hypothesis or idea and then build a test to prove it. The issue is, companies often rely on one or the other, but in reality, you need a combination. If we rely on the empirical method and let the data tell us what we want, then we fall into the trap of creating self-fulfilling prophecies. We see this when working with a new product offering that has no historical data to guide you. You won’t be able to validate the product offering for a new customer base until you design a test to do so. So, if you can both create a hypothesis and combine that with a test that utilizes data or gathers even more, then you more accurately look to the future.
  3. Add meaning (semantics) to your data: Once you have your hypothesis in place, teach the system what data relates to what. When your sample size is small, but you have many variables that describe it, you can run into the issue of “slicing your data too thin.” Imagine analyzing an online shopper who bought diapers, bottles and nursery decor. You zoom in too closely and you don’t see the pattern that this person might have a baby. External knowledge and human expertise can help businesses achieve better results by applying semantic modeling or context around these variables and accelerating machine learning — especially when modeling a “small data” problem. The trick to getting this right is in building out a strong taxonomy. We work with one of the largest medical device companies out there, and with millions of SKU numbers in their catalog, it’s imperative that human experts develop the taxonomy to understand and characterize families of products in order to also understand customer patterns and improve predictive modeling.
  4. Think “fast” and think “control”: Nearing the final steps, we go back to the data, because it’s ready to support the hypothesis and you’re ready to run your test. If possible, create your own lab environment where you can introduce more variables and outcomes that haven’t been used in the past and quickly run multiple tests (A/B testing) to learn from. This approach works well in marketing campaigns where you don’t need to wait until the end of a long sales cycle to receive feedback around lead conversion.Especially when past data is limited and you need to model a potential future outcome, designing a “control” is a critical step to finding the whole truth. Take the COVID-19 vaccine as an example. If we zoom in and look at the fact that people who are vaccinated are still getting sick, data tells us that the solution is failing. But if we add a control group (unvaccinated people), zoom out and compare our past to the possible future, we see that the model is working.
  5. Model to hit business metrics, not just past results: If you continue using what worked in the past to predict the future, then the past is all you are going to get. Marketing may tell you that your model is producing amazing results generating new leads, but if you aren’t closing new deals, the model is still ineffective. With your richer data, hypothesis, semantics, control and trials set in place, everything should now be measured against business results — that is, revenue. I work with some of the biggest B2B companies out there, one of which experienced tremendous growth during the pandemic. It was one of those “right place, right time” situations that took the company from a small startup to a household name. As it moves into a post-pandemic world, it can’t model a self-fulfilling prophecy, because its future is entirely different from the past. So, one thing it’s done really well is stay hyperfocused on the bottom line rather than getting distracted by local, misleading conversion, optimization, etc. The world is changing so much and so quickly that the impact on the bottom line is the only metric you can really trust.

It’s easy to make the excuse that without enough data, AI will never be an option. But as discussed above, the trick to applying AI correctly to small data problems is in following correct data science practices and avoiding bad ones — like creating self-fulfilling prophecies.

So whether your data is limited because you are a B2B company or you are launching a brand new product, AI can still be a valuable asset.

When AI starts to correctly solve both the small and large data problems, that’s when it will deliver us into the next generation of science and technology.