Where did IBM go wrong with Watson Health?

Simpler times.
Simpler times.
Image: Reuters/Stephane de Sakutin/Pool
We may earn a commission from links on this page.

IBM spent more than a decade trying to make a go of Watson Health, its moonshot to apply artificial intelligence in healthcare. Watson was supposed to revolutionize everything from diagnosing patients and recommending treatment options to finding candidates for clinical trials.

Now, after billions invested and a series of high-profile setbacks, the company is effectively selling Watson for parts to private equity firm Francisco Partners.

Where did Watson Health miss the mark?

In fact, Watson is not alone in its struggles. Along with IBM, big companies like Google, along with a slew of startups, have stumbled as they have attempted to transform healthcare using AI.

As a machine learning practitioner and health technology entrepreneur who has closely watched these challenges, here are my five takeaways about how to build successful AI products in healthcare and beyond.

Understand the problem and the domain you’re targeting

Watson Health started as a hammer looking for a nail. Riding on Watson’s success answering trivia questions on the TV game show Jeopardy! in 2011, IBM looked to throw AI at everything from medical imaging to clinical trial recruitment. But it’s far more effective to define and understand a problem first and build your AI from there, following the battle-tested startup advice to start small and iterate quickly. An AI-based product is still a product; at the end of the day, it needs to create real value for users.

Building any kind of product for healthcare, let alone an AI-powered one, can be tricky. There are many different stakeholders—patients, providers, hospitals, insurers—and their incentives don’t always align. Every hospital system seems to have its own way of digitizing and storing health records, and workflows vary from institution to institution and doctor to doctor. It is critical to keep all of this in mind from the get-go and ensure that the initial feedback you gather on the problem and user needs is representative of the broader market.

Use high-quality, representative data

A machine learning tool is only as good as the data that goes into it. One of Watson Health’s biggest setbacks was the revelation that its cancer diagnostics tool was not trained with real patient data, but instead with hypothetical cases provided by a small group of doctors in a single hospital. Hand-crafted or synthetic data aren’t necessarily bad, but Watson didn’t seem to account for the fact that this data reflected the doctors’ own biases and blindspots and wasn’t necessarily generalizable to all patient cases. As a result, the tool was accused of making inaccurate and unsafe recommendations, leading high-profile hospital partners to cancel their collaborations with Watson.

To avoid the problem of “garbage in, garbage out” and reduce the bias in your algorithms, make sure your data is high quality and reflects real-world distributions of the groups it eventually will be applied to. Pay attention to the human annotations that will guide the system’s learning. Make sure annotation guidelines are clear and there is consistency across the decisions made by different annotators (pdf).

Anticipate that problems will arise when your algorithm meets the real world

If you think your model’s success with test data will directly translate to the real world, you’re in for a surprise. When Google deployed a machine learning system for detecting diabetic retinopathy in Thailand, socio-environmental factors limited how well it worked in practice. For instance, nurses in under-resourced clinics were often unable to take the high-quality photos of the eye the system needed to make its assessments.

Don’t assume that if a tool succeeds in one setting it will work in others. Get ahead of potential problems by debugging your models, performing rigorous error analysis, and evaluating and investing in the stability and robustness of your models. After launch, continue to measure performance at regular intervals using effective testing (pdf) and monitoring practices.

Manage expectations and learn from errors

During my brief stint at the innovation arm of the University of Pittsburgh Medical Center, it was not uncommon to see companies pitching AI-powered solutions claiming to provide 99.9% accuracy. In reality, when tested on the internal hospital dataset, they almost always fell short by a large margin. It is extremely important for companies to provide accurate performance statistics and, where possible, to share details of the test dataset on which these metrics were calculated.

It’s also important for AI systems to offer users clear information about how reliable their predictions are—and to fail gracefully when they are uncertain (which often means abstaining from making predictions). For example, automated vehicles are designed to hand off control to drivers in unexpected situations where their models have less confidence. Similarly, a chest X-ray classifier can signal to a radiologist when it has low confidence about a diagnosis and greater scrutiny from a human is warranted.

To err is machine, so it is critical to continually log inaccurate predictions and outputs of AI systems. Even better, look for ways to engage users in providing feedback in order to continually improve your models and increase user trust.

Don’t let marketing hype outpace accountability

One of the biggest criticisms of Watson Health is that IBM poured money into marketing it without having the results to live up to the hype they were creating. While truly disruptive innovations are always a gamble, sometimes you can go too far out on a limb and get ahead of what the technology can do.

With all the fervor around AI, it’s easy for companies to adopt a publicity-first approach that inevitably falls short when they’re forced to deliver. At my company, Abridge, sharing our work in academic journals and conferences has helped us gather feedback from peers to improve our products and stay accountable for having the technology to back up the marketing.

The challenge of commercializing new technology can be daunting, especially when we see a giant with the resources of IBM fail in its quest to strike AI gold. I still believe machine learning has the potential to revolutionize everything from healthcare to transportation to how we live and work. But in a crowded field of competitors clamoring to launch the next big thing, it’s essential to clear away the hype and focus on building great products.

Sandeep Konam is a machine learning expert who trained in robotics at Carnegie Mellon University, has worked on numerous projects applying the technology to health, and was featured in Forbes’ 2021 “30 Under 30” list for Healthcare. He is the co-founder and chief technology officer of Abridge, a company that uses AI to summarize medical conversations for healthcare professionals and their patients.