Healthcare data holds great potential to improve medicine, but mining it is not easy. To get to the gold, Truveta built a large AI-powered model to crunch through medical texts from more than 20,000 clinics and 700 hospitals.
Truveta’s model is designed to extract patient diagnoses, medications, lab results and other data from sources like physician’s notes and insurance claims — messy, unstructured text filled with abbreviations, jargon and misspellings. The model accomplishes these tasks with greater than 90% accuracy, the company says.
The Seattle-area healthcare technology startup introduced the Truveta Language Model in a recent preprint publication, and gave more background this week in a white paper and blog post.
The model is trained on large quantities of medical texts from the company’s 28 health system partners, representing 16% of patient care in the U.S. The company also updates its datasets daily.
“The amount of the data that we process every day and make available for researchers in a timely fashion makes it a very complex and really big data problem,” said Jay Nanduri, Truveta chief technology officer, in an interview with GeekWire.
Truveta’s healthcare and life sciences customers study events like adverse reactions to medicines or patient seizure frequency. Cancer researchers might use the platform to flag disease progression and the need for a shift in treatment.
The model “normalizes” the messy data, so that texts like “Acute COVID-19″ and “COVID19 _ acute infection” mean the same thing. And it can accomplish that at scale — Truveta has access to 3.1 billion patient encounters and 2.4 billion medication orders due to its relationships with major health systems.
Truveta’s model is distinct from GPT-4, the “generative” large language model from Microsoft-backed OpenAI, which instantly produces content based on prompts. GPT-4’s proposed health uses include supporting diagnoses, summarizing doctor-patient conversations, and suggesting bedside language for doctors.
Truveta’s specialized training on medical datasets goes beyond GPT-4, which was trained on a broad range of open information on the internet, said Myerson. GPT-4 is also known to “hallucinate” false response to queries, he noted.
GPT-4 can seem like “a doctor on acid,” said Myerson. “The inaccuracies of GPT-4 are a real issue.”
But GPT-4 is about to get smarter. Microsoft subsidiary Nuance is already incorporating GPT-4 into a medical note-taking system trained on medical data, and will preview the application this summer.
Microsoft is also a Truveta investor and partners with the startup to introduce new customers to the platform, and other efforts.
Startups are beginning to bolt GPT-4 onto their offerings. Nanduri sees companies feeding GPT-4 their own datasets for customized uses. Truveta, in contrast, markets its platform as a source of data.
Truveta partners with other companies that build applications on top of its system. Users can build generative or extractive tools tapping into Truveta’s data, as well as “discriminative” tools such as models for predicting cancer. “We are enabling all three types of applications,” said Nanduri.
Truveta’s collaborators include Pfizer, which leverages the platform to monitor the safety of COVID-19 vaccines and therapies; and Seattle company Alpine Immune Sciences, which tapped Truveta to match patients to a clinical trial. Last fall, Truveta also unveiled Truveta Studio, an interface into real-time patient data.
The Truveta Language Model was built and trained over more than two years, beginning with an open-source option, a common starting point. The model works in sync with two other technology efforts at the company — assuring that information is private and anonymized; and standardizing the data, which is fragmented across multiple health systems.
Getting those health systems together under one roof has been a major vision for Truveta since its founding in 2020, with Providence and three other medical systems on board. The company raised $95 million in 2021 and continues to add new health systems to its network.
Myerson, a former Microsoft executive, sees parallels between the Truveta Language Model and BloombergGPT, a large language model built from scratch by the financial services company, announced in March. Bloomberg trained the model on copious amounts of financial information, similar to how Truveta’s model is trained on reams of medical data.
“The world of health needs an accurate model, and to get an accurate model you need the right data to train against,” Myerson said.