AI and datasets
Artificial intelligence can recognize patterns in large data collections and draw connections between multiple features in the data.
When successful, results are often dubbed as “superhuman” and get great media coverage. However, such AI systems can be easily fooled (for a great overview see 1).
So, the quality of datasets is of paramount importance, especially in medical applications, where human lives depend on accurate analysis. Securing high-quality datasets for medical applications is not a trivial task. It needs specific standards and infrastructure, as has been discussed at length (2-5).
Cervical Photographs and long-term vision in 1993
However, the importance of long-term vision is often neglected.
For instance, almost 30 years ago, NCI (National Cancer Institute) started a longitudinal study, Proyecto Epidemiologico Guanacaste, of human papillomavirus infection and risk of cervical precancer/cancer. From 1993 to 2001, more than 9,000 women were screened and, among other data, 60,000 cervical photographs were collected. The original 35mm slide films were later digitized as TIFF files accompanied with all necessary metadata.
Now they are available at no charge to scientists.
Thanks to that effort, a recently published study (6) trained a deep neural network AI system for automated detection of precancerous lesions.
With 91% accuracy, the system is way better than any existing approach (visual inspection – 69%, PAP smear – 71%).
Collection of this data started long before the current AI boom and they were not even intended for AI training. Non-the-less, the data collection protocols used were rock-solid. The vision was there, as well as the infrastructure that enables: correct data storage, metadata association, and the use of proper metadata terminology.
This leads us to Betsy Humphreys.
She was not involved in the cervical cancer study, but she clearly understood the importance of data quality for any medical research.
In 1986, together with then NLM (National Library of Medicine) director Donald Lindberg, she defined the Unified Medical Language System (UMLS) and served as the UMLS’s first project officer. UMLS today is widely used to precisely map terminology between patients, clinicians, pharmacies, and insurance companies.
In the early UMLS days, she aggressively pushed grants for UMLS-related projects within the NIH (whose parts are both NCI and NLM) and at various other institutions, to test the use cases and rapidly updated UMLS based on obtained feedbacks. In other words – she was applying a real entrepreneurial approach to speed up the wide adoption of new technology in such a huge sector as public health. Before that, she was leading a project that became the source of materials for today’s MEDLINE/PUBMED systems. Later, she developed a basis for the federal strategy for terminology standards for electronic health records. In her own words:
“I first outlined what I thought we needed to do to get there in 1991 or 1992. I would say […] we essentially got there in 2010, 2011” (7)
MEDLINE/PUBMED became the largest free full-text archive of biomedical and life sciences journal literature. Standardized, electronic health records are at the backbone of the precision medicine of the future.
So, forty years ago, before the high-throughput data infrastructure we have today, she understood that data quality, availability, and interoperability is the key. She was certainly not the first one to understand that, but she is one of the rare persons who managed to materialize her vision on such a large scale and for the real benefit of so many people.
Her lasting legacy does not only rely on her strong, long-term vision. That vision also included the care for the benefit of the wider community. And that benefit is materializing.