If cancer is spotted early enough, the odds of survival drastically improve. Unfortunately, traditional methods of screening, like endoscopy or biopsies, of suspect tissue identify cancers that are well on their way to becoming problematic, not to mention these methods are also invasive and complicated to perform.
Techniques that screen body fluids using lasers, such as surface enhanced Raman spectroscopy (SERS), show promise because they are quicker and less invasive. Researchers can accurately detect minuscule amounts of biological molecules, for example those associated with early-stage cancer, in an easy-to-get blood sample.
While researchers are optimistic about these techniques’ ability to provide earlier detection, problems remain, especially when dealing with rare forms of cancer These rare cases don’t produce the same results when using techniques, like SERS, and are often missed by models and algorithms.
In a paper published in Advanced Intelligent Systems, Professor Duo Lin at the Fujian Provincial Key Laboratory for Photonics Technology at Fujian Normal University, and his colleagues describe a creative solution for spotting rare forms of the disease.
Turning light into cancer diagnosis
The challenge with SERS is that unlike other medical tests where a physical specimen can be examined, SERS reveals a change in the energy of light photons that have passed through a sample. As photons from a laser contact the molecules in the sample, they scatter.
Translating this pattern of scattering into a diagnosis of cancer and furthermore, what type of cancer it could be, is next to impossible for humans because it requires that a scatter profile most likely associated with cancer be fine tuned by the screening of thousands of samples each with subtle differences. Fortunately, this is the type of problem a well-trained algorithm excels at.
To bring the algorithm up to speed researchers have been using large numbers of known samples, both positive and negative for cancer, as a training dataset that lets the algorithm learn what characteristics distinguish positive and negative.
After cataloguing and categorizing the subtle differences, a model to diagnose cancer emerges. The model works great for common cancers as there is plenty of training material, but by their very nature rare cancers are under represented in the training database and difficult for the algorithm to learn and detect.
Learning from real world cancer prevalence
To solve this data imbalance, there were several sampling strategies available to Lin and his team that artificially boost the representation of rare data points, in this case rare cancers.
They chose a strategy called the Synthetic Minority Over-Sampling Technique (SMOTE). “Generally, SMOTE is a minority oversampling technique, which increases the number of minority class samples by synthesizing new samples among the nearest neighbors, thus alleviating the data imbalance problem,” explained Lin.
To increase the sample size of rare cancers, Lin used SMOTE to randomly choose samples that are the nearest neighbors of the rare cancers — in other words, samples that are similar. SMOTE then artificially creates a new sample in between the two.
But SMOTE alone wasn’t solving the problem. “We found that when the number of minority classes was as many as the majority classes, the model suffered from data redundancy, leading to poor classification performance,” said Lin.
It wasn’t until Lin and his colleague at Fujian Normal University, Shangyuan Feng, made a key observation about the distribution of rare cancer in the population that the solution became clear. They saw that the prevalence of cancers in the population roughly follows what is known as a power law distribution.
Simply put, this is a statistical relationship between two quantities where a relative change in one results in a proportional relative change in the other. With this knowledge, they could tweak the amount of resampling that SMOTE was doing on the rare cancers to fit this real-world relationship and create a balanced dataset.
According to Lin, “experiments show that the power law-SMOTE method can effectively alleviate the data imbalance problem and improve the performance of the model.”
The power of statistics
Having overcome this hurdle, the team is scaling up the numbers of samples and cancer types in their training datasets and is hoping that the model holds up in the face of more and more data. If it does, a powerful new diagnostic technique could improve the prognosis of patients afflicted by all forms of cancers.
Interestingly, power law distributions are found in many datasets and Lin believes their method could be applied here too. “In fact, power-law or long-tail distributions are encountered in many scenarios, such as telecommunication fraud, anomaly detection, network intrusion detection, disaster prediction, etc.,” he explained.
Reference: Changbin Pan, et al., Power-Law based SMOTE on imbalanced serum surface enhanced Raman spectroscopy data for cancer screening, Advanced Intelligent Systems (2023). DOI: aisy.202300006
Feature image credit: Ivan Evans on Unsplash