Powerful diagnostic approach uses light to detect virtually all forms of cancer

Turning light into cancer diagnosis

The challenge with SERS is that unlike other medical tests where a physical specimen can be examined, SERS reveals a change in the energy of light photons that have passed through a sample. As photons from a laser contact the molecules in the sample, they scatter.

Translating this pattern of scattering into a diagnosis of cancer and furthermore, what type of cancer it could be, is next to impossible for humans because it requires that a scatter profile most likely associated with cancer be fine tuned by the screening of thousands of samples each with subtle differences. Fortunately, this is the type of problem a well-trained algorithm excels at.

To bring the algorithm up to speed researchers have been using large numbers of known samples, both positive and negative for cancer, as a training dataset that lets the algorithm learn what characteristics distinguish positive and negative.

After cataloguing and categorizing the subtle differences, a model to diagnose cancer emerges. The model works great for common cancers as there is plenty of training material, but by their very nature rare cancers are under represented in the training database and difficult for the algorithm to learn and detect.

Learning from real world cancer prevalence

To solve this data imbalance, there were several sampling strategies available to Lin and his team that artificially boost the representation of rare data points, in this case rare cancers.

They chose a strategy called the Synthetic Minority Over-Sampling Technique (SMOTE). “Generally, SMOTE is a minority oversampling technique, which increases the number of minority class samples by synthesizing new samples among the nearest neighbors, thus alleviating the data imbalance problem,” explained Lin.

To increase the sample size of rare cancers, Lin used SMOTE to randomly choose samples that are the nearest neighbors of the rare cancers — in other words, samples that are similar. SMOTE then artificially creates a new sample in between the two.

But SMOTE alone wasn’t solving the problem. “We found that when the number of minority classes was as many as the majority classes, the model suffered from data redundancy, leading to poor classification performance,” said Lin.

It wasn’t until Lin and his colleague at Fujian Normal University, Shangyuan Feng, made a key observation about the distribution of rare cancer in the population that the solution became clear. They saw that the prevalence of cancers in the population roughly follows what is known as a power law distribution.

Simply put, this is a statistical relationship between two quantities where a relative change in one results in a proportional relative change in the other. With this knowledge, they could tweak the amount of resampling that SMOTE was doing on the rare cancers to fit this real-world relationship and create a balanced dataset.

According to Lin, “experiments show that the power law-SMOTE method can effectively alleviate the data imbalance problem and improve the performance of the model.”

The power of statistics

Having overcome this hurdle, the team is scaling up the numbers of samples and cancer types in their training datasets and is hoping that the model holds up in the face of more and more data. If it does, a powerful new diagnostic technique could improve the prognosis of patients afflicted by all forms of cancers.

Interestingly, power law distributions are found in many datasets and Lin believes their method could be applied here too. “In fact, power-law or long-tail distributions are encountered in many scenarios, such as telecommunication fraud, anomaly detection, network intrusion detection, disaster prediction, etc.,” he explained.

Feature image credit: Ivan Evans on Unsplash