BioHack 2025: Detecting Lung Cancer from audio input with CNN model

BioTech

Lung Cancer

CNN

Random Forest

Posted at 2025/02/234 min to read

#Before We Dive In – BioHack 2025: Detecting Lung Cancer

This is a write-up on our BioHack 2025 project. I gathered a group of friends I respect to collaborate on this project before graduating. This experience was incredibly meaningful not just as a project, but as my first machine learning participant.

Machine learning had always been a field I wanted to pursue, but I initially lacked the necessary skills and understanding. However, my dream was to integrate biology, medicine, and machine learning.

#What We Wanted to Solve

Our goal was to work on challenges in the medical field, specifically cancer detection. However, rather than working on well-funded research areas like breast cancer or leukemia, we wanted to explore a less commonly studied space.

Additionally, we aimed to create a web-compatible, user-friendly platform to improve accessibility. For example, MRI scans and blood tests are often difficult to access, requiring specialized equipment and hospital visits.

This is why we developed a lung cancer detection platform, where users answer a few questions in a chatbot while their voice is analyzed in real time. The system then generates a report based on both textual responses and audio inputs.

#Methodology

#Data Collection

We acquired a dataset containing audio samples from cancer patients. Although the data was at least six years old, it has:

920 recordings from individuals aged 10 to 90 years
126 patients contributing to a total of five hours of recorded audio

#Data Processing & Augmentation

Upon analyzing the dataset through visualization techniques, I observed that the data distribution was uneven. To work on this, my first step was data augmentation to improve balance and maximize the dataset’s impact.

#Machine Learning Model Development

Once we ensured that the dataset was sufficiently representative, we moved on to training machine learning models.

This step posed several challenges:

Model Selection & Parameter Tuning – Given my limited understanding in this field, it was difficult to determine whether low accuracy was from dataset quality or the model itself.
Choosing the Best Model – After trying with various models, I decided on an ensemble learning approach, where I combine multiple models instead of relying on a single one. However, fine-tuning the weight distribution of each model within the ensemble was a complex challenge.

#Results & Challenges

Our model achieved a promising 90% accuracy rate, but we suspect this may be due to overfitting. Given our dataset's size and complexity, we need further validation to confirm real-world effectiveness.

#Thoughts and future development

Throughout this project, several challenges arose that we are still exploring. Here are the key questions from our team:

Dataset Augmentation – Is data augmentation necessary to improve model generalization, or could it introduce biases?
Test Coverage – Could our model's performance be affected by insufficient test data coverage?
Model Selection – Are there more suitable machine learning techniques for this type of dataset?
Overfitting – What methods can we use to reduce overfitting in audio-based medical classification?

Here is the testing on healthy .wav data to our prediction model. Although this person should be a non lung cancer, our model detected to be 40% of lung cancer which was not something I expected. This is probably due to overfitting.

Sound Processing Optimization – Are there better techniques to preprocess and extract features from voice data?
Data Splitting Strategy – Given that multiple voice frequencies may correspond to the same or different diseases, what is the best way to split the dataset to ensure balanced learning?
Data Cleaning: Couple research paper has worked on data cleaning, such as reducing a noise before data augmentation. This is future development.

Shoto