BioHack 2025: Detecting Lung Cancer from audio input with CNN model
#Before We Dive In – BioHack 2025: Detecting Lung Cancer
This is a write-up on our BioHack 2025 project. I gathered a group of friends I respect to collaborate on this project before graduating. This experience was incredibly meaningful not just as a project, but as my first machine learning participant.
Machine learning had always been a field I wanted to pursue, but I initially lacked the necessary skills and understanding. However, my dream was to integrate biology, medicine, and machine learning.
#What We Wanted to Solve
Our goal was to work on challenges in the medical field, specifically cancer detection. However, rather than working on well-funded research areas like breast cancer or leukemia, we wanted to explore a less commonly studied space.
Additionally, we aimed to create a web-compatible, user-friendly platform to improve accessibility. For example, MRI scans and blood tests are often difficult to access, requiring specialized equipment and hospital visits.
This is why we developed a lung cancer detection platform, where users answer a few questions in a chatbot while their voice is analyzed in real time. The system then generates a report based on both textual responses and audio inputs.
#Methodology
#Data Collection
We acquired a dataset containing audio samples from cancer patients. Although the data was at least six years old, it has:
- 920 recordings from individuals aged 10 to 90 years
- 126 patients contributing to a total of five hours of recorded audio
#Data Processing & Augmentation
Upon analyzing the dataset through visualization techniques, I observed that the data distribution was uneven. To work on this, my first step was data augmentation to improve balance and maximize the dataset’s impact.
#Machine Learning Model Development
Once we ensured that the dataset was sufficiently representative, we moved on to training machine learning models.
This step posed several challenges:
- Model Selection & Parameter Tuning – Given my limited understanding in this field, it was difficult to determine whether low accuracy was from dataset quality or the model itself.
- Choosing the Best Model – After trying with various models, I decided on an ensemble learning approach, where I combine multiple models instead of relying on a single one. However, fine-tuning the weight distribution of each model within the ensemble was a complex challenge.
#Results & Challenges
Our model achieved a promising 90% accuracy rate, but we suspect this may be due to overfitting. Given our dataset's size and complexity, we need further validation to confirm real-world effectiveness.
#Thoughts and future development
Throughout this project, several challenges arose that we are still exploring. Here are the key questions from our team:
- Dataset Augmentation – Is data augmentation necessary to improve model generalization, or could it introduce biases?
- Test Coverage – Could our model's performance be affected by insufficient test data coverage?
- Model Selection – Are there more suitable machine learning techniques for this type of dataset?
- Overfitting – What methods can we use to reduce overfitting in audio-based medical classification?
Here is the testing on healthy .wav data to our prediction model. Although this person should be a non lung cancer, our model detected to be 40% of lung cancer which was not something I expected. This is probably due to overfitting.
- Sound Processing Optimization – Are there better techniques to preprocess and extract features from voice data?
- Data Splitting Strategy – Given that multiple voice frequencies may correspond to the same or different diseases, what is the best way to split the dataset to ensure balanced learning?
- Data Cleaning: Couple research paper has worked on data cleaning, such as reducing a noise before data augmentation. This is future development.