BioHack 2025: Detecting Lung Cancer from audio input with CNN model
Before We Dive In – BioHack 2025: Detecting Lung Cancer
This is a write-up on our BioHack 2025 project. I gathered a group of friends I respect to collaborate on this project before graduating. This experience was incredibly meaningful not just as a project, but as my first machine learning participant.
Machine learning had always been a field I wanted to pursue, but I initially lacked the necessary skills and understanding. However, my dream was to integrate biology, medicine, and machine learning.
What We Wanted to Solve
Our goal was to work on challenges in the medical field, specifically cancer detection. However, rather than working on well-funded research areas like breast cancer or leukemia, we wanted to explore a less commonly studied space.
Additionally, we aimed to create a web-compatible, user-friendly platform to improve accessibility. For example, MRI scans and blood tests are often difficult to access, requiring specialized equipment and hospital visits.
This is why we developed a lung cancer detection platform, where users answer a few questions in a chatbot while their voice is analyzed in real time. The system then generates a report based on both textual responses and audio inputs.
Methodology
Data Collection
We acquired a dataset containing audio samples from cancer patients. Although the data was at least six years old, it has:
- 920 recordings from individuals aged 10 to 90 years
- 126 patients contributing to a total of five hours of recorded audio
Data Processing & Augmentation
Upon analyzing the dataset through visualization techniques, I observed that the data distribution was uneven. To work on this, my first step was data augmentation to improve balance and maximize the dataset’s impact.
Machine Learning Model Development
Once we ensured that the dataset was sufficiently representative, we moved on to training machine learning models.
This step posed several challenges:
- Model Selection & Parameter Tuning – Given my limited understanding in this field, it was difficult to determine whether low accuracy was from dataset quality or the model itself.
- Choosing the Best Model – After trying with various models, I decided on an ensemble learning approach, where I combine multiple models instead of relying on a single one. However, fine-tuning the weight distribution of each model within the ensemble was a complex challenge.
Results & Challenges
Our model achieved a promising 90% accuracy rate, but we suspect this may be due to overfitting. Given our dataset's size and complexity, we need further validation to confirm real-world effectiveness.
Thoughts and future development
Throughout this project, several challenges arose that we are still exploring. Here are the key questions from our team:
- Dataset Augmentation – Is data augmentation necessary to improve model generalization, or could it introduce biases?
- Test Coverage – Could our model's performance be affected by insufficient test data coverage?
- Model Selection – Are there more suitable machine learning techniques for this type of dataset?
- Overfitting – What methods can we use to reduce overfitting in audio-based medical classification?
Here is the testing on healthy .wav data to our prediction model. Although this person should be a non lung cancer, our model detected to be 40% of lung cancer which was not something I expected. This is probably due to overfitting.
- Sound Processing Optimization – Are there better techniques to preprocess and extract features from voice data?
- Data Splitting Strategy – Given that multiple voice frequencies may correspond to the same or different diseases, what is the best way to split the dataset to ensure balanced learning?
- Data Cleaning: Couple research paper has worked on data cleaning, such as reducing a noise before data augmentation. This is future development.