CST 499 Week 2

 Progress Update: Onco-Logic

This week, our team completed our individual contributions to the cancer-subtype classification project and officially moved into the final phase of Onco-Logic: building models to extract structured insights from free-text pathology reports. We’ve begun collecting the data, exploring the corpus structure, and running early-stage exploratory data analysis.

My best-performing approach on the cancer-subtype task came from selecting the top 350 genes based on variance. Using this reduced feature set, I trained a LinearSVC model that achieved nearly perfect accuracy, precision, and recall across all five cancer types. These results were strong both in terms of raw performance and interpretability, especially when paired with SHAP to visualize feature importance. With the modeling work complete, I also helped prepare materials for dashboard integration and documentation. 

Looking ahead, our team will be focusing on developing the NLP preprocessing pipeline and experimenting with baseline models for classification. We'll likely begin with TF-IDF features and explore embedding strategies like Sentence-BERT, which offers a lightweight and resource-efficient starting point. One area of concern as we ramp up modeling is compute availability. While SBERT is relatively small, some of the TensorFlow and PyTorch tools we hope to experiment with may require GPU access that could be limited. We may need to adjust our modeling strategy based on available resources or find a more coordinated solution to our team's fragmented compute access. 

Comments

Popular Posts