CST 499 Week 1
Onco-Logic - Intelligent Machine Oncology
As part of my CST499 Capstone, I’ve been working alongside Nicholas M., Federico M., and Harris P. on a series of oncology-focused machine learning challenges hosted by the Cedars-Sinai National AI Campus. The first, titled "Breast Cancer Prognosis: Leveraging Cancer Registry Data for Survival Prediction," involves developing survival models using clinical data from the SEER registry. The second, "Modeling Complex Binary-Class Associations in Simulated Genomic Data," focuses on identifying hidden patterns in synthetic high-dimensional SNP datasets. These two efforts form the foundation of a broader initiative we have named Onco-Logic, which will eventually include a third project that involves extracting structured information from pathology report text.
At the outset, our mentor encouraged us to work independently on the projects rather than collaborating from the start. This approach felt counterintuitive at first. However, as the week progressed, we began to understand its benefits. Each team member approached the tasks differently, applying unique strategies in data preprocessing, modeling, feature selection, and evaluation. This diversity of methods led to a wide range of outputs that we can now compare, benchmark, and synthesize. It also allowed each person to explore the full scope of the workflow and build a deeper understanding of the specific challenges presented by each dataset.
In the breast cancer prognosis project, we worked with SEER clinical data to build models that predict patient survival outcomes. One of the key challenges we encountered was class imbalance, as the dataset contained a disproportionately high number of survivors compared to non-survivors. To address this, we applied oversampling techniques such as Random Oversampling and SMOTE to ensure that our models could learn effectively from the minority class. We built both binary classifiers to predict vital status and multiclass models to estimate survival time brackets, using algorithms like Random Forest, XGBoost, and Support Vector Machines. These models were evaluated using metrics such as ROC-AUC, precision, recall, and F1-score. We also began using SHAP to interpret model predictions and found that tumor size and lymph node involvement were among the most influential features.
Simultaneously, we explored simulated SNP datasets that contain over 20,000 features. These were designed to mimic real-world genomic challenges involving additive, epistatic, and heterogeneous interactions. Our goal was to develop machine learning pipelines capable of identifying meaningful patterns despite the noise and complexity. We experimented with manual feature selection techniques, and multiple classification strategies to compare performance across conditions. This project has been especially useful in refining our skills in high-dimensional modeling and understanding the trade-offs between interpretability, performance, and computational cost.
In addition to modeling, we began developing a prototype of a Streamlit-based front-end interface to make our models interactive. Rather than relying solely on documentation or static plots, we want users to be able to input sample data, view predictions, and explore model explanations through an intuitive interface. This practical presentation of our work is intended to highlight the applied potential of our models in clinical or research settings.
Looking ahead, we plan to begin work on the third project, "Information Extraction from Pathology Reports," which will involve using natural language processing to extract TNM staging and cancer types from unstructured medical text. This addition will allow us to address three key data modalities: structured clinical data, high-dimensional genomic data, and free-text clinical narratives. Tackling this range of data types will make Onco-Logic a more comprehensive and realistic demonstration of AI applications in oncology.
Overall, the first week has been productive and rewarding. The decision to develop in parallel rather than in a single collaborative stream has brought unexpected benefits. We have generated a range of methods and solutions that strengthen our understanding and improve the overall quality of our models. We are now in a strong position to integrate the best parts of each effort into a unified system. As we move into the next phase, we will focus on refining the Streamlit interface, completing the modeling work for the second project, and kicking off the third. We are excited by the progress so far and look forward to sharing more in the weeks ahead as Onco-Logic takes shape.
Comments
Post a Comment