<Description of the project>
Identifying microRNA (miRNA) biomarkers in rare cancers remains a major challenge due to limited patient cohorts and high tumor heterogeneity. To overcome these limitations, we developed IsoFMiR — an unsupervised anomaly detection framework designed to discover cancer-specific miRNA signatures without relying on large, labeled datasets.
Built upon the Isolation Forest algorithm, IsoFMiR leverages the concept that samples from rare cancers such as sarcomas exhibit distinct molecular patterns when compared to more common malignancies. By training the model on miRNA expression profiles from abundant cancer datasets and subsequently applying it to sarcoma samples, IsoFMiR isolates anomalous expression signatures that may represent potential biomarkers.
This approach addresses key limitations of conventional supervised learning methods, which often fail when applied to data-scarce conditions. Through this unsupervised design, IsoFMiR facilitates robust biomarker discovery, even in cases where annotated samples are minimal. The framework was implemented using Python 3.11.9 and R 4.4.2, with the Scikit-learn library employed for building the Isolation Forest model. Sequencing data were sourced from publicly available repositories such as The Cancer Genome Atlas (TCGA) and XenaBrowser, ensuring transparency and reproducibility.
Overall, IsoFMiR represents a scalable and data-efficient framework that enables the identification of clinically relevant miRNA biomarkers in rare cancers. By combining unsupervised learning with accessible genomic resources, it paves the way for improved understanding of sarcoma biology and the development of novel diagnostic and prognostic strategies.
<Data ACCESS INFORMATION>
GDC TCGA: https://portal.gdc.cancer.gov/
Xena Browser: https://xenabrowser.net/
<Description of the code>
This set of scripts allows for running the IsoFMir algorithm. Each script is described below and should be used in the exact sequencial order.
Creating_Training_Data_Set → Reads multiple CSV files containing microRNA data, extracts patient information, and generates a training dataset with up to n patients.
Patients_Not_Used_For_Training → Creates a dataset of patients excluded from training. Provide the paths to the training data and patient information files generated by the Creating_Training_Data_Set script.
Training → Trains an Isolation Forest model using microRNA data from non-sarcoma cancer samples.
Testing → Evaluates the model’s ability to distinguish sarcoma samples from other cancer types.
Consensus_and_SHAP_Analysis → Performs consensus and SHAP analyses. The script randomly selects 50 test samples by default; this number can be adjusted as needed.
Overall_Survival_Code → Calculates overall survival for the selected microRNAs in sarcoma patients.
(2025)