Big Data to High Dimensional Feature Selection and Survival Prediction with Maximization of ROC Utility
摘要:Statistical learning models have been broadly utilized in business analytics such as Google marketing and biomedical research areas. I will present several of our recent developments in predictive analysis used to identify features or patterns of features that are capable of predict outcomes such as whether a Google advertisement banner results in sale, disease status and patient survival. This talk will focus on theory and methods with biomedical applications, for example, the molecular signatures predictive of patient survival. The first is the regularized statistical learning model. In supervised learning or disease classification, most standard methods, however, are designed to maximize the overall accuracy and cannot incorporate different costs to different classes explicitly. Integrating advances in machine learning, optimization, and statistics, we propose a novel nonparametric method (a regularized model) to select variables by explicitly maximizing a relevant function of the receiver operating characteristic curve, e.g., a weighted specificity and sensitivity or an L1 penalized global AUC maximization. Experimental results with chemotherapy and large genomics data demonstrate that the proposed procedures can be used for identifying important genes and pathways that are related to survival due to cancer and for building a parsimonious model for predicting survival of future patients.