Mathematical Statistics Seminar

Samuel Müller, University of Sydney

The categorical-continuous pliable Lasso to identify brain regions affecting motor impairment in Huntington disease

In many clinical studies, prediction models are essential for forecasting and monitoring the progression of a disease. Developing prediction models is a challenge when dealing with high-dimensional data since we do not know which variables are related to the response variable of interest and this relationship may depend on other continuous or categorical modifying variables as well. We formalize this problem as the varying-coefficient model selection and propose a novel variable selection method, c2pLasso, that accounts for both continuous and categorical modifying variables.

Our contributions are three-fold:

    The c2pLasso method is shown to better screen irrelevant variables over the existing method that ignores the group structure of categorical modifying variables and to lead to a prediction model with higher accuracy and easier interpretation.

    Our method adequately considers the pre-specified group structure among modifying variables in addition to unstructured modifying variables.

    The c2pLasso is empirically shown to perform better than existing methods such as the Lasso and pLasso even when there is no categorical modifying variable or any pre-specified group structure among modifying variables. Using simulation studies, we show our method selects less irrelevant variables compared to existing methods while choosing relevant variables correctly. This provides us with a prediction model with higher specificity, lower false discovery rate and lower mean squared error. The proposed methodology is motivated by and illustrated using data from a Huntington disease study; the result identifies brain regions associated with motor impairment accounting for differentiated relationship by disease severity. To the best of our knowledge, our study is the first to identify the interaction effect between disease severity and the volume of brain regions in a varying-coefficient model framework. This is joint work with Rakheon Kim and Tanya Garcia, both at Texas A&M, Department of Statistics.


Samuel Müller, University of Sydney