主题：Workshop in Semi/Nonparametrically Statistical Learning
主办单位：统计研究中心 统计学院 科研处
主题一： Treatment Allocations Based on Multi-Armed Bandit Strategies
Yuhong Yang received his Ph.D from Yale in statistics in 1996. He then joined the Department of Statistics at Iowa State University and moved to the University of Minnesota in 2004. He has been a full professor there since 2007. His research interests include model selection, multi-armed bandit problems, forecasting, high-dimensional data analysis, and machine learning. He is a fellow of Institute of Mathematical Statistics.详情请见其个人主页：http://users.stat.umn.edu/~yangx374/
In practice of medicine, multiple treatments are often available to treat individual patients. The task of identifying the best treatment for a specific patient is very challenging due to patient inhomogeneity. Multi-armed bandit with covariates provides a framework for designing effective treatment allocation rules in a way that integrates the learning from experimentation with maximizing the benefits to the patients along the process.
In this talk, we present new strategies to achieve asymptotically efficient or minimax optimal treatment allocations. Since many nonparametric and parametric methods in supervised learning may be applied to estimating the mean treatment outcome functions (in terms of the covariates) but guidance on how to choose among them is generally unavailable, we propose a model combining allocation strategy for adaptive performance and show its strong consistency. When the mean treatment outcome functions are smooth, rates of convergence can be studied to quantify the effectiveness of a treatment allocation rule in terms of the overall benefits the patients have received. A multi-stage randomized allocation with arm elimination algorithm is proposed to combine the flexibility in treatment outcome function modeling and a theoretical guarantee of the overall treatment benefits. Numerical results are given to demonstrate the performance of the new strategies.
The talk is based on joint work with Wei Qian.
主题二： Testing of Significance for High-Dimensional Longitudinal Data
李润泽是宾州州立大学统计系冠名讲座教授。他的研究领域包括高维数据的variable selection and feature screening以及非参数模型和半参数模型的建模和统计推断。目前他担任JASA的副主编。他是IMS , ASA and AAAS 的fellow。详情请见其个人主页：http://stat.psu.edu/people/ril4
This paper concerns statistical inference for longitudinal data with ultrahigh dimensional covariates. We first study the problem of constructing confidence intervals and hypothesis tests for a low dimensional parameter of interest. The major challenge is how to construct an optimal test statistic in the presence of high dimensional nuisance parameters and the sophisticated dependence among measurements. To deal with the challenge, we propose a novel quadratic decorrelated inference function approach, which simultaneously removes the impact of nuisance parameters and incorporates the correlation to enhance the efficiency of the estimation procedure. We prove that the proposed estimator is asymptotically normal and attains the semiparametric information bound, based on which we can construct an optimal test statistic for the parameter of interest. We then study how to control the false discovery rate (FDR) when a vector of high-dimensional regression parameters is of interest. We prove that applying the Storey (2002)'s procedure to the proposed test statistics for each regression parameter controls FDR asymptotically in longitudinal data. We conduct simulation studies to assess the finite sample performance of the proposed procedures. Our simulation results imply that the newly proposed procedure can control both Type I error for testing a low dimensional parameter of interest and the FDR in the multiple testing problem. Finally, we apply the proposed procedure to a real data example.
主题三：Joint Analysis of Interval-censored failure Time Data and Panel Count Data
孙建国，现为密苏里大学统计系教授，1992年毕业于滑铁卢大学，并取得博士学位。其研究兴趣包括：生物统计学, 生存分析, 纵向数据分析, 化学计量学。他是数理统计研究所和ASA的fellow。具体详情请见其个人主页： https://www.stat.missouri.edu/people/sunj
Interval-censored failure time data and panel count data are two types of incomplete data that commonly occur in event history studies and many methods have been developed for their analysis separately (Sun, 2006; Sun and Zhao, 2013). Sometimes one may be interested in or need to conduct their joint analysis such as in the clinical trials with composite endpoints, for which it does not seem to exist an established approach in the literature. This talk will discuss this problem and present a sieve maximum likelihood approach. Some simulation results and an application will also be provided.
主题四：Modeling Hybrid Dependent Responses
张和平博士，耶鲁大学Susan Dwight Bliss生物统计学教授，统计与数据科学教授，儿童研究中心教授。他创建并主持耶鲁大学科学与统计协作中心。同时他也是香港大学荣誉教授，国家千人计划学者和长江讲座教授，泛华统计协会候任主席。他是期刊Statistics and Its Interface的创始主编。他目前担任美国统计协会杂志（JASA）, 遗传流行病学和生殖与不育专题研究的编委。研究兴趣包括非参数方法，纵向数据，统计遗传学和生物信息学，临床试验，流行病学数据统计建模，脑成像分析，统计计算和行为科学的统计方法。他在高影响力的统计、遗传、流行病学和精神病学期刊上发表了280多篇学术论文。详情请见其个人主页：https://publichealth.yale.edu/people/heping_zhang-2.profile
I will present a novel multivariate model for analyzing hybrid traits and identifying genetic factors for comorbid conditions. Comorbidity is a common phenomenon in mental health in which an individual suffers from multiple disorders simultaneously. For example, in the Study of Addiction: Genetics and Environment (SAGE), alcohol and nicotine addiction were recorded through multiple assessments that we refer to as hybrid traits. Statistical inference for studying the genetic basis of hybrid traits has not been well-developed. Recent rank-based methods have been utilized for conducting association analyses of hybrid traits but do not inform the strength or direction of effects. To overcome this limitation, a parametric modeling framework is imperative. Although such parametric frameworks have been proposed in theory, they are neither well-developed nor extensively used in practice due to their reliance on complicated likelihood functions that have high computational complexity. Many existing parametric frameworks tend to instead use pseudo-likelihoods to reduce computational burdens. Here, we develop a model fitting algorithm for the full likelihood. Our extensive simulation studies demonstrate that inference based on the full likelihood can control the type-I error rate, and gains power and improves the effect size estimation when compared with several existing methods for hybrid models. These advantages remain even if the distribution of the latent variables is misspecified. After analyzing the SAGE data, we identify three genetic variants (rs7672861, rs958331, rs879330) that are significantly associated with the comorbidity of alcohol and nicotine addiction at the chromosome-wide level. Moreover, our approach has greater power in this analysis than several existing methods for hybrid traits. Although the analysis of the SAGE data motivated us to develop the model, it can be broadly applied to analyze any hybrid responses.
主题五：An empirical comparison of deep learning and other methods for prediction of protein subcellular localization with microscopy images
We compare the performance of deep-learning method and more traditional machine learning methods to predict protein subcellular localization based on a large dataset of single cell microscopy images. Specifically, we show better performance of various VGG-type Convolutional Neural Networks (CNNs) and residual CNNs (ResNets) over random forests and gradient boosting. We also demonstrate the use of CNNs for transfer learning and feature extraction.
主题六：Noise Injection Regularization in Large Models with Applications to Neural Networks and Graphical Models
Prof Fang Liu is currently an Associate Professor and the Director of Graduate Studies in the Department of Applied and Computational Mathematics and Statistics at the University of Notre Dame. Prof Liu’s research interests include development of statistical methods for protecting data privacy, missing data analysis, Bayesian methods and modelling, statistical learning and regularization of complex models, and application of statistics to biological and social science data.详情请见其个人主页：http://acms.nd.edu/people/faculty/fang-liu/
The noise injection regularization technique (NIRT) is an approach to mitigate over-fitting in large models. In this talk, I will demonstrate the applications of the NIRT in two scenarios of learning large models: Neural Networks (NN) and Graphical Models (GM). For NNs, we develop a NIRT called whiteout that injects adaptive Gaussian noises during the training of NNs. We show that the optimization objective function associated with whiteout in generalized linear models has a closed-form penalty term that has connections with a wide range of regularizations and includes the bridge, lasso, ridge, and elastic net penalization as special cases; it can also be extended to offer regularizations similar to the adaptive lasso and group lasso. For GMs, we develop an AdaPtive Noisy Data Augmentation regularization (PANDA) approach to promote sparsity in estimating individual graphical models and similarity among multiple graphs through training of generalized linear models. On the algorithmic level, PANDA can be implemented in a straightforward manner by iteratively solving for MLEs without constrained optimizations. For both the NN and PANDA approaches, we use simulated and real-life data to demonstrate their applications and show their superiority or comparability with existing methods.
主题七：A New Joint Screening Method for Right-Censored Time-to-Event Data with Ultrahigh Dimensional Covariates
He is Professor of Biostatistics and Biomathematics at University of California at Los Angeles (UCLA) and Director of UCLA’s Jonsson Comprehensive Cancer Center Biostatistics Shared Resource. He has published extensively with over 110 peer-reviewed articles in statistical research and applied work in the areas of survival analysis, longitudinal data analysis, high dimensional data analysis, clinical trials, and evaluation of biomarkers. Dr. Li is Elected Fellow of the Institute of Mathematics, Elected Fellow of the American Statistical Association, Elected Member of the International Statistics Institute, and Elected Fellow of the Royal Statistical Society. 详情请见其个人主页：https://faculty.biostat.ucla.edu/gangli/
In an ultrahigh dimensional setting with a huge number of covariates, variable screening is useful for dimension reduction before a more refined variable selection and parameter estimation method is applied. This paper proposes a new sure joint screening procedure for right-censored time-to-event data based on a sparsity-restricted semiparametric accelerated failure time model. Our method, referred to as Buckley-James assisted sure screening (BJASS), consists of an initial screening step using a sparsity-restricted least-squares estimate based on a synthetic time variable and a refinement screening step using a sparsity-restricted least-squares estimate with the Buckley-James imputed event times. The refinement step may be repeated several times to obtain more stable results. We show that with any fixed number of refinement steps, the BJASS procedure retains all important variables with probability tending to 1. Simulation results are presented to illustrate its performance in comparison with some marginal screening methods. A real data example is provided using a diffuse large-B-cell lymphoma (DLBCL) data. We have implemented the BJASS method using Matlab and R, which are available to readers upon request.
(This talk is based on joint work with Yi Liu and Xiaolin Chen)