🎓 I am a Ph.D. candidate in Economics at University of Southern California, where I had the privilege of being advised by Professor Roger Moon (Chair), Professor Cheng Hsiao and Professor Geert Ridder. Before coming to LA, I received M.Phil in Economics (18’) and B.Sc. in Mathematics (16’) from The Chinese University of Hong Kong where I was fortunate to be advised by Professor Zhentao Shi.
📝 I will be on the academic job market for AY 2024-25 and available for interviews.
📖 My primary research field is econometrics, with a focus on machine learning methods for high-dimensional time series and panel data. My research aims to push the boundaries of econometric techniques for modern big data, capturing rich heterogeneity and dynamics to enable valid inference and powerful prediction. Additionally, I strive to bridge the gap between cutting-edge theory and empirical practice. My research interests also encompass biostatistics and health economics, financial econometrics, computational methods, labor and personnel economics, and the economics of digital platforms.
💬 zhangao [at] usc [dot] edu
| gaozhan [dot] cuhk [at] gmail [dot] com
Research
Peer-reviewed Publications
- (2023) “Identification and Estimation of Categorical Random Coefficient Models”, (with M. Hashem Pesaran), Empirical Economics, 64(6), 2543–2588.
supplement | R packageccrm
| arXiv: 2303.14380 | poster | slidesAbstract
This paper proposes a linear categorical random coefficient model, in which the random coefficients follow parametric categorical distributions. The distributional parameters are identified based on a linear recurrence structure of moments of the random coefficients. A Generalized Method of Moments estimator is proposed and its finite sample properties are examined using Monte Carlo simulations. The utility of the proposed method is illustrated by estimating the distribution of returns to education in the U.S. by gender and educational levels. We find mean return to education to be lower and less heterogeneous for people with high school or less, as compared to those with postsecondary education.
- (2023) “Copula Graphic Estimation of Survival Function with Dependent Censoring and its Application to an Analysis of Pancreatic Cancer Clinical Trial”, (with Jung Hyun Jo, Inkyung Jung, Hyungsik Roger Moon, Geert Ridder and Si Young Song), Statistical Methods in Medical Research, 32(5), 944-962.
R packageCopulaGraphic
Abstract
In this article, we consider a survival function estimation method that may be suitable for analyses of clinical trials of cancer treatments whose prognosis is known to be poor such as pancreatic cancer treatment. Typically, these kinds of trials are not double-blind, and patients in the control group may drop out in more significant numbers than in the treatment group if their disease progresses (DP). If disease progression is associated with a higher risk of death, then censoring becomes dependent. To estimate the survival function with dependent censoring, we use copula-graphic estimation, where a parametric copula function is used to model the dependence in the joint survival function of the event and censoring time. In this article, we propose a novel method that one can use in choosing the copula parameter. As an application example, we estimate the survival function of the overall survival time of the KG4/2015 study, the phase 3 clinical trial of the efficacy of GV1001 as a treatment for pancreatic cancer. We provide both statistical and clinical pieces of evidence that support the violation of independent censoring. Applying the estimation method with dependent censoring, we obtain that the estimates of the median survival times are 339 days in the treatment group and 225.5 days in the control group. We also find that the estimated difference of the medians is 113.5 days, and the difference is statistically significant at the one-sided level with size 2.5%.
- (2022) “On LASSO for Predictive Regression”, (with Ji Hyung Lee and Zhentao Shi), Journal of Econometrics, 229(2), 322-349.
supplement | replication code | arXiv: 1810.03140 | slidesAbstract
Explanatory variables in a predictive regression typically exhibit low signal strength and various degrees of persistence. Variable selection in such a context is of great importance. In this paper, we explore the pitfalls and possibilities of the LASSO methods in this predictive regression framework. In the presence of stationary, local unit root, and cointegrated predictors, we show that the adaptive LASSO cannot asymptotically eliminate all cointegrating variables with zero regression coefficients. This new finding motivates a novel post-selection adaptive LASSO, which we call the twin adaptive LASSO (TAlasso), to restore variable selection consistency. Accommodating the system of heterogeneous regressors, TAlasso achieves the well-known oracle property. In contrast, conventional LASSO fails to attain coefficient estimation consistency and variable screening in all components simultaneously. We apply these LASSO methods to evaluate the short- and long-horizon predictability of S&P 500 excess returns.
- (2021) “Implementing Convex Optimization in R: Two Econometric Examples”, (with Zhentao Shi), Computational Economics, 58, 1127-1135.
supplement | replication code | arXiv: 1806.10423Abstract
Economists specify high-dimensional models to address heterogeneity in empirical studies with complex big data. Estimation of these models calls for optimization techniques to handle a large number of parameters. Convex problems can be effectively executed in modern programming languages. We complement Koenker and Mizera (J Stat Softw 60(5):1–23, 2014)’s work on numerical implementation of convex optimization, with focus on high-dimensional econometric estimators. Combining R and the convex solver MOSEK achieves speed gain and accuracy, demonstrated by examples from Su et al. (Econometrica 84(6):2215–2264, 2016) and Shi (J Econom 195(1):104–119, 2016). Robust performance of convex optimization is witnessed across platforms. The convenience and reliability of convex optimization in R make it easy to turn new ideas into executable estimators.
Working Papers
- “Generalized Method of Moments with Grouped Heterogeneous Validity in Panel Data Models”, Job Market Paper.
Abstract
This paper provides a unified framework for the selection of valid moment conditions and detection of latent group structures based on the moment condition validity in general nonlinear generalized method of moments (GMM) panel data models. It accommodates a diverging number of moment conditions and group-specific heterogeneous validity of moment conditions across agents. The proposed method integrates the pairwise adaptive fused Lasso and the adaptive Lasso regularization into the GMM estimation. The estimator is shown to be consistent and achieves classification and moment selection consistency simultaneously. The asymptotic distribution of a post-regularization estimator is derived, and its oracle properties are established. The finite-sample performance of the proposed method is evaluated through a Monte Carlo simulation experiment. The method is applied to empirically investigate the impact of agricultural productivity shocks on rural-to-urban migration in China.
- “Robust Estimation of Regression Models with Potentially Endogenous Outliers via a Modern Optimization Lens”, (with Hyungsik Roger Moon), revision requested by Econometric Reviews.
Abstract
This paper addresses the robust estimation of linear regression models in the presence of potentially endogenous outliers. Through Monte Carlo simulations, we demonstrate that existing methods using $L_1$-regularization on case-specific parameters, including the Huber estimator and the least absolute deviation (LAD) estimator, exhibit significant bias when outliers are endogenous. Motivated by this finding, we investigate $L_0$-regularized estimation methods. We propose systematic heuristic algorithms, notably a local combinatorial search refinement based on the iterative hard-thresholding solution, to solve the combinatorial optimization problem of the $L_0$-regularized estimation efficiently. Our Monte Carlo simulations yield two key results: (i) The local combinatorial search algorithm substantially improves solution quality compared to the initial projection-based hard-thresholding algorithm while offering greater computational efficiency than directly solving original optimization problem; (ii) The $L_0$-regularized estimator demonstrates superior performance in terms of bias reduction, estimation accuracy, and out-of-sample prediction errors compared to $L_1$-regularized alternatives. In the stock return forecasting application, our method identifies the crisis periods across rolling windows and improves the prediction accuracy over baseline methods. An accompanying \texttt{R} package is provided for practitioners.
- “Econometric Inference for High Dimensional Predictive Regressions”, (with Ji Hyung Lee, Ziwei Mei and Zhentao Shi). Submitted.
Abstract
LASSO introduces shrinkage bias into estimated coefficients, which can adversely affect the desirable asymptotic normality and invalidate the standard inferential procedure based on the t-statistic. The desparsified LASSO has emerged as a well-known remedy for this issue. In the context of high dimensional predictive regression, the desparsified LASSO faces an additional challenge: the Stambaugh bias arising from nonstationary regressors. To restore the standard inferential procedure, we propose a novel estimator called IVX-desparsified LASSO (XDlasso). XDlasso eliminates the shrinkage bias and the Stambaugh bias simultaneously and does not require prior knowledge about the identities of nonstationary and stationary regressors. We establish the asymptotic properties of XDlasso for hypothesis testing, and our theoretical findings are supported by Monte Carlo simulations. Applying our method to real-world applications from the FRED-MD database — which includes a rich set of control variables — we investigate two important empirical questions: (i) the predictability of the U.S. stock returns based on the earnings-price ratio, and (ii) the predictability of the U.S. inflation using the unemployment rate.
- “From Isolation to Compassion: A Natural Experiment of How Stay-at-home Orders Unleashed a Wave of Virtual Altruism”, (with Mingxuan Liu, Alex Bisberg and Dimitri Williams).
Abstract
The COVID-19 pandemic and subsequent stay-at-home orders pushed many people to find social connection and support in the virtual world. As suggested by the Altruism Born of Suffering Model, this shift likely led to an increase in acts of kindness. One notable platform where this virtual altruism could be observed is Sky: Children of the Light, a popular social game that fosters player interaction and support. Bridging the Altruism Born of Suffering Model and the Self-Categorization Theory, this study adopts a difference-in-differences approach to make causal inferences on whether the initial implementation of stay-at-home orders in California, the first state in the U.S. to do so, triggered more prosocial behavior (i.e., gifting) among California players. Results from 306,504 unique daily observations of 13,932 players showed that the stay-at-home order had a positive effect on players’ prosocial behavior toward strangers, but a negative effect on players' willingness to do the same towards individuals seen as part of different groups. Theoretical and practical implications are discussed.
- “Heterogeneous Return to Education and Wage Inequality in China” (with Xiongfei Li
Abstract
The mounting income inequality in China raised general concerns and attention, however, the trajectory of the Gini Index for the past decade convinced some researchers that China has already passed the Kuznets turning point. In this paper, we use a new categorical random coefficient model, and a survey dataset of 1995, 2002, 2007, 2013, and 2018 to estimate the distribution of education returns in China generally and all by different groups. We show that the overall education return in China drops more than 54% in the recent decade, and at the same time, educational return dispersion is also higher for the rural/urban divide, the postsecondary or below divide, and the gender divide, as well as within-group dispersion. Our estimate should raise attention to the potential future divergence of wage income and all total income.
Software
- R package
classo
: Classified-Lasso proposed in Su, Shi and Phillips (2016): “Identifying latent structures in panel data”, Econometrica, 84(6), 2215-2264. Refer to Gao and Shi (2021) for details of the numerical optimization implementation. - R package
LasForecast
: A versatile toolbox for predictive regressions with high-dimensional mixed-roots covariates. - An illustration of implementing bubble testing based on Phillips, Shi and Yu (2015) and the
MultipleBubbles
package. - Contribution to open-source software:
bHP
,fsPDA
, etc. Full list: Github profile. - Supplementary packages for papers are along with the publication/working paper items.
Teaching
Fall 2024
Date | |||
---|---|---|---|
Sep. 12 | R (basics) | PDF | ||
Sep. 24 | R (advanced) | Git and Github | PDF | |
Oct. 1 | Stationarity | PDF | ARIMA | |
Oct. 17 | Unit roots | Bubble Testing | |
Nov. 12 | Volatility | | PDF | VAR | Forecasting |
Previous semesters:
- ECON 609 Econometric Methods | Notes - Optimization
- ECON 570 Big Data Econometrics | 01-Intro-R | 02-Git | 03-optimization
- ECON 611 Probability and Statistics
- ECON 613 Economic and Financial Time Series
- ECON 419 Advanced Econometrics