I have too many control variables… Which variables should I include in the regression model? – Healthcare economist


Suppose you have some data about the health care expenditures of different individuals, and you want to know which patient characteristics will increase health care expenditures. Although this seems to be something that any health economist can do, measuring this relationship requires knowing (i) which independent variables to include in your data analysis, and (ii) their functional form. Option (i) can be determined based on previous research and clinical experts, but even this is not perfect. Point (ii) is difficult to decipher. Is there a data-driven way to achieve this?

a dissertation Belloni, Chernozhukov and Hansen (2014) It is recommended to use Post Double Selection (PDS) to identify related controls and their functional forms. Consider a situation where we want to model the following:

withA generation = g(wA generation) +?A generation

Where

E(sA generation|g(wA generation))=0

Belloni paper treats Grams (width) As a high-dimensional, approximately linear model, where:

Grams (watts)A generation) = ?j=1 to P (SecondjXMe, j+rp,i)

Note that in the Belloni framework, the number of control variables (phosphorus) Is greater than the number of observations (no). How can you have more regressions than results?Basically because Belloni requires causality to be About sparse Means out of phosphorus Control variables, only Second Which is different from 0 ? n.

Belloni proposed to determine these Second Use the important variables of the least absolute contraction and selection operator (LASSO) model Frank and Friedman (1993) as follows:

Under LASSO, the coefficients are selected to minimize the residual sum of squares plus a penalty term that penalizes the size of the model by the sum of the absolute values ??of the coefficients. The term ? is the penalty level, which provides the degree of penalty for the number of variables with non-zero (or very small) coefficients.Papers such as Belloni et al. (2012) with Belloni et al. (2016) Provide some reasonable estimates for the value of ?. The gamma coefficient is a “penalty load” designed to ensure that the coefficient estimate is equal to the rescaling of x. For example, if one variable is school education on a scale from 1 to 16, and another variable is income in dollars, then an increase in school education by one year is much higher than an increase in annual income by one dollar. The penalty load aims to correct this discrepancy. The author pointed out:

The special feature of the penalty function in LASSO is that it has a kink at 0, and the special feature of the penalty function in LASSO is that it has a kink at 0, which causes the sparse estimator to have many coefficient results. The coefficient is completely zero.

However, one of the problems with the LASSO method is that the coefficients obtained are biased towards zero. The method proposed by Belloni is to use the following two-step method for post-lasso estimation:

First, LASSO is used to determine which variables can be deleted from the perspective of forecasting. Then, only the variables with non-zero first-step estimation coefficients are used, and the coefficients of the remaining variables are estimated by ordinary least squares regression. The Post-LASSO estimator is easy to implement and… is as good as LASSO in terms of convergence rate and bias, and is generally better than LASSO.

There are more details in the paper, as well as various examples of experience.Read the full text Learn.

In addition, a recent paper by Kugler et al. (2021) An article published last month used Belloni’s method in their research to examine the impact of salary expectations on the decision to become a nurse.



Source link