What is the endogenous switching model and how does it work?

I have been playing around with endogenous switching models since I have been becoming more interested in selection bias corrections recently, and these models represent a more general formulation of that problem (i.e. switching between two possible regimes rather than into one possible sample). Since much of the material on these models is sparsely distributed, I have collected it here for my own reference as much as anything.

So, first of all, what is an endogenous switching model? It is a model in which there are two or more regimes and agents may choose which regime to enter based on both their observable and unobservable characteristics. To make this concrete, suppose we have data on patients who are being treated for alcoholism and we want to estimate the effect of participation in group therapy on the number of drinks consumed per day. If we just run OLS, we will be comparing individuals who have chosen to participate in group therapy to those who chose not to. If, for example, very self-motivated and independent people are choosing not to participate and they are also more likely than the participants to limit their alcoholism in the absence of group therapy, then our OLS estimates are going to be biased and inconsistent (in this example, we will underestimate the true effect), which I show below.

The earliest precursors to switching models can actually be found in nineteenth century biology through the works of the famous statistician Karl Pearson on the distribution of organ sizes in the population. He hypothesised that oddly-shaped distributions may arise due to there being many sub-populations, each following a normal distribution with their own mean and variance. Exogenous switching models are later introduced from the 1950’s onward through the work of Quandt, Fair and Jaffee. It isn’t until 1975 that the first endogenous switching model is introduced by Maddala and Nelson, allowing agents to individually choose their switching point. Let’s describe the set-up of this classical model and then demonstrate why OLS is inconsistent, $Y_{1i} = X_i^{\prime} \beta + U_{1i} \\ Y_{0i} = X_i^{\prime} \alpha + U_{0i} \\ D_i = 1(Z_i^{\prime} \gamma + \epsilon_i > 0)$
Note that $Y_{1i}$ and $Y_{0i}$ are the potential outcomes in this model, meaning that for each individual $i$, they have a potential outcome $Y_{1i}$ under regime 1 and a potential outcome $Y_{0i}$ under regime 0. The outcome that is actually realised in the data depends on which regime the individual chooses, which is determined by $D_i$. Putting this together, we have that the observed outcome for a given individual is, $Y_i = D_i X_i^{\prime} \beta + (1 - D_i) X_i^{\prime} \alpha + D_i U_{1i} + (1 - D_i) U_{0i}$.
We write $Y$ in vector form as $Y = (Y_1, Y_2, ..., Y_n)^{\prime}$. We also write our independent variables as an $n \times 2k$ matrix $W = [ D_i X_i ~ (1 - D_i) X_i ]_i$, where $i = 1,2,...,n$, and our coefficients as $\lambda = (\beta, \alpha)^{\prime}$. Lastly we can write our error term for an individual $i$ as $V_i = D_i U_{1i} + (1 - D_i) U_{0i}$ and in vector form as $V = (V_1, V_2, ..., V_n)^{\prime}$. Putting all of this together we can write out our switching regression in the more compact form, $Y = W \lambda + V$.
The OLS estimator of $\lambda$ is $\hat{\lambda} = (W^{\prime} W)^{-1} W^{\prime} Y = \lambda + (W^{\prime} W)^{-1} W^{\prime} V$, the second equality by substitution of $Y = W \lambda + V$. Let’s assume that $\text{plim}_{n \to \infty} W^{\prime} W/n \to \Sigma$ by the LLN, where $\Sigma$ is symmetric positive definite. Then by the LLN and CMT, $\hat{\lambda} - \lambda \overset{p}{\to} \Sigma^{-1} \text{E} \Big( \begin{bmatrix} D_i X_i (D_i U_{1i} + (1 - D_i) U_{0i}) \\ (1 - D_i) X_i (D_i U_{1i} + (1 - D_i) U_{0i}) \end{bmatrix}_i \Big)$
Naturally for $\hat{\lambda}$ to be consistent we need the above expression to be 0. Let’s check this out (using the fact that expectation is a linear operator), $\begin{array}{l@{}l} \text{E}(D_i X_i (D_i U_{1i} + (1 - D_i) U_{0i})) &{}= \text{E}_{D_i} (\text{E} (D_i X_i (D_i U_{1i} + (1 - D_i) U_{0i})| D_i )) \\ &{}= \text{P}(D_i = 1) X_i \text{E}(U_{1i} | D_i = 1) \\ &{}= \text{P}(D_i = 1) X_i \text{E}(U_{1i} | \epsilon \geq -Z_i^{\prime} \gamma) \neq 0 \end{array}$
We can similarly show $\text{E}((1 - D_i) X_i (D_i U_{1i} + (1 - D_i) U_{0i})) = \text{P}(D_i = 0) X_i \text{E}(U_{0i} | \epsilon < -Z_i^{\prime} \gamma) \neq 0$.
This is the root of the inconsistency of OLS under an endogenous switching model. For each regime we are viewing only those individuals whose observable characteristics $Z_i$ and unobservable characteristics $\epsilon_i$ induced them to select into that regime. If there is correlation between the unobservable characteristics which determine selection and the unobservable characteristics which determine either of the potential outcomes, then there is nothing guaranteeing that the expressions above are zero. Note that, as a corollary, OLS is consistent if $U_1$ and $U_0$ are independent of $\epsilon$, which would be the case if there is only selection on observables.

So how does the endogenous switching regression work? The concept behind it is that, if we could estimate the bias terms $\text{E}(U_{1i} | \epsilon \geq -Z_i^{\prime} \gamma)$ and $\text{E}(U_{0i} | \epsilon < -Z_i^{\prime} \gamma)$, then we would once again be in a selection on observables model and so OLS would be consistent. While the variables $U_{1i}, U_{0i}$ and $\epsilon_i$ are unobservable for each individual, we need only make assumptions about their distribution in order to estimate the bias terms. This is the key insight. In the classical endogenous switching model, we assume that $(U_{1i}, U_{0i}, \epsilon_i) \sim \mathcal{N}(0, \Omega)$ where $\Omega = \begin{bmatrix} \sigma_{\epsilon}^2 & \sigma_{1\epsilon} & \sigma_{0\epsilon} \\ \sigma_{1\epsilon} & \sigma_1^2 & . \\ \sigma_{0\epsilon} & . & \sigma_0^2 \end{bmatrix}$ $\sigma_{1\epsilon}$ is the covariance between $U_1$ and $\epsilon$ and $\sigma_{0\epsilon}$ is the covariance between $U_0$ and $\epsilon$. Note that $\sigma_{10}$, the covariance between $U_1$ and $U_0$, is undefined in this model. It is not estimable given that the bias terms do not depend on it. Note also that it is common to assume that $\sigma_{\epsilon}^2 = 1$ since $\gamma$ is estimable only up to a scale factor. In the absence of an exclusion restriction, identification of the parameters in this model comes solely through non-linearity of the normal distribution (as you will see below). It is this extremely strong assumption which has led to the relative unpopularity of this approach in recent years. If you are lucky enough to be in possession of an exclusion restriction (i.e. something which determines selection but not does determine potential outcomes) then it is possible to identify off of that rather than the normality assumption.

Note that the likelihood function of this model can be written as, $\text{L} = \prod\limits_{i=1}^n \big[ f(Y_{1i} | D_i = 1) P(D_i = 1) \big]^{D_i} \big[ f(Y_{0i} | D_i = 0) P(D_i = 0) \big]^{1-D_i}$
Here is where our normality assumption comes into play, $\begin{array}{l@{}l} f(Y_{1i} | D_i = 1) P(D_i = 1) &{}= f(Y_{1i} | \epsilon_i \geq -Z_i^{\prime} \gamma) P(\epsilon_i \geq -Z_i^{\prime} \gamma) \\ &{}= f(Y_{1i}, \epsilon_i \geq -Z_i^{\prime} \gamma | \epsilon_i \geq -Z_i^{\prime} \gamma) \\ &{}= \int_{-Z_i^{\prime}\gamma}^{\infty} f(U_{1i},\epsilon_i) d \epsilon_i \\ &{}= f(U_{1i}) \int_{-Z_i^{\prime}\gamma}^{\infty} f(\epsilon_i | U_{1i}) d \epsilon_i \end{array}$
Note that $f(U_{1i}) = \frac{1}{\sigma_1}\phi(\frac{U_{1i}}{\sigma_1})$ and $f(\epsilon_i | U_{1i}) = \frac{1}{\sqrt{1-\rho_1^2}} \phi(\frac{\epsilon_i + \rho_1 U_{1i}/ \sigma_1}{\sqrt{1-\rho_1^2}})$ where $\rho_1$ is the correlation coefficient between $U_{1i}$ and $\epsilon_i$ and $\phi(\cdot)$ is the standard normal density. Performing a similar operation for regime 0 (which I omit) we can rewrite the likelihood function, $\text{L} = \prod\limits_{i=1}^n \big[ \frac{1}{\sigma_1}\phi(\frac{U_{1i}}{\sigma_1}) \Phi(\frac{Z_i^{\prime} \gamma + \rho_1 U_{1i}/ \sigma_1}{\sqrt{1-\rho_1^2}}) \big]^{D_i} \big[ \frac{1}{\sigma_0}\phi(\frac{U_{0i}}{\sigma_0}) (1-\Phi(\frac{Z_i^{\prime} \gamma + \rho_0 U_{0i}/ \sigma_0}{\sqrt{1-\rho_0^2}})) \big]^{1-D_i}$
The log likelihood function is therefore, $\begin{array}{l@{}l} \text{lnL} =&{} \sum\limits_{i=1}^n \lbrace D_i \big[ \text{ln}(\Phi(\frac{Z_i^{\prime} \gamma + \rho_1 U_{1i}/ \sigma_1}{\sqrt{1-\rho_1^2}})) + \text{ln}(\phi(\frac{U_{1i}}{\sigma_1})) - \text{ln}(\sigma_1) \big] \\ &{}+ (1 - D_i) \big[ \text{ln}(1-\Phi(\frac{Z_i^{\prime} \gamma + \rho_0 U_{0i}/ \sigma_0}{\sqrt{1-\rho_0^2}})) + \text{ln}(\phi(\frac{U_{0i}}{\sigma_0})) - \text{ln}(\sigma_0) \big] \rbrace \end{array}$
We therefore end up estimating $\beta$, $\alpha$, $\gamma$, $\rho_1$, $\rho_0$, $\sigma_1$ and $\sigma_0$. From here it is very straight-forward to estimate conditional and unconditional outcomes under both regimes.

Something which I have been interested in recently is, in the absence of an exclusion restriction, how sensitive is the endogenous switching regression to violations of the joint normality assumption? I have been running some simulations in which the errors $U_1$, $U_0$ and $\epsilon$ are drawn from a skew normal distribution. In another blog post I hope to share some of my results.

Many thanks to Dutoit (2007) for a brief history of switching regressions and an overview of how to derive the likelihood function.