Sunday, January 15, 2017

Multilevel regression with poststratification (Gelman's MrP) in R - What is this all about?

Multilevel regression with poststratification (MrP) is a useful technique to predict a parameter of interest within small domains through modeling the mean of the variable of interest conditional on poststratification counts. This method (or methods) was first proposed by Gelman and Little (1997) and is widely used in political science where the voting intention is modeling conditional on the interaction of classification variables.

The aim fo this methodology is to provide reliable estimates on strata based on census counts. For those who have some background on survey sampling, this method should look very similar to the Raking method, where sampling weights are adjusted due to known census cell counts. However, a significant difference with Raking is that MrP is a model-based approach, rather than a design-based method. This way, even in the presence of a (maybe complex) survey design, MrP does not take it into account for inference. In other words, sampling design will be considered as ignorable. So, the probability measure that governs the whole inference is based on modeling the voting intention (variable of interest) to demographic categories (auxiliary variables).

Is this a major technical issue to ignore the complex survey design? Yes, because in any case we are considering a probability measure to draw the sample. However, when it comes to voting intention (a major area where this technique is used), we rarely find a sophisticated complex design. Moreover, this kind of studies is barely based on probabilistic polls. So, if the survey lacks a proper sampling plan, it is always better to model the response variable.

Therefore, the ultimate goal of this technique it to estimate a parameter of interest (totals, means, proportions, etc.) for all of the strata (domains, categories or subgroups) in a finite population. From now on, let's assume that:

  1. a population is divided into $H$ strata of interest (for example states),
  2. the parameters of interest are the means (same rules apply for proportions) in each strata $\theta_h$ ($h=1, \ldots, H$), 
  3. every stratum is cross-classified by some demographics $j \in H$ (from now on defined as post-srata), besides every population count $N_j$ is known, and
  4. all of the population means $\mu_j$ can be estimated by using some statistical technique, such as multilevel regression.

This way, the mean in stratum $h$ ($\theta_h$), is defined as a function of means in post-strata $j$ ($\mu_j$), and post-strata counts ($N_j$):

$$\theta_h = \frac{\sum_{j \in h} N_j \mu_j }{\sum_{j \in h} N_j}$$

The first part of MrP is defined by a multilevel regression (MR). This kind of models is a particular case of mixed effect models. The core components of this phase are

  • the variable of interest (for example, voting intention), 
  • the auxiliary variables (classification on census demographic cells) and, 
  • the random effects (strata of interest, that usually are states or counties).

The second part of MrP is cell poststratification (P), and the predicted response variable is aggregated into states and adjusted by corresponding poststratification weights.

R workshop

Let’s consider the Lucy database of the TeachingSampling package. This population contains some economic variables of 2396 industrial companies in a particular year. Assume that we want to estimate the mean income of industries by each of the five existing zones (strata of interest) on that database. This way, our parameters of interest are $\theta_1, \ldots, \theta_5$.

Now, we also know that the population is divided into three levels (small, medium and big industries) and we have access to the total number of industries within each cross group. That is, we know exactly how many small industries are on each of the five zones, and how many medium industries are on each of the five zones, and so on. 

The following code shows how to load the database and obtain the cell counts. 

> rm(list = ls())
> set.seed(123)
>
> library(TeachingSampling)
> library(dplyr)
> library(lme4)
>
> data("Lucy")
> # Number of industries per level
> table(Lucy$Level)

   Big Medium  Small
    83    737   1576
> # Number of industries per zone
> table(Lucy$Zone)

  A   B   C   D   E
307 727 974 223 165
> # Size of post-strata
> (Np <- table(Lucy$Level, Lucy$Zone))
        
           A   B   C   D   E
  Big     30  13   1  16  23
  Medium 180 121 111 187 138
  Small   97 593 862  20   4

Of course, this technique works over a selected sample. That's why we are going to select a random sample of size $n = 1000$. We can also create some tables showing the counts on the sample.

> # A sample is selected
> SLucy <- sample_n(Lucy, size = 1000)
> table(SLucy$Level)

   Big Medium  Small
    33    280    687
> table(SLucy$Zone)

  A   B   C   D   E
130 295 426  86  63

The first step of MRP is Multilevel Regression in order to estimate post-strata means. The following code shows how to estimate them by using the lmer function. The object Mupred contains the corresponding $\mu_j$ ($j$ is defined for each level) regarding each stratum (zones).

> # Step 1: <<MR>> - Multilevel regression
> M1 <- lmer(Income ~ Level + (1 | Zone), data = SLucy)
> coef(M1)
$Zone
  (Intercept) LevelMedium LevelSmall
A    1265.596   -579.1851  -893.8958
B    1138.337   -579.1851  -893.8958
C    1189.285   -579.1851  -893.8958
D    1248.658   -579.1851  -893.8958
E    1284.322   -579.1851  -893.8958

attr(,"class")
[1] "coef.mer"
> SLucy$Pred <- predict(M1)
>
> # Summary
> grouped <- group_by(SLucy, Zone, Level)
> sum <- summarise(grouped, mean2 = mean(Pred))
> (Mupred <- matrix(sum$mean2, ncol = 5, nrow = 3))
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 1265.5959 1138.3370 1189.2852 1248.6575 1284.3224
[2,]  686.4107  559.1518  610.1000  669.4724  705.1373
[3,]  371.7001  244.4412  295.3894  354.7618  390.4267

Now we have estimated the post-strata means, it’s time to weight every strata by their corresponding counts in order to obtain an estimate of the mean income by zone. As we know each post-strata size, we simply use the function aggregate to obtain the MRP estimator of the parameters of interest.

> # Step 2: <<P>> - Post-stratification
> # Mean income estimation per zone
> colSums(Np * Mupred) / table(Lucy$Zone)

       A        B        C        D        E
643.5724 312.8052 332.1726 682.8031 778.2428

How accurate was the estimation? It was good compared to the true parameters on the finite population. 

> # True parameters
> aggregate(Lucy$Income, by = list(Lucy$Zone), FUN = mean)
  Group.1        x
1       A 652.2834
2       B 320.7469
3       C 331.0195
4       D 684.9821
5       E 767.3879

 

6 comments:

  1. What about the uncertainty of the estimates?

    ReplyDelete
    Replies
    1. Seba, Henrik ... thanks for sharing your questions. The parameter of interest is defined to be:

      $\theta_h = \frac{\sum_{j \in h} N_j \mu_j }{\sum_{j \in h} N_j}$

      So, when the modeling follows a normal bayesian setup, the posterior distribution of $\theta_h$ is conjugate and then to compute its variance is straightforward. If the modeling is frequentist, it is just a matter of algebra as $\theta_h$ is a linear combination of $\mu_j$ for which a variance is associated. Then, by assuming independence among $\mu_j$'s, we have that:

      $Var(\theta_h) = \frac{\sum_{j \in h} N_j^2 Var(\mu_j) }{(\sum_{j \in h} N_j)^2}$

      Finally, $Var(\mu_j)$ can be found in M1 object

      Delete
  2. Very interesting and informative blog post. Two things.

    1. As the previous person asks, what about CIs around the final estimate?

    2. Obtaining the predictions per random effect and fixed effect combination via predict seems unnecessarily complicated. What about:
    Mupred <- t(coef(M1)$Zone)
    Mupred[2,] <- Mupred[1,] + Mupred[2,]
    Mupred[3,] <- Mupred[1,] + Mupred[3,]
    Mupred

    ReplyDelete
    Replies
    1. Seba, Henrik ... thanks for sharing your questions. The parameter of interest is defined to be:

      $\theta_h = \frac{\sum_{j \in h} N_j \mu_j }{\sum_{j \in h} N_j}$

      So, when the modeling follows a normal bayesian setup, the posterior distribution of $\theta_h$ is conjugate and then to compute its variance is straightforward. If the modeling is frequentist, it is just a matter of algebra as $\theta_h$ is a linear combination of $\mu_j$ for which a variance is associated. Then, by assuming independence among $\mu_j$'s, we have that:

      $Var(\theta_h) = \frac{\sum_{j \in h} N_j^2 Var(\mu_j) }{(\sum_{j \in h} N_j)^2}$

      Finally, $Var(\mu_j)$ can be found in M1 object

      Delete
  3. Also, you can perform an EM algorithm as is stated in pages 134 and 135 of the seminal paper http://www.statcan.gc.ca/pub/12-001-x/1997002/article/3616-eng.pdf

    ReplyDelete
  4. Concepts are very clear.tq for sharing such a valuable content.we are very happy to recieve such a nice one tq

    Data science training
    Data science online training
    Data science training in hyderabad

    ReplyDelete