Sunday, April 16, 2017

Small Area Estimation 101

Small area estimation (SAE) has become a widely used technique in official statistics since the last decade of past century. When the sample size is not enough to provide reliable estimates at a very particular level, the power of models and auxiliary information must be applied with no hesitation. In a nutshell, SAE tries to exploits similarity and borrows strength from available information.

I will write some posts to present, step by step, the fundamentals of SAE and how it can be implemented in the R software. The first post (this one you are reading now) is about basic concepts such as sampling and databases, the second and third posts will deal with direct and indirect estimates, respectively; the fourth post will introduce model-assisted estimation; and finally, the fifth post will deal with the Fay-Harriot method.

Let's begin. First of all, The Australian Bureau of Statistics declares that small area estimation refers to methods of producing sufficiently reliable estimates for geographic areas that are too fine to obtain with precision, using direct survey estimation methods. By direct estimation, we mean classical design-based survey estimation methods that utilize only the sample units contained in each small area. Small area estimation methods are used to overcome the problem of small samples sizes to produce small area estimates that improve the quality of direct survey estimates obtained from the sample in each small area. The more sophisticated of these methods work by taking advantage of various relationships in the data, and involve, either implicitly or explicitly, a statistical model to describe these relationships.

Now, I want to reproduce a clarifying explanation from Dr. Little (paraphrasing Groves) about SAE, and its use in survey sampling. They claim that regression estimates provide relatively precise predictions for small areas from a survey that account for the differences between areas of characteristics included as predictors in the survey but do not account for differences in characteristics not included in the study. Direct estimates for each area are unique to the area and hence take into account both observed and unobserved relevant characteristics; however, they have low precision in areas where the sample size is small. The SAE model combines the regression estimate and direct estimate for each area in a sensible way, balancing bias and precision.

For this technique to succeeds, Longford 2005 claims that areas that are known to be similar to one another should be receiving similar estimates, rather than estimates independent of one another. The degree to which similarity can or should be imposed can be chosen from statistical grounds by minimizing the overall discrepancy (mean squared error).

Rahman 2008 emphasizes that SAE uses data from similar domains to estimate the statistics in a particular small area of interest, and this ‘borrowing of strength’ is justified by assuming a model which relates the small area statistics. SAE is the process of using statistical models to link survey outcome or response variables to a set of predictor variables known for small areas to predict small area-level estimates. Traditional area-specific estimates may not provide enough statistical precision because of small sample observations in small geographical regions. In such situation, it may be worth checking whether it is possible to use indirect estimation approaches based on the linking models.

R workshop

This code will not produce any small area estimation, but it will help to introduce basic concepts of sampling. We will use the BigLucy database from the TeachingSampling package to illustrate how to obtain a probabilistic sample from a finite population. This database is about deals with some economic variables for a population of 85296 companies spread out into 100 counties (areas) in a particular year of some fake country. The aim of the exercise is 1) to select a stratified sample according to the size of the companies and 2) obtain accurate estimates of the total income within each of the 100 counties. Note that the parameters of interest are given by:

$$t_{y,d} = \sum_{k \in U_d} y_k$$

Where $t_{y,d}$ denotes the total income of the $d$-th county, $y_k$ is the income of the $k$-th company belonging the $d$-th county. The whole population of companies into the county is noted by $U_d$. Thus, this code computes the total income for each county along with the number of companies belonging each county. Finally, this parameters are saved in a new database named Results

##### Setting things up #####

setwd(“/wherever your prefer location is")

rm(list = ls())


Total <- BigLucy %>%
  group_by(Zone) %>%
  summarise(Income. = sum(Income)) %>%

N <- BigLucy %>%
  group_by(Zone) %>%
  summarise(N.county = n()) %>%

Results <- data.frame(N, Total$Income.)

#Checking the population total
(Total <- sum(Results$Total.Income.))

Now, once the parameters of the finite populations are computed, the time has come for us to select a sample. The sampling design we will use is an stratified sampling by taking advantage of the classification of each company into the database according to its level: small, medium and large. The selected sample will be stored into an object named data.sample and the sampling weights will be saved in another object called FEX.

##### Drawing a stratified sample #####

# Level is the stratifying variable
# Defines the size of each stratum
N1 <- summary(Level)[[1]]
N2 <- summary(Level)[[2]]
N3 <- summary(Level)[[3]]
Nh <- c(N1,N2,N3)
# Defines the sample size at each stratum

n1 <- round(N1 * 0.05)
n2 <- round(N2 * 0.05)
n3 <- round(N3 * 0.05)
# Draws a stratified sample
sam <- S.STSI(Level, Nh, nh)

data.sam <- BigLucy[sam,]

data.sam$FEX <- NULL
data.sam$FEX[data.sam$Level == "Big"] <- Nh[1] / nh[1]
data.sam$FEX[data.sam$Level == "Medium"] <- Nh[2] / nh[2]
data.sam$FEX[data.sam$Level == "Small"] <- Nh[3] / nh[3]

save(data.sam, file = "data.sam.RData")

In summary, we have drawn a sample of 4265 companies: 145 big companies, 1290 medium companies and, 2830 small companies. Each of these companies is spread out through the 100 counties in the country. In the Results database will be stored the information about those counties (how many companies are in each county and, how many companies are in the sample for each county.) along with different estimations for the total income of the counties. 

##### In summary #####

n <- data.sam %>%
  group_by(Zone) %>%
  summarise(n.county = n()) %>%

Results$n.county <- n$n.county
Results <- Results[c("Zone", "N.county", "n.county", "Total.Income.")]

save(Results, file = "Results.RData")

In the following post, we will use this sample to estimate the total income for each of the hundred zones in BigLucy’s population. 


Saturday, January 28, 2017

Regression to the mean (or at the end, people are not as smart as you could expect)

Francis Galton very cleverly coined the term "regression to (or towards) the mean" meaning that if a variable is shown extreme in a first measurement, then the following observed values of that very variable will tend to get closer to the average of its distribution. The classical example is height: a tall child will have (on average) parents less tall than himself. Moreover, extremely small parents tend to have children who are smaller than average, but in both cases, the children tend to be closer to the mean than were their parents (Senn, 2011).

Ok, this story should be widely known by the readers of this blog. However, I want to put forward another point of view. This is from Rolf Tarrach (former president of Luxemburg University) who has written a book on logical reasoning entitled <<The Pleasure of Deciding>>. He claims that the regression to the mean is a phenomenon that occurs not only in body measuring but also in cognitive measuring. That is: smart parents will tend to have children who are not as smart as expected.

So if you consider yourself as an intelligent person and you have decided to share your life with a smart mate, it is very likely that your children won't be smarter than you two. So, do not expect your children to be geniuses. That fact goes against the common idea that insists in requesting to children of smart parents to be even more intelligent. Of course, there are some exemptions such as the Bach family or the Bernoulli family. But, those are isolated deviations from the normality of real life.

I want to finish with this story about mathematician Bernard Shaw and dancer Isadora Duncan. She told him: “Would it not be wonderful if we could have a child who had your brains and my beauty?” He replied: “Yes, but suppose the child had your brains and my beauty!”

PS: About geniuses, Koenker (1998) claims that Galton not only managed to invent Regression in one plot but also a bivariate kernel density estimation.

Sunday, January 15, 2017

Multilevel regression with poststratification (Gelman's MrP) in R - What is this all about?

Multilevel regression with poststratification (MrP) is a useful technique to predict a parameter of interest within small domains through modeling the mean of the variable of interest conditional on poststratification counts. This method (or methods) was first proposed by Gelman and Little (1997) and is widely used in political science where the voting intention is modeling conditional on the interaction of classification variables.

The aim fo this methodology is to provide reliable estimates on strata based on census counts. For those who have some background on survey sampling, this method should look very similar to the Raking method, where sampling weights are adjusted due to known census cell counts. However, a significant difference with Raking is that MrP is a model-based approach, rather than a design-based method. This way, even in the presence of a (maybe complex) survey design, MrP does not take it into account for inference. In other words, sampling design will be considered as ignorable. So, the probability measure that governs the whole inference is based on modeling the voting intention (variable of interest) to demographic categories (auxiliary variables).

Is this a major technical issue to ignore the complex survey design? Yes, because in any case we are considering a probability measure to draw the sample. However, when it comes to voting intention (a major area where this technique is used), we rarely find a sophisticated complex design. Moreover, this kind of studies is barely based on probabilistic polls. So, if the survey lacks a proper sampling plan, it is always better to model the response variable.

Therefore, the ultimate goal of this technique it to estimate a parameter of interest (totals, means, proportions, etc.) for all of the strata (domains, categories or subgroups) in a finite population. From now on, let's assume that:

  1. a population is divided into $H$ strata of interest (for example states),
  2. the parameters of interest are the means (same rules apply for proportions) in each strata $\theta_h$ ($h=1, \ldots, H$), 
  3. every stratum is cross-classified by some demographics $j \in H$ (from now on defined as post-srata), besides every population count $N_j$ is known, and
  4. all of the population means $\mu_j$ can be estimated by using some statistical technique, such as multilevel regression.

This way, the mean in stratum $h$ ($\theta_h$), is defined as a function of means in post-strata $j$ ($\mu_j$), and post-strata counts ($N_j$):

$$\theta_h = \frac{\sum_{j \in h} N_j \mu_j }{\sum_{j \in h} N_j}$$

The first part of MrP is defined by a multilevel regression (MR). This kind of models is a particular case of mixed effect models. The core components of this phase are

  • the variable of interest (for example, voting intention), 
  • the auxiliary variables (classification on census demographic cells) and, 
  • the random effects (strata of interest, that usually are states or counties).

The second part of MrP is cell poststratification (P), and the predicted response variable is aggregated into states and adjusted by corresponding poststratification weights.

R workshop

Let’s consider the Lucy database of the TeachingSampling package. This population contains some economic variables of 2396 industrial companies in a particular year. Assume that we want to estimate the mean income of industries by each of the five existing zones (strata of interest) on that database. This way, our parameters of interest are $\theta_1, \ldots, \theta_5$.

Now, we also know that the population is divided into three levels (small, medium and big industries) and we have access to the total number of industries within each cross group. That is, we know exactly how many small industries are on each of the five zones, and how many medium industries are on each of the five zones, and so on. 

The following code shows how to load the database and obtain the cell counts. 

> rm(list = ls())
> set.seed(123)
> library(TeachingSampling)
> library(dplyr)
> library(lme4)
> data("Lucy")
> # Number of industries per level
> table(Lucy$Level)

   Big Medium  Small
    83    737   1576
> # Number of industries per zone
> table(Lucy$Zone)

  A   B   C   D   E
307 727 974 223 165
> # Size of post-strata
> (Np <- table(Lucy$Level, Lucy$Zone))
           A   B   C   D   E
  Big     30  13   1  16  23
  Medium 180 121 111 187 138
  Small   97 593 862  20   4

Of course, this technique works over a selected sample. That's why we are going to select a random sample of size $n = 1000$. We can also create some tables showing the counts on the sample.

> # A sample is selected
> SLucy <- sample_n(Lucy, size = 1000)
> table(SLucy$Level)

   Big Medium  Small
    33    280    687
> table(SLucy$Zone)

  A   B   C   D   E
130 295 426  86  63

The first step of MRP is Multilevel Regression in order to estimate post-strata means. The following code shows how to estimate them by using the lmer function. The object Mupred contains the corresponding $\mu_j$ ($j$ is defined for each level) regarding each stratum (zones).

> # Step 1: <<MR>> - Multilevel regression
> M1 <- lmer(Income ~ Level + (1 | Zone), data = SLucy)
> coef(M1)
  (Intercept) LevelMedium LevelSmall
A    1265.596   -579.1851  -893.8958
B    1138.337   -579.1851  -893.8958
C    1189.285   -579.1851  -893.8958
D    1248.658   -579.1851  -893.8958
E    1284.322   -579.1851  -893.8958

[1] "coef.mer"
> SLucy$Pred <- predict(M1)
> # Summary
> grouped <- group_by(SLucy, Zone, Level)
> sum <- summarise(grouped, mean2 = mean(Pred))
> (Mupred <- matrix(sum$mean2, ncol = 5, nrow = 3))
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 1265.5959 1138.3370 1189.2852 1248.6575 1284.3224
[2,]  686.4107  559.1518  610.1000  669.4724  705.1373
[3,]  371.7001  244.4412  295.3894  354.7618  390.4267

Now we have estimated the post-strata means, it’s time to weight every strata by their corresponding counts in order to obtain an estimate of the mean income by zone. As we know each post-strata size, we simply use the function aggregate to obtain the MRP estimator of the parameters of interest.

> # Step 2: <<P>> - Post-stratification
> # Mean income estimation per zone
> colSums(Np * Mupred) / table(Lucy$Zone)

       A        B        C        D        E
643.5724 312.8052 332.1726 682.8031 778.2428

How accurate was the estimation? It was good compared to the true parameters on the finite population. 

> # True parameters
> aggregate(Lucy$Income, by = list(Lucy$Zone), FUN = mean)
  Group.1        x
1       A 652.2834
2       B 320.7469
3       C 331.0195
4       D 684.9821
5       E 767.3879


Sunday, January 1, 2017

3PL models viewed through the lens of total probability theorem (updated)

As I currently am the NPM for PISA in Colombia, I must assist to several meetings dealing with the proper implementation of this assessment in my country. Few of them are devoted to the analysis of this kind of data (coming from IRT models). As usual, OECD has hired organizations with high technical standards. The institute that handles all this data and take part in the analysis is ETS (Educational Testing Services).

PISA and ETS are changing from Rasch models to 2PL models. That involves a significant technical effort to maintain comparability along time. I had the opportunity to talk with some experts from ETS last year, and I formulated them the following question: ¿why to consider 2PL models instead of 3PL models? Well, the answer was not easy, and I am not pretending to explain it here in detail, in part because I am still convinced about the advantages of 3PL models that exceed those of 2PL models. However, they yielded me to a recent paper entitled Is There Need for the 3PL Model? Guess What?

The article was written by Mathias von Davier from ETS. I liked the way von Davier showed 3PL models, as coming from a total probability setup involving 2PL models. Consider the following hierarchical structure: first, the test-taker decides whether he/she is answering that item by guessing or not; then, he/she uses his/her ability to found the correct response. So, the stochastic process behind this structure can be easily shown in a tree diagram:

Screen Shot 2017 01 01 at 12 46 51 PM


Remember that 3PL models can be written as:

$$P_{3PL}(x=1) = P(Guess) + P(NoGuess) \times P_{2PL}(x=1|NoGuess)$$

Note that, per the model, once the student has chosen to answer by guessing, a correct answer is always found (kind of weird, isn’t it?). So, a major criticism against 3PL models is related to this last point. In R, you can estimate 3PL models by using the mirt package. So, for example, when using the LSAT7 data on the second item, we can estimate this guessing parameter.

data <- expand.table(LSAT7)
md2  <-  mirt(data, 1, itemtype = '3PL', IRTpars = TRUE)
coef(md2, IRTpars = TRUE)$Item.2

We found that the guessing parameter is estimated as 0.295. This way, the model is specified as:

$$P_{3PL}(x=1) = 0.295 + 0.705 \times P_{2PL}(x=1|NoGuess)$$. 

PD: Alexander Robitzsch pointed me out to this paper (Aitkin, 2006) where an alternative 3PL has been proposed which aims to address the critique.

Saturday, December 24, 2016

Computing Sample Size for Variance Estimation

The R package samplesize4surveys contains functions that allow to calculate sample sizes for estimating proportions, means, difference of proportions and even difference of two means. It also permits the calculation of sample error and power level for a fixed sample size.

Here four functions are introduced for the estimation of a population variance and for conducting statistical hypothesis testing on this parameter of interest. Right away is the description of these functions:

  1. Function ss4S2 allows calculating the sample size for estimating $s^2_{y_U}$ subject to a particular value of the coefficient of variation or the relative margin of error. Additionally, it offers to the user the option of mapping the coefficient of variation and the margin of error as a function of the sample size, to make easier the decision about $n$.
  2. Function ss4S2H allows calculating the sample size for estimating $s^2_{y_U}$ subject to a particular power level to detect a population variance greater than the value set in the null hypothesis. It also offers to the user the option of mapping the power level in function of the sample size.
  3. Function e4S2 allows calculating the coefficient of variation and the margin of error for a particular sample size. It also allows obtaining a mapping similar to the one of ss4S2.
  4. Function b4S2 allow calculating the power level for a fixed sample size. It also allows obtaining a mapping similar to the one of ss4S2H

In order to use the above functions it is necessary to install and call the package that contains them in the Comprehensive R Archive Network (CRAN). That for, it is required to type the following code lines from the console:


For example, the following code line gives the necessary sample size to estimate the variance of a characteristic of interest in a finite population (with a coefficient of kurtosis of one) to reach an estimated coefficient of variation of maximum 5% and a relative margin of error of 3%

ss4S2(N = 10000, K = 1, CV = 0.05, me = 0.03, DEFF = 2, plot = TRUE)

Screen Shot 2016 12 24 at 6 45 56 PM

On the other hand, as the package is in constant update, the authors have arranged a repository in which users can use the newest features and interact with the academic community to correct possible errors in computer codes and improve the efficiency of functions, among others. In order to access to this version control, it is necessary to type the following lines from R.


In this paper you can find the mathematical background behind those R functions. 

Saturday, December 3, 2016

Highlighting R code for the web

When blogging about statistics and R, it is very useful to differentiate the body text to R code. I used to manage this issue by highlighting the code and pretty-R was a valuable instrument from Revolutions Analytics to accomplish this. However, as you may know, Microsoft acquired that company, and now this feature (dressing R code for the web) is not available anymore.

After some searching, I found this online syntax highlighter and it seems to work pretty well. Besides, it allows you to select from different styles, and you can even choose among a lot of computational languages.

How important is that variable?

When modeling any phenomena by including explanatory variables that highly relates the variable of interest, one question arises: which of the auxiliary variables have a higher influence on the response? I am not writing about significance testing or something like this. I am just thinking like a researcher who wants to know the ranking of variables that influence the response and their related weight.

There are a variety of methods that try to answer that question. The one inducing this thread is very simple: isolate units from variables. Assume a linear model with the following structure (for the sake of simplicity, assume only two explanatory variables):

$$y = \beta_1 x_1 + \beta_2 x_2 + \varepsilon$$

If you assume this model as true and $\beta_i > 0$, then the influence of variable $x_i$, over response $y$ could be found when isolating measure units from variables. Then, one could fit a model over the standardized variables (explanatory and response) and then directly comparing the regression coefficients. Another way to do this is by means of the following expression:

$$I(i) = \frac{\beta_i}{sd(\beta_i)} = \beta_i\frac{ sd(x_i)}{sd(y)}$$

For example, let's consider the following model $y = -500 x_1 + 50 x_2 + \varepsilon$, then the relative importance of the first and second variable is around 500/(500+50) = 0.9, and 50/(500+50) = 0.1, respectively. The following code shows how to perform this simple analysis in R.

n <- 10000

x1 <- runif(n)
x2 <- runif(n)
y <- -500 * x1 + 50 * x2 + rnorm(n)

model <- lm(y ~ 0 + x1 + x2)

# 1a. Standardized betas
sd.betas <- summary(model)$coe[,2]
betas <- model$coefficients
imp <- abs(betas)/sd.betas
imp <- imp/sum(imp)

# 1b. Standardized betas
imp1 <- abs(model$coefficients[1] * sd(x1)/sd(y))
imp2 <- abs(model$coefficients[2] * sd(x2)/sd(y))

imp1 / (imp1 + imp2)
imp2 / (imp1 + imp2)

# 2. Standardized variables
model2 <- lm(I(scale(y)) ~ 0 + I(scale(x1)) + I(scale(x2)))

Sunday, November 20, 2016

Lord's Paradox in R

In an article called A Paradox in the Interpretation of Group Comparisons published in Psychological Bulletin, Lord (1967) made famous the following controversial story:

A university is interested in investigating the effects of the nutritional diet its students consume in the campus restaurant. Various types of data were collected including the weight of each student in the month of January and their weight in the month of June of the same year. The objective of the University is to know if the diet has greater effects on men than on women. This information is analyzed by two statisticians.

The first statistician observes that at the end of the semester (June), the average weight of the men is identical to their average weight at the beginning of the semester (January). This situation also occurs for women. The only difference is that women started the year with a lower average weight (which is obvious from their background). On average, neither men nor women gained or lost weight during the course of the semester. The first statistician concludes that there is no evidence of any significant effect of diet (or any other factor) on student weight. In particular, there is no evidence of any differential effect on both sexes, since no group shows systematic differences.

The second statistician examines the data more carefully. Note that there is a group of men and women who started the semester with the same weight. This group consisted of thin men and overweight women. He notes that those men gained weight from the average and these women lost weight with respect to the average. The second statistician concludes that by controlling for the initial weight, the university diet has a positive differential effect on men relative to women. It is evident that for men and women with the same initial weight, on average they differ since men gained more weight, and women lost more weight.

The following chart shows the reasoning of both statisticians in dealing with the problem. Note that the black line describes a 45 degrees line, the green points are the data coming from the men and the red ones from the women

Screen Shot 2016 11 20 at 6 04 27 PM

The reasoning of the first statistician focuses on the expectations of both distributions. Specifically in the coordinates (x = 60, y = 60), for females, and (x = 70, y = 70) for males, where black, red and green lines appear to coincide. The reasoning of the second statistic is limited to the continuum induced by the overlap of red and green dots. Specifically to the space induced by x = (60, 70), y = (60, 70). Suppose we have access to this dataset as shown in the following illustration, where the first column denotes the initial weight of the students, the second column indicates the final weight, the third column describes the difference between pesos and the last one defines the Sex of the student.

Screen Shot 2015 12 30 at 11 13 09 PM

The findings of the first statistician are obtained through a simple regression analysis that, taking as a response variable the difference between weights, induces a coefficient of regression equal to zero for the variable sex, which indicates that there are no significant differences in the weight difference between men and women.

Screen Shot 2016 11 20 at 6 04 49 PM

The findings of the second statistic are obtained through a covariance analysis, taking as response variable the final weight and covariates are sex and the initial weight of the individual. This method induces a coefficient of regression equal to 5.98 which implies that there is significant difference between the final weight of the people, according to sex.

Screen Shot 2016 11 20 at 6 05 28 PM

For Imbens and Rubin (2015), both are right when it comes to describing the data, although both lack a sound reasoning in establishing some kind of causality between the diet of the university and the loss or gain of weight in the students. Regardless of this I still find more interesting the analysis that arises from the comparison between men and women who started with the same weight (ie all data restricted to x = (60, 70) y = (60, 70). 

R workshop

Lord's paradox summarizes the analysis of two statisticians who analyze the average weight of some students within a particular university. At the end of the semester (June), the average weight of the men is identical to their average weight at the beginning of that six months (January). This situation also occurs for women. The only difference is that women started the year with a lower average weight (which is evident from their natural contexture). On average, neither men nor women gained or lost weight during the semester.

To perform the simulation, we assumed that both the final weight of the men and the women follow a linear relationship with the original weight. Thus, it is assumed that $y_{2i}^M = \beta_0^M + \beta_1 y_{1i}^M + \varepsilon_i$ for the weight of women; and $y_{2i}^H = \beta_0^H + \beta_1 y_{1i}^H + \varepsilon_i$, for the weight of men. Where $y_{1i}^M$ denotes the weight of the $i$-th female at the beginning of the semester, and $y_{2i}^M$ denotes the weight of the $i$-th female at the end of the semester. The notation for men (H) maintains this logic.

Now, note that from their natural contexture, men must have greater weight than women. Suppose that on average the weight of men is equal to that of women plus a constant $c$. In addition, the mean weight in both groups is identical in both times. Then, we have $\bar{y}^M = \beta_0^M + \beta_1 \bar{y}^M$ and that $\bar{y}^H = \beta_0^H + \beta_1 \bar{y}^H = \beta_0^H + \beta_1 (\bar{y}^M + c).$ Hence, after some algebra, we have that $\beta_0^M = (1 - \beta_1) \bar{y}^M$ and $\beta_0^H = \bar{y}^H - \beta_1 (\bar{y}^M + c)$.

The following code replicates a set of data that follows the relationship proposed by Lord.

N <- 1000
b <- 10
l <- 50
u <- 70
Mujer1 <- runif(N, l, u)
Hombre1 <- Mujer1 + b
beta1 <- 0.4
Mujerb0 <- (1 - beta1) * mean(Mujer1)
Hombreb0 <- mean(Hombre1) - beta1 * (mean(Mujer1) + b)
sds <- 1
Mujer2 <- Mujerb0 + beta1 * Mujer1 + rnorm(N, sd=sds)
Hombre2 <- Hombreb0 + beta1 * Hombre1 + rnorm(N, sd=sds)

The graph can be done with the following piece of code:

datos <- data.frame(inicio = c(Mujer1, Hombre1), final = c(Mujer2, Hombre2))
datos$dif <- datos$final - datos$inicio
datos$sexo = c(rep(0, N), rep(1, N))
ggplot(data = datos, aes(inicio, final, color = factor(sexo))) +
  geom_point() + stat_smooth(method = "lm")  +
  geom_abline(intercept = 0, slope = 1) +
  ggtitle("Paradoja de Lord") + theme_bw()

Monday, November 14, 2016

Intercept or not? That's the question!

My current passion is statistical modeling. While each model requires the researcher to make a proper contextualization of the problem he/she is addressing, which means that no model is equal to another, there is a common question that the researcher should answer before estimating model parameters.

Do I fit the model with an intercept or not?

While seeking for the goodness of fit, the researcher is tempted many times to run automated variable selection procedures (i.e. stepwise, forward, backward). If luckily, these methods will provide you "the best model" for you to choose (based on the highest coefficient of determination, or lower AIC, BIC, or DIC). Call me old fashioned, and retrogressive, but I have always been a little reluctant to the practice of throwing the data into the software waiting for the best model to come automatically.

Returning to the subject of this entry I will highlight the importance of inclusion/omission of the intercept in a model. For this, I will consider the following cases

1. If the response variable Y is continuous:

When the explanatory variable X is also continuous. This is the classic case of a linear regression model, where the inclusion of the intercept assumes that when X = 0, the mean value of Y = 0, and corresponds to the estimate of the intercept. However, when excluding the intercept, we are demanding that the average value of Y = 0 when X = 0. Thus the inclusion or exclusion of the intercept, in many cases, depends on the nature and interpr etation of the variables.

When the explanatory variable X is categorical. Without loss of generality, let's assume it as dichotomous (two levels); in this case, when fitting a regression line including the intercept, one can define a dummy variable representing the first level of the variable X, and the model is set as

$Y_i = \beta_0 + \beta_1D_{1i} + E_i$

Where D1 = 1 for units belonging to the first level of X and, D1 = 0 for units belonging to the second level of X. In this case, the interpr etation of this model is as follows: For individuals belonging to level 1, the mean value of Y is given by $ \beta_0 + eta1$. For units belonging to level 2, the average value of Y is given by $ \beta_0$. Coefficient $ \beta_1$ is defined to be the difference between these two levels. If the estimate is significant, it implies that the variable X does have considerable influence on Y. That is, the mean value of Y at each level of X varies in a significant way.

On the other hand, if the regression is fitted without an intercept, two dummies variables must be created, each one representing the as many levels as X has. The model is formulated as

$Y_i = \beta_0D_{1i} + \beta_1D_{2i} + E_i$

For units at the first level (D1 = 1), the mean value of Y is given by $ \beta_0$ and, for units at the second level (D2 = 1), the average value of Y is given by $\beta_1$. Thus, even if the estimate of either $\beta_0$ or $\beta_1$ is significant, that does not imply that X has any influence over Y. All we can claim is that in this model is that the two parameters are significantly different from zero. So, if you really want to establish whether X influences Y, then omitting the intercept would not be a good choice.

 2. If the response variable Y is discrete:

When the explanatory variable X is continuous. In this case, the fitted is a logistic regression, modeling the probability of success (Y = 1) in terms of $p_i = Pr(Y=1)$:

$logit (p_i) = \beta_0 + \beta_1X_i$

If the model includes an intercept, $ \beta_0$ estimate can be used to estimate the probability of success when X = 0, since $p_i = \frac{\exp{ \beta_0}}{ 1 + exp{ \beta_0}}$. On the other hand, if the estimate of $\beta_1$ is not significant, that implies that the values of X do not influence the chances of success or failure over Y. If the estimate of $ \beta_1$ is significant with a positive (or negative) value, it indicates that an increase in the variable X implies an increase (or decrease) on the probability of success of Y. Note that this interpr etation is the same when the regression is adjusted without an intercept.

When the explanatory variable is categorical (let's assume it as a dichotomous variable). In this case, by fitting a regression line including the intercept a dummy variable representing the first level of the variable is created and the model is defined as

$logit (p_i) = \beta_0 + \beta_1D_{1i}$

The interpretation of this model is as follows: for units in the first level of X, $logit(p_i) = \beta_0 + \beta_1$. For units in the second level of X, $logit(p_i) = \beta_0$. Thus, if $ \beta_1$ is significant, it indicates that $logit(p_i)$ is different between levels of variable X, and we can conclude that X does have an important influence on Y.

On the other hand, if the intercept is not taken into account, two dummies are produced (representing X levels) and the model is formulated as

$logit (p_i) = \beta_0D_{1i} + \beta_1D_{2i}$

For this pattern, estimates of $\beta_0$ and $\beta_1$ represent the values of $logit(p_i)$ in the two levels of X. Thus, the significance of X over Y cannot be estimated via $\beta_1$ or $\beta_2$. Those coefficients give no information on the influence of X on Y.

In summary, we can conclude that when the explanatory variable is continuous, the interpr etation of $\beta_1$ does not change if the intercept is included (or excluded). Although when the explanatory variable is discrete, we must consider whether the model includes or not the intercept, since the interpr etation of $\beta_1$ changes. Also, if what you want is to know the influence of X on Y, it is necessary to include the intercept. That can only be achieved if the model considers the intercept, and putting aside (just for a moment) automated procedures.

Sunday, October 30, 2016

Data Literacy

Once again I have decided to pimp my blog. This time is a significant change: the name. This blog began a long time ago. It was 2006; I was 22-year-old, and I was enrolled in a Master of Science in Statistics. I had plenty of doubts about statistics, data science, and uncertainty (fortunately, some of those questions remain) and I decided to solve them by blogging.

Then it was born this blog with the name "Apuntes de Estadística." Ten years later, this blog has become an important space for researchers, teachers, and students who, like myself, want to solve questions about variability and statistics.

Now, after hundreds of posts, a hundred thousand visitors per year, and being migrated from Spanish to English, from WordPress to Blogger, it is time for me to revisit the very own name of this blog. Lately, posts are not notes anymore; I am not only answering basic questions but discussing modeling, forecasting, data and even statistics epistemology.

As a consequence of what I currently do in my job, now I am strongly convinced (more than ever in my life) that the time predicted by H.G. Wells has not arrived yet. Statistical thinking is not part of our culture. It should be encrusted in every one of us, but it is not. I will put in just my two cents to build an effective citizenship through data literacy: the ability to communicate information from data.

Monday, October 17, 2016

Sublime Text 3: an alternative to RStudio

It was a Saturday morning; I was lecturing my students of my Item Response Theory class when I decided to run some R scripts to introduce my students with the JAGS syntax and the estimation of parameters in a Bayesian logistic regression setup.

As it was usual, I opened RStudio because it was my favorite R interface. So when I tried to run a Bayesian sampler with five chains in JAGS, it appeared that damn message: <<R Session Aborted - R encountered a fatal error - The session was terminated>>.

Rstudio bomb

So I was there, the lecture was completely stopped because of RStudio. However, knowing about those annoying disadvantages of RStudio I asked some colleagues about some other interfaces of R. The answer was convincing: Sublime Text is an editor designed for people that are serious when it comes to programming, not only in R but any other language.

So, I am just spreading the word about this editor, but I am not trying to convince you to abandon RStudio. If you want to give a try, just follow this simple steps:

  • Of course, you install R (
  • Install Sublime Text 3 (
  • Cmd + Shift + P, then type Package Control and install it.
  • Cmd + Shift + P, then type "Install Package", look for SublimeREPL and install it.
  • Cmd + Shift + P, then type "Install Package", look for R-Box and install it.
  • Cmd + Shift + P, then type "Install Package", look for SendTextPlus and install it.
  • Cmd + Shift + P, then type "Install Package" look for R-Snippets and install it.
  • Cmd + Shift + P, then type "Set Syntax: R Extended”.
  • Cmd + Shift + P, then type "SendTextPlus: Choose Program" and select R.
  • Enjoy! Write your code and Cmd + Enter to send your lines to R.

You can also send your code to R within Sublime Text by using REPL. Cmd + Shift + P, then type "SublimeREPL: R" and that's it. Please, comment your experience. You can be sure that trying this editor is worthwhile.

Tuesday, October 11, 2016

Multilevel Modeling of Educational Data using R (Part 1)

Linear models fail to recognize the effect of clustering due to intraclass correlation accurately. However, under some scenarios force you to take into account that units are clustered into subgroups that at the same time are nested within larger groups. The typical example is the analysis of standardized tests, where students are grouped in schools, and schools are grouped into districts, while districts are part of a whole system (state or country).

When data is coming from a hierarchical structure, the proper way to analyze it is via multilevel modeling (Goldstein, 1995). Between other advantages, multilevel modeling allows you to correctly estimate the relative variation in the test score due to the effect of clustering. In other words, you can decompose the variance into two parts: the one coming from students and the one coming from schools, by fitting a regression model per group. That allows you to make proper inferences when it comes to the source of variation of the outcome measure.

For example, the following graph shows a scatter plot between the test score and the socio-economic status of the students. If you consider the left side of the chart below, you realize a linear relationship indeed; however, when you take a closer look, it is evident that the same model does not hold within schools (different colors identify different schools).

Screen Shot 2016 10 11 at 9 42 37 PM

Let's assume that we want to fit a two-level model: the first level is related to schools, and the second level describes students within schools. The following code allows simulating a hierarchical structure over a population of five schools and a hundred students per school. Notice that the test score is somehow related to the socio-economic status (SES) of the student.

rm(list = ls())
N <- 100 #Number of students per school
sigma <- 200
x1 <- runif(N, 10, 40)
x2 <- runif(N, 25, 55)
x3 <- runif(N, 40, 70)
x4 <- runif(N, 55, 85)
x5 <- runif(N, 70, 100)
y1 <- 20 + 0 * x1 + rnorm(N, 0, sigma)
y2 <- 40 + 10 * x2 + rnorm(N, 0, sigma)
y3 <- 60 + 20 * x3 + rnorm(N, 0, sigma)
y4 <- 80 + 30 * x4 + rnorm(N, 0, sigma)
y5 <- 100 + 40 * x5 + rnorm(N, 0, sigma)
ID <- rep(LETTERS[1:5], each = N)
test <- data.frame(SES = c(x1, x2, x3, x4, x5), 
Score = c(y1, y2, y3, y4, y5), ID = ID)

When analyzing data with such a type of hierarchical structures, the researcher always should fit a null model to decompose the variation due to schools. The following expression gives the functional form of this model:

$y_{ij} = \alpha_{j} + \varepsilon_{ij}$
$\alpha_{j} = \alpha_0 + u_{j}$

This model allows decomposing the source of total variation into two parts: the one due to students (within schools) and the one due to schools (test score between all of the schools). The R code to fit this null multilevel model to data is given next:

HLM0 <- lmer(Score ~ (1 | ID), data = test)
# 96% - Between-schools variance
# 4% - Within-schools variance
100 * 87346 / (87346 + 1931757)

By examining the estimates of the random effects' variance we note that the percentage of variation accounted for the school (intraclass correlation) is nearly 96%; while the proportion of variance accounted for students is almost 4%. The null model claims that high achievers belong to particular schools and low achievers are not part of those specific schools. In other words, the school determines the test outcome.

The previous unconditional model fails to include relevant variables, such as the socio-economic status (SES) associated to every student. The following expressions give a more elaborated model involving random intercepts and random slopes for each of the schools:

$y_{ij} = \alpha_{j} + \beta_{j} * SES_{ij} + \varepsilon_{ij}$
$\alpha_{j} = \alpha_0 + u_{j}$
$\beta{j} = \beta_0 + v_{j}$

The R code that allows fitting this model is shown below.

HLM1 <- lmer(Score ~ SES + (SES | ID), data = test)
# 1% - BS variance
# 99% - WS variance
100 * 40400.24 / (40400.24 + 257.09 + 1.65)
# Percentage of variation explained by SES between schools
1 - ((257.09 + 1.65) / 1931757)
# Percentage of variation explained by SES within schools
1 - (40400.24 / 87346)

On the one hand, we note that the SES explains 99% of the between-schools variance; on the other hand, the SES explains 53% of the within-schools variance. What does it mean? School segregation. That is, wealthy students belong to rich schools, and poor students belong to poor schools. Moreover, affluent students far exceed the performance of poor students.

Tuesday, October 4, 2016

#PredictiveCOL - Forecasting Colombia's peace plebiscite (final update)

For sure, this is the more exciting forecast I have ever done. On one hand, I am Colombian guy, and I really want to live in a peaceful country, and I do want a better place for raising my children. On the other hand, I am very serious when it comes to forecasting.

Maybe you have read this blog before, and you have realized that I love predictions cuz, as a data scientist, you can use your own set of statistical methodologies that relates to voting intention and turnout. If you think genuinely, everything counts while forecasting this kind of processes. For example, you may weekly track people's security perception, how much the president is disliked, how many minutes newscast spend about the peace process, how many articles or opinion columns are written, how many tweets are sent per week about FARC, everything, everything counts and matter. Let us name these variables as contextual variables

Of course, polls and pollsters are also important. They are the only proxy of voting intention. However, polls get old, and they should be updated with new data; also some pollsters are not as confident as others (I really do not believe everything some pollsters claim), and sample size also matters cuz it improves the sampling error. 

Remember that this plebiscite has two options for Colombians to choose. The first option is Yes, that goes for Yes, I approve the agreement between Farc and government. The second option is No, meaning No, I don't support that agreement. With those topics in mind, let me introduce you my (bayesian) predictions for the Colombian plebiscite. Firstly, let me exhibit the trend that the two options have shown over time:
As you can see, we have deflated the data by undecided voters. Note that we extracted the signal (bold lines) from the noise generated by polls. The following graph shows the less robust prediction based only on what polls have estimated.

Now, a more reasonable forecast based on prior information (2014 presidential elections and 2014 legislative elections) that tries to explain the outcomes of the polls by modeling the extracted signal (see trends in previous graphic) with contextual variables. After the model is fitted, we use a Bayesian setup that relates the prior information with the estimated response from this model. The following chart shows this forecast.
Ok, it is clear that Colombian people will support this peace process. However, this election is legitimate if and only if Yes voter turnout is greater than 4.4 million. By using a similar Bayesian methodology, next graph shows the predicted turnout.
Also, we used a small area modeling to forecast the response of every Colombian department (equivalent to a state in the US). The following map shows that the majority of departments are supporting the agreement between FARC and government. However, there are some of them that will vote No. Dark areas are not supporting the deal, while light-gray areas will do support the agreement.
Finally, the posterior probability that Yes defeats No is 98.8%.

Thursday, August 18, 2016

My talk in Bogotá - Pvalues: use and abuse

Did you know that American Statistical Association (ASA) made a disclaimer about the proper use of p-values? Moreover, ASA (the oldest scientific association in the USA) claimed that:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

With such a disclaimer, from my statistical perspective I gave a talk at ICFES about the use (misuse and abuse) of p-values and how to face this new reality. In the end, I consider that p-values and hypothesis testing have several disadvantages in this information era. With Petabytes of data generated every day, sample sizes influence directly on p-values, and decisions taken from this perspective may be misleading.

Hope you enjoy these slides.

Tuesday, August 9, 2016

My talk in Sincelejo - Small area estimation and multiple imputation

My second talk in Sincelejo deals with fascinating topics: Statistical methods in education in conjunction with sample surveys and small area estimation. I am leading this research since last year, and the results are very consistent.

Maybe the future of the assessment of education will rest on this kind of predictive methods. In fact, the use of plausible values trough multiple imputations is (sort of) a  predictive method. That technique assumes that items not answered (by design) by students could be imputed with the help of auxiliary variables. We use small area methods to predict the performance of provinces in mathematic tests. 

My talk in Sincelejo - Sample size for estimating population variances

Colombian Statistical Symposium has been held this August in Sincelejo. I will be presenting this talk about sample size and population variances. The discussion rest on a recent paper we wrote for Revista Comunicaciones en Estadística.

Besides, we not only approached the theory but also wrote some R functions to compute proper sample sizes in various scenarios. These codes are part of the samplesize4surveys R package.

Monday, July 18, 2016

Voting intention and calibration estimators - My article in CJS

During the last few years, I've been very interested in electoral studies. If you have been a reader of this blog, maybe you could remind that I predicted, some years ago, that Santos was going to win the presidential elections in Colombia. From that very election (Zuluaga won in the first round, while Santos won in the soon round), I learned that estimating voting intention is not a matter of sample size. Moreover, is not an issue of sampling. Elections are a very dynamic process that takes into account so many factors. However, in this paper, we strived to relate a sampling technique called "sampling calibration" with "electoral studies." So, if you are a social researcher on political issues, and if you are interested in predicting voting intention, this paper published in the Colombian Journal of Statistics is just for you.

Here is the abstract: In this paper, the calibration approach is revisited to allow new calibration weights that are subject to the restriction of multiple calibration equations on a vector of ratios, means, and proportions. The classical approach is extended in such a way that the calibration equations are not based on a vector of totals, but on a vector of other nonlinear parameters. We stated some properties of the resulting estimators and carried out some empirical simulations to assess the performance of this approach. We found that this methodology is suitable for some practical situations like vote intention estimation, estimation of the labor force, and retrospective studies. The method is applied in the context of the Presidential elections held in Colombia in 2014 for which we estimated the vote intention in the second round using information from an election poll, taking the results from the first round as auxiliary information.

Here is the link to the entire document.

Friday, July 8, 2016

Isolating confounding effects - Rankings and residuals

In a previous entry, we talked about the meaning and importance of isolating confounding variables. This entry is dedicated to the residuals and its relation to the variable of interest when controlling for some confounding factors.

Let's think about education. This example is always a good illustration to understand this issue. Assume that the performance of students on a standardized test is only related to Socio-Economic Status (SES) and the Ability of students. This way, consider the following model as a description of this matter:

$y = \beta_0 + \beta_1 \times SES + \beta_2 \times Ability + \varepsilon$

The previous model claims that a wealthy student tends to perform better in the test, besides a student with high ability will perform well. Note that Ability is a latent variable that is impossible to observe directly. However, if we control for SES, the residual term of the regression will represent the Ability of the student. This way, a model that we can fit in real life is given by:

$y = \beta_0 + \beta_1 \times SES + \epsilon = \hat{y} + \epsilon$

Note that the term $\hat{y}$ renders the performance of the student in the test explained by SES, while the term $\epsilon$ represents the performance of the student explained by unobservable variables such as Ability.

For instance, let's assume that we collect data for 20 students including the performance in the test along with the SES. As the Ability is an unobservable variable, we only can adjust a model relating performance and SES. The following chart shows the ranking of students based on the test-score.

Plot zoom png

This way, we can conclude that, from a test-based point of view, the student who performed better was Student K, and the one who performed worse were Student O. On the other hand, Student K is the wealthier one, and Students O is quite poor. This relation is very common in education: rich people perform better than poor people. However, when we explore the residuals' behavior, we note that when plotting them against the Score, there is some linear trend that remains.

Plot zoom png 1

That fact tells us two things: 1) the predicted values of the Score lacks the variable Ability and 2) The residuals do capture the behavior of that very variable. This way, if one would want to make a ranking (of the performance on the test) not based on SES but Ability, you must look to residuals.

Plot zoom png 2

When deflacting by SES, things change. Now, the best one is not Student K (wealthier), but Students A (middle class). Besides that, the worst is not student O (poor), but Student T. However, we can claim that students K and O are not so far from the initial positions.

Here is the R code I applied to obtain the plots of this post.  

rm(list = ls())
N <- 20
SES <- runif(N, 20, 80)
Ability <- runif(N, 0, 100)
Score <- 100 + 1 * Ability + 2 * SES + rnorm(N)
Schools <- data.frame(ID = ID, Score = Score,
                      SES = SES, Ability = Ability)
ggplot(Schools, aes(SES, Score, label = ID)) + geom_point() +
  geom_text(vjust = 0, nudge_y = 3)
### Modelo Observable ###
fit <- lm(Score ~ SES, data = Schools)
Schools$Resid <- residuals(fit)
# Los residuales están capturando el patrón de Ability
ggplot(Schools, aes(Score, Resid, label = ID)) + geom_point() +
  geom_text(vjust = 0, nudge_y = 3)
ggplot(Schools, aes(Ability, Resid, label = ID)) + geom_point() +
  geom_text(vjust = 0, nudge_y = 3)
# Correlación bastante alta
cor(Schools$Ability, Schools$Resid)

Saturday, June 11, 2016

Isolating confounding effects - What does it mean?

Have you heard (or read) to those geeks (econometricians and statisticians) talking about controlling for some variable in a study? The origin of this practice lies on the design of experiments: when you plan an experimental study, you try to randomise units according to categories that you know that are important.

For example, let’s assume that you want to assess the effect of some treatment over some population of humans. You know that the age of the individual is important when it comes to prove the effectivity of the treatment. Then, your experiment must be carried out by means of randomisation over age categories; that is you define, say, 2 categories, then you fit your model and everything is alright. Isn’t it? Let’s take a look at some scenarios. First, if you stratified by age and then you drew a sample for every stratum, then your model must look like this:

$$y_i = \beta_1D_{1i} + \beta_2D_{2i} + \varepsilon_i$$

Where $D_{1i}=1-D_{2i}$ and $D_{1i}$ is defined to be a dummy variable regarding the membership of unit $i$ to the first stratum. Now, if you didn’t stratify (because of you ignored that this specific treatment provides differential result by age category), then your model should look like this:

$$y_i = \beta_0 + \varepsilon_i$$

It is straightforward to note that when you fit the first model, the estimated response is $\hat{y}_1$, if unit  belongs to stratum 1, or $\hat{y}_2$, if unit belongs to stratum 2. However, when you fit the second model, the estimated response is always $\hat{y}$. 

So, once you have fitted these two models, you realise that the relationship between residuals and age category is not the same. That is, in the second model, the variable age has a strong effect over the residuals. Why? Because of the effect of this variable is absorbed by residuals; in other words your model didn’t take into account the effect of this variable; that is to say, your model is not controlling for the effect of this variable. 

Plot zoom png

Here I present some R code useful to reproduce the previous plots.
y[D1 == 1] <- rnorm(sum(D1), 10, 1)
y[D2 == 1] <- rnorm(sum(D2), 5, 1)
e1 <- resid(lm(y ~ 0 + D1 + D2))
e2 <- resid(lm(y ~ 1))
D1 <- as.factor(D1)
datos <- data.frame(y = y, D = D1, e1 = e1, e2 = e2)
p1 <- ggplot(data = datos, aes(D, e1)) + geom_boxplot() +
  ggtitle("Distribution of residuals for model 1")
p2 <- ggplot(data = datos, aes(D, e2)) + geom_boxplot() +
  ggtitle("Distribution of residuals for model 2")
grid.arrange(p1, p2, nrow = 1)
In addition, when it comes to causal inference or more sophisticated studies, isolating confounding variables is very  important, because you want a well formulated model with correct and precise estimated effects. For example, let’s assume that the actual relation between the response variable and a causal variable $D$ is given by

$$y_i = \beta_0 + \beta_1D_{i} +\beta_2 x_i + \varepsilon_i$$

Where $x$ is a covariate. Then, if you control for the effect of that covariate, the estimate of $\beta_1$ (the causal effect) will be precise as you can see in the next code:
N <- 1000
D <- rbinom(N, 1, 0.3)
x <- runif(N, 1, 100)
e <- rnorm(N, 0, 10)
y <- 10 + 5 * D + 3 * x + e
plot(x, y)
m1 <- lm(y ~ D + x)


lm(formula = y ~ D + x)
(Intercept)            D            x
     10.225        5.431        2.993  

On the other hand, if you do not control for the effect of that covariate, the estimate of $\beta_1$ (the causal effect) will be biased as you can see in the next code:

m2 <- lm(y ~ D)
lm(formula = y ~ D)
(Intercept)           D1
    160.200        8.967

Finally, note that the residuals of both models have different behaviours, the correlation of the covariate with the residuals of the right model is null, while the correlation of the covariate with the residuals of the wrong model is very large

Plot zoom png 1