Thursday, June 15, 2017

Sampling weights and multilevel modeling in R

So many things have been said about weighting, but on my personal view of statistical inference processes, you do have to weight. From a single statistic until a complex model, you have to weight, because of the probability measure that induces the variation of the sample comes from an (almost always) complex sampling design that you should not ignore. Weighting is a complex issue that has been discussed by several authors in recent years. The social researchers have no found consensus about the appropriateness of the use of weighting when it comes to the fit of statistical models. Angrist and Pischke (2009, p. 91) claim that few things are as confusing to applied researchers as the role of sample weights. Even now, 20 years post-Ph.D., we read the section of the Stata manual on weighting with some dismay.

Anyway, despite the fact that researchers do not have consensus on when to weight, the reality is that you have to be careful when doing so. For example, when it comes to estimating totals, means or proportions, you can use the inverse probability as a way for weighting, and it looks like every social researcher agrees to weight in order to estimate this kind of descriptive statistics. The rationale behind this practice is that you suppose that every unit belonging to the sample represents itself and many others that were not selected in the sample.

When using weights to estimate parameter models, you have to keep in mind the nature of the sampling design. For example, when it comes to estimates multilevel parameters, you have to take into account not only the final sampling unit weights but also the first sampling unit weights. For example, let’s assume that you have a sample of students, selected from a national frame of schools. Then, we have two sets of weights, the first one regarding schools (notice that one selected school represents itself as well as others not in the sample) and the second one regarding students.

Let’s assume that in the finite population we have 10.000 students and 40 schools. For the sake of my example, let's consider that you have selected 500 students allocated in 8 schools. For the sake of easiness, let’s think that a simple random sample is used (I know, this kind of sampling design is barely used) to select students. Think about it, if you take into account only the student’s weights to fit your multilevel model, you will find that you are estimating parameters with an expanded sample that represents 10.000 students that are allocated in a sample of just eight schools. So, any conclusion stated will be wrong. For example, when performing a simple analysis of variance, the percentage of variance explained by the schools will be extremely low, because of you are expanding the sample of schools. Now, if you take into account both sets of weights (students and schools), you will find yourself fitting a model with expanded samples that represent 10.000 students and 40 schools (which is good).

Unfortunately, as far as I know, the R suitcase lacks of a package that performs this kind of design-based inference to fitting multilevel models. So, right about now, we can unbiasedly estimate model parameters, but when it comes to estimate standard errors (from a design-based perspective) we need to use other computational resources and techniques like bootstrapping or Jackknife. According to the assumption of independence, most of the applied statistical methods cannot be used to analyze this kind of data directly due to dependency among sampled observation units. Inaccurate standard errors may be produced if no adjustment is made when analyzing complex survey data

When it comes to educational studies (based on large-assessment tests), we can distinguish (at least) four set of weights: total student weight, student house-weight, student senate-weight and school weight. TIMMS team claims that total student weight is appropriate for single-level student-level analyses. Student house weight, also called normalized weight, is used when analyses are sensitive to sample size. Student house weight is essentially a linear transformation of total student weight so that the sum of the weights is equal to the sample size. Student Senate weight is used when analyses involve more than one country because it is total student weight scaled in such a way that all students’ senate weights sum to 500 (or 1000) in each country. School weight should be used when analyzing school-level data, as it is the inverse of the probability of selection for the selected school.

R workshop

We will use the student house-weight to fit a multilevel model. As stated before, the sum of these weights is equal to the sample. For the R workshop, we will use PISA 2012 data (available in the OECD website). I have done a filter for the Colombian case and saved this data to be directly compatible with R (available here). Let’s load the data into R.

rm(list = ls())
library(dplyr)
library(ggplot2)
library(lme4)

setwd("/your working directory")
load("PisaCol.RData")

head(names(PisaCol))
summary(PisaCol$STRATUM)

Now, we create an object containing the student house-weights and summarize some results based on that set of weights. Notice that the total student weights are stored in the column W_FSTUWT of the PISA database. I recall you that I am working with the first plausible value of the mathematics test and that score will be defined as our (dependent) variable of interest for the modeling.  

n <- nrow(PisaCol)
PisaCol$W_HOUSEWHT <- n * PisaCol$W_FSTUWT / sum(PisaCol$W_FSTUWT)

PisaCol %>%
  group_by(STRATUM) %>%
  summarise(avg1 = weighted.mean(PV1MATH, w = W_HOUSEWHT),
            avg2 = weighted.mean(PV2MATH, w = W_HOUSEWHT))

We use the function lmer of the lme4 package to obtain the estimation of the model coefficients in the null model (where schools are defined as independent variables).

##################
### Null model ###
##################

HLM0 <- lmer(PV1MATH ~ (1 | SCHOOLID), data = PisaCol,
             weights = W_HOUSEWHT)
coef(HLM0)
summary(HLM0)

# 62.81% of the variance is due to students
# 37.19% of the variance is due to schools
100 * 3569 / (3569 + 2113)

As you may know, the PISA index of economic, social and cultural status has a strong relationship to student achievement, so it is a good idea to control for this variable in a more refined model. 

#################
### ESCS mdel ###
#################

HLM1 <- lmer(PV1MATH ~ ESCS + (1 + ESCS | SCHOOLID), data = PisaCol,
             weights = W_HOUSEWHT)
coef(HLM1)
summary(HLM1)

# After contoling for ESCE, 34.58% of the variance is due to schools
100 * (96.12 + 1697.36) / (3392.58 + 96.12 + 1697.36)

So then, in summary: we have 3569 units of within-schools variance (63%), after controlling for ESCE that figure turns out to 3392 units (student background explains 5% of that variation). We have 2113 (37%) units of between-school variances, after controlling for ESCE that figure turns out to 1793 (student background explains 15% of that variation). The following code makes a graph that summarizes the relationship of the student achievement with ESCE.

ggplot(data = PisaCol, aes(x = ESCS, y = PV1MATH, size = W_HOUSEWHT)) +
  theme_minimal() + geom_point() + theme(legend.position="none")

ggplot(data = PisaCol, aes(x = ESCS, y = PV1MATH, size = W_HOUSEWHT)) +
  geom_point(aes(colour = SCHOOLID)) + theme(legend.position="none")

Sunday, April 16, 2017

Small Area Estimation 101

Small area estimation (SAE) has become a widely used technique in official statistics since the last decade of past century. When the sample size is not enough to provide reliable estimates at a very particular level, the power of models and auxiliary information must be applied with no hesitation. In a nutshell, SAE tries to exploits similarity and borrows strength from available information.

I will write some posts to present, step by step, the fundamentals of SAE and how it can be implemented in the R software. The first post (this one you are reading now) is about basic concepts such as sampling and databases, the second and third posts will deal with direct and indirect estimates, respectively; the fourth post will introduce model-assisted estimation; and finally, the fifth post will deal with the Fay-Harriot method.

Let's begin. First of all, The Australian Bureau of Statistics declares that small area estimation refers to methods of producing sufficiently reliable estimates for geographic areas that are too fine to obtain with precision, using direct survey estimation methods. By direct estimation, we mean classical design-based survey estimation methods that utilize only the sample units contained in each small area. Small area estimation methods are used to overcome the problem of small samples sizes to produce small area estimates that improve the quality of direct survey estimates obtained from the sample in each small area. The more sophisticated of these methods work by taking advantage of various relationships in the data, and involve, either implicitly or explicitly, a statistical model to describe these relationships.

Now, I want to reproduce a clarifying explanation from Dr. Little (paraphrasing Groves) about SAE, and its use in survey sampling. They claim that regression estimates provide relatively precise predictions for small areas from a survey that account for the differences between areas of characteristics included as predictors in the survey but do not account for differences in characteristics not included in the study. Direct estimates for each area are unique to the area and hence take into account both observed and unobserved relevant characteristics; however, they have low precision in areas where the sample size is small. The SAE model combines the regression estimate and direct estimate for each area in a sensible way, balancing bias and precision.

For this technique to succeeds, Longford 2005 claims that areas that are known to be similar to one another should be receiving similar estimates, rather than estimates independent of one another. The degree to which similarity can or should be imposed can be chosen from statistical grounds by minimizing the overall discrepancy (mean squared error).

Rahman 2008 emphasizes that SAE uses data from similar domains to estimate the statistics in a particular small area of interest, and this ‘borrowing of strength’ is justified by assuming a model which relates the small area statistics. SAE is the process of using statistical models to link survey outcome or response variables to a set of predictor variables known for small areas to predict small area-level estimates. Traditional area-specific estimates may not provide enough statistical precision because of small sample observations in small geographical regions. In such situation, it may be worth checking whether it is possible to use indirect estimation approaches based on the linking models.

R workshop

This code will not produce any small area estimation, but it will help to introduce basic concepts of sampling. We will use the BigLucy database from the TeachingSampling package to illustrate how to obtain a probabilistic sample from a finite population. This database is about deals with some economic variables for a population of 85296 companies spread out into 100 counties (areas) in a particular year of some fake country. The aim of the exercise is 1) to select a stratified sample according to the size of the companies and 2) obtain accurate estimates of the total income within each of the 100 counties. Note that the parameters of interest are given by:

$$t_{y,d} = \sum_{k \in U_d} y_k$$

Where $t_{y,d}$ denotes the total income of the $d$-th county, $y_k$ is the income of the $k$-th company belonging the $d$-th county. The whole population of companies into the county is noted by $U_d$. Thus, this code computes the total income for each county along with the number of companies belonging each county. Finally, this parameters are saved in a new database named Results

#############################
##### Setting things up #####
#############################

setwd(“/wherever your prefer location is")

rm(list = ls())
set.seed(2017)
library(TeachingSampling)
library(dplyr)
data("BigLucy")

summary(BigLucy$Level)
levels(BigLucy$Zone)

Total <- BigLucy %>%
  group_by(Zone) %>%
  summarise(Income. = sum(Income)) %>%
  arrange(Zone)

N <- BigLucy %>%
  group_by(Zone) %>%
  summarise(N.county = n()) %>%
  arrange(Zone)

Results <- data.frame(N, Total$Income.)

#Checking the population total
(Total <- sum(Results$Total.Income.))

Now, once the parameters of the finite populations are computed, the time has come for us to select a sample. The sampling design we will use is an stratified sampling by taking advantage of the classification of each company into the database according to its level: small, medium and large. The selected sample will be stored into an object named data.sample and the sampling weights will be saved in another object called FEX.

#######################################
##### Drawing a stratified sample #####
#######################################

# Level is the stratifying variable
summary(BigLucy$Level)
attach(BigLucy)
# Defines the size of each stratum
N1 <- summary(Level)[[1]]
N2 <- summary(Level)[[2]]
N3 <- summary(Level)[[3]]
N1;N2;N3
Nh <- c(N1,N2,N3)
# Defines the sample size at each stratum

n1 <- round(N1 * 0.05)
n2 <- round(N2 * 0.05)
n3 <- round(N3 * 0.05)
(nh<-c(n1,n2,n3))
# Draws a stratified sample
sam <- S.STSI(Level, Nh, nh)

data.sam <- BigLucy[sam,]

data.sam$FEX <- NULL
data.sam$FEX[data.sam$Level == "Big"] <- Nh[1] / nh[1]
data.sam$FEX[data.sam$Level == "Medium"] <- Nh[2] / nh[2]
data.sam$FEX[data.sam$Level == "Small"] <- Nh[3] / nh[3]

dim(data.sam)
save(data.sam, file = "data.sam.RData")

In summary, we have drawn a sample of 4265 companies: 145 big companies, 1290 medium companies and, 2830 small companies. Each of these companies is spread out through the 100 counties in the country. In the Results database will be stored the information about those counties (how many companies are in each county and, how many companies are in the sample for each county.) along with different estimations for the total income of the counties. 

######################
##### In summary #####
######################

n <- data.sam %>%
  group_by(Zone) %>%
  summarise(n.county = n()) %>%
  arrange(Zone)

Results$n.county <- n$n.county
Results <- Results[c("Zone", "N.county", "n.county", "Total.Income.")]

save(Results, file = "Results.RData")

In the following post, we will use this sample to estimate the total income for each of the hundred zones in BigLucy’s population. 

 

Saturday, January 28, 2017

Regression to the mean (or at the end, people are not as smart as you could expect)

Francis Galton very cleverly coined the term "regression to (or towards) the mean" meaning that if a variable is shown extreme in a first measurement, then the following observed values of that very variable will tend to get closer to the average of its distribution. The classical example is height: a tall child will have (on average) parents less tall than himself. Moreover, extremely small parents tend to have children who are smaller than average, but in both cases, the children tend to be closer to the mean than were their parents (Senn, 2011).

Ok, this story should be widely known by the readers of this blog. However, I want to put forward another point of view. This is from Rolf Tarrach (former president of Luxemburg University) who has written a book on logical reasoning entitled <<The Pleasure of Deciding>>. He claims that the regression to the mean is a phenomenon that occurs not only in body measuring but also in cognitive measuring. That is: smart parents will tend to have children who are not as smart as expected.

So if you consider yourself as an intelligent person and you have decided to share your life with a smart mate, it is very likely that your children won't be smarter than you two. So, do not expect your children to be geniuses. That fact goes against the common idea that insists in requesting to children of smart parents to be even more intelligent. Of course, there are some exemptions such as the Bach family or the Bernoulli family. But, those are isolated deviations from the normality of real life.

I want to finish with this story about mathematician Bernard Shaw and dancer Isadora Duncan. She told him: “Would it not be wonderful if we could have a child who had your brains and my beauty?” He replied: “Yes, but suppose the child had your brains and my beauty!”

PS: About geniuses, Koenker (1998) claims that Galton not only managed to invent Regression in one plot but also a bivariate kernel density estimation.

Sunday, January 15, 2017

Multilevel regression with poststratification (Gelman's MrP) in R - What is this all about?

Multilevel regression with poststratification (MrP) is a useful technique to predict a parameter of interest within small domains through modeling the mean of the variable of interest conditional on poststratification counts. This method (or methods) was first proposed by Gelman and Little (1997) and is widely used in political science where the voting intention is modeling conditional on the interaction of classification variables.

The aim fo this methodology is to provide reliable estimates on strata based on census counts. For those who have some background on survey sampling, this method should look very similar to the Raking method, where sampling weights are adjusted due to known census cell counts. However, a significant difference with Raking is that MrP is a model-based approach, rather than a design-based method. This way, even in the presence of a (maybe complex) survey design, MrP does not take it into account for inference. In other words, sampling design will be considered as ignorable. So, the probability measure that governs the whole inference is based on modeling the voting intention (variable of interest) to demographic categories (auxiliary variables).

Is this a major technical issue to ignore the complex survey design? Yes, because in any case we are considering a probability measure to draw the sample. However, when it comes to voting intention (a major area where this technique is used), we rarely find a sophisticated complex design. Moreover, this kind of studies is barely based on probabilistic polls. So, if the survey lacks a proper sampling plan, it is always better to model the response variable.

Therefore, the ultimate goal of this technique it to estimate a parameter of interest (totals, means, proportions, etc.) for all of the strata (domains, categories or subgroups) in a finite population. From now on, let's assume that:

  1. a population is divided into $H$ strata of interest (for example states),
  2. the parameters of interest are the means (same rules apply for proportions) in each strata $\theta_h$ ($h=1, \ldots, H$), 
  3. every stratum is cross-classified by some demographics $j \in H$ (from now on defined as post-srata), besides every population count $N_j$ is known, and
  4. all of the population means $\mu_j$ can be estimated by using some statistical technique, such as multilevel regression.

This way, the mean in stratum $h$ ($\theta_h$), is defined as a function of means in post-strata $j$ ($\mu_j$), and post-strata counts ($N_j$):

$$\theta_h = \frac{\sum_{j \in h} N_j \mu_j }{\sum_{j \in h} N_j}$$

The first part of MrP is defined by a multilevel regression (MR). This kind of models is a particular case of mixed effect models. The core components of this phase are

  • the variable of interest (for example, voting intention), 
  • the auxiliary variables (classification on census demographic cells) and, 
  • the random effects (strata of interest, that usually are states or counties).

The second part of MrP is cell poststratification (P), and the predicted response variable is aggregated into states and adjusted by corresponding poststratification weights.

R workshop

Let’s consider the Lucy database of the TeachingSampling package. This population contains some economic variables of 2396 industrial companies in a particular year. Assume that we want to estimate the mean income of industries by each of the five existing zones (strata of interest) on that database. This way, our parameters of interest are $\theta_1, \ldots, \theta_5$.

Now, we also know that the population is divided into three levels (small, medium and big industries) and we have access to the total number of industries within each cross group. That is, we know exactly how many small industries are on each of the five zones, and how many medium industries are on each of the five zones, and so on. 

The following code shows how to load the database and obtain the cell counts. 

> rm(list = ls())
> set.seed(123)
>
> library(TeachingSampling)
> library(dplyr)
> library(lme4)
>
> data("Lucy")
> # Number of industries per level
> table(Lucy$Level)

   Big Medium  Small
    83    737   1576
> # Number of industries per zone
> table(Lucy$Zone)

  A   B   C   D   E
307 727 974 223 165
> # Size of post-strata
> (Np <- table(Lucy$Level, Lucy$Zone))
        
           A   B   C   D   E
  Big     30  13   1  16  23
  Medium 180 121 111 187 138
  Small   97 593 862  20   4

Of course, this technique works over a selected sample. That's why we are going to select a random sample of size $n = 1000$. We can also create some tables showing the counts on the sample.

> # A sample is selected
> SLucy <- sample_n(Lucy, size = 1000)
> table(SLucy$Level)

   Big Medium  Small
    33    280    687
> table(SLucy$Zone)

  A   B   C   D   E
130 295 426  86  63

The first step of MRP is Multilevel Regression in order to estimate post-strata means. The following code shows how to estimate them by using the lmer function. The object Mupred contains the corresponding $\mu_j$ ($j$ is defined for each level) regarding each stratum (zones).

> # Step 1: <<MR>> - Multilevel regression
> M1 <- lmer(Income ~ Level + (1 | Zone), data = SLucy)
> coef(M1)
$Zone
  (Intercept) LevelMedium LevelSmall
A    1265.596   -579.1851  -893.8958
B    1138.337   -579.1851  -893.8958
C    1189.285   -579.1851  -893.8958
D    1248.658   -579.1851  -893.8958
E    1284.322   -579.1851  -893.8958

attr(,"class")
[1] "coef.mer"
> SLucy$Pred <- predict(M1)
>
> # Summary
> grouped <- group_by(SLucy, Zone, Level)
> sum <- summarise(grouped, mean2 = mean(Pred))
> (Mupred <- matrix(sum$mean2, ncol = 5, nrow = 3))
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 1265.5959 1138.3370 1189.2852 1248.6575 1284.3224
[2,]  686.4107  559.1518  610.1000  669.4724  705.1373
[3,]  371.7001  244.4412  295.3894  354.7618  390.4267

Now we have estimated the post-strata means, it’s time to weight every strata by their corresponding counts in order to obtain an estimate of the mean income by zone. As we know each post-strata size, we simply use the function aggregate to obtain the MRP estimator of the parameters of interest.

> # Step 2: <<P>> - Post-stratification
> # Mean income estimation per zone
> colSums(Np * Mupred) / table(Lucy$Zone)

       A        B        C        D        E
643.5724 312.8052 332.1726 682.8031 778.2428

How accurate was the estimation? It was good compared to the true parameters on the finite population. 

> # True parameters
> aggregate(Lucy$Income, by = list(Lucy$Zone), FUN = mean)
  Group.1        x
1       A 652.2834
2       B 320.7469
3       C 331.0195
4       D 684.9821
5       E 767.3879

 

Sunday, January 1, 2017

3PL models viewed through the lens of total probability theorem (updated)

As I currently am the NPM for PISA in Colombia, I must assist to several meetings dealing with the proper implementation of this assessment in my country. Few of them are devoted to the analysis of this kind of data (coming from IRT models). As usual, OECD has hired organizations with high technical standards. The institute that handles all this data and take part in the analysis is ETS (Educational Testing Services).

PISA and ETS are changing from Rasch models to 2PL models. That involves a significant technical effort to maintain comparability along time. I had the opportunity to talk with some experts from ETS last year, and I formulated them the following question: ¿why to consider 2PL models instead of 3PL models? Well, the answer was not easy, and I am not pretending to explain it here in detail, in part because I am still convinced about the advantages of 3PL models that exceed those of 2PL models. However, they yielded me to a recent paper entitled Is There Need for the 3PL Model? Guess What?

The article was written by Mathias von Davier from ETS. I liked the way von Davier showed 3PL models, as coming from a total probability setup involving 2PL models. Consider the following hierarchical structure: first, the test-taker decides whether he/she is answering that item by guessing or not; then, he/she uses his/her ability to found the correct response. So, the stochastic process behind this structure can be easily shown in a tree diagram:

Screen Shot 2017 01 01 at 12 46 51 PM

 

Remember that 3PL models can be written as:

$$P_{3PL}(x=1) = P(Guess) + P(NoGuess) \times P_{2PL}(x=1|NoGuess)$$

Note that, per the model, once the student has chosen to answer by guessing, a correct answer is always found (kind of weird, isn’t it?). So, a major criticism against 3PL models is related to this last point. In R, you can estimate 3PL models by using the mirt package. So, for example, when using the LSAT7 data on the second item, we can estimate this guessing parameter.

library(mirt)
data <- expand.table(LSAT7)
md2  <-  mirt(data, 1, itemtype = '3PL', IRTpars = TRUE)
coef(md2, IRTpars = TRUE)$Item.2

We found that the guessing parameter is estimated as 0.295. This way, the model is specified as:

$$P_{3PL}(x=1) = 0.295 + 0.705 \times P_{2PL}(x=1|NoGuess)$$. 

PD: Alexander Robitzsch pointed me out to this paper (Aitkin, 2006) where an alternative 3PL has been proposed which aims to address the critique.

Saturday, December 24, 2016

Computing Sample Size for Variance Estimation

The R package samplesize4surveys contains functions that allow to calculate sample sizes for estimating proportions, means, difference of proportions and even difference of two means. It also permits the calculation of sample error and power level for a fixed sample size.

Here four functions are introduced for the estimation of a population variance and for conducting statistical hypothesis testing on this parameter of interest. Right away is the description of these functions:

  1. Function ss4S2 allows calculating the sample size for estimating $s^2_{y_U}$ subject to a particular value of the coefficient of variation or the relative margin of error. Additionally, it offers to the user the option of mapping the coefficient of variation and the margin of error as a function of the sample size, to make easier the decision about $n$.
  2. Function ss4S2H allows calculating the sample size for estimating $s^2_{y_U}$ subject to a particular power level to detect a population variance greater than the value set in the null hypothesis. It also offers to the user the option of mapping the power level in function of the sample size.
  3. Function e4S2 allows calculating the coefficient of variation and the margin of error for a particular sample size. It also allows obtaining a mapping similar to the one of ss4S2.
  4. Function b4S2 allow calculating the power level for a fixed sample size. It also allows obtaining a mapping similar to the one of ss4S2H

In order to use the above functions it is necessary to install and call the package that contains them in the Comprehensive R Archive Network (CRAN). That for, it is required to type the following code lines from the console:

install.packages("samplesize4surveys")
library(samplesize4surveys)

For example, the following code line gives the necessary sample size to estimate the variance of a characteristic of interest in a finite population (with a coefficient of kurtosis of one) to reach an estimated coefficient of variation of maximum 5% and a relative margin of error of 3%

ss4S2(N = 10000, K = 1, CV = 0.05, me = 0.03, DEFF = 2, plot = TRUE)
 

Screen Shot 2016 12 24 at 6 45 56 PM

On the other hand, as the package is in constant update, the authors have arranged a repository in which users can use the newest features and interact with the academic community to correct possible errors in computer codes and improve the efficiency of functions, among others. In order to access to this version control, it is necessary to type the following lines from R.

library(devtools)
install_github("psirusteam/samplesize4surveys")

In this paper you can find the mathematical background behind those R functions. 

Saturday, December 3, 2016

Highlighting R code for the web

When blogging about statistics and R, it is very useful to differentiate the body text to R code. I used to manage this issue by highlighting the code and pretty-R was a valuable instrument from Revolutions Analytics to accomplish this. However, as you may know, Microsoft acquired that company, and now this feature (dressing R code for the web) is not available anymore.

After some searching, I found this online syntax highlighter and it seems to work pretty well. Besides, it allows you to select from different styles, and you can even choose among a lot of computational languages.

How important is that variable?

When modeling any phenomena by including explanatory variables that highly relates the variable of interest, one question arises: which of the auxiliary variables have a higher influence on the response? I am not writing about significance testing or something like this. I am just thinking like a researcher who wants to know the ranking of variables that influence the response and their related weight.

There are a variety of methods that try to answer that question. The one inducing this thread is very simple: isolate units from variables. Assume a linear model with the following structure (for the sake of simplicity, assume only two explanatory variables):

$$y = \beta_1 x_1 + \beta_2 x_2 + \varepsilon$$

If you assume this model as true and $\beta_i > 0$, then the influence of variable $x_i$, over response $y$ could be found when isolating measure units from variables. Then, one could fit a model over the standardized variables (explanatory and response) and then directly comparing the regression coefficients. Another way to do this is by means of the following expression:

$$I(i) = \frac{\beta_i}{sd(\beta_i)} = \beta_i\frac{ sd(x_i)}{sd(y)}$$

For example, let's consider the following model $y = -500 x_1 + 50 x_2 + \varepsilon$, then the relative importance of the first and second variable is around 500/(500+50) = 0.9, and 50/(500+50) = 0.1, respectively. The following code shows how to perform this simple analysis in R.

n <- 10000

x1 <- runif(n)
x2 <- runif(n)
y <- -500 * x1 + 50 * x2 + rnorm(n)

model <- lm(y ~ 0 + x1 + x2)

# 1a. Standardized betas
summary(model)$coe[,2]
sd.betas <- summary(model)$coe[,2]
betas <- model$coefficients
imp <- abs(betas)/sd.betas
imp <- imp/sum(imp)
imp

# 1b. Standardized betas
imp1 <- abs(model$coefficients[1] * sd(x1)/sd(y))
imp2 <- abs(model$coefficients[2] * sd(x2)/sd(y))

imp1 / (imp1 + imp2)
imp2 / (imp1 + imp2)

# 2. Standardized variables
model2 <- lm(I(scale(y)) ~ 0 + I(scale(x1)) + I(scale(x2)))
summary(model2)
abs(model2$coefficients)/sum(abs(model2$coefficients))

Sunday, November 20, 2016

Lord's Paradox in R

In an article called A Paradox in the Interpretation of Group Comparisons published in Psychological Bulletin, Lord (1967) made famous the following controversial story:

A university is interested in investigating the effects of the nutritional diet its students consume in the campus restaurant. Various types of data were collected including the weight of each student in the month of January and their weight in the month of June of the same year. The objective of the University is to know if the diet has greater effects on men than on women. This information is analyzed by two statisticians.

The first statistician observes that at the end of the semester (June), the average weight of the men is identical to their average weight at the beginning of the semester (January). This situation also occurs for women. The only difference is that women started the year with a lower average weight (which is obvious from their background). On average, neither men nor women gained or lost weight during the course of the semester. The first statistician concludes that there is no evidence of any significant effect of diet (or any other factor) on student weight. In particular, there is no evidence of any differential effect on both sexes, since no group shows systematic differences.

The second statistician examines the data more carefully. Note that there is a group of men and women who started the semester with the same weight. This group consisted of thin men and overweight women. He notes that those men gained weight from the average and these women lost weight with respect to the average. The second statistician concludes that by controlling for the initial weight, the university diet has a positive differential effect on men relative to women. It is evident that for men and women with the same initial weight, on average they differ since men gained more weight, and women lost more weight.

The following chart shows the reasoning of both statisticians in dealing with the problem. Note that the black line describes a 45 degrees line, the green points are the data coming from the men and the red ones from the women

Screen Shot 2016 11 20 at 6 04 27 PM

The reasoning of the first statistician focuses on the expectations of both distributions. Specifically in the coordinates (x = 60, y = 60), for females, and (x = 70, y = 70) for males, where black, red and green lines appear to coincide. The reasoning of the second statistic is limited to the continuum induced by the overlap of red and green dots. Specifically to the space induced by x = (60, 70), y = (60, 70). Suppose we have access to this dataset as shown in the following illustration, where the first column denotes the initial weight of the students, the second column indicates the final weight, the third column describes the difference between pesos and the last one defines the Sex of the student.

Screen Shot 2015 12 30 at 11 13 09 PM

The findings of the first statistician are obtained through a simple regression analysis that, taking as a response variable the difference between weights, induces a coefficient of regression equal to zero for the variable sex, which indicates that there are no significant differences in the weight difference between men and women.

Screen Shot 2016 11 20 at 6 04 49 PM

The findings of the second statistic are obtained through a covariance analysis, taking as response variable the final weight and covariates are sex and the initial weight of the individual. This method induces a coefficient of regression equal to 5.98 which implies that there is significant difference between the final weight of the people, according to sex.

Screen Shot 2016 11 20 at 6 05 28 PM

For Imbens and Rubin (2015), both are right when it comes to describing the data, although both lack a sound reasoning in establishing some kind of causality between the diet of the university and the loss or gain of weight in the students. Regardless of this I still find more interesting the analysis that arises from the comparison between men and women who started with the same weight (ie all data restricted to x = (60, 70) y = (60, 70). 

R workshop

Lord's paradox summarizes the analysis of two statisticians who analyze the average weight of some students within a particular university. At the end of the semester (June), the average weight of the men is identical to their average weight at the beginning of that six months (January). This situation also occurs for women. The only difference is that women started the year with a lower average weight (which is evident from their natural contexture). On average, neither men nor women gained or lost weight during the semester.

To perform the simulation, we assumed that both the final weight of the men and the women follow a linear relationship with the original weight. Thus, it is assumed that $y_{2i}^M = \beta_0^M + \beta_1 y_{1i}^M + \varepsilon_i$ for the weight of women; and $y_{2i}^H = \beta_0^H + \beta_1 y_{1i}^H + \varepsilon_i$, for the weight of men. Where $y_{1i}^M$ denotes the weight of the $i$-th female at the beginning of the semester, and $y_{2i}^M$ denotes the weight of the $i$-th female at the end of the semester. The notation for men (H) maintains this logic.

Now, note that from their natural contexture, men must have greater weight than women. Suppose that on average the weight of men is equal to that of women plus a constant $c$. In addition, the mean weight in both groups is identical in both times. Then, we have $\bar{y}^M = \beta_0^M + \beta_1 \bar{y}^M$ and that $\bar{y}^H = \beta_0^H + \beta_1 \bar{y}^H = \beta_0^H + \beta_1 (\bar{y}^M + c).$ Hence, after some algebra, we have that $\beta_0^M = (1 - \beta_1) \bar{y}^M$ and $\beta_0^H = \bar{y}^H - \beta_1 (\bar{y}^M + c)$.

The following code replicates a set of data that follows the relationship proposed by Lord.

N <- 1000
 
b <- 10
l <- 50
u <- 70
Mujer1 <- runif(N, l, u)
Hombre1 <- Mujer1 + b
 
beta1 <- 0.4
Mujerb0 <- (1 - beta1) * mean(Mujer1)
Hombreb0 <- mean(Hombre1) - beta1 * (mean(Mujer1) + b)
 
sds <- 1
Mujer2 <- Mujerb0 + beta1 * Mujer1 + rnorm(N, sd=sds)
Hombre2 <- Hombreb0 + beta1 * Hombre1 + rnorm(N, sd=sds)

The graph can be done with the following piece of code:

datos <- data.frame(inicio = c(Mujer1, Hombre1), final = c(Mujer2, Hombre2))
datos$dif <- datos$final - datos$inicio
datos$sexo = c(rep(0, N), rep(1, N))
 
library(ggplot2)
ggplot(data = datos, aes(inicio, final, color = factor(sexo))) +
  geom_point() + stat_smooth(method = "lm")  +
  geom_abline(intercept = 0, slope = 1) +
  ggtitle("Paradoja de Lord") + theme_bw()

Monday, November 14, 2016

Intercept or not? That's the question!

My current passion is statistical modeling. While each model requires the researcher to make a proper contextualization of the problem he/she is addressing, which means that no model is equal to another, there is a common question that the researcher should answer before estimating model parameters.

Do I fit the model with an intercept or not?

While seeking for the goodness of fit, the researcher is tempted many times to run automated variable selection procedures (i.e. stepwise, forward, backward). If luckily, these methods will provide you "the best model" for you to choose (based on the highest coefficient of determination, or lower AIC, BIC, or DIC). Call me old fashioned, and retrogressive, but I have always been a little reluctant to the practice of throwing the data into the software waiting for the best model to come automatically.

Returning to the subject of this entry I will highlight the importance of inclusion/omission of the intercept in a model. For this, I will consider the following cases

1. If the response variable Y is continuous:

When the explanatory variable X is also continuous. This is the classic case of a linear regression model, where the inclusion of the intercept assumes that when X = 0, the mean value of Y = 0, and corresponds to the estimate of the intercept. However, when excluding the intercept, we are demanding that the average value of Y = 0 when X = 0. Thus the inclusion or exclusion of the intercept, in many cases, depends on the nature and interpr etation of the variables.

When the explanatory variable X is categorical. Without loss of generality, let's assume it as dichotomous (two levels); in this case, when fitting a regression line including the intercept, one can define a dummy variable representing the first level of the variable X, and the model is set as

$Y_i = \beta_0 + \beta_1D_{1i} + E_i$

Where D1 = 1 for units belonging to the first level of X and, D1 = 0 for units belonging to the second level of X. In this case, the interpr etation of this model is as follows: For individuals belonging to level 1, the mean value of Y is given by $ \beta_0 + eta1$. For units belonging to level 2, the average value of Y is given by $ \beta_0$. Coefficient $ \beta_1$ is defined to be the difference between these two levels. If the estimate is significant, it implies that the variable X does have considerable influence on Y. That is, the mean value of Y at each level of X varies in a significant way.

On the other hand, if the regression is fitted without an intercept, two dummies variables must be created, each one representing the as many levels as X has. The model is formulated as

$Y_i = \beta_0D_{1i} + \beta_1D_{2i} + E_i$

For units at the first level (D1 = 1), the mean value of Y is given by $ \beta_0$ and, for units at the second level (D2 = 1), the average value of Y is given by $\beta_1$. Thus, even if the estimate of either $\beta_0$ or $\beta_1$ is significant, that does not imply that X has any influence over Y. All we can claim is that in this model is that the two parameters are significantly different from zero. So, if you really want to establish whether X influences Y, then omitting the intercept would not be a good choice.

 2. If the response variable Y is discrete:

When the explanatory variable X is continuous. In this case, the fitted is a logistic regression, modeling the probability of success (Y = 1) in terms of $p_i = Pr(Y=1)$:

$logit (p_i) = \beta_0 + \beta_1X_i$

If the model includes an intercept, $ \beta_0$ estimate can be used to estimate the probability of success when X = 0, since $p_i = \frac{\exp{ \beta_0}}{ 1 + exp{ \beta_0}}$. On the other hand, if the estimate of $\beta_1$ is not significant, that implies that the values of X do not influence the chances of success or failure over Y. If the estimate of $ \beta_1$ is significant with a positive (or negative) value, it indicates that an increase in the variable X implies an increase (or decrease) on the probability of success of Y. Note that this interpr etation is the same when the regression is adjusted without an intercept.

When the explanatory variable is categorical (let's assume it as a dichotomous variable). In this case, by fitting a regression line including the intercept a dummy variable representing the first level of the variable is created and the model is defined as

$logit (p_i) = \beta_0 + \beta_1D_{1i}$

The interpretation of this model is as follows: for units in the first level of X, $logit(p_i) = \beta_0 + \beta_1$. For units in the second level of X, $logit(p_i) = \beta_0$. Thus, if $ \beta_1$ is significant, it indicates that $logit(p_i)$ is different between levels of variable X, and we can conclude that X does have an important influence on Y.

On the other hand, if the intercept is not taken into account, two dummies are produced (representing X levels) and the model is formulated as

$logit (p_i) = \beta_0D_{1i} + \beta_1D_{2i}$

For this pattern, estimates of $\beta_0$ and $\beta_1$ represent the values of $logit(p_i)$ in the two levels of X. Thus, the significance of X over Y cannot be estimated via $\beta_1$ or $\beta_2$. Those coefficients give no information on the influence of X on Y.

In summary, we can conclude that when the explanatory variable is continuous, the interpr etation of $\beta_1$ does not change if the intercept is included (or excluded). Although when the explanatory variable is discrete, we must consider whether the model includes or not the intercept, since the interpr etation of $\beta_1$ changes. Also, if what you want is to know the influence of X on Y, it is necessary to include the intercept. That can only be achieved if the model considers the intercept, and putting aside (just for a moment) automated procedures.

Sunday, October 30, 2016

Data Literacy

Once again I have decided to pimp my blog. This time is a significant change: the name. This blog began a long time ago. It was 2006; I was 22-year-old, and I was enrolled in a Master of Science in Statistics. I had plenty of doubts about statistics, data science, and uncertainty (fortunately, some of those questions remain) and I decided to solve them by blogging.

Then it was born this blog with the name "Apuntes de Estadística." Ten years later, this blog has become an important space for researchers, teachers, and students who, like myself, want to solve questions about variability and statistics.

Now, after hundreds of posts, a hundred thousand visitors per year, and being migrated from Spanish to English, from WordPress to Blogger, it is time for me to revisit the very own name of this blog. Lately, posts are not notes anymore; I am not only answering basic questions but discussing modeling, forecasting, data and even statistics epistemology.

As a consequence of what I currently do in my job, now I am strongly convinced (more than ever in my life) that the time predicted by H.G. Wells has not arrived yet. Statistical thinking is not part of our culture. It should be encrusted in every one of us, but it is not. I will put in just my two cents to build an effective citizenship through data literacy: the ability to communicate information from data.

Monday, October 17, 2016

Sublime Text 3: an alternative to RStudio

It was a Saturday morning; I was lecturing my students of my Item Response Theory class when I decided to run some R scripts to introduce my students with the JAGS syntax and the estimation of parameters in a Bayesian logistic regression setup.

As it was usual, I opened RStudio because it was my favorite R interface. So when I tried to run a Bayesian sampler with five chains in JAGS, it appeared that damn message: <<R Session Aborted - R encountered a fatal error - The session was terminated>>.

Rstudio bomb

So I was there, the lecture was completely stopped because of RStudio. However, knowing about those annoying disadvantages of RStudio I asked some colleagues about some other interfaces of R. The answer was convincing: Sublime Text is an editor designed for people that are serious when it comes to programming, not only in R but any other language.

So, I am just spreading the word about this editor, but I am not trying to convince you to abandon RStudio. If you want to give a try, just follow this simple steps:

  • Of course, you install R (https://www.r-project.org)
  • Install Sublime Text 3 (https://www.sublimetext.com)
  • Cmd + Shift + P, then type Package Control and install it.
  • Cmd + Shift + P, then type "Install Package", look for SublimeREPL and install it.
  • Cmd + Shift + P, then type "Install Package", look for R-Box and install it.
  • Cmd + Shift + P, then type "Install Package", look for SendTextPlus and install it.
  • Cmd + Shift + P, then type "Install Package" look for R-Snippets and install it.
  • Cmd + Shift + P, then type "Set Syntax: R Extended”.
  • Cmd + Shift + P, then type "SendTextPlus: Choose Program" and select R.
  • Enjoy! Write your code and Cmd + Enter to send your lines to R.

You can also send your code to R within Sublime Text by using REPL. Cmd + Shift + P, then type "SublimeREPL: R" and that's it. Please, comment your experience. You can be sure that trying this editor is worthwhile.

Tuesday, October 11, 2016

Multilevel Modeling of Educational Data using R (Part 1)

Linear models fail to recognize the effect of clustering due to intraclass correlation accurately. However, under some scenarios force you to take into account that units are clustered into subgroups that at the same time are nested within larger groups. The typical example is the analysis of standardized tests, where students are grouped in schools, and schools are grouped into districts, while districts are part of a whole system (state or country).

When data is coming from a hierarchical structure, the proper way to analyze it is via multilevel modeling (Goldstein, 1995). Between other advantages, multilevel modeling allows you to correctly estimate the relative variation in the test score due to the effect of clustering. In other words, you can decompose the variance into two parts: the one coming from students and the one coming from schools, by fitting a regression model per group. That allows you to make proper inferences when it comes to the source of variation of the outcome measure.

For example, the following graph shows a scatter plot between the test score and the socio-economic status of the students. If you consider the left side of the chart below, you realize a linear relationship indeed; however, when you take a closer look, it is evident that the same model does not hold within schools (different colors identify different schools).

Screen Shot 2016 10 11 at 9 42 37 PM

Let's assume that we want to fit a two-level model: the first level is related to schools, and the second level describes students within schools. The following code allows simulating a hierarchical structure over a population of five schools and a hundred students per school. Notice that the test score is somehow related to the socio-economic status (SES) of the student.

rm(list = ls())
library(ggplot2)
library(gridExtra)
library(lme4)
library(sjPlot)
library(dplyr)
set.seed(123)
##### 
N <- 100 #Number of students per school
sigma <- 200
x1 <- runif(N, 10, 40)
x2 <- runif(N, 25, 55)
x3 <- runif(N, 40, 70)
x4 <- runif(N, 55, 85)
x5 <- runif(N, 70, 100)
y1 <- 20 + 0 * x1 + rnorm(N, 0, sigma)
y2 <- 40 + 10 * x2 + rnorm(N, 0, sigma)
y3 <- 60 + 20 * x3 + rnorm(N, 0, sigma)
y4 <- 80 + 30 * x4 + rnorm(N, 0, sigma)
y5 <- 100 + 40 * x5 + rnorm(N, 0, sigma)
ID <- rep(LETTERS[1:5], each = N)
test <- data.frame(SES = c(x1, x2, x3, x4, x5), 
Score = c(y1, y2, y3, y4, y5), ID = ID)

When analyzing data with such a type of hierarchical structures, the researcher always should fit a null model to decompose the variation due to schools. The following expression gives the functional form of this model:

$y_{ij} = \alpha_{j} + \varepsilon_{ij}$
$\alpha_{j} = \alpha_0 + u_{j}$

This model allows decomposing the source of total variation into two parts: the one due to students (within schools) and the one due to schools (test score between all of the schools). The R code to fit this null multilevel model to data is given next:

HLM0 <- lmer(Score ~ (1 | ID), data = test)
coef(HLM0)
summary(HLM0)
# 96% - Between-schools variance
# 4% - Within-schools variance
100 * 87346 / (87346 + 1931757)

By examining the estimates of the random effects' variance we note that the percentage of variation accounted for the school (intraclass correlation) is nearly 96%; while the proportion of variance accounted for students is almost 4%. The null model claims that high achievers belong to particular schools and low achievers are not part of those specific schools. In other words, the school determines the test outcome.

The previous unconditional model fails to include relevant variables, such as the socio-economic status (SES) associated to every student. The following expressions give a more elaborated model involving random intercepts and random slopes for each of the schools:

$y_{ij} = \alpha_{j} + \beta_{j} * SES_{ij} + \varepsilon_{ij}$
$\alpha_{j} = \alpha_0 + u_{j}$
$\beta{j} = \beta_0 + v_{j}$

The R code that allows fitting this model is shown below.

HLM1 <- lmer(Score ~ SES + (SES | ID), data = test)
coef(HLM1)
summary(HLM1)
# 1% - BS variance
# 99% - WS variance
100 * 40400.24 / (40400.24 + 257.09 + 1.65)
# Percentage of variation explained by SES between schools
1 - ((257.09 + 1.65) / 1931757)
# Percentage of variation explained by SES within schools
1 - (40400.24 / 87346)

On the one hand, we note that the SES explains 99% of the between-schools variance; on the other hand, the SES explains 53% of the within-schools variance. What does it mean? School segregation. That is, wealthy students belong to rich schools, and poor students belong to poor schools. Moreover, affluent students far exceed the performance of poor students.

Tuesday, October 4, 2016

#PredictiveCOL - Forecasting Colombia's peace plebiscite (final update)

For sure, this is the more exciting forecast I have ever done. On one hand, I am Colombian guy, and I really want to live in a peaceful country, and I do want a better place for raising my children. On the other hand, I am very serious when it comes to forecasting.

Maybe you have read this blog before, and you have realized that I love predictions cuz, as a data scientist, you can use your own set of statistical methodologies that relates to voting intention and turnout. If you think genuinely, everything counts while forecasting this kind of processes. For example, you may weekly track people's security perception, how much the president is disliked, how many minutes newscast spend about the peace process, how many articles or opinion columns are written, how many tweets are sent per week about FARC, everything, everything counts and matter. Let us name these variables as contextual variables

Of course, polls and pollsters are also important. They are the only proxy of voting intention. However, polls get old, and they should be updated with new data; also some pollsters are not as confident as others (I really do not believe everything some pollsters claim), and sample size also matters cuz it improves the sampling error. 

Remember that this plebiscite has two options for Colombians to choose. The first option is Yes, that goes for Yes, I approve the agreement between Farc and government. The second option is No, meaning No, I don't support that agreement. With those topics in mind, let me introduce you my (bayesian) predictions for the Colombian plebiscite. Firstly, let me exhibit the trend that the two options have shown over time:
Trendplot
As you can see, we have deflated the data by undecided voters. Note that we extracted the signal (bold lines) from the noise generated by polls. The following graph shows the less robust prediction based only on what polls have estimated.

Pollplot
Now, a more reasonable forecast based on prior information (2014 presidential elections and 2014 legislative elections) that tries to explain the outcomes of the polls by modeling the extracted signal (see trends in previous graphic) with contextual variables. After the model is fitted, we use a Bayesian setup that relates the prior information with the estimated response from this model. The following chart shows this forecast.
Bayesplot
Ok, it is clear that Colombian people will support this peace process. However, this election is legitimate if and only if Yes voter turnout is greater than 4.4 million. By using a similar Bayesian methodology, next graph shows the predicted turnout.
Turnoutplot
Also, we used a small area modeling to forecast the response of every Colombian department (equivalent to a state in the US). The following map shows that the majority of departments are supporting the agreement between FARC and government. However, there are some of them that will vote No. Dark areas are not supporting the deal, while light-gray areas will do support the agreement.
Plotcol1
Finally, the posterior probability that Yes defeats No is 98.8%.

Thursday, August 18, 2016

My talk in Bogotá - Pvalues: use and abuse

Did you know that American Statistical Association (ASA) made a disclaimer about the proper use of p-values? Moreover, ASA (the oldest scientific association in the USA) claimed that:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

With such a disclaimer, from my statistical perspective I gave a talk at ICFES about the use (misuse and abuse) of p-values and how to face this new reality. In the end, I consider that p-values and hypothesis testing have several disadvantages in this information era. With Petabytes of data generated every day, sample sizes influence directly on p-values, and decisions taken from this perspective may be misleading.

Hope you enjoy these slides.

Tuesday, August 9, 2016

My talk in Sincelejo - Small area estimation and multiple imputation

My second talk in Sincelejo deals with fascinating topics: Statistical methods in education in conjunction with sample surveys and small area estimation. I am leading this research since last year, and the results are very consistent.

Maybe the future of the assessment of education will rest on this kind of predictive methods. In fact, the use of plausible values trough multiple imputations is (sort of) a  predictive method. That technique assumes that items not answered (by design) by students could be imputed with the help of auxiliary variables. We use small area methods to predict the performance of provinces in mathematic tests.