Yongfu's Blog

Demystifying Item Response Theory (3/4)

Improving Estimation through Partial Pooling

Mar 29, 2023

Fixed, Random and Mixed

Statistics is confusing enough through its massive terminology. Psychology, which is largely experiment-oriented, further confuses people by adding in its own flavor. A peek at the definitions of fixed, random, and mixed effects in the APA Dictionary of Psychology exemplifies this:

[A mixed-effect model is] any statistical procedure or experimental design that uses one or more independent variables whose levels are specifically selected by the researcher (fixed effects; e.g., gender) and one or more additional independent variables whose levels are chosen randomly from a wide range of possible values (random effects; e.g., age).

—Definition of “mixed-effect model” in the APA Dictionary of Psychology

The definitions for random and fixed effects above are not only confusing but also misleading. In principle, whether a categorical variable is “fixed” or “random” has nothing to do with the nature of the variable (e.g., gender doesn’t have to be fixed) or how the levels within a variable are selected (randomly drawn or chosen by researchers). Whether a variable is modeled as fixed or random is a decision to be made by the modeler. And the modeler should model variables as random if there are no justifiable prohibitive reasons. Let me explain.

Multilevel Instead of Mixed

A better way to understand fixed and random effects is to think hierarchically. The levels of a random-effect variable are treated as related by the model, meaning that the effect of each level is estimated by also considering information from other levels in the variable. This is known as partial pooling, and it has several desirable properties. On the other hand, the levels within a fixed-effect variable are treated as independent: during parameter estimation, the model considers only information within each level. This is the no-pooling case. So how does the model incorporate information from the other levels during estimation in the partial-pooling case? To explain this, let me start with the no-pooling case.

As an example, suppose we have a categorical variable with $n$ levels. Our goal is to obtain an estimate for each of these levels (and the variability in the estimates), labeled as $\alpha_1, \alpha_2, …, \alpha_n$. To provide some context, we can think of the categorical variable here as nationality, and for each nation (a level), we want to estimate the average height (the parameter) of its citizens. When we are not pooling information across the levels, the structure of the data-generating process assumed by the statistical model is shown in the figure below. Here, the model assumes that there is a parameter associated with each level that generates the observations. However, the parameters here are assumed to be independent. Therefore, the model utilizes only the observations under each level to estimate its parameter. What has been learned about a nation is uninformative about another nation for the no-pooling model.

Levels within a categorical variable estimated
independently.
Levels within a categorical variable estimated independently.

But there actually is some information, right? Take the perspective of an alien, for instance. Knowing the average height of the Americans, say 5′9″, provides quite much information about the heights of the Japanese—they are unlikely to be 60 feet or 6 inches, agree? This is why we prefer to partial pool. Partial pooling allows the sharing of information across different levels, which, as elucidated below, leads to several desirable properties.

When we want to incorporate—or partial pool—information from other levels during estimation, we can utilize models that assume a hierarchical structure on the levels within a variable. Such a hierarchical structure is exemplified in the figure below. This hierarchical structure assumes that all levels, or more precisely all parameters underlying the levels, come from a common distribution. Here, this common distribution is assumed to be a normal distribution with mean $\mu$ and variance $\sigma^2$. The mean and the variance are to be estimated from the data. When this structure is imposed, the observations under different levels will be naturally linked since now, the observations under every level all provide information for estimating the common distribution.

Levels within a categorical variable estimated by incorporating
information from all levels. This is achieved through assuming all
levels to come from a common distribution.
Levels within a categorical variable estimated by incorporating information from all levels. This is achieved through assuming all levels to come from a common distribution.

What can be gained from partial pooling information across levels? As mentioned above, when we partial-pool, the information of the observations across all levels is shared. This means that the model considers more information for estimating the parameter of each level. As a direct result of this, the model uses the data (observations) more efficiently by squeezing out more information. Secondly, it reduces overfitting and thus provides better (out-of-sample) estimation. The model is less likely to overfit because it is more “objective” by considering information across different levels. Overfitting occurs when data are scarce, which is the case in the no-pooling case since only data within each level are considered. In such cases, the model bases the estimations on fewer observations and hence tends to be overly sensitive to idiosyncratic patterns in the local data. Another great thing about partial pooling is that it automatically adjusts according to the sample sizes under each level. For levels with fewer observations (e.g, North Korea, in which the alien managed to collect only heights from three of its citizens), the model places more weight on the overall information provided by other levels, resulting in larger adjustments of the levels’ estimates. For levels with abundant data, their estimates are only slightly affected by the observations from other levels.

Partial pooling essentially arises from the hierarchical structure assumed in the models. Therefore, these models are known as hierarchical or multilevel models. Mixed (effect) models are another common label for these models, though, as explained above, the name is quite uninformative. To understand how these models work, it is better to start with their hierarchical structuring. I will use the term multilevel models from now on to save ourselves from confusion. This name is also nice in that it coincides in abbreviation with the mixed model—both of them are abbreviated as GLMM for Generalized Linear Multilevel/Mixed Models.

Back to IRT

Now, we are acquainted with the concept of partial pooling and multilevel models, let’s apply them to the IRT context to improve our previous model, which is fitted without partial pooling across levels. To warm up, let me rephrase the structure of the simulated IRT dataset in terms of the multilevel terminologies.

There are two variables at work here—the item variable and the person (or subject) variable. Within the item variable, there are several items. In other words, each item acts as a level within the item variable. Similarly, each person corresponds to a level within the person variable. For each item, we want to estimate a parameter, the item’s difficulty. Likewise, for each person, we also want to estimate a parameter, the person’s ability. To improve our model in estimating the item/person parameters, we can partial pool information across the levels within the item and/or the person variable.

The chunk below copies the data simulation code from the previous post, with two minor changes. The first is the renaming of the variables for the item and subject ID as Iid (originally I) and Sid (originally S). The second is that, instead of item difficulty (D in the previous post), we conceptualize the effect of items as easiness (E) here. Item easiness is simply the negative of item difficulty. This simple switch would allow us to skip the step of reversing the item effects’ signs returned by the regression model.

 1logistic = function(x) 1 / (1 + exp(-x))
 2logit = function( p ) log( p/(1-p) )
 3rbern = function( p, n=length(p) ) rbinom( n=n, size=1, prob=p )
 4
 5set.seed(12)
 6n_item = 30    # number of items
 7n_subj = 60    # number of subjects
 8n_resp = n_item * n_subj
 9n_param = n_item + n_subj
10A = rnorm( n_subj )  # Person ability
11E = seq( -1.6, 1, length=n_item )  # Item easiness
12
13# The data
14d = expand.grid( Sid=1:n_subj, Iid=1:n_item, KEEP.OUT.ATTRS = F )
15d$mu = A[d$Sid] + E[d$Iid]
16d$R = rbern( logistic( d$mu ) )
17d$Sid = factor(d$Sid)
18d$Iid = factor(d$Iid)

Unpooled Model

With the data prepared, let’s refit model m1.2 from the previous post. Later, I will fit another model that partial pools the subject variable (m2) and compare it to the unpooled model here (m1).

The code below for fitting m1 is identical to those in the previous post, except that I adopt another method (starting from line 5) to reconstruct the dropped estimate (forced by the sum-to-zero constraint). This change is necessary, as it also allows us to reconstruct the standard errors of the dropped estimate. We will need the standard errors later to quantify the uncertainty in the estimates, which are used for comparing the unpooled and partial-pooled models. In addition, the method adopted here is more principled and general, which further consolidates our understanding of contrasts and dummy coding. However, it takes up some space for the explanation since a little matrix algebra is involved. I thus leave the details in the box at the end of the post.

 1d1 = d
 2contrasts(d1$Sid) = contr.sum( n_subj )
 3m1 = glm( R ~ -1 + Iid + Sid, data=d1, family=binomial("logit") )
 4
 5# Construct contrast matrix
 6Cmat = diag(0, nrow=n_item+n_subj)[, -1]
 7diag(Cmat)[1:n_item] = 1
 8idxS = 1:n_subj + n_item
 9Cmat[idxS, idxS[-length(idxS)]] = contr.sum( n_subj )
10
11# Reconstruct estimates with the constrast matrix
12m1_eff = (Cmat %*% coef(m1))[, 1]
13# Reconstruct std. error of the estimates with the constrast matrix
14Vmat = Cmat %*% vcov(m1) %*% t(Cmat) 
15m1_se = sqrt(diag(Vmat))

Partial-pooled Model

To fit the partial-pooled model, glmer() from the lme4 package is used.

1library(lme4)
2m2 = glmer( R ~ -1 + Iid + (1|Sid), data=d, family=binomial('logit') )

lme4 provides a syntax for expressing multilevel models of different structures. For our model here, which is one of the simplest multilevel models (known as the varying intercept models), we express the partial pooling of persons with the syntax (1|Sid), as shown in the last term of the model formula. When such a partial pooling structure is specified, glmer() automatically imposes a constraint of zero-meaned normal distribution on the partial-pooled variable. In the case here, this means that the ability of each person is modeled as being drawn from a normal distribution with a mean of zero and an unknown standard deviation to be estimated from the data. This constraint on the distribution of the person ability naturally resolves the identification issue of the IRT model. Hence, there is no need to impose an additional sum-to-zero constraint as we did in m1. We are only partial-pooling the person variable here, so except for (1|Sid), everything else in glmer() is identical to those in m1.

After fitting the model, the estimates from m2 can be obtained with the code below. Unpooled and partial-pooled estimates are extracted differently in lme4. To extract the unpooled estimates, one uses the fixef() function. These unpooled estimates, along with their standard errors and other information, are also found in the model summary table returned by summary(). The partial-pooled estimates, however, are not found in the table. To extract them, we need the ranef() function as shown below.

1m2_eff.item = fixef(m2)
2m2_eff.subj = ranef(m2)$Sid[, 1]

In addition to the estimates, we would also like to retrieve their standard errors. Similar to the estimates, the standard errors of the estimates are extracted differently according to whether they are unpooled (fixed) or partial-pooled (random). We can utilize se.fixef() and se.ranef() from the arm package to extract these standard errors:

1m2_se.item = arm::se.fixef(m2)
2m2_se.subj = arm::se.ranef(m2)$Sid[, 1]

Finally, to compare the estimates of m1 and m2, I plot them together in the same figure. I also plot the uncertainty—calculated as $\pm 2 \times Standard~error$—around each estimate. Estimates and uncertainties from m1 are plotted as blue, whereas those from m2 are plotted as pink. The true effects for generating the simulated data are plotted as solid black points.

 1# Concatenate item & subj effect/std to match m1_eff/m1_se
 2m2_eff = c( m2_eff.item, m2_eff.subj )
 3m2_se = c( m2_se.item, m2_se.subj )
 4
 5#' Function stolen from `rethinking::col.alpha()`
 6col.alpha = function (acol, alpha = 0.5) {
 7  acol = col2rgb(acol)
 8  acol = rgb(acol[1]/255, acol[2]/255, acol[3]/255, alpha)
 9  acol
10}
11
12# Plot for comparing `m1` & `m2`
13plot( 1, type="n", ylim = c(-4.8, 4.8), xlim=c( 0, n_subj+n_item + 1 ), 
14      ylab="Effect", xlab="Item Index" )
15abline( v=n_item + .5, lty=2, col="grey" )
16segments( -5, mean(m2_eff.item), n_item+.5, lty=2, col="grey" )
17segments( n_item+.5, 0, 1000, lty=2, col="grey" )
18points( c(E, A), pch=19 )
19# Uncertainty bars
20for (i in seq_along(m2_se)) {
21  lines( c(i,i), m1_eff[i] + c(-2,2)*m1_se[i], col=col.alpha(4), lwd=6 )
22  lines( c(i,i), m2_eff[i] + c(-2,2)*m2_se[i], col=col.alpha(2,.7), lwd=3 )
23}
24points( m1_eff, col=4 )
25points( m2_eff, col=2 )

Shrinkage

Some of the benefits of partial pooling discussed earlier are visualized in the comparison plot above. The most drastic changes from the unpooled to the partial-pooled model are seen in the ability estimates, which are exactly the levels that get partially pooled. Two things to notice here. First, there is less uncertainty in the partial-pooled estimates than in the unpooled ones (pink bars tend to be shorter than their blue counterparts). This follows naturally because, through partial pooling, the model has access to more information (hence less uncertainty) for each level. Secondly, the partial-pooled estimates tend to get “pulled” towards the center (i.e., the grand mean of the subject estimates). In addition, more extreme estimates are further pulled toward the center. Essentially, this means that the model behaves in a way that is robust against observations that result in extreme estimates. This is known as shrinkage and is also a feature that naturally arises from partial pooling.

From the figure, we can see that partial pooling improves the estimation of the person abilities, as most pink circles are found to be much closer to the solid black dots (true effects) than the blue ones. For item easiness, which are not partial-pooled, the estimates also improve slightly. This results from the improvement in estimating abilities. Since ability and easiness are jointly estimated by the model, the improvement from ability estimation carries on to easiness estimation. Given the large improvement in ability estimates, one might consider also pooling the items. Indeed, there is no reason to not pool. Partial pooling should be the default.

Partial Pool Items and Subjects

To specify the partial pooling of items in the model, we again utilize the “bar” syntax: (1|Iid). This allows glmer() to also model the items as coming from a normal distribution with zero mean and unknown variance. Now, since both the items and the subjects are centered at zeros, an additional step is needed to reconstruct the zero-centered item estimates back to their original locations1. This is why the model formula in m2.2 uses 1 instead of -1. By specifying 1, glmer() estimates an independent global intercept. In the case here, this intercept is identical to the amount subtracted from the item effects for centering. Hence, to reconstruct the original non-centered item estimates, we add the global intercept back to the item estimates, as shown in line 4 in the code below.

1m2.2 = glmer( R ~ 1 + (1|Iid) + (1|Sid), data=d, family=binomial('logit') )
2
3m2.2_eff.subj = ranef(m2.2)$Sid[, 1]
4m2.2_eff.item = ranef(m2.2)$Iid[, 1] + fixef(m2.2)[["(Intercept)"]]
5m2.2_eff = c( m2.2_eff.item, m2.2_eff.subj )
6m2.2_se = arm::se.ranef( m2.2 )
7m2.2_se = c( m2.2_se$Iid, m2.2_se$Sid )

You can compare m2.2 to m2 by plotting their estimates with the plotting code previously shown and see that m2.2 further improves the estimation (though not large). In the psychometric/measurement literature, partial pooling both item and person is uncommon. But as seen in our simulate-and-fit approach, partial pooling results in better estimation. This approach also refutes the unjustified belief that “fixed effects should be used when the levels are specifically selected by the researcher”. In our simulation, values of the item easiness are specifically “picked out”. They are deliberately set to be equally-spaced values. And still, we saw that modeling the item effects as random is not only benign but even improves estimation. This is true in general, and you can change the values of the item easiness in the simulation to see that partial pooling mostly, if not always, gives better estimates.

What’s next

So far, we have been dealing with item response models with dichotomous item responses. That is, a response can only either be correct (1) or wrong (0). In Part 4, we move on to item response models for rating responses. These models are extremely useful since rating scales are common in the social sciences. The models also allow us to model the so-called “rater effect”, which quantifies the leniency of the raters. By incorporating such rater effects, the model corrects for potential biases introduced by subjective ratings, thereby giving more accurate person and item estimates.

Don’t be intimidated by matrix algebra. It’s simply arithmetics in a fancy manner, and it looks scary only because it does many things at once. With some patience, you will be able to break down and understand the steps involved.

Reconstructing Dropped Estimates

Let’s first see how the contrast matrix reconstructs the dropped estimate from the sum-to-zero constrained model. I’ll start with a toy example with only three subjects, $S_1, S_2, S_3$. The contrast matrix for imposing a sum-to-zero constraint on the subjects is shown below. Recall that the sum-to-zero constraint is coded through the dropping of the last subject, $S_3$ (hence two columns left in the contrast matrix), and implicit coding of $S_3$’s information into $S_1$ and $S_2$ by the -1s on the third row. Given this coding, the effect of $S_3$ can be reconstructed from the effects of $S_1$ and $S_2$ by taking the negative of their sum. This can be done through the code c(subj_eff.m1.2, -sum(subj_eff.m1.2) ) from the previous post.

$S_1$$S_2$
$S_1$10
$S_2$01
$S_3$-1-1

The same thing can be done through matrix multiplication. Simply take the above 3-by-2 contrast matrix and multiply the 2-by-1 column vector of the estimated effects for $S_1$ and $S_2$, which I abbreviate as $E_1$ and $E_2$ here. The last entry of the resulting column vector would then give what we want.

$$ \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ -1 & -1 \end{bmatrix} \begin{bmatrix} E_1 \\ E_2 \end{bmatrix} = \begin{bmatrix} E_1 \\ E_2 \\ -E_1 - E_2 \end{bmatrix} $$

Here’s the R code version of the above matrix multiplication:

1Cmat = contr.sum(3)  # Contrast matrix coding sum-to-zero constraint
2eff = c( E1=1.5, E2=1.7 )  # Made-up effect of S1 and S2
3Cmat %*% eff
  [,1]
1  1.5
2  1.7
3 -3.2

Reconstructing Standard Error of Dropped Estimates

The contrast matrix can similarly be applied to reconstruct the variance ( hence the standard error) of the dropped subject’s estimate. The reconstruction is based on the variance sum law, $Var(X+Y) = Var(X) + Var(Y) + 2~Cov(X,Y)$, which has a natural generalization through matrix notations. Hence, given the variance of the estimates for $S_1$ and $S_2$ and their covariance, we will be able to reconstruct the variance of $E_3$ as

$$ \begin{equation} Var(E_3) = Var(E_1) + Var(E_2) + 2~Cov(E_1,E_2) \tag{1} \end{equation} $$

The variances and covariances of the estimates are given by the (variance-)covariance matrix of the fitted model. The formula below shows the matrix generalization to the variance sum law. Note that through the matrix generalization, we also get the reconstructed covariances, as shown in the off-diagonal entries in the reconstructed covariance matrix (the right-most matrix). The variance of $E_1$, $E_2$, and $E_3$ are found on the diagonal. The standard errors of the estimates are then obtained by taking the square roots of these diagonal entries.

In R, the same calculation is done with the code below.

1Cmat = contr.sum(3)  # Contrast matrix for coding sum-to-zero constraint
2# Made-up variances-covariances matrix of the estimates
3Vmat = matrix(c( 0.3, -0.01,
4                 0.0,  0.4  ), 
5              byrow=T, nrow=2 )
6Cmat %*% Vmat %*% t(Cmat)  # Reconstructed variance-covariance matrix
     1     2     3
1  0.3 -0.01 -0.29
2  0.0  0.40 -0.40
3 -0.3 -0.39  0.69

Back to the Code

 1d1 = d
 2contrasts(d1$Sid) = contr.sum( n_subj )
 3m1 = glm( R ~ -1 + Iid + Sid, data=d1, family=binomial("logit") )
 4
 5# Construct contrast matrix
 6Cmat = diag(0, nrow=n_item+n_subj)[, -1]
 7diag(Cmat)[1:n_item] = 1
 8idxS = 1:n_subj + n_item
 9Cmat[idxS, idxS[-length(idxS)]] = contr.sum( n_subj )
10
11# Reconstruct estimates with the constrast matrix
12m1_eff = (Cmat %*% coef(m1))[, 1]
13# Reconstruct std. error of the estimates with the constrast matrix
14Vmat = Cmat %*% vcov(m1) %*% t(Cmat) 
15m1_se = sqrt(diag(Vmat))

Once familiar with the matrix algebra discussed, the above code for reconstructing the dropped subject’s effect should become self-explaining. The only complication here is that instead of using the contrast matrix of the subjects, a larger matrix encompassing the coding of all levels of both the item and the subject variables is used to match the covariance matrix given by the model (which also contains all levels from all variables). This large contrast matrix can be thought of as the concatenation of two contrast matrices along the diagonal, with the remaining off-diagonal entries filled in with zeros. To better explain this, let me go back to our previous example with three subjects.

To keep things simple, let’s assume additionally that there are only three items. Since in the model, the sum-to-zero constraint is only imposed on the subjects, the contrast matrix for the coding of items would be a 3-by-3 identity matrix. Concatenating the item and subject contrast matrices in the way mentioned above results in the matrix:

$$ \begin{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}_{3 \times 3} & 0~~~~~ \\ 0 & \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ -1 & -1 \end{bmatrix}_{3 \times 2} \end{bmatrix}_{6 \times 5} $$

In general, with $n_I$ items and $n_S$ subjects, this concatenated contrast matrix has the form:

$$ \begin{bmatrix} ~~\mathrm{I}_{n_I \times n_I}~~~ & 0 ~~~~~~~~~~~~~~ \\ \\ 0 & \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & & 0 \\ \vdots & & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \\ -1 & -1 & \cdots & -1 \end{bmatrix}_{n_S \times (n_S - 1)} \end{bmatrix}_{ (n_I + n_S) \times (n_I + n_S - 1) } $$

This is what the second part of the above code (reproduced below) is doing. It first sets up the correct shape of this large contrast matrix according to the number of items and subjects. The trick here is to use the diag() function to initialize a square matrix of zeros and drop one of the columns to match the correct number of dimensions. Then, line 3 of the code sets the upper-left portion of this matrix (the item sub-matrix) as an identity matrix by filling in the diagonal with ones. Finally, the lower-right portion of the matrix (the subject sub-matrix) is replaced with the subject contrast matrix constructed by the contr.sum() function.

1# Construct contrast matrix
2Cmat = diag(0, nrow=n_item+n_subj)[, -1]
3diag(Cmat)[1:n_item] = 1
4idxS = 1:n_subj + n_item
5Cmat[idxS, idxS[-length(idxS)]] = contr.sum( n_subj )

With the contrast matrix Cmat prepared, we can construct what we need through matrix algebra. The estimates for all levels, including the dropped one, are reconstructed by multiplying Cmat with the estimates returned by the model. This is illustrated in the line below. The estimates are given by coef(m1), and the [, 1] at the end of the line forces the resulting one-column matrix into vector form.

1m1_eff = ( Cmat %*% coef(m1) )[, 1]

The full covariance matrix is similarly reconstructed through Cmat and the covariance matrix extracted from the model (vcov(m1)):

1Vmat = Cmat %*% vcov(m1) %*% t(Cmat) 

Since the final products we need are the standard errors, we extract the diagonal entries of Vmat and take the square root:

1m1_se = sqrt( diag(Vmat) )

  1. We don’t touch the subject estimates, though, since we assume them to be centered at zero in the simulation, remember? ↩︎