Yongfu's Blog

Demystifying Item Response Theory (2/4)

IRT as Generalized Linear Models

Mar 6, 2023

In Part 1, we went through the simplest item response model, the 1PL model, from the perspective of simulations. Starting with item difficulty and testee ability, we worked forward to simulate item responses that mimic real-world data. Back then, we were precisely laying out the data generating process that is assumed by the item response theory. In this post, we work backward. We will start with the item responses and work back toward the unobserved difficulties and abilities, with the help of statistical models. But first, let’s simulate the data we will be using!

 1logistic = function(x) 1 / (1 + exp(-x))
 2rbern = function( p, n=length(p) ) rbinom( n=n, size=1, prob=p )
 3
 4set.seed(12)
 5n_item = 30   # number of items
 6n_subj = 60   # number of subjects
 7n_resp = n_subj * n_item  # number of responses
 8
 9A = rnorm( n=n_subj, mean=0, sd=1 )  # Subjects' ability
10D = seq( -1.6, 1, length=n_item )  # Items' difficulty
11
12# The data
13d = expand.grid( S=1:n_subj, I=1:n_item, KEEP.OUT.ATTRS = FALSE )
14d$R = rbern( logistic(A[d$S] - D[d$I]) )
15d$S = factor(d$S)
16d$I = factor(d$I)
17str(d)
'data.frame':   1800 obs. of  3 variables:
 $ S: Factor w/ 60 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ I: Factor w/ 30 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ R: int  0 1 0 0 1 1 1 1 1 1 ...

In the code above, the first two lines are the definitions for the logistic and the Bernoulli functions used previously. The second chunk of code sets the shape of our data. This time, the simulated data is much larger, with 30 items and 60 testees, or subjects (I will use the more general term “subject” hereafter). Since we assume here that each subject responds to every item, this gives us 1800 responses in the data.

The subject abilities come from a normal distribution with a zero mean and a standard deviation of one (the standard normal). The item difficulties are equally-spaced values that range from -1.6 to 1. A notable change from the previous post is that number indices are used here for labeling items (I) and subjects (S). For simple illustrations, letter indices are clearer. But for larger data sets, number indices are easier to manipulate with code. Now, since S and I are coded as integers, we need to explicitly convert them into factors. Otherwise, the model will treat the number indices as values in a continuous variable.

DAGs Revisited

Before we move on to the statistical model, let me lay out the DAGs again. The DAG on the right below is identical to the one in Part 1 (the left DAG here), but with a slight modification that emphasizes the perspective from the statistical model. Here, the observed $S$ and $I$ take place of the unobserved $A$ and $D$, respectively. So why the difference?

Recall that the nodes $A$ and $D$ represent the joint influences of a subject’s ability and an item’s difficulty on the probability of success on that item. However, the statistical model cannot notice $A$ and $D$ since they are theoretical concepts proposed by the IRT. What the model “sees” is more similar to the DAG on the right. This DAG is theoretically neutral. All it says is that the probability of success is influenced by the particular subject and item present in an observation. It does not further comment on the factors underlying each subject/item that lead to the results.

Given the data and the right DAG, the statistical model estimates the so-called subject effects and item effects. These effects will be estimates of subject ability and item difficulty if the IRT assumptions are met: when a subject and an item influence the result only through subject ability and item difficulty. With the concepts of subject/item effects in place, we can move on to the formulas of the statistical model.

Equation, Index and Annoying Things

The equations in (1) are the formulation of our model. This model is known as the logistic regression, or in GLM terms, the Generalized Linear Model of the binomial family with the logit link (more on this later). Lots of things are going on here. Don’t panic, I’ll walk you through slowly.

$$ \begin{align} & R_i \sim \text{Bernoulli}( P_i ) \\ & P_i = \text{logistic}( \mu_i ) \\ & \mu_i = \alpha_{[S_i]} + \delta_{[I_i]} \end{align} \tag{1} $$

First, note the common subscript $_i$ to the variables above. The presence of this common $_i$ indicates that the equations are read at the observational level. The observational level is easier to think of with help of the long data format. In this long form of data, each row records an observation and is indexed by the subscript $_i$. So you can think of the set of three equations as describing the links among the variables for each observation. Note that the long data format is also the format we have been using for the data frames.

The last equation in (1), also related to the reading of the subscript $_i$, deserves some elaboration, as some might feel confused about the square brackets after $\alpha$ and $\delta$. Actually, we have already met these brackets in Part 1. The brackets here serve a similar function to R’s subset function [] that we have used for linking particular ability/difficulty levels of a subject/item to the rows (observations) of the data frame. So what the square brackets after $\alpha$ and $\delta$ do exactly, is to “look up” the index of the subject and item for the $_i$th observation such that the $\alpha$ corresponding to the subject and the $\delta$ corresponding to the item could be correctly retrieved. The model can thus “know” which $\alpha$ and $\delta$ to update when it encounters an observation. For instance, suppose we are on the 3rd row (observation) of the data, in which $S_3 = 5$ and $I_3 = 8$. This tells the model that the observation gives information about $\alpha_5$ and $\delta_8$. The model thus learns something about them and updates accordingly.

I haven’t written about $\alpha$ and $\delta$ yet, but based on the previous paragraph, you might already know what they are: $\alpha$s are the subject effects and $\delta$s the item effects to be estimated by the model.

Now, let me walk you through the equations from bottom to top:

  • $\mu_i = \alpha_{[S_i]} + \delta_{[I_i]}$
    No surprise here. This equation simply illustrates how the model computes a new variable $\mu$ from $\alpha$ and $\delta$.
  • $P_i = \text{logistic}( \mu_i )$
    The equation should look familiar. It indicates how the model maps $\mu$, which can range from $-\infty$ to $\infty$, to probability, $P$, through the logistic function.
  • $R_i \sim \text{Bernoulli}( P_i )$
    This equation describes that each observed response is generated from a Bernoulli distribution with probability $P_i$. Or even simpler, $R_i$ would be $1$ with probability $P_i$ and $0$ with probability $1 - P_i$.

These equations all look familiar because they are essentially mathematical representations of the simulation we have done. Here, the model formulation is simply simulation in reverse.

The GLM formulation of (1) is often seen in an alternative form in (2). The only difference between (2) and (1) lies in the second equation. Instead of the logistic function, the second equation in (2) uses the logit function. What is the logit?

$$ \begin{align} & R_i \sim \text{Bernoulli}( P_i ) \\ & \text{logit}(P_i) = \mu_i \\ & \mu_i = \alpha_{[S_i]} + \delta_{[I_i]} \end{align} \tag{2} $$

The logit function is simply the mirror of the logistic. They do the same mapping but in reverse directions:

  • the logistic function maps real numbers to probabilities
  • the logit function maps probabilities to real numbers

The logistic and the logit are inverse functions to each other. So if a real number gets converted to the probability by the logistic, the logit can convert it back to the original real, and vice versa.

1logit = function( p ) log( p/(1-p) )
2x = seq( -1, 1, by=0.1 )
3( p = logistic(x) )  # Transformed x on probability space
 [1] 0.2689414 0.2890505 0.3100255 0.3318122 0.3543437 0.3775407 0.4013123
 [8] 0.4255575 0.4501660 0.4750208 0.5000000 0.5249792 0.5498340 0.5744425
[15] 0.5986877 0.6224593 0.6456563 0.6681878 0.6899745 0.7109495 0.7310586
1# Logit gives x back
2logit(p)
 [1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1  0.0  0.1  0.2  0.3  0.4
[16]  0.5  0.6  0.7  0.8  0.9  1.0

Some algebra would get us from the logistic to the logit:

$$ \begin{aligned} \text{logistic}(x) &= \frac{1}{1 + e^{-x}} = p \\ & \Rightarrow ~~ e^{-x} = \frac{1-p}{p} \\ & \Rightarrow ~ -x = \text{log}(\frac{1-p}{p}) \\ & \Rightarrow ~~ x = -\text{log}(\frac{1-p}{p}) \\ & \phantom{\Rightarrow ~~ x } = \text{log}(\frac{p}{1-p}) = \text{logit}(p) \end{aligned} $$

There is really nothing special about the logit function. We have learned all the important things through the logistic back in Part 1. I mention the logit here simply because the term is frequently used. When people talk about GLMs, they prefer to use the link function to characterize the model. The link function, in the case of the logistic regression here, is the logit function. It transforms the outcome probabilities into real numbers that are modeled linearly. It’s just the logistic, but works in the reverse direction.

Fitting GLM

Now, we are packed with the statistical muscles to carry out the analysis. Let’s fit the model on the data we’ve simulated. In R, this is done through the function glm(). The first argument of glm() is the formula, in which we specify our linear model with R’s model syntax. There are in principle two ways, one succinct and the other tedious, to express the formula when there are categorical predictors in the model. I will first demonstrate the tedious one, as it exposes all the details hidden by the succinct form. Though tedious, it saves us from confusion.

Dummy Coding

The formulas we specify in glm() (and other model fitting functions in general) correspond pretty well to their mathematical counterparts. So let me first present the math before we move on to the code. Lots of things to explain here.

Equation (3.2) is rewritten from the last two equations, $logit(P_i) = \mu_i$ and $\mu_i = \alpha_{[S_i]} + \delta_{[I_i]}$ in (2), which I reproduce here in Equation (3.1) by combining the two equations.

Earlier I mentioned that the square brackets after $\alpha$ and $\delta$ serve as a “look up” function to locate the relevant $\alpha$ and $\delta$ of each subject and item in an observation. There is an equivalent way to express the same formula without the use of these “look up” functions, which is shown in equation (3.2). For the sake of simplicity, let’s assume here that we have only two items (A, B) and three subjects (J, K, L). For real data, equation (3.2) would be extremely long.

$$ \begin{align} \tag{3.1} \text{logit}(\mu_i) &= \alpha_{[S_i]} + \delta_{[I_i]} \\ \tag{3.2} \text{logit}(\mu_i) &= J_i \alpha_J + K_i \alpha_K + L_i \alpha_L + A_i \delta_A + B_i \delta_B \end{align} $$

The variables ($J_i, K_i, L_i, A_i, B_i$) in front of the $\alpha$s and $\delta$s have a value of either 0 or 1. Here, they serve as a “switch” that turns on the relevant $\alpha$ and $\delta$ and turns off the others in each observation. This is easier to see with the help of the tables below. Table 3.1 corresponds to Equation (3.1), and Table 3.2 corresponds to Equation (3.2). So, for instance, in row 2 of Table 3.2, $K$ and $A$ are 1 while the others are 0. This turns on, or picks out, $\alpha_K$ and $\delta_A$. As such, they would be updated by the model when it reaches this observation. In row 2 of Table 3.1, $\alpha_K$ and $\delta_A$ are picked out too, but not by the switches. They are directly picked out through the K and A present in the row.

Table 3.1Table 3.2
$_i$$S$$I$
1JA
2KA
3LB
$_i$$J$$K$$L$$A$$B$
110010
201010
300101

The re-expression of Table 3.1 as Table 3.2 by coding the categories into zeros and ones is known as dummy coding1. Why do we need dummy coding? In short, this is because regression programs do not “understand” the difference between categorical and continuous variables. They read only numbers. Dummy coding is essentially representing categorical variables as continuous ones so that the program would know how to deal with them. Most programs dummy code for the users (such as glm()) if you give them categories. But there are various ways to dummy code the categories and each of which results in a different output. The interpretation of the output coefficients depends on how the categories were coded. This confuses the novice as too much is happening under the hood.

Coding models: the long route

With the concepts of dummy coding in place, let’s code the model. I use dummy_cols() from the fastDummies package to help me with dummy coding. In the code below, I recode the item and subject variables into zeros and ones. The result is identical to Table 3.2, except that it now expands to 30 items and 60 subjects (90 columns in total). I won’t print out the full dummy-coded data frame to save space. But be sure to take a look at d_dummy to see how it corresponds to Table 3.2.

1library(fastDummies)
2d_dummy = dummy_cols( d, c("I", "S"), remove_selected_columns=TRUE )
3dim(d_dummy)
[1] 1800   91
1colnames(d_dummy)
 [1] "R"    "I_1"  "I_2"  "I_3"  "I_4"  "I_5"  "I_6"  "I_7"  "I_8"  "I_9" 
[11] "I_10" "I_11" "I_12" "I_13" "I_14" "I_15" "I_16" "I_17" "I_18" "I_19"
[21] "I_20" "I_21" "I_22" "I_23" "I_24" "I_25" "I_26" "I_27" "I_28" "I_29"
[31] "I_30" "S_1"  "S_2"  "S_3"  "S_4"  "S_5"  "S_6"  "S_7"  "S_8"  "S_9" 
[41] "S_10" "S_11" "S_12" "S_13" "S_14" "S_15" "S_16" "S_17" "S_18" "S_19"
[51] "S_20" "S_21" "S_22" "S_23" "S_24" "S_25" "S_26" "S_27" "S_28" "S_29"
[61] "S_30" "S_31" "S_32" "S_33" "S_34" "S_35" "S_36" "S_37" "S_38" "S_39"
[71] "S_40" "S_41" "S_42" "S_43" "S_44" "S_45" "S_46" "S_47" "S_48" "S_49"
[81] "S_50" "S_51" "S_52" "S_53" "S_54" "S_55" "S_56" "S_57" "S_58" "S_59"
[91] "S_60"

Now, we can fit the model with the dummy-coded data d_dummy. The first argument in glm() specifies the formula in R’s model syntax. Based on Equation (3.2), we include all the dummy variables in the table. In R, this means typing out all the variables as R ~ S_1 + S_2 + ... + S_80 + I_1 + I_2 + ... + I_20. That’s a lot of work!

Luckily, R provides a handy syntax for this. Since we are including all the variables except the outcome on the right-hand side of the formula, we can simply type R ~ .. Here, the dot serves as a placeholder for all the remaining variables not specified in the formula. We also need a -1 in front of the dot: R ~ -1 + .. The -1 tells the model not to estimate a global intercept, which is done by default2. We don’t need a global intercept here since we want all the effects to be presented in the items and subjects. If a global intercept is estimated, it will “suck out” what should have been part of the subject/item effects, rendering the results hard to interpret.

The last thing to note in glm() is the family argument, which characterizes the type of GLM used. Since we are fitting the data with logistic regression, we pass binomial("logit") to family. The GLM will then adopt the binomial distribution3 with the logit link to map the right-hand linear terms to the outcome space.

1m1 = glm( R ~ -1 + ., data=d_dummy, family=binomial("logit") )
2summary(m1)
Call:
glm(formula = R ~ -1 + ., family = binomial("logit"), data = d_dummy)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4291  -0.8850   0.2498   0.8978   2.7509  

Coefficients: (1 not defined because of singularities)
       Estimate Std. Error z value Pr(>|z|)    
I_1   1.926e+00  5.422e-01   3.553 0.000381 ***
I_2   1.798e+00  5.339e-01   3.368 0.000758 ***
I_3   1.154e+00  5.049e-01   2.286 0.022267 *  
I_4   1.154e+00  5.049e-01   2.286 0.022267 *  
I_5   1.351e+00  5.118e-01   2.639 0.008304 ** 
I_6   9.686e-01  4.997e-01   1.939 0.052559 .  
I_7   1.351e+00  5.118e-01   2.639 0.008304 ** 
I_8   1.060e+00  5.021e-01   2.111 0.034737 *  
I_9   9.686e-01  4.997e-01   1.939 0.052559 .  
I_10  1.060e+00  5.021e-01   2.111 0.034737 *  
I_11  5.368e-01  4.920e-01   1.091 0.275173    
I_12  3.715e-01  4.905e-01   0.757 0.448848    
I_13 -3.712e-02  4.901e-01  -0.076 0.939634    
I_14 -3.712e-02  4.901e-01  -0.076 0.939634    
I_15  1.263e-01  4.897e-01   0.258 0.796423    
I_16  2.079e-01  4.898e-01   0.424 0.671244    
I_17 -2.857e-01  4.922e-01  -0.581 0.561549    
I_18  1.263e-01  4.897e-01   0.258 0.796423    
I_19 -3.712e-02  4.901e-01  -0.076 0.939634    
I_20  4.473e-02  4.898e-01   0.091 0.927249    
I_21 -7.217e-01  5.000e-01  -1.443 0.148885    
I_22 -1.194e-01  4.906e-01  -0.243 0.807787    
I_23 -6.313e-01  4.979e-01  -1.268 0.204796    
I_24 -7.217e-01  5.000e-01  -1.443 0.148885    
I_25 -7.217e-01  5.000e-01  -1.443 0.148885    
I_26 -7.217e-01  5.000e-01  -1.443 0.148885    
I_27 -1.007e+00  5.082e-01  -1.981 0.047587 *  
I_28 -1.212e+00  5.159e-01  -2.350 0.018770 *  
I_29 -1.212e+00  5.159e-01  -2.350 0.018770 *  
I_30 -1.961e+00  5.585e-01  -3.511 0.000447 ***
S_1  -1.800e+00  6.322e-01  -2.847 0.004415 ** 
S_2   1.732e+00  6.580e-01   2.632 0.008491 ** 
S_3  -1.575e+00  6.143e-01  -2.564 0.010355 *  
S_4  -8.194e-01  5.775e-01  -1.419 0.155962    
S_5  -1.176e+00  5.909e-01  -1.991 0.046527 *  
S_6   3.276e-01  5.732e-01   0.572 0.567650    
S_7   4.966e-01  5.774e-01   0.860 0.389744    
S_8  -3.227e-01  5.688e-01  -0.567 0.570509    
S_9  -4.853e-01  5.705e-01  -0.851 0.394979    
S_10  4.966e-01  5.774e-01   0.860 0.389744    
S_11 -1.176e+00  5.909e-01  -1.991 0.046527 *  
S_12 -1.369e+00  6.010e-01  -2.277 0.022758 *  
S_13 -8.194e-01  5.775e-01  -1.419 0.155962    
S_14 -1.613e-01  5.682e-01  -0.284 0.776469    
S_15 -1.613e-01  5.682e-01  -0.284 0.776469    
S_16 -1.176e+00  5.909e-01  -1.991 0.046527 *  
S_17  1.253e+00  6.146e-01   2.039 0.041452 *  
S_18  3.276e-01  5.732e-01   0.572 0.567650    
S_19  1.046e+00  6.010e-01   1.741 0.081706 .  
S_20 -8.194e-01  5.775e-01  -1.419 0.155962    
S_21  4.966e-01  5.774e-01   0.860 0.389744    
S_22  2.856e+00  8.565e-01   3.334 0.000855 ***
S_23  1.479e+00  6.328e-01   2.337 0.019424 *  
S_24 -9.940e-01  5.833e-01  -1.704 0.088343 .  
S_25 -8.194e-01  5.775e-01  -1.419 0.155962    
S_26  1.625e-01  5.704e-01   0.285 0.775661    
S_27 -4.853e-01  5.705e-01  -0.851 0.394979    
S_28 -3.227e-01  5.688e-01  -0.567 0.570509    
S_29  1.680e-15  5.687e-01   0.000 1.000000    
S_30  6.712e-01  5.831e-01   1.151 0.249706    
S_31  6.712e-01  5.831e-01   1.151 0.249706    
S_32  3.618e+00  1.111e+00   3.256 0.001131 ** 
S_33 -1.613e-01  5.682e-01  -0.284 0.776469    
S_34 -8.194e-01  5.775e-01  -1.419 0.155962    
S_35 -4.853e-01  5.705e-01  -0.851 0.394979    
S_36 -4.853e-01  5.705e-01  -0.851 0.394979    
S_37  1.046e+00  6.010e-01   1.741 0.081706 .  
S_38 -6.504e-01  5.734e-01  -1.134 0.256656    
S_39  1.046e+00  6.010e-01   1.741 0.081706 .  
S_40 -1.575e+00  6.143e-01  -2.564 0.010355 *  
S_41  1.625e-01  5.704e-01   0.285 0.775661    
S_42 -6.504e-01  5.734e-01  -1.134 0.256656    
S_43  1.253e+00  6.146e-01   2.039 0.041452 *  
S_44 -1.800e+00  6.322e-01  -2.847 0.004415 ** 
S_45 -3.227e-01  5.688e-01  -0.567 0.570509    
S_46  4.966e-01  5.774e-01   0.860 0.389744    
S_47 -9.940e-01  5.833e-01  -1.704 0.088343 .  
S_48  3.276e-01  5.732e-01   0.572 0.567650    
S_49  8.536e-01  5.908e-01   1.445 0.148546    
S_50 -3.227e-01  5.688e-01  -0.567 0.570509    
S_51 -4.853e-01  5.705e-01  -0.851 0.394979    
S_52 -1.613e-01  5.682e-01  -0.284 0.776469    
S_53  6.712e-01  5.831e-01   1.151 0.249706    
S_54  2.856e+00  8.565e-01   3.334 0.000855 ***
S_55 -6.504e-01  5.734e-01  -1.134 0.256656    
S_56  6.712e-01  5.831e-01   1.151 0.249706    
S_57  1.253e+00  6.146e-01   2.039 0.041452 *  
S_58 -9.940e-01  5.833e-01  -1.704 0.088343 .  
S_59 -4.853e-01  5.705e-01  -0.851 0.394979    
S_60         NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2495.3  on 1800  degrees of freedom
Residual deviance: 1927.2  on 1711  degrees of freedom
AIC: 2105.2

Number of Fisher Scoring iterations: 6

Take a look at the output. Something’s strange. Since there are 30 items and 60 subjects in the data, we expect the model to return 90 coefficients, one for each subject/item effect. However, the last coefficient, S_60 in this case, is NA. The model does not estimate this coefficient. Why?

Identifiability

The reason is that there are infinite sets of parameter combinations that generate the same probabilities underlying our data. Thus, the model is unable to work in reverse to infer a unique set of coefficients from the data. To deal with this issue, R silently sets a constraint on the parameters: it simply drops one of the parameters and estimates the rest. When this is done, the remaining parameters become identifiable, and the model would be able to estimate them.

But still, where did the infinity come from? Didn’t we simulate the data? We didn’t introduce infinity, did we?

We did actually, in silence. Recall that the probability of success on an item is determined by the difference between ability and difficulty. Since it is the difference that matters, there could be an infinite number of ability and difficulty pairs that yield the same difference. By adding any common value to a pair, we get a new pair of ability and difficulty that yields the same probability. The code below demonstrates this. Here, I shift the ability and difficulty levels by a common value s. The resulting probabilities should be identical before and after the shift (except for a tiny floating point imprecision). You can play with the code below by changing the value of s. Identical results should always be yielded.

1# Shift A/D by a common factor
2s = 5
3A2 = A + s
4D2 = D + s
5
6p1 = logistic( A[d$S] - D[d$I] )    # Probabilities before shift
7p2 = logistic( A2[d$S] - D2[d$I] )  # Probabilities after shift
8sum( abs(p1 - p2) )  # Should be extremely close to zero
[1] 9.453549e-14

The way R deals with this issue of identifiability is not preferable since we want to recover the parameters in our simulation (i.e., the set of parameters without the shift). To get around R’s default treatment, we have to impose constraints on the parameters ourselves.

Recall that subject abilities are generated according to a standard normal distribution in the simulation. Since the standard normal has a mean of zero, the expectation of the sum of subject abilities is zero. We can use this expectation as a constraint to the parameters by constraining the subject effects to sum to zero. This constraint, however, does not scale the model’s estimates to perfectly match the true parameters since the true subject abilities never exactly sum to zero in a single run of the simulation. However, the relative scale would be close enough for the simulated parameters to be comparable to those recovered by the model. Later in the next post, when the generalized linear mixed model is introduced, you will see that there is no need to impose such a constraint. The constraint is naturally included through the model’s assumptions. The estimated subject effects then, do not need to sum to zero. Subject effects would be assumed to result from a normal distribution with a mean of zero.

Through dummy coding, we can impose the sum-to-zero constraint on the subject effects. I illustrate this with the example previously presented in Table 3.2, where there are only 3 subjects and 2 items. Table 3.3 is re-coded from Table 3.2, in which the sum-to-zero constraint is imposed.

The sum-to-zero constraint is imposed by dropping one of the subjects and coding the remaining as -1 for all the observations where the dropped subject is originally coded as 1. This is shown in Table 3.3, where I drop subject $L$ (hence the - in the column) and code the 3rd row of $J$ and $K$ as -1.

Table 3.2Table 3.3
$_i$$J$$K$$L$$A$$B$
110010
201010
300101
$_i$$J$$K$$L$$A$$B$
110-10
201-10
3-1-1-01

With the coding scheme in Table 3.3, the estimated effect of subject $L$ can be reconstructed from the remaining subject effects returned by the model. Since the sum of all subject effects is zero, the effect of subject $L$ will be the negative of the others’ sum. This might seem a bit confusing. But notice that the sum-to-zero constraint simultaneously applies to all effects in the variable. Once the effect of the dropped category is reconstructed, each effect will also be the negative sum of the remaining effects.

Let’s impose this constraint on the data with code. Here, I will drop the first subject S_1. You can choose any subject you like to drop, and the result will be identical. The code from line 3 to 5 below pick outs the rows where S_1 is coded as 1 and recode them as -1 on all the subject columns. After the re-coding, the final line of code then drops S_1.

1d_dummy2 = d_dummy
2toDrop = "S_1"
3allCategories = startsWith( names(d_dummy2), "S_" )
4idx_recode = which( d_dummy2[[toDrop]] == 1 )
5d_dummy2[idx_recode, allCategories] = -1
6d_dummy2[[toDrop]] = NULL

Now, let’s refit the model with this constraint-coded data. I simply replace d_dummy with d_dummy2 in the data argument. Everything else is the same.

1m1.1 = glm( R ~ -1 + ., data=d_dummy2, family=binomial("logit") )
2# summary(m1.1)

If you print out the coefficients of the fitted model, you will see that the result is what we expected. The model returns 89 coefficients, which match the number of the predictor variables we passed in. No coefficient is dropped. We already dropped it for the model. And since we dropped the predictor in a principled way, we know how to reconstruct it. The effect of the dropped S_1 will be the negative sum of the remaining. This is shown in the code below, which reconstructs all the item/subject effects from the model’s coefficients.

1eff = coef(m1.1)
2item_eff = eff[ startsWith(names(eff), "I_") ]
3subj_eff = eff[ startsWith(names(eff), "S_") ]
4# Reconstruct S_1 from the remaining subject effects
5subj_eff = c( -sum(subj_eff), subj_eff )

We can now plot the estimated effects against the true parameter values from the simulation. The figures below plot the estimated effects on the x-axis and the true parameters of the y-axis. The dashed line has a slope of 1 without an intercept. This line indicates perfect matches between the truth and the estimates. Notice that for the figure on the right, I reverse the signs of the item effects to match the scale of item difficulty. This is necessary since $D$ is subtracted from $A$ in the simulation. In other words, the effect of difficulty assumed by the 1PL model is negative: the larger the difficulty, the less probability of success on the item. However, glm() allows only additive effects. The effects in the model are summed together to yield predictions. Hence, the item effects estimated by glm() will be the negative of those assumed by the 1PL model.

1plot(  subj_eff, A, pch=19, col=4 ); abline(0, 1, lty="dashed" )
2plot( -item_eff, D, pch=19, col=2 ); abline(0, 1, lty="dashed" )

As seen in both figures, the dots scatter around the lines quite randomly, which indicates that the model does recover the parameters. To have a clearer view of the estimates’ accuracy, let me plot some more.

 1# Set figure margins
 2par(oma=c(0,0,0,0))  # outer margin
 3par(mar=c(3, 4, 3, 1.6) )  # margin
 4
 5true_eff = c(D, A)
 6est_eff = c(-item_eff, subj_eff)
 7n_param = n_item + n_subj
 8cols = c( rep(2, n_item), rep(4, n_subj) )
 9
10y_lim = max( abs(c(true_eff, est_eff)) )
11y_lim = c( -y_lim-.1, y_lim+.1 )
12plot( 1, type="n", ylim=y_lim, xlim=c(1, n_param+1), ylab="Effect" )
13abline( h=0, lty="dashed", col="grey" )
14abline( v=n_item+0.5, col="grey")
15points( true_eff, pch=19, col=cols )
16points( est_eff, col=cols )
17for (i in 1:n_param)
18  lines( c(i, i), c(true_eff[i], est_eff[i]), col=cols[i] )
19mtext( c("Items", "Subjects"), at=c(9, 61), padj = -.5, col=c(2, 4) )

In the plot above, I overlay the estimated effects onto the true parameters. The dots are the true parameters and the circles are the model’s estimates. The vertical lines connecting the dots and the circles show the distances between the truth and the estimates. It is obvious from the plot that, compared to item difficulties, subject abilities are harder to estimate, as the distances to the truth are in general larger for subject estimates. This is apparent in hindsight, as each item is taken by 60 subjects whereas each subject only takes 30 items. Hence, the estimation for the items is more accurate, compared to the subjects, as there are more data to estimate each.

The effect of manipulating the number of subjects and items is revealed in the plot below. Here, I refit the model with data that have the subject and the item numbers flipped. The subject abilities are now estimated more accurately than the item difficulties. You can experiment with this to see how the estimation accuracy changes with different combinations of subject/item numbers. The functions sim_data() and plot_estimate() in estimate-acc.R can help you with this.

Coding models: the easy route

We have gone through a long route, along which we have learned a lot. Now, you are qualified to take the easy route: to use R’s handy function for dummy coding. Note that this route won’t be easy at all if you never went through the longer one. Rather, confusion is all you will get, and you will have no confidence in the model you coded. Simple code is simple only for those who are well-trained. So now, let’s fit the model again. This time, we take the highway.

The trick for controlling how the model functions dummy code the categorical variables is to use the contrasts() function to set up the preferred coding scheme. In the code below, I pass the number of the categories in $S$ (i.e., n_subj) to contr.sum(), which is a helper function that codes the subjects in the exact same way as we did in Table 3.3 (execute contr.sum(3) and you will see a table that corresponds exactly to Table 3.3).

After the coding scheme is set, we can express categorical predictors in the model formula directly. Everything else is the same except the last line. Previously, I demonstrated dummy coding by dropping the first subject in $S$. Here, contr.sum() drops the last subject by default. Thus, the code for constructing the dropped subject is slightly different here.

1dat = d
2# Drop the last S and impose sum-to-zero constraint on S
3contrasts( dat$S ) = contr.sum( n_subj )
4m1.2 = glm( R ~ -1 + I + S, data=dat, family=binomial("logit") )
5eff = coef(m1.2)
6item_eff.m1.2 = eff[ startsWith(names(eff), "I") ]
7subj_eff.m1.2 = eff[ startsWith(names(eff), "S") ]
8subj_eff.m1.2 = c( subj_eff.m1.2, -sum(subj_eff.m1.2) )

Now we have the estimated effects from the model, let’s check the results. The figure below plots the current estimates (m1.2) against previous ones (m1.1). The estimates from the two models agree, which confirms that the second model is correctly coded.

1est_m1.1 = c( item_eff, subj_eff )
2est_m1.2 = c( item_eff.m1.2, subj_eff.m1.2 )
3plot( 1, type="n", xlim=c(-2.5,2.5), ylim=c(-2.5,2.5), xlab="m1.2", ylab="m1.1" )
4abline( 0,1, lty="dashed", col="grey" )
5points( est_m1.2, est_m1.1, pch=19, col=cols )

What’s next?

This post is lengthy, but not because it is hard. Rather, the concepts presented are fairly simple. The post is lengthy because we got used to texts that hide details from the readers. The text here does the opposite: it presents all the necessary details to get the statistical model working, without confusion. People often assume that hidden details are trivial. But more often, it is just because writers are too lazy to present the details. Statistics is hard partly because it is loaded with details that are hidden and ignored. When details get ignored long enough, they accumulate to become entangled and uncrackable. Coding, again, is here to help. It dissolves the fuzziness that otherwise accumulates and hinders understanding.

In Part 3, we move on to Generalized Linear Mixed Models, which are extensions to GLMs that improve estimation and efficiency by harnessing the information from common group memberships in the data. We will use the same data, and the text would be much shorter, I hope. Seeya!


  1. I use dummy coding as a general umbrella term to cover all systems for coding categorical variables. In R, what is known as “treatment coding” (contr.treatment) is sometimes called “dummy coding” by others. Here, I follow R’s convention. When “dummy coding” is used, I always refer to the general sense of re-coding categories as numbers (not necessarily zeros and ones). ↩︎

  2. This default behavior of estimating a global intercept makes sense in the context of continuous predictors, such as the simple linear model shown below. In this case, we can succinctly express the formula as y ~ x in R’s model syntax. The estimate of the global intercept $\alpha$ would be given as the intercept coefficient in the model output.

    $$ \begin{aligned} y_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta x_i \end{aligned} $$ ↩︎

  3. I haven’t mentioned the binomial distribution before. The binomial distribution is the extension of the Bernoulli distribution to $n$ independent trials. So if you repeat the Bernoulli process $n$ times and sum the outcomes, say, you toss the coin $n=10$ times and record the number of heads observed, the distribution of outcomes would follow a binomial distribution with parameters $n$ and $p$. So the Bernoulli distribution is simply a special case of the binomial distribution where $n=1$. ↩︎