Demystifying Item Response Theory (1/4)
Playing God through Simulations
Mysterious Item Response Theory
Item response theory is mysterious and intimidating to students. It is mysterious in the way it is presented in textbooks, at least in introductory ones. The text often starts with an ambitious conceptual introduction to IRT, which most students would be able to follow, but with some confusion. Curious students might bear with the confusion and expect it to resolve in the following text, only to find themselves disappointed. At the point where the underlying statistical model should be further elaborated, the text abruptly stops and tries to convince the readers to trust the results from black-box IRT software packages.
It isn’t that I have trust issues with black-box software, and I also agree that certain details of IRT model estimation should be hidden from the readers. The problem is that there’s a huge gap here, between where textbooks stopped explaining and where the confusing details of statistics should be hidden. Hence, students would be tricked into believing that they have a sufficient degree of understanding, but in reality, it’s just blind faith.
A sufficient degree of understanding should allow the student to deploy the learned skills to new situations. Therefore, a sufficient degree of understanding of IRT models should allow students to extend and apply the models to analyses of, for instance, differential item functioning (DIF) or differential rater functioning (DRF).
I’m arguing here that there is a basic granularity of understanding, somewhat similar to the concept of basic-level category, that when reached, allows a student to smoothly adapt the learned skills to a wide variety of situations, modifying and extending the skills on demand. And I believe that item response theory is too hard for a student to learn and reach this basic level of understanding, given its conventional presentation and historical development1.
There is still hope however, thanks to the development of a very general set of statistical models known as the Generalized Linear Models (GLM). Item response models could be understood in terms of the GLM and its extensions (GLMM and non-linear form of the GLM/GLMM). To be too particular about the details, the results from IRT software packages and the GLMs/GLMMs would be very similar but not identical, since they utilize different estimation methods. The strengths of the GLM, however, lie in its conceptual simplicity and extensibility. Through GLM, IRT and other models such as the T-test, ANOVA, and linear regression, are all placed together into the same conceptual framework. Furthermore, software packages implementing GLMs are widely available. Users can thus experiment with them—simulate a set of data based on known parameters, construct the model and feed it the data, and see if the fitted model correctly recovers the parameters. This technique of learning statistics is probably the only effective way for students to understand a mysterious statistical model.
In this series of posts, I will walk you through the path of understanding item response theory, with the help of simulations and generalized linear models. No need to worry if you don’t know GLMs yet. We have another ally—R, in which we will be simulating artificial data and fitting statistical models along the way. Although it might seem intimidating at first, coding simulations and models in fact provide scaffolding for learning. When feeling unsure or confused, you can always resort to these simulation-based experiments to resolve the issues at hand. In this very first post, I will start by teaching you simulations.
Just Enough Theory to Get Started
Jargons aside, the concept behind item response theory is fairly simple. Consider the case where 80 testees are taking a 20-item English proficiency test. Under this situation, what are the factors that influence whether an item is correctly solved by a testee? Straightforward right? If an item is easy and if a testee is proficient in English, he/she would probably get the item correct. Here, two factors jointly influence the result:
- how difficult (or easy) the item is
- the English ability of the testee
We can express these variables and the relations between them in the
graphs below. Let’s focus on the left one first. Here, $A$ represents
the ability of the testee, $D$ represents the difficulty of the
item, and $R$ represents the item response, or score on the
item. A response for an item is coded as 1
($R=1$) if it is solved
correctly. Otherwise, it is coded as 0
($R=0$). The arrows
$A \rightarrow R$ and $D \rightarrow R$ indicate the direction of
influence. The arrows enter $R$ since item difficulty and testee ability
influence the score on the item (not the other way around). $A$ and $D$
are drawn as a circled node to indicate that they are unobserved (or
latent, if you prefer a fancier term), whereas uncircled nodes
represent directly observable variables (i.e., stuff that gets recorded
during data collection). This graphical representation of the variables
and their relationships is known as a Directed Acyclic Graph
(DAG).
The DAGs laid out here represent the concept behind the simplest kind of item response models, known as the 1-parameter logistic (1PL) model (or the Rasch Model). In more formal terms, this model posits that the probability of correctly answering an item is determined by the difference between testee ability and item difficulty. So a more precise DAG representation for this model would be the one shown on the right above. Here, $P$ is the probability of correctly answering the item, which cannot be directly observed. However, $P$ directly influences the item score $R$, hence the arrow $P \rightarrow R$.
Believe it or not, the things we have learned so far could get us started. So let’s simulate some data, based on what we’ve learned about item response theory!
Simulating Item Responses
Consider the scenario where 3 students (Rob, Tom, and Joe) took a math
test with 2 items (A and B). Since we play gods during simulations, we
know the math ability of the students and the difficulty of the items.
These ability/difficulty levels can range from positive to negative
numbers, unbounded. Larger numbers indicate higher levels of
difficulty/ability. In addition, the levels of difficulty and ability
sit on a common scale and hence could be directly compared. Also, each
student responds to every item, so we get responses from all 6 (3x2)
combinations of students and items. Let’s code this in R below. The
function expand.grid()
would pair up the 6 combinations for us.
1D = c( A=0.4, B=0.1 ) # Difficulty of item
2A = c( R=0.5, T=0.1, J=-0.4 ) # Ability of student (R:Rob, T:Tom, J:Joe)
3
4dat = expand.grid(
5 I = c( "A", "B" ), # Item index
6 T = c( "R", "T", "J" ) # Testee index
7)
8dat
# A tibble: 6 x 2
I T
<fct> <fct>
1 A R
2 B R
3 A T
4 B T
5 A J
6 B J
After having all possible combinations of the students and the items, we could collect the values of student ability and item difficulty into the data frame.
1dat$A = A[ dat$T ] # map ability to df by testee index T
2dat$D = D[ dat$I ] # map difficulty to df by item index I
3dat
# A tibble: 6 x 4
I T A D
<fct> <fct> <dbl> <dbl>
1 A R 0.5 0.4
2 B R 0.5 0.1
3 A T 0.1 0.4
4 B T 0.1 0.1
5 A J -0.4 0.4
6 B J -0.4 0.1
Now, we’ve got all the data needed for simulation, the only thing left is to precisely lay out the rules for generating the response data $R$—the scores (zeros and ones) on the items solved by the students. We are two steps away.
Generating Probabilities
When IRT is introduced in the previous section, I mention that the
probability of successfully solving an item is determined by the
difference between testee ability and item difficulty. It is
straightforward to get this difference: simply subtract $D$ from $A$ in
the data. This would give us a new variable $\mu$. I save the values of
$\mu$ to column Mu
in the data frame.
1dat$Mu = dat$A - dat$D
2dat
# A tibble: 6 x 5
I T A D Mu
<fct> <fct> <dbl> <dbl> <dbl>
1 A R 0.5 0.4 0.1
2 B R 0.5 0.1 0.4
3 A T 0.1 0.4 -0.3
4 B T 0.1 0.1 0
5 A J -0.4 0.4 -0.8
6 B J -0.4 0.1 -0.5
From the way $\mu$ is calculated ($A$ - $D$), we can see that, for a particular observation, if $\mu$ is positive and large, the testee’s ability will be much greater than the item’s difficulty, and he would probably succeed on this item. On the other hand, if $\mu$ is negative and small, the item difficulty would be much greater in this case, and the testee would likely fail on this item. Hence, $\mu$ should be directly related to probability, in that $\mu$ of large values result in high probabilities of success on the items, whereas $\mu$ of small values result in low probabilities of success. But how exactly is $\mu$ linked to probability? How can we map $\mu$ to probability in a principled manner? The solution is to take advantage of the logistic function.
$$ \text{logistic}( x ) = \frac{ 1 }{ 1 + e^{-x} } $$
The logistic is a function that maps a real number $x$ to a probability $p$. In other words, the logistic function transforms the input $x$ and constrains it to a value between zero and one. Note that the transformation is monotonic increasing, meaning that a smaller $x$ would be mapped onto a smaller $p$, and a larger $x$ would be mapped onto a larger $p$. The ranks of the values before and after the transformation stay the same. To have a feel of what the logistic function does, let’s transform some values with the logistic.
1# Set plot margins # (b, l, t, r)
2par(oma=c(0,0,0,0)) # Outer margin
3par(mar=c(4.5, 4.5, 1, 3) ) # margin
4
5logistic = function(x) 1 / ( 1 + exp(-x) )
6x = seq( -5, 5, by=0.1 )
7p = logistic( x )
8plot( x, p )
As the plot shows, the logistic transformation results in an S-shaped curve. Since the transformed values (p) are bounded by 0 and 1, extreme values on the poles of the x-axis would be “squeezed” after the transformation. Real numbers with absolute values greater than 4, after transformations, would have probabilities very close to the boundaries.
Less Math, Less Confusion
For many students, the mathematical form of the logistic function leads to confusion. Staring at the math symbols hardly enables one to arrive at any insightful interpretation of the logistic. A suggestion here is to let go of the search for such an interpretation. The logistic function is introduced not because it is loaded with some crucial mathematical or statistical meaning. Instead, it is used here solely for a practical reason: to monotonically map real numbers to probabilities. You may well use another function here to achieve the same purpose (e.g., the cumulative distribution function of the standard normal).
Generating Responses
We have gone all the way from ability/difficulty levels to the probabilities of success on the items. Since we cannot directly observe probabilities in the real world, the final step is to link these probabilities to observable outcomes. In the case here, the outcomes are simply item responses of zeros and ones. How do we map probabilities to zeros and ones? Coin flips, or Bernoulli distributions, will get us there.
Every time a coin is flipped, either a tail or a head is observed. The
Bernoulli distribution is just a fancy way of describing this process.
Assume that we record tails as 0
s and heads as 1
s, and suppose that
the probability $p$ of observing a head equals 0.75 (since the coin is
imbalanced in some way that the head is more likely observed and we know
it somehow). Then, the distribution of the outcomes (zero and one) will
be a Bernoulli distribution with parameter $P=0.75$. In graphical terms,
the distribution is just two bars.
We’ve got all we need by now. Let’s construct the remaining columns to
complete this simulation. In the code below, I compute the probabilities
(P
) from column Mu
. Column P
could then generate column R
, the
item responses, through the Bernoulli distribution.
1rbern = function( p, n=length(p) )
2 rbinom( n=n, size=1, prob=p )
3
4set.seed(13)
5dat$P = logistic( dat$Mu )
6dat$R = rbern( dat$P ) # Generate 0/1 from Bernoulli distribution
7dat
# A tibble: 6 x 7
I T A D Mu P R
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <int>
1 A R 0.5 0.4 0.1 0.525 0
2 B R 0.5 0.1 0.4 0.599 1
3 A T 0.1 0.4 -0.3 0.426 0
4 B T 0.1 0.1 0 0.5 0
5 A J -0.4 0.4 -0.8 0.310 1
6 B J -0.4 0.1 -0.5 0.378 0
Now, we have a complete table of simulated item responses. A few things
to notice here. First, look at the fourth row of the data frame, where
the response of testee T (Tom) on item B is recorded. Column Mu
has a
value of zero since Tom’s ability level is identical to the difficulty
of item B. What does it mean to be “identical”? “Identical” implies that
Tom is neither more likely to succeed nor to fail on item B. Hence, you
can see that Tom has a 50% of getting item B correct in the $P$ column.
This is how the ability/difficulty levels and $\mu$ are interpreted.
They are on an abstract scale of real numbers. We need to convert them
to probabilities to make sense of them.
The second thing to notice is column R
. This is the only column that
has randomness introduced. Every run of the simulation would
likely give different values of $R$ (unless a random seed is set, or the
P
column consists solely of zeros and ones). The outcomes are not
guaranteed, probability is at work.
The presence of such randomness is the gist of simulations and statistical models. We add uncertainty to the simulation, mimicking the real world, to know that in the presence of such uncertainty, would it still be possible to discover targets of interest with a statistical model. Randomness, however, poses some challenges for coding. We need to equip ourselves for those challenges.
Coding Randomness
Randomness is inherent in simulations and statistical models, so it is impossible to run away from it. It is everywhere. The problem with randomness is that it introduces uncertainty in the outcome produced. Thus, it would be hard to spot any errors just by eyeballing the results.
Take rbern()
for instance. Given a parameter $P=0.5$, we can
repeatedly run rbern(0.5)
a couple of times to produce zeros and ones.
But these zeros and ones cannot tell us whether rbern(0.5)
is working
properly. rbern(0.5)
might be broken somehow, and instead generates
the ones with, say, $P=0.53$.
The law of large
numbers can help us
here. Since rbern()
generates ones with probability $P$, if we run
rbern()
many times, the proportion of the ones in the outcomes should
converge to $P$. To achieve this, take a look at the second argument n
in rbern()
, which is set here to repeatedly generate outcomes ten
thousand times. You can increase n
to see if the result comes even
closer to $0.5$.
1outcomes = rbern( p=0.5, n=1e4 ) # Run rbern with P=0.5 10,000 times
2mean(outcomes)
[1] 0.4991
A more general way to rerun a chunk of code is through the for loop or
convenient wrappers such as the replicate()
function. I demonstrate
some of their uses below.
1outcomes = replicate( n=1e4, expr={ rbern( p=0.5 ) } )
2mean(outcomes)
[1] 0.5027
1# See if several runs of rbern( p=0.5, n=1e4 ) give results around 0.5
2Ps = replicate( n=100, expr={
3 outcomes = rbern( p=0.5, n=1e4 )
4 mean(outcomes)
5})
6
7# Plot Ps to see if they scatter around 0.5
8plot( 1, type="n", xlim=c(0, 100), ylim=c(0.47, 0.53), ylab="P" )
9abline( h=0.5, lty="dashed" )
10points( Ps, pch=19, col=2 )
Randomness in Models
The method described above works only for simple cases. What about complex statistical models? How do we test that they are working as they claim to? Guess what? Simulation is the key.
We simulate data based on the assumptions of the statistical model and see if it indeed returns what it claims to estimate. The simulation can be repeated several times, each set to different values of parameters. If the parameters are always recovered by the statistical model, we can then be confident that the model is properly constructed and correctly coded. So simulation is really not an option when doing statistics. It is the only safety that helps us guard against bugs in our statistical models, both programmatical and theoretical ones. Without first testing the statistical model on simulated data, any inferences about the empirical data are in doubt.
What’s next?
In a real-world scenario of the example presented here, we would only observe the score ($R$) of an item ($I$) taken by a testee ($T$). The targets of interest are the unobserved item difficulty ($D$) and testee ability ($A$). In Part 2, we will work in reverse and fit statistical models on simulated data. We will see how the models discover the true $A$s and $D$s from the information of $R$, $I$, and $T$. See you there!
I mean, what the hack is Joint and Conditional Maximum Likelihood Estimation? These are methods developed in the psychometric and measurement literature and are hardly seen in other fields. Unless obsessed with psychometrics, I don’t think one would be able to understand these things. ↩︎