# Demystifying Item Response Theory (1/4)

Playing God through simulations

## Mysterious Item Response Theory

**Item response theory is mysterious and intimidating to students.**
It is mysterious in the way it is presented in textbooks, at least in
introductory ones. The text often starts with an ambitious conceptual
introduction to IRT, which most students would be able to follow, but
with some confusion. Curious students might bear with the confusion and
expect it to resolve in the following text, only to find themselves
disappointed. At the point where the underlying statistical model should
be further elaborated, the text abruptly stops and tries to convince
readers to trust the results from black-box IRT software packages.

It isn’t that I have trust issues with black-box software, and I also
agree that it is impractical to understand all the details of model
estimation in IRT. Doing so would be similar to coding an IRT program
from scratch. The issue is that there is a huge gap here, between where
textbooks stopped explaining and where the confusing details of
statistical models should be hidden. Hence, students would be tricked
into believing that they have a *sufficient degree of understanding*,
but in reality, it is just blind faith.

A sufficient degree of understanding should allow the student to deploy
the learned skill to new situations. Therefore, a sufficient degree of
understanding of IRT models, for instance, should allow students to
extend and apply the models to analyses of differential item
functioning
(DIF) and
**differential rater functioning (DRF)**.

I’m arguing here that there is a basic granularity of understanding,
somewhat similar to the concept of basic-level
category, that when
reached, allows a student to smoothly adapt the learned skill to a wide
variety of situations, modifying and extending the skill on demand. And
I believe that item response theory is * too hard* for a student to
learn and reach this basic level of understanding, due to its
conventional presentation and historical development

^{1}.

There is still hope however, thanks to the development of a very general
set of statistical models known as the Generalized Linear Models
(GLM). Item
response models could be understood in terms of the GLM and its
extensions
(GLMM
and non-linear form of GLM/GLMM). To be too particular about the
details, the results from IRT software packages and GLMs/GLMMs would be
very close but not identical, since they utilize different estimation
methods. The strengths of GLM, however, lie in its conceptual simplicity
and extensibility. Through GLM, IRT and other models such as the T-test,
ANOVA, and linear regression, are all placed together into the same
conceptual framework. Furthermore, software packages implementing GLMs
are widely available. Users can thus experiment with them—simulate a set
of data based on known parameters, construct the model and feed it the
data, and see if the fitted model correctly recovers the parameters.
This technique of learning statistics is probably the only effective way
for students to understand * a mysterious statistical model*.

In this series of posts, I will walk the readers through the path of
understanding item response theory, with help from simulations and
generalized linear models. No need to worry if you don’t know GLMs yet.
We have another ally—R, in which we will be
simulating artificial data and fitting statistical models along the way.
Although it might seem intimidating at first, coding simulations and
models in fact provides scaffolding for learning. When feeling unsure or
confused, you can always resort to these simulation-based experiments to
resolve the issues at hand. In this very first post, I will start by
teaching you * simulations*.

## Just Enough Theory to Get Started

Jargons aside, the concept behind item response theory is fairly simple.
Consider the case where 80 testees are taking a 20-item English
proficiency test. Under this situation, what are the * factors* that
influence whether an item is correctly answered by a testee?
Straightforward right? If an item is easy and if a testee is proficient
in English, he/she would probably get the item correct. Here,

*:*

**two factors jointly influence the result**- how difficult (or easy) an item is
- the English ability of a testee

We can express these variables and the relations between them in the
graphs below. Let’s focus on the left one first. Here, $A$ represents
the **ability** of the testee, $D$ represents the **difficulty** of the
item, and $R$ represents the **item response**, or **score** on the
item. A response for an item is coded as `1`

($R=1$) if it is correctly
answered. Otherwise, it is coded as `0`

($R=0$). The arrows $A$ → $R$
and $D$ → $R$ indicate the direction of influence. The arrows enter $R$
since item difficulty and testee ability influence the score on the item
(not the other way around). $A$ and $D$ are drawn as a circled node to
indicate that they are **unobserved** (or **latent**, if you prefer a
fancier term), whereas uncircled nodes represent directly observable
variables (i.e., stuff that gets recorded during data collection). This
graphical representation of the variable and their relationships is
known as a Directed Acyclic Graph
(DAG).

The DAGs laid out here represent the concept behind the simplest kinds
of item response models, known as the **1-parameter logistic (1PL)
model** (or Rasch Model). In more formal terms, this model posits that
the **probability** of correctly answering an item is determined by the
**difference** between testee ability and item difficulty. So a more
precise DAG representation for this model would be the one shown on the
right above. Here, $P$ is the probability of correctly answering the
item, which cannot be directly observed. However, $P$ directly
influences the item score $R$, hence the arrow $P$ → $R$.

Believe it or not, the things we have learned so far could get us started. So let’s simulate some data, based on what we know about item response theory!

## Simulating Item Responses

Consider the scenario where 3 students (Rob, Tom, and Joe) took a math
test with 2 items (A and B). Since we play gods during simulations, we
know the math ability of the students and the difficulty of the items.
These ability/difficulty levels can range from positive to negative
numbers, unbounded. Larger numbers indicate higher levels of
difficulty/ability. In addition, the levels of difficulty and ability
sit on a common scale, and hence could be directly compared. Also, each
student answers every item, so we get responses from all 6 (3x2)
combinations of students and items. Let’s code this in R. The function
`expand.grid()`

would pair up the 6 combinations for us.

```
1D = c( A=0.4, B=0.1 ) # Difficulty of item
2A = c( R=0.5, T=0.1, J=-0.4 ) # Ability of student (R:Rob, T:Tom, J:Joe)
3
4dat = expand.grid(
5 I = c( "A", "B" ), # Item index
6 T = c( "R", "T", "J" ) # Testee index
7)
8dat
```

```
# A tibble: 6 × 2
I T
<fct> <fct>
1 A R
2 B R
3 A T
4 B T
5 A J
6 B J
```

After having all possible combinations of the students and the items, we could collect the values of student ability and item difficulty into the data frame.

```
1dat$A = A[ dat$T ] # map ability to df by testee index T
2dat$D = D[ dat$I ] # map difficulty to df by item index I
3dat
```

```
# A tibble: 6 × 4
I T A D
<fct> <fct> <dbl> <dbl>
1 A R 0.5 0.4
2 B R 0.5 0.1
3 A T 0.1 0.4
4 B T 0.1 0.1
5 A J -0.4 0.4
6 B J -0.4 0.1
```

Now, we’ve got all the data needed for simulation, the only thing left
is to precisely lay out the **rules for generating the response data
$R$**—the scores (zeros and ones) on the items answered by the students.
We are two steps away.

### Generating Probabilities

When IRT is introduced in the previous section, I mention that the
probability of success on an item is determined by the **difference
between testee ability and item difficulty**. It is straightforward to
get this difference: simply subtract $D$ from $A$ in the data. This
would give us a new variable $\mu$. I save the values of $\mu$ to column
`Mu`

in the data frame.

```
1dat$Mu = dat$A - dat$D
2dat
```

```
# A tibble: 6 × 5
I T A D Mu
<fct> <fct> <dbl> <dbl> <dbl>
1 A R 0.5 0.4 0.1
2 B R 0.5 0.1 0.4
3 A T 0.1 0.4 -0.3
4 B T 0.1 0.1 0
5 A J -0.4 0.4 -0.8
6 B J -0.4 0.1 -0.5
```

From the way $\mu$ is calculated ($A$ - $D$), we can infer that, for a particular observation, if $\mu$ is positive and large, the testee’s ability will be much greater than the item’s difficulty, and he would probably succeed on this item. On the other hand, if $\mu$ is negative and small, the item difficulty would be much greater in this case, and the testee would likely fail on this item. Hence, $\mu$ is directly related to probability, in that $\mu$ of large values result in high probabilities of success on the items, whereas $\mu$ of small values result in low probabilities of success. But how exactly is $\mu$ linked to probability? How could we map $\mu$ to probability in a principled manner? The solution is to take advantage of the logistic function.

$$ logistic( x ) = \frac{ 1 }{ 1 + e^{-x} } $$

The * logistic* is a function that maps a real number $x$ to a
probability $p$. In other words, the logistic function transforms the
input $x$ and constrains it to a value between zero and one. Note that
the transformation is

**monotonic increasing**, meaning that a smaller $x$ would be mapped onto a smaller $p$, and a larger $x$ would be mapped onto a larger $p$. The ranks of the values before and after the transformation stay the same. To have a feel of what the logistic function does, let’s transform some values with the logistic.

```
1# Set plot margins # (b, l, t, r)
2par(oma=c(0,0,0,0)) # Outer margin
3par(mar=c(4.5, 4.5, 1, 3) ) # margin
4
5logistic = function(x) 1 / ( 1 + exp(-x) )
6x = seq( -5, 5, by=0.1 )
7p = logistic( x )
8plot( x, p )
```

As the plot shows, the logistic transformation results in an S-shaped curve. Since the transformed values (p) are bounded by 0 and 1, extreme values on the poles of the x-axis would be “squeezed” after the transformation. Real numbers with absolute values greater than 4, after transformations, would have probabilities very close to the boundaries.

#### Less Math, Less Confusion

For many students, the mathematical form of the logistic function leads
to confusion. Staring at the math symbols hardly enables one to arrive
at any insightful interpretation of the logistic. A suggestion here is
to let go of the search for such an interpretation. The logistic
function is introduced not because it is loaded with some crucial
mathematical or statistical meaning. Instead, it is used here solely for
a practical reason: to monotonically map real numbers to probabilities.
You may well use another function here to achieve the same purpose
(e.g., the **cumulative distribution function** for the normal
distribution).

### Generating Responses

We have gone all the way from ability/difficulty levels to the probabilities of success on the items. Since we cannot directly observe probabilities in the real world, the final step is to link these probabilities to observable outcomes. In the case here, the outcomes are simply item responses of zeros and ones. How do we map probabilities to zeros and ones? Coin flips, or Bernoulli distributions, will get us there.

Every time a coin is flipped, either a tail or a head is observed. The
Bernoulli distribution is just a fancy way of describing this process.
Assume that we record tails as `0`

s and heads as `1`

s, and suppose that
the probability $p$ of observing a head equals 0.75 (since the coin is
imbalanced in some way that the head is more likely observed and we know
it somehow). Then the distribution of the outcomes (zero and one) is a
Bernoulli distribution with parameter $P=0.75$. In graphical terms, the
distribution is just two bars.

We’ve got all we need by now. Let’s construct the remaining columns to
complete this simulation. In the code below, I compute the probabilities
(`P`

) from column `Mu`

. Column `P`

could then generate column `R`

, the
item responses, through the Bernoulli distribution.

```
1rbern = function( p, n=length(p) )
2 rbinom( n=n, size=1, prob=p )
3
4set.seed(13)
5dat$P = logistic( dat$Mu )
6dat$R = rbern( dat$P ) # Generate 0/1 from Bernoulli distribution
7dat
```

```
# A tibble: 6 × 7
I T A D Mu P R
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <int>
1 A R 0.5 0.4 0.1 0.525 0
2 B R 0.5 0.1 0.4 0.599 1
3 A T 0.1 0.4 -0.3 0.426 0
4 B T 0.1 0.1 0 0.5 0
5 A J -0.4 0.4 -0.8 0.310 1
6 B J -0.4 0.1 -0.5 0.378 0
```

Now, we have a complete table of simulated item responses. A few things
to notice here. First, look at the fourth row of the data frame, where
the response of testee T (Tom) on item B is recorded. Column `Mu`

has a
value of zero since Tom’s ability level is identical to the difficulty
of item B. What does it mean to be “identical”? “Identical” implies that
Tom is neither more likely to succeed nor to fail on item B. Hence, you
can see that Tom has a 50% of getting item B correct in the $P$ column.
This is how the ability/difficulty levels and $\mu$ are interpreted.
They are on an abstract scale of real numbers. We need to convert them
to probabilities to make sense of them.

The second thing to notice is column `R`

. This is the only column that
has * randomness* introduced. Every run of the simulation would
likely give different values of $R$ (unless a random seed is set, or the

`P`

column consists solely of zeros and ones). The outcomes are not
guaranteed, probability is at work.The presence of such randomness is the gist of simulations and statistical models. We add uncertainty to the simulation, mimicking the real world, to know that in the presence of such uncertainty, would it still be possible to discover targets of interest with a statistical model. Randomness, however, poses some challenges for coding. We need to equip ourselves for those challenges.

## Coding Randomness

Randomness is inherent in simulations and statistical models, so it is
impossible to run away from it. It is everywhere. The problem with
randomness is that it introduces uncertainty in the outcome produced.
Thus, it would be hard to spot any errors just by **eyeballing the
results**.

Take `rbern()`

for instance. Given a parameter $P=0.5$, we can
repeatedly run `rbern(0.5)`

a couple of times to produce zeros and ones.
But these zeros and ones cannot tell us whether `rbern(0.5)`

is working
properly. `rbern(0.5)`

might be broken somehow, and instead generates
the ones with, say, $P=0.53$.

The law of large
numbers can help us
here. Since `rbern()`

generates one with probability $P$, if we run
`rbern()`

many times, the proportion of one in the outcomes should
converge to $P$. The second argument `n`

in `rbern()`

is a handy way to
achieve this. You can increase `n`

to see if the result comes even
closer to $0.5$.

```
1outcomes = rbern( p=0.5, n=1e4 ) # Run rbern with P=0.5 10,000 times
2mean(outcomes)
```

```
[1] 0.4991
```

A more general way to rerun a chunk of code is through the for loop or
convenient wrappers such as the `replicate()`

function.

```
1outcomes = replicate( n=1e4, expr={ rbern( p=0.5 ) } )
2mean(outcomes)
```

```
[1] 0.5027
```

```
1# See if several runs of rbern( p=0.5, n=1e4 ) give results around 0.5
2Ps = replicate( n=100, expr={
3 outcomes = rbern( p=0.5, n=1e4 )
4 mean(outcomes)
5})
6
7# Plot Ps to see if they scatter around 0.5
8plot( 1, type="n", xlim=c(0, 100), ylim=c(0.47, 0.53), ylab="P" )
9abline( h=0.5, lty="dashed" )
10points( Ps, pch=19, col=2 )
```

### Randomness in Models

The method described above works only for simple cases. What about
complex statistical models? How do we test that they are working as they
claim to? Guess what? * Simulation* is the key.

We simulate data based on the assumptions of the statistical model and
see if it indeed returns what it claims to estimate. The simulation can
be repeated several times, each set to different values of parameters.
If the parameters are always recovered by the statistical model, we can
then be confident that the model is properly constructed and correctly
coded. So **simulation is really not an option when doing
statistics**. It is the only safety that helps us guard against bugs in
our statistical models, both programmatical and theoretical ones.
Without first testing the statistical model on simulated data, any
inferences about the empirical data are in doubt.

## What’s next?

In a real-world scenario of the example presented here, we would only observe the score ($R$) of an item ($I$) taken by a testee ($T$). The targets of interest are the unobserved item difficulty ($D$) and testee ability ($A$). In Part 2, we will work in reverse and fit statistical models on simulated data. We will see how the models discover the true $A$s and $D$s from the information of $R$, $I$, and $T$. See you there!

I mean, what the hack is

*Joint/Conditional Maximum Likelihood Estimation*? These are methods developed in psychometrics and are hardly seen in other fields. Unless obsessed with psychometrics, I don’t think one would be able to understand these things. ↩︎