Introduction

Assessment instruments are widely used to measure individuals latent traits, that is, internal characteristics that cannot be directly measured. An example of such assessment instruments are educational and psychological tests. Each test is composed of a series of items and an examinee’s answers to these items allow for the measurement of one or more of his or hers latent traits. When a latent trait is expressed in numerical form, it is called an ability or proficiency.

Ordinary tests, hereon called linear tests, are applied using the orthodox paper and pencil strategy, in which tests are printed and all examinees are presented with the same items in the same order. One of the drawbacks of this methodology is that both individuals with high and low proficiencies must answer all items in order to have their proficiency estimated. An individual with high proficiency might get bored of answering the whole test if it only contains items that he or she considers easy; on the other hand, an individual of low proficiency might get frustrated if he is confronted by items considered and hard and might give up on the test or answer the items without paying attention.

With these concerns in mind, a new paradigm in assessment emerged in the 70s. Initially named tailored testing in [Lord77], these were tests in which items were chosen to be presented to the examinee in real time, based on the examinee’s responses to previous items. The name was changed to computerized adaptive testing (CAT) due to the advances in techonology that facilitated the application of such a testing methodology using electronic devices, like computers and tablets.

In a CAT, the examinee’s proficiency is evaluated after the response of each item. The new proficiency is then used to select a new item, closer to the examinee’s real proficiency. This method of test application has several advantages compared to the traditional paper-and-pencil method, since high-proficiency examinees are not required to answer all the easy items in a test, answering only the items that actually give some information regarding his or hers true knowledge of the subject at matter. A similar, but inverse effect happens for those examinees of low proficiency level.

Finally, the advent of CAT allowed for researchers to create their own variant ways of starting a test, choosing items, estimating proficiencies and stopping the test. Fortunately, the mathematical formalization provided by Item Response Theory (IRT) allows for tests to be computationally simulated and the different methodologies of applying a CAT to be compared under different constraints. Packages with these functionalities already exist in the R language ([Magis12]) but not yet in Python. catsim was created to fill this gap, using the facilities of established scientific packages such as numpy and scipy, as well as the object-oriented programming paradigm supported by Python to create a simple, comprehensive and user-extendable CAT simulation package.

Item Response Theory Models

As a CAT simulator, catsim borrows many concepts from Item Response Theory ([Lord68] and [Rasch66]), a series of models created in the second part of the 20th century with the goal of measuring latent traits. catsim makes use of Item Response Theory one-, two- and three-parameter logistic models, a series of models in which examinees and items are represented by a set of numerical values (the models’ parameters). Item Response Theory itself was created with the goal of measuring latent traits as well as assessing and comparing individuals’ proficiencies by allocating them in proficiency scales, inspiring as well as justifying its use in adaptive testing.

The logistic models of Item Response Theory are unidimensional, which means that a given assessment instrument only measures a single proficiency (or dimension of knowledge). The instrument, in turn, is composed of items in which examinees manifest their latent traits when answering them.

In unidimensional IRT models, an examinee’s proficiency is represented as \(\theta\). Usually \(-\inf < \theta < \inf\), but since the scale of \(\theta\) is up to the individuals creating the instrument, it is common for the values to be around the normal distribution \(N(0; 1)\), such that \(-4 < \theta < 4\).Additionally, \(\hat\theta\) is the estimate of \(\theta\). Since a latent trait can’t be measured directly, estimates need to be made, which tend to get closer to the theorically real \(\theta\) as the test progresses in length.

Under the logistic models of IRT, an item is represented by the following parameters:

  • \(a\) represents an item’s discrimination parameter, that is, how well it discriminates individuals who answer the item correctly (or, in an alternative interpretation, individuals who agree with the idea of the item) and those who don’t. An item with a high \(a\) value tends to be answered correctly by all individuals whose \(\theta\) is above the items difficulty level and wrongly by all the others; as this value gets lower, this threshold gets blurry and the item starts not to be as informative. It is common for \(a > 0\).
  • \(b\) represents an item’s difficulty parameter. This parameter, which is measured in the same scale as \(\theta\), shows at which point of the proficiency scale an item is more informative, that is, where it discriminates the individuals who agree and those who disagree with the item. Since \(b\) and \(\theta\) are measured in the same scale, \(b\) follows the same distributions as \(\theta\). For a CAT, it is good for an item bank to have as many items as possible in all difficulty levels, so that the CAT may select the best item for each individual in all ability levels.
  • \(c\) represents an item’s pseudo-guessing parameter. This parameter denotes what is the probability of individuals with low proficiency values to still answer the item correctly. Since \(c\) is a probability, \(0 < c \leq 1\), but the lower the value of this parameter, the better the item is considered.
  • \(d\) represents an item’s upper asymptote. This parameter denotes what is the probability of individuals with high proficiency values to still answer the item incorrectly. Since \(d\) is a probability, \(0 < d \leq 1\), but the higher the value of this parameter, the better the item is considered.

For a set of items \(I\), when \(\forall i \in I, c_i = 0\), the three-parameter logistic model is reduced to the two-parameter logistic model. Additionally, if all values of \(a\) are equal, the two-parameter logistic model is reduced to the one-parameter logistic model. Finally, when \(\forall i \in I, a_i = 1\), we have the Rasch model ([Rasch66]). Thus, catsim is able of treating all of the logistic models presented above, since the underlying functions of all logistic models related to test simulations are the same, given the correct item paramaters.

Under IRT, the probability of an examinee with a given \(\hat\theta\) value to answer item \(i\) correctly, given the item parameters, is given by ([Ayala2009], [Magis13])

\[P(X_i = 1| \theta) = c_i + \frac{d_i-c_i}{1+ e^{a_i(\theta-b_i)}}.\]

The information this item gives is calculated as ([Ayala2009], [Magis13])

\[I_i(\theta) = \frac{a^2[(P(\theta)-c)]^2[d - P(\theta)]^2}{(d-c)^2(1-P(\theta))P(\theta)}.\]

Both of these functions are graphically represented in the following figure. It is possible to see that an item is most informative when its difficulty parameter is close the examinee’s proficiency.

(Source code, png, hires.png, pdf)

_images/introduction-1.png

The sum of the information of all items in a test is called test information [Ayala2009]:

\[I(\theta) = \sum_{j \in J} I_j(\theta).\]

The amount of error in the estimate of an examinee’s proficiency after a test is called the standard error of estimation [Ayala2009] and it is given by

\[SEE = \sqrt{\frac{1}{I(\theta)}}\]

Since the denominator in the calculation of the \(SEE\) is \(I(\theta)\), it is clear to see that the more items an examinee answers, the smaller SEE gets.

catsim provides these functions in the catsim.irt() module.

The Item Matrix

In catsim, a collection of items is represented as a numpy.ndarray whose rows and columns represent items and their parameters, respectively. Thus, it is referred to as the item matrix. The most important features of the items are situated in the first three columns of the matrix, which represent the parameters \(a\), \(b\) and \(c\), respectively. Item matrices can be generated via the catsim.cat.generate_item_bank() function as follows:

>>> generate_item_bank(5, '1PL')
>>> generate_item_bank(5, '2PL')
>>> generate_item_bank(5, '3PL')
>>> generate_item_bank(5, '3PL', corr=0.5)

These examples depict the generation of an array of five items according to the different logistic models. In the last example, parameters \(a\) and \(b\) have a correlation of \(0.5\), an adjustment that may be useful in case simulations require it [Chang2001].

After the simulation, catsim adds a fourth column to the item matrix, representing the items exposure rate, commonly denoted as \(r\). Its value denotes how many times an item has been used and it is calculated as follows:

\[r_i = \frac{q_i}{N}\]

Where \(q_i\) represents the number of tests item \(i\) has been used on and \(N\) is the total number of tests applied.

Computerized adaptive tests

Unlike linear tests, in which items are sequentially presented to examinees and their proficiency estimated at the end of the test, in a computerized adaptive test (CAT), an examinees’ proficiency is calculated after the response of each item. The updated knowledge of an examinee’s proficiency at each step of the test allows for the selection of more informative items during the test itself, which in turn reduce the standard error of estimation of their proficiency at a faster rate. This behavior

The CAT Lifecycle

In general, a computerized adaptive test has a very well-defined lifecycle:

digraph cat_simple {
    bgcolor="transparent";
    rankdir=TB;
    a[label=<START>, shape=box];
    b[label=<Initial proficiency<br/>estimation>];
    c[label=<Item selection and <br/>administration>];
    d[label=<Capture answer>];
    e[label=<Proficiency estimation>];
    rank=same;
    f[label=<Stopping criterion<br/>reached?>, shape=diamond];
    g[label=<END>, shape=box];
    a -> b -> c -> d -> e -> f;
    f -> g[label=<YES>];
    f -> c[label=<NO>];
}
  1. The examinee’s initial proficiency is estimated;
  2. An item is selected based on the current proficiency estimation;
  3. The proficiency is reestimated based on the answers to all items up until now;
  4. If a stopping criterion is met, stop the test. Else go back to step 2.

There is a considerable amount of literature covering these four phases proposed by many authors. In catsim, each phase is separated in its own module, which makes it easy to create simulations combining different methods for each phase. Each module will be explained separately, along with its API.

Initialization

The initialization procedure is done only once during each examinee’s test. In it, the initial value of an examinee’s proficiency \(\hat\theta_0\) is selected. This procedure may be done in a variety of ways: a standard value can be chosen to initialize all examinees (catsim.initialization.FixedInitializer); it can be chosen randomly from a probability distribution (catsim.initialization.RandomInitializer); the place in the item bank with items of more information can be chosen to initialize \(\hat\theta_0\) etc.

In catsim, initialization procedures can be found in the catsim.initialization module.

Item Selection

With a set value for \(\hat\theta\), an item is chosen from the item bank and presented to the examinee, which the examinee answers and its answer, along with the answers to all previous items, is used to estimate \(\hat\theta\).

Item selection methods are diverse. The most famous method is to choose the item that maximizes the gain of information, represented by catsim.selection.MaxInfoSelector. This method, however, has been shown to have some drawbacks, like overusing few items from the item bank while ignoring items with inferior parameters. In order to correct that, other item selection methods were proposed.

In catsim, an examinee’s response to a given item is simulated by sampling a binary value from the Bernoulli distribution, in which the value of \(p\) is given by the IRT logistic model characteristic function (catsim.irt.icc()), given by:

\[P(X_i = 1| \theta) = c_i + \frac{1-c_i}{1+ e^{a_i(\theta-b_i)}}\]

In catsim, item selection procedures can be found in the catsim.selection module.

Proficiency Estimation

Proficiency estimation occurs whenever an examinee answers a new item. Given a dichotomous (binary) response vector and the parameters of the corresponding items that were answered, it is the job of an estimator to return a new value for the examinee’s \(\hat\theta\). This value reflects the examinee’s proficiency, given his or hers answers up until that point of the test.

In Python, an example of a list that may be used as a valid dichotomous response vector is as follows:

>>> response_vector = [1,1,1,0,1,1,0,1,0,0,1,0,0,0,1,0]

Estimation techniques are generally separated between maximum-likelihood estimation procedures (whose job is to return the \(\hat\theta\) value that maximizes the log-likelihood function, presented in catsim.irt.log_likelihood()); and Bayesian estimation procedures, which tend to use a priori information of the distributions of examinee’s proficiencies to estimate new values for them.

In catsim, proficiency estimation procedures can be found in the catsim.estimation module.

Stopping Criterion

Since items in a CAT are selected on-the-fly, a stopping criterion must be chosen such that, when achieved, no new items are presented to the examinee and the test is deemed finished. These stopping criteria might be achieved when the test reaches a fixed number of items or when the standard error of estimation (catsim.irt.see()) reaches a lower threshold etc. Both of these stopping criteria are implemented as catsim.stopping.MaxItemStopper and catsim.stopping.MaxItemStopper, respectively.

In catsim, test stopping criteria can be found in the catsim.stopping module.

Package architecture

catsim was built using an object-oriented architecture, an uncommon feat for scientific packages in Python, but which introduces many benefits for its maintenance and expansion. As explained in previous sessions, each phase in the CAT lifecycle is represented by a different module in the package. Additionaly, each module involved in the CAT lifecycle has a base abstract class, which must be implemented if a new methodology is to be presented to that module’s respective phase. This way, new users can implement their own methods for each phase of the CAT lifecycle, or even an entire new CAT lifecycle while still using catsim and its features to simulate tests, plot results etc. Modules and their corresponding abstract classes are presented on Table 1.

Table 1 Modules and their corresponding abstract classes
Module Abstract class
catsim.initialization catsim.initialization.Initializer
catsim.selection catsim.selection.Selector
catsim.estimation catsim.estimation.Estimator
catsim.stopping catsim.stopping.Stopper

Examples

catsim components can be used in one of two ways: as part of a simulation or autonomously. This section will present examples of both.

First, import the package:

>>> # this function generates an item bank, in case the user cannot provide one
>>> from catsim.cat import generate_item_bank
>>> # simulation package contains the Simulator and all abstract classes
>>> from catsim.simulation import *
>>> # initialization package contains different initial proficiency estimation strategies
>>> from catsim.initialization import *
>>> # selection package contains different item selection strategies
>>> from catsim.selection import *
>>> # estimation package contains different proficiency estimation methods
>>> from catsim.estimation import *
>>> # stopping package contains different stopping criteria for the CAT
>>> from catsim.stopping import *
>>> import catsim.plot as catplot

Running simulations

>>> # generate an item bank
>>> bank_size = 5000
>>> items = generate_item_bank(bank_size)
>>>
>>> print('Starting simulation 1...')
>>> # simulate 10 examinees taking a CAT, given the generated item bank,
>>> # a random proficiency initializer, maximum information item selector,
>>> # hill climbing proficiency estimator and stopping criterion of 20 items
>>> s = Simulator(items, 10)
>>> s.simulate(RandomInitializer(), MaxInfoSelector(), HillClimbingEstimator(), MaxItemStopper(20), verbose=True)
>>> catplot.test_progress(simulator=s,index=0)
>>> print('Bias:', s.bias)
>>> print('Mean squared error:', s.mse)
>>> print('Root mean squared error:', s.rmse)
>>>
>>> examinee_index=0
>>> print('Accessing examinee', examinee_index, 'results...')
>>> print('    True proficiency:', s.examinees[examinee_index])
>>> print('    Items administered:', s.administered_items[examinee_index])
>>> print('    Responses:', s.response_vectors[examinee_index])
>>> print('    Proficiency estimation during each step of the test:', s.estimations[examinee_index])
>>>
>>> print('Starting simulation 2...')
>>> # examinees can also be passed as 1D numpy arrays or Python lists containing
>>> # their proficiency values
>>> # this example uses a stopping criterion of minimum error
>>> examinees = numpy.random.normal(size=10)
>>> s = Simulator(items, examinees)
>>> s.simulate(RandomInitializer(), MaxInfoSelector(), HillClimbingEstimator(), MinErrorStopper(.3), verbose=True)
>>>
>>> catplot.test_progress(simulator=s,index=0, info=True)
>>>
>>> print('Starting simulation 3...')
>>> # catsim can also simulate linear (non-adaptive) tests by using a linear item selector
>>> s = Simulator(items, 10)
>>> indexes = numpy.random.choice(items.shape[0], 50, replace=False)
>>> s.simulate(RandomInitializer(), LinearSelector(indexes), HillClimbingEstimator(), MaxItemStopper(50), verbose=True)
>>> catplot.test_progress(simulator=s,index=0, info=True, see=True)

Autonomous usage

>>> # generating an item bank
>>> print('Generating item bank...')
>>> bank_size = 5000
>>> items = generate_item_bank(bank_size)
>>>
>>> # creating dummy response patterns and selecting item indexes to pass as administered items
>>> print('Creating dummy examinee data...')
>>> responses = [True, True, False, False]
>>> administered_items = [1435, 3221, 17, 881]
>>>
>>> print('Creating simulation components...')
>>> # create a random proficiency initializer
>>> initializer = RandomInitializer()
>>>
>>> # create a maximum information item selector
>>> selector = MaxInfoSelector()
>>>
>>> # create a hill climbing proficiency estimator
>>> estimator = HillClimbingEstimator()
>>>
>>> # create a stopping criterion that will make tests stop after 20 items
>>> stopper = MaxItemStopper(20)
>>>
>>> # manually initialize an examinee's proficiency as a float variable
>>> est_theta = initializer.initialize()
>>> print('Examinee initial proficiency:', est_theta)
>>>
>>> # get an estimated theta, given the answers to the dummy items
>>> new_theta = estimator.estimate(items=items, administered_items=administered_items, response_vector=responses, est_theta=est_theta)
>>> print('Estimated proficiency, given answered items:', new_theta)
>>>
>>> # get the index of the next item to be administered to the current examinee, given the answers they have already given to the previous dummy items
>>> item_index = selector.select(items=items, administered_items=administered_items, est_theta=est_theta)
>>> print('Next item to be administered:', item_index)
>>>
>>> # get a boolean value pointing out whether the test should stop
>>> _stop = stopper.stop(administered_items=items[administered_items], theta=est_theta)
>>> print('Should the test be stopped:', _stop)