Professional Testing, Inc.
Providing High Quality Examination Programs

From the Item Bank

The Professional Testing Blog

 

Alternatives to Traditional Form Assembly and Delivery: An Overview of CBT, LOFT and CAT

May 5, 2017  | By  | Leave a comment

by Joy Matthews-López, PhD, and Vince Maurelli, MSc

How forms are assembled and delivered are important components of an exam program’s design. These decisions will drive operational procedures as well as over-arching policies, such as immediate score reporting vs. delayed reporting, publication schedules, repeater policies for candidates that need to retest, necessary analyses, security policies and procedures, item types used in the program, and score interpretation and use.

The forms assembly process requires a list of items that are eligible for inclusion on the form, a current blueprint, a list of statistical and/or psychometric targets, and may or may not rely on technology. Item eligibility may include a history of item usage, exposure (rates), and status (active vs pilot vs draft). Statistical/psychometric targets may include form-level difficulty, form-level reliability, adherence with equating requirements, or fit with test information or characteristic curves.

Fixed-forms, also referred to as linear forms, are called such because the same set of questions is administered to all examinees that are given that particular form. Regardless of delivery mode, all examinees see the same items (a fixed set of items). If the delivery mode is paper-based (PBT), then items appear in the same order for everyone that receives that form. If the delivery mode is computer-based (CBT), then examinees still see the same items, but those items (and/or options) may or may not be in the same order. A few added benefits of CBT delivery are the potential to use alternative item types, such as hot spot or drag-and-place. For obvious reasons, most alternative item types cannot be used in PBT delivery. The take-away point is that linear forms can be delivered via paper or computer. In other words, assembly of linear forms is independent of its delivery.

It makes sense to frame forms assembly in terms of available technology. Linear forms can be assembled one at a time or hundreds at a time, depending on the type of technology available to aid the assembly process, the constraints applied to the exams, such as upper and lower bounds of item use or inclusion targets, and the depth and diversity of the item pool. The more highly constrained the assembly model, the more complicated the assembly process.

It also makes sense to view the assembly process in terms of non-adaptive vs. adaptive. In terms of non-adaptive, forms can be assembled well in advance of their administration to candidates, or forms can be assembled in real-time,  that is, during an actual test session. Regardless, non-adaptive assembly does not take examinees’ responses into consideration when selecting the remaining items for a form. A popular, real-time assembly method of assembling linear forms is commonly referred to as linear-on-the-fly testing, or LOFT. In its truest sense, LOFT requires computer delivery of forms.

LOFT has numerous advantages over traditional, paper-based (linear) forms. First and foremost, LOFT produces unique (or nearly unique) forms, which can address certain types of security concerns because candidates see different sets of items. And because LOFT is used in a computer-based delivery setting, candidate response times can be captured, which is extremely useful when analyzing test sessions for irregularities. Second, LOFT is often associated with the use of alternative items types. This is because computer delivery expands the variety of item types that can be used. Paper-based tests are traditionally limited to multiple-choice item types (or matching or true/false), but CBT delivery allows for technology-enhanced item types – such as drag-and-place, hot spot, or items that include embedded sound, video, or interactive graphics. Third, LOFT forms are built to meet blueprint specifications and pre-defined psychometric targets (test characteristic curve or test information curve) while controlling for item exposure and are pre-equated by design. LOFT can be compared to automated test assembly (ATA) on steroids.

Just as there are advantages to using LOFT, there are some limitations. Form-level reliability of LOFT-based forms is comparable to that of paper-based forms. LOFT-based forms are usually the same length as, and have the same measures of reliability, as their PBT counterparts. For even moderately constrained programs, LOFT is not as well-suited for high-volume programs. This is primarily attributed to complex assembly techniques that are needed due to pressure placed on the item pool and the need to control item exposure rates.

Another form of technology-enhanced assembly is computer adaptive testing (CAT). CAT is different from LOFT in that it assembles forms dynamically. That is, the item selection algorithm factors candidate behavior into the selection process. If a candidate answers incorrectly, then the next question may be slightly “easier” and if the candidate answers correctly, then the next question may be slightly “harder.” This Socratic form of assembly is dynamic and adaptive. In theory, it yields highly reliable scores across the entire ability continuum. In a non-adaptive setting, reliability is computed at the form-level and items are usually selected to maximize information at or near the cut score. The farther a score is from the cut point, the less reliable it is. In a criterion-based test, this isn’t an issue. Scores that are substantially below the cut can safely be assumed to belong to “true non-masters,” and scores that are substantially above the cut can be assumed to belong to “true masters.” So despite scores not being equally reliable, depending on where they fall on the ability continuum, we can still interpret and use them to make pass/fail decisions. However, in a CAT, all scores are highly reliable, regardless of their proximity to the cut. This means that scores in both the low end as well as the high end of the score range have great precision.

As with all methodologies, CAT has certain advantages as well as certain disadvantages. It is without dispute that CAT is designed to produce highly reliable scores and to do so in a very efficient manner. This is particularly true in a variable-length CAT. But whether the CAT is variable-length or fixed length, there are some practical benefits to this assembly (and delivery) method. There are no “forms” per se in CAT, so the concept of form-level reliability is different than PBT or LOFT. Instead of talking about form-level reliability, it is more appropriate to look at score-level reliability. CAT scores are pre-equated, so traditional equating of forms is not necessary. The mathematical models behind CAT are rooted in item response theory (IRT). As such, items are calibrated and placed on a scale that is used to describe difficulty and discrimination. One of the nice features of using IRT is that forms can be assembled to be pre-equated (also necessary for LOFT and can be useful for automated test assembly as well).

As mentioned above, CAT exams are either fixed length or variable length. Fixed length CATs contain a fixed number of items, so all candidates receive the same length test. Variable length CAT is just that: variable length. The stopping criterion is set in advance, along with a lower bound for the minimum number of items that must be administered. Though variable length CATs are more efficient –high reliability with fewer questions, which can translate into less seat time, less fatigue, and greater security — they can be hard to explain, particularly to examinees who want to know how scores can be comparable if different candidates receive different items and different numbers of items.

There are also disadvantages or challenges to using CAT. First, a large, calibrated item pool is needed to support the item selection algorithm. The size of a CAT pool usually needs to be eight-times the length of the test. So, if a fixed-length CAT is 100 items long, then the pool should have around 800 calibrated, active items. Second, building and maintaining large item banks can be expensive and require well thought-out pilot plans so that the pool stays fresh. Third, CAT programs require constant psychometric supervision. Item exposure rates must be carefully controlled in order to protect the integrity of the pool, and the item scale must be checked for drift.

From a user’s perspective, candidates should not be able to tell the difference between a linear CBT exam and a LOFT-based exam. However, there is a noticeable difference with CAT exams: CAT does not allow candidates to skip a question and return to it later in the test session. On the contrary, CAT requires candidates to answer every question in the order that the question is given, and there is no opportunity to revisit previously answered questions and change a response. This is because of the adaptive nature of the assembly; candidate answers drive the item selection process.

Regardless of how forms are assembled (manually or automated) or how they are delivered (PBT, CBT, LOFT, or CAT), these choices are but means to an end. Does the test measure what it was designed to measure? Are the resulting scores interpretable and defensible? Was the process fair? Did the test meet the needs of the sponsoring agency? If the answers to these questions are yes, then the raison d’etre of testing has been met and the results can be used meaningfully.

 

Tags: , , , ,

Categorized in:

Leave a Reply

Your email address will not be published. Required fields are marked *