Item response theory

In psychometrics, item response theory (IRT) (also known as latent trait theory, strong true score theory, or modern mental test theory) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics.^[1] Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments".^[2] By contrast, item response theory treats the difficulty of each item (the item characteristic curves, or ICCs) as information to be incorporated in scaling items.

It is based on the application of related mathematical models to testing data. Because it is often regarded as superior to classical test theory,^[3] it is the preferred method for developing scales in the United States,^{[citation needed]} especially when optimal decisions are demanded, as in so-called high-stakes tests, e.g., the Graduate Record Examination (GRE) and Graduate Management Admission Test (GMAT).

The name item response theory is due to the focus of the theory on the item, as opposed to the test-level focus of classical test theory. Thus IRT models the response of each examinee of a given ability to each item in the test. The term item is generic, covering all kinds of informative items. They might be multiple choice questions that have incorrect and correct responses, but are also commonly statements on questionnaires that allow respondents to indicate level of agreement (a rating or Likert scale), or patient symptoms scored as present/absent, or diagnostic information in complex systems.

IRT is based on the idea that the probability of a correct/keyed response to an item is a mathematical function of person and item parameters. (The expression "a mathematical function of person and item parameters" is analogous to Lewin's equation, B = f(P, E), which asserts that behavior is a function of the person in their environment.) The person parameter is construed as (usually) a single latent trait or dimension. Examples include general intelligence or the strength of an attitude. Parameters on which items are characterized include their difficulty (known as "location" for their location on the difficulty range); discrimination (slope or correlation), representing how steeply the rate of success of individuals varies with their ability; and a pseudoguessing parameter, characterising the (lower) asymptote at which even the least able persons will score due to guessing (for instance, 25% for a pure chance on a multiple choice item with four possible responses).

In the same manner, IRT can be used to measure human behavior in online social networks. The views expressed by different people can be aggregated to be studied using IRT. Its use in classifying information as misinformation or true information has also been evaluated.

Overview

The concept of the item response function was around before 1950. The pioneering work of IRT as a theory occurred during the 1950s and 1960s. Three of the pioneers were the Educational Testing Service psychometrician Frederic M. Lord,^[4] the Danish mathematician Georg Rasch, and Austrian sociologist Paul Lazarsfeld, who pursued parallel research independently. Key figures who furthered the progress of IRT include Benjamin Drake Wright and David Andrich. IRT did not become widely used until the late 1970s and 1980s, when practitioners were told the "usefulness" and "advantages" of IRT on the one hand, and personal computers gave many researchers access to the computing power necessary for IRT on the other. In the 1990's Margaret Wu developed two item response software programs that analyse PISA and TIMSS data; ACER ConQuest (1998) and the R-package TAM (2010).

Among other things, the purpose of IRT is to provide a framework for evaluating how well assessments work, and how well individual items on assessments work. The most common application of IRT is in education, where psychometricians use it for developing and designing exams, maintaining banks of items for exams, and equating the difficulties of items for successive versions of exams (for example, to allow comparisons between results over time).^[5]

IRT models are often referred to as latent trait models. The term latent is used to emphasize that discrete item responses are taken to be observable manifestations of hypothesized traits, constructs, or attributes, not directly observed, but which must be inferred from the manifest responses. Latent trait models were developed in the field of sociology, but are virtually identical to IRT models.

IRT is generally claimed as an improvement over classical test theory (CTT). For tasks that can be accomplished using CTT, IRT generally brings greater flexibility and provides more sophisticated information. Some applications, such as computerized adaptive testing, are enabled by IRT and cannot reasonably be performed using only classical test theory. Another advantage of IRT over CTT is that the more sophisticated information IRT provides allows a researcher to improve the reliability of an assessment.

IRT entails three assumptions:

A unidimensional trait denoted by ${\theta }$ ;
Local independence of items;
The response of a person to an item can be modeled by a mathematical item response function (IRF).

The trait is further assumed to be measurable on a scale (the mere existence of a test assumes this), typically set to a standard scale with a mean of 0.0 and a standard deviation of 1.0. Unidimensionality should be interpreted as homogeneity, a quality that should be defined or empirically demonstrated in relation to a given purpose or use, but not a quantity that can be measured. 'Local independence' means (a) that the chance of one item being used is not related to any other item(s) being used and (b) that response to an item is each and every test-taker's independent decision, that is, there is no cheating or pair or group work. The topic of dimensionality is often investigated with factor analysis, while the IRF is the basic building block of IRT and is the center of much of the research and literature.

The item response function

The IRF gives the probability that a person with a given ability level will answer correctly. Persons with lower ability have less of a chance, while persons with high ability are very likely to answer correctly; for example, students with higher math ability are more likely to get a math item correct. The exact value of the probability depends, in addition to ability, on a set of item parameters for the IRF.

Three parameter logistic model

For example, in the three parameter logistic model (3PL), the probability of a correct response to a dichotomous item i, usually a multiple-choice question, is:

p_{i}({\theta })=c_{i}+{\frac {1-c_{i}}{1+e^{-a_{i}({\theta }-b_{i})}}}

where ${\theta }$ indicates that the person's abilities are modeled as a sample from a normal distribution for the purpose of estimating the item parameters. After the item parameters have been estimated, the abilities of individual people are estimated for reporting purposes. $a_{i}$ , $b_{i}$ , and $c_{i}$ are the item parameters. The item parameters determine the shape of the IRF. Figure 1 depicts an ideal 3PL ICC.

The item parameters can be interpreted as changing the shape of the standard logistic function:

P(t)={\frac {1}{1+e^{-t}}}.

In brief, the parameters are interpreted as follows (dropping subscripts for legibility); b is most basic, hence listed first:

b – difficulty, item location: $p(b)=(1+c)/2,$ the half-way point between $c_{i}$ (min) and 1 (max), also where the slope is maximized.
a – discrimination, scale, slope: the maximum slope $p'(b)=a\cdot (1-c)/4.$
c – pseudo-guessing, chance, asymptotic minimum $p(-\infty )=c.$

If $c=0,$ then these simplify to $p(b)=1/2$ and $p'(b)=a/4,$ meaning that b equals the 50% success level (difficulty), and a (divided by four) is the maximum slope (discrimination), which occurs at the 50% success level. Further, the logit (log odds) of a correct response is $a(\theta -b)$ (assuming $c=0$ ): in particular if ability θ equals difficulty b, there are even odds (1:1, so logit 0) of a correct answer, the greater the ability is above (or below) the difficulty the more (or less) likely a correct response, with discrimination a determining how rapidly the odds increase or decrease with ability.

In other words, the standard logistic function has an asymptotic minimum of 0 ( $c=0$ ), is centered around 0 ( $b=0$ , $P(0)=1/2$ ), and has maximum slope $P'(0)=1/4.$ The $a$ parameter stretches the horizontal scale, the $b$ parameter shifts the horizontal scale, and the $c$ parameter compresses the vertical scale from to $.$ This is elaborated below.

The parameter $b_{i}$ represents the item location which, in the case of attainment testing, is referred to as the item difficulty. It is the point on ${\theta }$ where the IRF has its maximum slope, and where the value is half-way between the minimum value of $c_{i}$ and the maximum value of 1. The example item is of medium difficulty since $b_{i}$ =0.0, which is near the center of the distribution. Note that this model scales the item's difficulty and the person's trait onto the same continuum. Thus, it is valid to talk about an item being about as hard as Person A's trait level or of a person's trait level being about the same as Item Y's difficulty, in the sense that successful performance of the task involved with an item reflects a specific level of ability.

The item parameter $a_{i}$ represents the discrimination of the item: that is, the degree to which the item discriminates between persons in different regions on the latent continuum. This parameter characterizes the slope of the IRF where the slope is at its maximum. The example item has $a_{i}$ =1.0, which discriminates fairly well; persons with low ability do indeed have a much smaller chance of correctly responding than persons of higher ability. This discrimination parameter corresponds to the weighting coefficient of the respective item or indicator in a standard weighted linear (Ordinary Least Squares, OLS) regression and hence can be used to create a weighted index of indicators for unsupervised measurement of an underlying latent concept.

For items such as multiple choice items, the parameter $c_{i}$ is used in attempt to account for the effects of guessing on the probability of a correct response. It indicates the probability that very low ability individuals will get this item correct by chance, mathematically represented as a lower asymptote. A four-option multiple choice item might have an IRF like the example item; there is a 1/4 chance of an extremely low ability candidate guessing the correct answer, so the $c_{i}$ would be approximately 0.25. This approach assumes that all options are equally plausible, because if one option made no sense, even the lowest ability person would be able to discard it, so IRT parameter estimation methods take this into account and estimate a $c_{i}$ based on the observed data.^[6]

IRT models

Broadly speaking, IRT models can be divided into two families: unidimensional and multidimensional. Unidimensional models require a single trait (ability) dimension ${\theta }$ . Multidimensional IRT models model response data hypothesized to arise from multiple traits. However, because of the greatly increased complexity, the majority of IRT research and applications utilize a unidimensional model.

IRT models can also be categorized based on the number of scored responses. The typical multiple choice item is dichotomous; even though there may be four or five options, it is still scored only as correct/incorrect (right/wrong). Another class of models apply to polytomous outcomes, where each response has a different score value.^[7]^[8] A common example of this is Likert-type items, e.g., "Rate on a scale of 1 to 5." Another example is partial-credit scoring, to which models like the Polytomous Rasch model may be applied.

Number of IRT parameters

Dichotomous IRT models are described by the number of parameters they make use of.^[9] The 3PL is named so because it employs three item parameters. The two-parameter model (2PL) assumes that the data have no guessing, but that items can vary in terms of location ( $b_{i}$ ) and discrimination ( $a_{i}$ ). The one-parameter model (1PL) assumes that guessing is a part of the ability and that all items that fit the model have equivalent discriminations, so that items are only described by a single parameter ( $b_{i}$ ). This results in one-parameter models having the property of specific objectivity, meaning that the rank of the item difficulty is the same for all respondents independent of ability, and that the rank of the person ability is the same for items independently of difficulty. Thus, 1 parameter models are sample independent, a property that does not hold for two-parameter and three-parameter models. Additionally, there is theoretically a four-parameter model (4PL), with an upper asymptote, denoted by $d_{i},$ where $1-c_{i}$ in the 3PL is replaced by $d_{i}-c_{i}$ . However, this is rarely used. Note that the alphabetical order of the item parameters does not match their practical or psychometric importance; the location/difficulty ( $b_{i}$ ) parameter is clearly most important because it is included in all three models. The 1PL uses only $b_{i}$

Navigácia: Veda >

Analytika
Antropológia
Aplikované vedy
Bibliometria
Dejiny vedy
Encyklopédie
Filozofia vedy
Forenzné vedy
Humanitné vedy
Knižničná veda
Kryogenika
Kryptológia
Kulturológia
Literárna veda
Medzidisciplinárne oblasti
Metódy kvantitatívnej analýzy
Metavedy
Metodika

Metodológia vedy
Náboženstvo a veda
Náučná literatúra
Podvody vo vede
Popularizácia vedy
Potravinárstvo
Prírodné vedy
Pseudoveda
Scientometria
Spoločenské vedy
Teórie
Teatrológia
Technické vedy
Technika
Terminológia
Umenie
Výskum

Veda
Veda a technika podľa štátu
Veda a technika podľa kontinentu
Veda a technika podľa roka
Veda v kozme
Vedci
Vedecká literatúra
Vedecké databázy
Vedecké experimenty
Vedecké konferencie
Vedecké metódy
Vedecké ocenenia
Vedecké organizácie
Vedecké parky
Vedeckí spisovatelia
Vzdelávanie
Záhady

Príbuzné výrazy:

Text je dostupný za podmienok Creative Commons Attribution/Share-Alike License 3.0 Unported; prípadne za ďalších podmienok.
Podrobnejšie informácie nájdete na stránke Podmienky použitia.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]