Handbook of Item Response Theory (3-volume series)

Name: Handbook of Item Response Theory (3-volume series)
Author: Wim van der Linden (editor)

by Wim van der Linden (editor)

Recommended by

"Item response theory or IRT is a framework that people who are statisticians—we call them ‘psychometricians’ in educational and psychological assessment—use a lot. It is, in fact, the predominant framework for taking data from assessments, summarising that data, and reporting scores out to learners, and is a very powerful framework. It is also a very large framework, and subsumes a lot of different models under its hood. What Wim van der Linden—who is the editor of this three-volume series—has done that I find so remarkable is that he has updated a single book that he had several years ago and really brought together a large number of these models under a single umbrella in a coherent and principled fashion. “Wim van der Linden is one of the smartest people alive working in the psychometric field.” If you are someone who needs to learn about the range of models that exist out there, what they offer in terms of how they summarise data, how you can make inferences with them, what we currently know about how they should be estimated statistically, how their fit to data should be evaluated, and so on, then you can really get a wonderful sense of the entire space by looking across these three volumes. As a reference framework to have on the shelf, it is really indispensable for anybody who studies these kinds of models. And if you work in educational assessment, and you are somebody who works with the quantitative data, you need to learn about these models. To me, it is a must-have volume. In addition, Wim van der Linden is one of the smartest people alive working in the psychometric field. As I said, he has been very principled and rigorous and detailed in editing these books, so that sets of chapters have similar kinds of structures and give a similar balance to the different kinds of topics. I admire that kind of editorial and contributory work as someone who has, myself, written and edited three books. I know how much hard work that is—to pull together many people with different styles and different personalities and different ways of expressing their ideas. So I admire this book not only for its content, but also in terms of what it represents as an editorial effort. It is difficult to put a number on it. There are currently over a thousand members of the National Council for Measurement Education, for example, which is one of the larger associations that has historically existed. Nowadays, one of the challenges is that when you think about where the field has its boundaries, it is becoming fuzzy. When you think about educational assessment in the way I talked about it earlier, you also have to think about people who are in learning analytics, data science, and educational data mining fields for instance. These are often people who have an interdisciplinary training, many with a strong emphasis in computer science. The numbers are just mushrooming from year to year as these kinds of applications get larger and now we have areas like ‘computational psychometrics’ and very computationally oriented psychometrics programs like at the business school in Cambridge. “Assessment activities or tasks are like scientific instruments. Once you change the instrument, you can ask new questions about the subject that you are studying.” You also have a large number of different companies and start-ups concerned with educational assessment nowadays. You have companies like ETS, which are historically relatively well established and therefore ‘robust’ in some important ways. For example, we have a relatively large research division compared to many smaller educational assessment companies, with many specialists dedicated to statistics, psychometrics, learning sciences, cognitive science, and so on. But if you go to conferences, you do, of course, repeatedly run into certain key people within your field from across various institutions. Moreover, when you work in a scientific field, from the outside it often seems holistic and relatively undifferentiated but it typically breaks down relatively quickly into lines of work that people are concerned with. For example, I work in an area called diagnostic measurement, which is an area on which I co-wrote a book. In that community I have 25 or so colleagues who do consistent recognizable work but quite a few more colleagues who occasionally dabble in it. Measuring psychological traits, yes. It is all about measuring the unobservable characteristics of individuals that you cannot see directly. The logic is that you design situations—which we often call ‘items’ or ‘tasks’ or ‘activities’ or ‘environments’—in such a way that people, when they interact with them, draw on those skills and give you data—behavioural traces essentially—around the things that they do. They select options. They move around in an environment in a particular way. They write an essay. They give a spoken response. Nowadays you could even measure gestures or facial expressions. You then analyse those data, and infer back from the things that you directly observe to what they might have been relying on when they were doing these kinds of things. It is that chain of reasoning that makes assessment so challenging. When you say someone is a little bit more ‘clever’ than another person, then that is essentially a very intuitive way of thinking about what we do whenever we make comparative judgments but it is not all. In addition, we may say ‘clever’ meaning a certain person is very competent in English writing. They are at the top end of the scale. They are able to write essays that are informationally relevant, are well structured, contain few errors, are on topic, and so on and so forth. People who are not so skilled might make a lot of mistakes. So that intuition is correct. A lot of testing is either about comparing people – rank ordering or sorting them into different groups. But it is also about evaluating their performance in absolute terms against a particular criterion or standards. Such kinds of evaluations can be done along either one conceptual dimension – like global proficiency in reading, mathematics, or science for example – or multiple subcompetencies in these domains. What we find nowadays is that, as the assessment environments that people engage in become more complex and interactive – and to some degree more open ended – and we open up all this space about how individuals and teams could work on these problems, we essentially have to change the kind of questions that we ask about people. Assessment activities or tasks are like scientific instruments. Once you change the instrument, you can ask new questions about the subject that you are studying. It might be the learners in a particular grade or adults in a particular professional situation. As the questions get more complex, the data analytics get more complex, which means that any of these studies that you have to design to convince yourself that what you are seeing is trustworthy also get more complex. But I think, nowadays, we are able to capture—in a more authentic and comprehensive way—the abilities learners across a lifespan have, and what sort of non-cognitive factors they bring to bear when they engage in these activities. That is why research in this area is still very much ongoing and the field is continuing to grow. There are things we already know that are very well established, hard facts and truths that you don’t really have to re-question. But there are also a lot of new questions that get asked that have all of these new research efforts attached to them that are worth pursuing."

Educational Testing · fivebooks.com