Handbook of Test Development

Name: Handbook of Test Development
Author: Mark Raymond and Thomas Haladyna (Editors) & Suzanne Lane

by Mark Raymond and Thomas Haladyna (Editors) & Suzanne Lane

Recommended by

"The first two books that we have talked about were really about different ways of making a certain technical body of knowledge accessible to different audiences. This next book is about the entirety of the test development process. When you get a degree in graduate school in the area of educational measurement, it is often very much focused on the statistical models, like the ones in the first two books. One of the advantages is that people that come out of these graduate programmes have really solid and detailed training about how to think about the models, how to estimate them, the relative advantages and disadvantages, and so on. What they often lack is a systemic understanding of what happens when you actually try to do these data analyses in a real life context, where you have to design a test from A to Z. It is so much more than just scoring. It has to do with making complex decisions about the kind of competencies that you want to measure, the kind of tasks that you want to design, the kinds of reports you want to create. And the kinds of studies that you need to do in order to justify the defensibility of those reports. It is about the kind of computational architecture that you need to set up, the kind of data you see, the Excel spreadsheets, the Word documents, all of that. The kind of skill sets that you need in order to manage that entire process. It is about all the really complex, systemic thinking that needs to go into this. “If someone…wants to get a sense of the real world of test development through reading a particular book, this is a really great book to have in your hands.” Equally important is all the resource constraints that this happens under. When you work for an assessment company and you actually have to design, evaluate, deploy, and monitor an assessment—whether that is a traditional large-scale assessment or more of an innovative assessment—you have only so much time, only so much money to spend on certain studies, and only so much experience typically to manage all of these processes, which creates constraints that you have to work under. Often what people who are trained in educational measurement find is that when they have learned about all these fancy and wonderful models and they come to a testing company, the models that are being used are relatively simple. They are much simpler than they would expect although that is not necessarily a good thing. But it has to do with the fact that the simpler models may create graphics or summaries for you that are easily interpretable, or they do the job well enough for operational reporting purposes so that fine-tuning is not necessary. They may be easier to communicate to the clients who have to use the data. Or it might be about sample sizes—you do not have enough people for your assessment. You do not have enough items or tasks that you need for a particular kind of competency that you are interested in measuring, so you cannot really do anything reliably with a fancy model yet although you can start to think about how to do it eventually. I think that is often a real wake up call. In this handbook, the editors have done a really nice job of getting together authors who have written on different aspects of this process. I think if someone is in graduate school and learns about statistical models and wants to get a sense of the real world of test development through reading a particular book, this is a really great book to have in your hands. It does convey a sense of the entire enterprise, with warts and all. I think the United States is certainly a country that is well known for large-scale assessment, which has its advantages and disadvantages. It is where a lot of really strong educational measurement programmes are. A lot of cutting edge research comes out of the United States. The irony is that it is often done by people that have grown up in very different countries. It is people like me or people from Australia, from Italy, from the Netherlands, from Britain, from Turkey, who do take up jobs in the United States because there is more of a job market. They bring their cultural background and their scientific backgrounds to the table. In that sense, the US is a very attractive place for this kind of work. One of the big trends that many people are, at some level, familiar with is these international comparison surveys of student achievement that are being done. For example, the PISA survey, which stands for the Programme for International Student Assessment, is one of those international surveys that 30-plus countries participate in every three years. Reading, math, and science are the focal areas and it is essentially a fancy way of summarising the performance of 15-year-olds in these areas and then doing global comparisons of where countries stand. In the mid-2000s, in my country, there was this belief that the German educational system was very advanced and had produced all these wonderful strong thinkers and doers. We were almost implicitly expected to perform well on this assessment. But when we participated in PISA for the first time that was not at all the case. We were somewhere in the middle or upper middle of the scale on all these competencies. That was known as the ‘PISA shock’. As a result of that, in Germany, large-scale educational testing got kick-started in the middle-2000s. At the time, I was working at the first national institute for this kind of an assessment, the Institute for Educational Progress in Berlin, which still exists today. We had done other studies like PISA before but this was the first time standards-based large-scale assessment was done rigorously on a national scale. “There is…sometimes, a misperception that assessments are only of a certain, simplistic kind, and therefore any assessment is necessarily a bad thing. I think societally and politically, you have to wrestle with that.” For better or for worse, it kick-started an entire culture of that kind of assessment in my country. It meant that people had to wrestle with this idea of students being tested at certain intervals, that deficiencies and strengths were being made more public, that money was funnelled into those enterprises now out of state or federal funds, or that certain lines of research were suddenly being advanced to a stronger degree than it had been before. That was a really big change. Nowadays what we have is a world where, in these large-scale surveys, you also use interactive technologies much more. You have more tablet-based or PC-based delivery. With more and more countries participating, there is also more of an awareness of what those kinds of assessments can do and what learners in those countries are able to do. I think one of the challenges with all the innovations in assessment is when you get into more impoverished areas of the world, whether that is within the developing world or the developed world. You still have to struggle with access to the technology for assessment although some large-scale surveys have recently gone fully computer-based. I think some countries are very fortunate in that a lot of investments are made and computer labs or tablets are becoming very commonplace. But even in those countries, you have pockets where that is not the case. To your earlier question about fairness, it is always a challenge how to make sure that you get a fair representation of what learners can do, given those simple delivery constraints. First of all I should say that I have not followed the politics and the societal implications of this closely since I left my country. However, I feel that creating that kind of an awareness about strengths and weaknesses—and challenging the public and the scientific community to wrestle with issues such as how to measure relevant, novel, and complex competencies while we are competing in a global market—is really beneficial. I informally hear from colleagues that there is a weariness setting in about the amount of assessment that is being done. There is also, sometimes, a misperception that assessments are only of a certain, simplistic kind, and therefore any assessment is necessarily a bad thing. I think societally and politically, you have to wrestle with that. I am someone who, as a person, strives for authenticity and integrity when I interact with different stakeholders. I think the worst thing that we can do as scientists or as specialists is to oversell or undersell certain kinds of products such as assessments. Our job is to educate our audience as best as we can about what the relative advantages and disadvantages are of doing certain kinds of things around assessments. The hard aspect about this is that the deep answers to these questions are often very complex and nuanced, and we live in a world where people want fast answers. They want simple answers, and they want to make quick judgements. Even with the best intentions, you sometimes have audiences that are just not receptive. Most of the time, when questions get asked about the value of educational testing, typically people think of ‘institutions’ that are doing certain things. The reality is that it is always ‘individuals’ who have to communicate ideas, even if at the end of the day, an institution releases a pamphlet or an FAQ. When you talk to individuals, most scientists, most teachers, most parents, most students, have very good intentions, and many are really very thoughtful, want to do a good job, and want to understand certain complexities. As a result, I always think that it is good to think about the human component of all of this and really engage with the human being and ask the appropriate questions, be willing and open to learn from the other person— but it clearly goes both ways. If people were to do that frequently, I think the understanding about what assessment can and cannot do in certain situations would be much more evolved and much more nuanced and maybe much more representative of what the current state of the art actually is."

Educational Testing · fivebooks.com