André Rupp's Reading List

André Rupp is Research Director at the Educational Testing Service (ETS) in Princeton, New Jersey.

Educational Testing (2017)

Scraped from fivebooks.com (2017-10-23).

Handbook of Item Response Theory (3-volume series)

Wim van der Linden (editor) · Buy on Amazon

"Item response theory or IRT is a framework that people who are statisticians—we call them ‘psychometricians’ in educational and psychological assessment—use a lot. It is, in fact, the predominant framework for taking data from assessments, summarising that data, and reporting scores out to learners, and is a very powerful framework. It is also a very large framework, and subsumes a lot of different models under its hood. What Wim van der Linden—who is the editor of this three-volume series—has done that I find so remarkable is that he has updated a single book that he had several years ago and really brought together a large number of these models under a single umbrella in a coherent and principled fashion. “Wim van der Linden is one of the smartest people alive working in the psychometric field.” If you are someone who needs to learn about the range of models that exist out there, what they offer in terms of how they summarise data, how you can make inferences with them, what we currently know about how they should be estimated statistically, how their fit to data should be evaluated, and so on, then you can really get a wonderful sense of the entire space by looking across these three volumes. As a reference framework to have on the shelf, it is really indispensable for anybody who studies these kinds of models. And if you work in educational assessment, and you are somebody who works with the quantitative data, you need to learn about these models. To me, it is a must-have volume. In addition, Wim van der Linden is one of the smartest people alive working in the psychometric field. As I said, he has been very principled and rigorous and detailed in editing these books, so that sets of chapters have similar kinds of structures and give a similar balance to the different kinds of topics. I admire that kind of editorial and contributory work as someone who has, myself, written and edited three books. I know how much hard work that is—to pull together many people with different styles and different personalities and different ways of expressing their ideas. So I admire this book not only for its content, but also in terms of what it represents as an editorial effort. It is difficult to put a number on it. There are currently over a thousand members of the National Council for Measurement Education, for example, which is one of the larger associations that has historically existed. Nowadays, one of the challenges is that when you think about where the field has its boundaries, it is becoming fuzzy. When you think about educational assessment in the way I talked about it earlier, you also have to think about people who are in learning analytics, data science, and educational data mining fields for instance. These are often people who have an interdisciplinary training, many with a strong emphasis in computer science. The numbers are just mushrooming from year to year as these kinds of applications get larger and now we have areas like ‘computational psychometrics’ and very computationally oriented psychometrics programs like at the business school in Cambridge. “Assessment activities or tasks are like scientific instruments. Once you change the instrument, you can ask new questions about the subject that you are studying.” You also have a large number of different companies and start-ups concerned with educational assessment nowadays. You have companies like ETS, which are historically relatively well established and therefore ‘robust’ in some important ways. For example, we have a relatively large research division compared to many smaller educational assessment companies, with many specialists dedicated to statistics, psychometrics, learning sciences, cognitive science, and so on. But if you go to conferences, you do, of course, repeatedly run into certain key people within your field from across various institutions. Moreover, when you work in a scientific field, from the outside it often seems holistic and relatively undifferentiated but it typically breaks down relatively quickly into lines of work that people are concerned with. For example, I work in an area called diagnostic measurement, which is an area on which I co-wrote a book. In that community I have 25 or so colleagues who do consistent recognizable work but quite a few more colleagues who occasionally dabble in it. Measuring psychological traits, yes. It is all about measuring the unobservable characteristics of individuals that you cannot see directly. The logic is that you design situations—which we often call ‘items’ or ‘tasks’ or ‘activities’ or ‘environments’—in such a way that people, when they interact with them, draw on those skills and give you data—behavioural traces essentially—around the things that they do. They select options. They move around in an environment in a particular way. They write an essay. They give a spoken response. Nowadays you could even measure gestures or facial expressions. You then analyse those data, and infer back from the things that you directly observe to what they might have been relying on when they were doing these kinds of things. It is that chain of reasoning that makes assessment so challenging. When you say someone is a little bit more ‘clever’ than another person, then that is essentially a very intuitive way of thinking about what we do whenever we make comparative judgments but it is not all. In addition, we may say ‘clever’ meaning a certain person is very competent in English writing. They are at the top end of the scale. They are able to write essays that are informationally relevant, are well structured, contain few errors, are on topic, and so on and so forth. People who are not so skilled might make a lot of mistakes. So that intuition is correct. A lot of testing is either about comparing people – rank ordering or sorting them into different groups. But it is also about evaluating their performance in absolute terms against a particular criterion or standards. Such kinds of evaluations can be done along either one conceptual dimension – like global proficiency in reading, mathematics, or science for example – or multiple subcompetencies in these domains. What we find nowadays is that, as the assessment environments that people engage in become more complex and interactive – and to some degree more open ended – and we open up all this space about how individuals and teams could work on these problems, we essentially have to change the kind of questions that we ask about people. Assessment activities or tasks are like scientific instruments. Once you change the instrument, you can ask new questions about the subject that you are studying. It might be the learners in a particular grade or adults in a particular professional situation. As the questions get more complex, the data analytics get more complex, which means that any of these studies that you have to design to convince yourself that what you are seeing is trustworthy also get more complex. But I think, nowadays, we are able to capture—in a more authentic and comprehensive way—the abilities learners across a lifespan have, and what sort of non-cognitive factors they bring to bear when they engage in these activities. That is why research in this area is still very much ongoing and the field is continuing to grow. There are things we already know that are very well established, hard facts and truths that you don’t really have to re-question. But there are also a lot of new questions that get asked that have all of these new research efforts attached to them that are worth pursuing."

Principles and Practice of Structural Equation Modeling

Rex Kline · Buy on Amazon

"Structural equation modelling is another set of statistical models that are very popular in the social sciences. They are often used by people who want to investigate how different kinds of abilities, often called ‘constructs’, relate to one another. People design studies where they give survey instruments, for example, or educational assessments. They then create scores from these. They relate all of these to one another. Essentially, it is a very nice way of taking graphical representations of these relationships, and taking data and quantifying how strong these relationships are. Which variable predicts which other variable? Are there moderating or mediating effects between variables that might influence that relationship? How strong is it? Which direction does it go in? And so on and so forth. The reason why I chose this book in particular is because it is reflective of a series that I have really come to like. In this series, the publisher is really trying to break down relatively complex information around assessment methods for people who are educated but not yet experts. If you compare that with the item response theory handbook, it is much more accessible and much more at an introductory level. Get the weekly Five Books newsletter I think this is such an important kind of work to do in our field. It is the kind of work I identify myself with very much. It is what I call ‘handholding for smart people’. It is the same style in which I co-wrote a book with two colleagues a few years back, and with which Jackie and I have edited our latest handbook. You try to describe the key ideas, the key principles, the key practices in an area at a level where you use technical terms sometimes as well as mathematical equations and graphics but you still talk that all through, step-by-step, so that you do not lose all of the nuance and abstract it so much that you trivialise the ideas. I think sometimes that colleagues who are scientists think that that is maybe not as valuable, and it is much more valuable to produce very technical publications in peer review journals, but I personally think this kind of book is a very important contribution. It turns out that this topic, structural equation modelling, represents a very popular, very important family of models. This particular book is already in its fourth edition, so it has clearly found a lot of people who appreciate it practically. Imagine you have an application where you are looking at the relationship between different competencies in English language. Let’s say you have three variables: writing competency, speaking competency, and interpersonal communicative competence. You are interested in how these relate to background variables that people bring to bear in assessment. Maybe the kind of educational background that they have, the kinds of households that they come from, or the educational context in which they are learning English. You might also be interested in how certain kinds of non-cognitive factors like motivation, grit, or persistence mediate how they use these competencies to solve tasks. With structural equation modelling you can set up a model where you have these different constructs represented and you can try to see, say, whether one is predictive of the other."

Handbook of Test Development

Mark Raymond and Thomas Haladyna (Editors) & Suzanne Lane · Buy on Amazon

"The first two books that we have talked about were really about different ways of making a certain technical body of knowledge accessible to different audiences. This next book is about the entirety of the test development process. When you get a degree in graduate school in the area of educational measurement, it is often very much focused on the statistical models, like the ones in the first two books. One of the advantages is that people that come out of these graduate programmes have really solid and detailed training about how to think about the models, how to estimate them, the relative advantages and disadvantages, and so on. What they often lack is a systemic understanding of what happens when you actually try to do these data analyses in a real life context, where you have to design a test from A to Z. It is so much more than just scoring. It has to do with making complex decisions about the kind of competencies that you want to measure, the kind of tasks that you want to design, the kinds of reports you want to create. And the kinds of studies that you need to do in order to justify the defensibility of those reports. It is about the kind of computational architecture that you need to set up, the kind of data you see, the Excel spreadsheets, the Word documents, all of that. The kind of skill sets that you need in order to manage that entire process. It is about all the really complex, systemic thinking that needs to go into this. “If someone…wants to get a sense of the real world of test development through reading a particular book, this is a really great book to have in your hands.” Equally important is all the resource constraints that this happens under. When you work for an assessment company and you actually have to design, evaluate, deploy, and monitor an assessment—whether that is a traditional large-scale assessment or more of an innovative assessment—you have only so much time, only so much money to spend on certain studies, and only so much experience typically to manage all of these processes, which creates constraints that you have to work under. Often what people who are trained in educational measurement find is that when they have learned about all these fancy and wonderful models and they come to a testing company, the models that are being used are relatively simple. They are much simpler than they would expect although that is not necessarily a good thing. But it has to do with the fact that the simpler models may create graphics or summaries for you that are easily interpretable, or they do the job well enough for operational reporting purposes so that fine-tuning is not necessary. They may be easier to communicate to the clients who have to use the data. Or it might be about sample sizes—you do not have enough people for your assessment. You do not have enough items or tasks that you need for a particular kind of competency that you are interested in measuring, so you cannot really do anything reliably with a fancy model yet although you can start to think about how to do it eventually. I think that is often a real wake up call. In this handbook, the editors have done a really nice job of getting together authors who have written on different aspects of this process. I think if someone is in graduate school and learns about statistical models and wants to get a sense of the real world of test development through reading a particular book, this is a really great book to have in your hands. It does convey a sense of the entire enterprise, with warts and all. I think the United States is certainly a country that is well known for large-scale assessment, which has its advantages and disadvantages. It is where a lot of really strong educational measurement programmes are. A lot of cutting edge research comes out of the United States. The irony is that it is often done by people that have grown up in very different countries. It is people like me or people from Australia, from Italy, from the Netherlands, from Britain, from Turkey, who do take up jobs in the United States because there is more of a job market. They bring their cultural background and their scientific backgrounds to the table. In that sense, the US is a very attractive place for this kind of work. One of the big trends that many people are, at some level, familiar with is these international comparison surveys of student achievement that are being done. For example, the PISA survey, which stands for the Programme for International Student Assessment, is one of those international surveys that 30-plus countries participate in every three years. Reading, math, and science are the focal areas and it is essentially a fancy way of summarising the performance of 15-year-olds in these areas and then doing global comparisons of where countries stand. In the mid-2000s, in my country, there was this belief that the German educational system was very advanced and had produced all these wonderful strong thinkers and doers. We were almost implicitly expected to perform well on this assessment. But when we participated in PISA for the first time that was not at all the case. We were somewhere in the middle or upper middle of the scale on all these competencies. That was known as the ‘PISA shock’. As a result of that, in Germany, large-scale educational testing got kick-started in the middle-2000s. At the time, I was working at the first national institute for this kind of an assessment, the Institute for Educational Progress in Berlin, which still exists today. We had done other studies like PISA before but this was the first time standards-based large-scale assessment was done rigorously on a national scale. “There is…sometimes, a misperception that assessments are only of a certain, simplistic kind, and therefore any assessment is necessarily a bad thing. I think societally and politically, you have to wrestle with that.” For better or for worse, it kick-started an entire culture of that kind of assessment in my country. It meant that people had to wrestle with this idea of students being tested at certain intervals, that deficiencies and strengths were being made more public, that money was funnelled into those enterprises now out of state or federal funds, or that certain lines of research were suddenly being advanced to a stronger degree than it had been before. That was a really big change. Nowadays what we have is a world where, in these large-scale surveys, you also use interactive technologies much more. You have more tablet-based or PC-based delivery. With more and more countries participating, there is also more of an awareness of what those kinds of assessments can do and what learners in those countries are able to do. I think one of the challenges with all the innovations in assessment is when you get into more impoverished areas of the world, whether that is within the developing world or the developed world. You still have to struggle with access to the technology for assessment although some large-scale surveys have recently gone fully computer-based. I think some countries are very fortunate in that a lot of investments are made and computer labs or tablets are becoming very commonplace. But even in those countries, you have pockets where that is not the case. To your earlier question about fairness, it is always a challenge how to make sure that you get a fair representation of what learners can do, given those simple delivery constraints. First of all I should say that I have not followed the politics and the societal implications of this closely since I left my country. However, I feel that creating that kind of an awareness about strengths and weaknesses—and challenging the public and the scientific community to wrestle with issues such as how to measure relevant, novel, and complex competencies while we are competing in a global market—is really beneficial. I informally hear from colleagues that there is a weariness setting in about the amount of assessment that is being done. There is also, sometimes, a misperception that assessments are only of a certain, simplistic kind, and therefore any assessment is necessarily a bad thing. I think societally and politically, you have to wrestle with that. I am someone who, as a person, strives for authenticity and integrity when I interact with different stakeholders. I think the worst thing that we can do as scientists or as specialists is to oversell or undersell certain kinds of products such as assessments. Our job is to educate our audience as best as we can about what the relative advantages and disadvantages are of doing certain kinds of things around assessments. The hard aspect about this is that the deep answers to these questions are often very complex and nuanced, and we live in a world where people want fast answers. They want simple answers, and they want to make quick judgements. Even with the best intentions, you sometimes have audiences that are just not receptive. Most of the time, when questions get asked about the value of educational testing, typically people think of ‘institutions’ that are doing certain things. The reality is that it is always ‘individuals’ who have to communicate ideas, even if at the end of the day, an institution releases a pamphlet or an FAQ. When you talk to individuals, most scientists, most teachers, most parents, most students, have very good intentions, and many are really very thoughtful, want to do a good job, and want to understand certain complexities. As a result, I always think that it is good to think about the human component of all of this and really engage with the human being and ask the appropriate questions, be willing and open to learn from the other person— but it clearly goes both ways. If people were to do that frequently, I think the understanding about what assessment can and cannot do in certain situations would be much more evolved and much more nuanced and maybe much more representative of what the current state of the art actually is."

The Skilled Facilitator: A Comprehensive Resource for Consultants, Facilitators, Coaches, and Trainers

Roger Schwarz · Buy on Amazon

"The Skilled Facilitator is a book that I have picked because my current job requires management. I am currently a research director at ETS, and that means I have a team of colleagues who work with me on different projects. Even if I was not in that position, a lot of work at this company is interdisciplinary, and so you work with people with all different kinds of backgrounds and training. You have to bring them together and share information, get reasonable buy-in for ideas, for processes, for practices, and that is a hard thing to do. “I really believe that even if you just have a small unit, you need to live by a set of values that are really constructive, that also reflect you, and that you want others to live by when you are not with them. ” The current director of the division that I am in at one point referenced the ‘mutual learning’ framework by Roger Schwartz. This particular book is one of the first books in which Roger talks about that framework in detail. It is essentially a very accessible and elegant way of communicating that in order to work together, you have to have a psychological mindset that is formed on ideas like transparency, curiosity, compassion, and that you should ground communication and negotiation on those kinds of values and their surrounding culture. Put simply, rather than being top-down, punitive, secretive, and unnecessarily directive, this kind of mindset really helps teams work together better. It helps you connect with individuals better. It helps you make smarter, more efficient, more effective managerial choices, which, if you think about that whole test development process that we talked about earlier in our third book, is really what you need. I have found this framework to be really, really helpful. I actually have a poster on my door from a workshop that Roger Schwartz and colleagues did to remind people that when we have conversations about ideas, those are the principles, the assumptions, and also the behaviours that we should be guided by. I have seen it work really well. Incidentally, I always believed in those kinds of values anyway, so for me it was just fine-tuning that, reminding myself to continue to improve myself as a manager, as a director, as a colleague. I really believe that even if you just have a small unit, you need to live by a set of values that are really constructive, that also reflect you, and that you want others to live by when you are not with them. And let’s face it, as a director or manager or coach that is 99% of the time. I personally believe that this is a critical part of how we do our job and are successful, and how we help others be successful."

Hamilton: The Revolution

Jeremy McCarter & Lin-Manuel Miranda · Buy on Amazon

"Finally some light reading, right? I chose this book because the work that we do specifically on the innovative edges of educational assessment is, as I said earlier, a mixture of scientific rigour and artful practice. Essentially, it is all about designing under constraints. The design decisions have to permeate everything, from the way you design the activities to the way you design your scoring to the way you design the reporting to the way you design how in teams you work together to make all of this happen. I am, personally, a big supporter and fan of performing arts: musicals, plays, concerts, comedy. I have seen over 900 shows in my life. In many different countries. When I moved to the East Coast, I eventually got closer to New York City, and that is like paradise for anybody who loves the performing arts of any kind. Every day you could go to several different shows that are phenomenal. Hamilton is a musical, by Lin-Manuel Miranda. I think that most people have heard of it in some way by now. What he did to me is just such an inspiration and I love what it represents on so many levels. It is a musical that he created based on an inspiration that he had when he read Alexander Hamilton’s biography on a vacation. He thought it would be a hip-hop story. Then, over many years—as you find out if you watch his film or read this book—he created, in many different steps, with many different iterations, many different colleagues, and many different decisions, this engrossing musical that is so different from any other musicals that currently exist in the world. It fuses a variety of musical styles like pop, rock and hip-hop in a beautifully flowing narrative. It teaches you about a part of history. It teaches you about the personal challenges of people who were involved in the history. That makes it really accessible. It is beautifully staged. The music is phenomenal. The lighting is fantastic. I admire the complexity of all of these decisions that had to be made and all of the people who had to come together to make a project like this successful. If you know anything about Broadway or other kinds of professional theatre, producers will say that most of the money gets lost and they are not profitable. I forgot what the number was, but I think only 10-15% of musicals ever recoup their initial investment on Broadway . Everybody wants the holy grail like Hamilton. To really change the culture of what it means to be a musical in this way, I find that so inspiring. For me, going with my wife and seeing this show or any other good show for that matter really lifts me up. It lifts my soul up, and I bring that to my work. I try to take that same spirit into conversations that we have around assessment design or when we write articles, and to really always be an artist while also being a scientist. I think having this kind of creation out there as a landmark is just unbelievable. It is such a wonderful and admirable piece of work, as many others have said. I highly recommend seeing Hamilton and supporting the performing arts. One of the sadder things about standardised assessment over the years, is that the predominant focus has often been on math, science, and reading. STEM—science, technology, engineering, and mathematics—is important but the inclusion of ‘A’ for the arts—STEAM—is really important, because I feel the arts are such a powerful contributor to how human beings are shaped, what their values are, what their beliefs are. It is how their passions are ignited. It can bring out the best in people. It is important to support that through educational assessments, which is I why I really admire innovative assessments where maybe learners or learners have to design certain kinds of artefacts or tools or environments, and we try to measure that, and model it, and give them feedback on it in such a way that it is still assessment and not just a cool exercise. Yes, absolutely. It is even happening, to some degree, on these international surveys that I mentioned earlier. It is certainly happening in research projects that have been used in school districts. For example, one of my colleagues has designed a system called InqITS for science. The research team has developed apps for teachers where they can monitor how learners in their classroom are doing and get indicators of engagement and feedback on the learners, while the learners can do interactive tasks that are smartly designed to help them do scientific experimentation. Then I have colleagues who are doing research on video games—some people call those ‘serious’ educational games. One is called Newton’s Playground, where learners have to design, graphically, innovative solutions that help a ball reach a balloon in an environment with obstacles. It uses understanding of physics to help learners do that, but it has these creative design components. All of this is happening, but it is typically at the edges, so there are quite a few research projects that are funded by the National Science Foundation, MacArthur, the Bill and Melinda Gates Foundation, the Institute for Educational Sciences, or start-ups. I think those assessments are a critical part of our future. Unfortunately, it is understandably not the first thing that people think about when they think of educational assessment, which is more like the standardised, sit-in-a-classroom, paper-and-pencil test, with relatively abstract questions that seem, for many people, disconnected from what adults do in their professions. That is of course partly true, but that association is also partly a shame because that is not the entirety of where the field is or the core of where it is going."

Suggest an update?