HOME:  Dismissive Reviews in Education Policy Research            
  Author Co-author(s) Dismissive Quote type Title Source Link1 Notes Notes2
1 Jill Barshay Daniel Koretz [interviewee] " In this country, we treat education data as the private sandbox of superintendents and commissioners. This is entirely different from how we treat data in other areas of public policy, such as medicine or airline safety." Dismissive PROOF POINTS: 5 Questions for Daniel Koretz  Hechinger Report, July 13, 2020 https://hechingerreport.org/proof-points-5-five-questions-for-daniel-koretz/ There are privacy controls on student data; there should probably be more and they should probably be enforced more vigourously. But, such controls are even stronger with medical data, which Koretz seems to imply here are weaker. Nonetheless, access to anonymized student data is granted all the time. Externally administered high-stakes testing is widely reviled among US educationists. It strains credulity that Koretz can not find one district out of the many thousands to cooperate with him to conduct a study to discredit testing.  
2 Jill Barshay Daniel Koretz [interviewee] "And so there aren’t that many studies, but the ones we have are quite consistent." Dismissive PROOF POINTS: 5 Questions for Daniel Koretz  Hechinger Report, July 13, 2020 https://hechingerreport.org/proof-points-5-five-questions-for-daniel-koretz/ In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019). The many experimental studies of test coaching are consistent, it has some modest effect, and not the volatile or very large effects that Koretz claims.  
3 Jill Barshay Daniel Koretz [interviewee] "Experts have been writing about test score inflation since at least 1951. It’s not news but people have willfully ignored it." Denigrating PROOF POINTS: 5 Questions for Daniel Koretz  Hechinger Report, July 13, 2020 https://hechingerreport.org/proof-points-5-five-questions-for-daniel-koretz/ Seems hypocritical. The most famous, and most honest, study of test score inflation--which primarily blamed cheating, corruption, and lax test security for it--was conducted by John J. Cannell in the mid 1980s. Koretz and his colleagues at CRESST have misrepresented Cannell's reports for three decades. More recently, Koretz has claimed that he conducted the first test score inflation study around 1990.  
4 Daniel M. Koretz   "Our current system is premised on the assumption that if we hold people accountable for just a few important things — primarily scores on a few tests — the rest of what matters in schools will follow along, but experience has confirmed that this is nonsense." Dismissive American students aren't getting smarter — and testbased
'reform' initiatives are to blame
NBC News, Thought Experiment https://www.nbcnews.com/think/opinion/american-students-aren-t-getting-smarter-test-based-reform-initiatives-ncna1103366 In fact, the evidence "that testing can improve education" is voluminous. See, for example, Phelps, R. P. (2005). The rich, robust research literature on testing’s achievement benefits. In R. P. Phelps (Ed.), Defending standardized testing (pp. 55–90). Mahwah, NJ: Psychology Press. Or, see https://journals.sagepub.com/doi/abs/10.1177/0193841X19865628#abstract  
5 Daniel M. Koretz   "One of the main reasons for the failure of test-based accountability was reformers’ refusal to evaluate their innovations before imposing them wholesale on students and teachers." Dismissive American students aren't getting smarter — and testbased
'reform' initiatives are to blame
NBC News, Thought Experiment https://www.nbcnews.com/think/opinion/american-students-aren-t-getting-smarter-test-based-reform-initiatives-ncna1103366 In fact, many, if not most, large-scale testing and accountability programs in the past have been evaluated. The evaluation reports tended to end up on shelves in district and state research bureaus. Some declare there to be no research after looking only in the most easily accessible locations for the most easily retrieved evidence.  
6 Daniel M. Koretz   "However, our experience is still limited, and there is a serious dearth of research investigating the characteristics and effects of testing in the postsecondary sector." Dismissive Measuring Postsecondary Achievement: Lessons from Large-Scale Assessments in the K-12 Sector Higher Education Policy, April 24, 2019, Abstract https://link.springer.com/article/10.1057/s41307-019-00142-4 In fact, the research literature on testing in higher education is long and deep. Consider, for example, the work of Trudy Banta, Patricia Cross, and Thomas Angelo. See also the large number of higher education studies in this meta analysis:  https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
7 Matt Barnum Daniel Koretz [interviewee] Journalist: I take it it’s very hard to quantify this test prep phenomenon, though? Koretz: It is extremely hard, and there’s a big hole in the research in this area. Dismissive Why one Harvard professor calls American schools’ focus on testing a ‘charade’ Chalkbeat, January 19, 2018 https://www.chalkbeat.org/posts/us/2018/01/19/why-one-harvard-professor-calls-american-schools-focus-on-testing-a-charade/ In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
8 Matt Barnum Daniel Koretz [interviewee] "There aren’t that many studies, but they’re very consistent. The inflation that does show up is sometimes absolutely massive. Worse, there is growing evidence that that problem is more severe for disadvantaged kids, creating the illusion of improved equity." Dismissive Why one Harvard professor calls American schools’ focus on testing a ‘charade’ Chalkbeat, January 19, 2018 https://www.chalkbeat.org/posts/us/2018/01/19/why-one-harvard-professor-calls-american-schools-focus-on-testing-a-charade/ In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
9 Daniel M. Koretz   "However, this reasoning isn't just simple, it's simplistic--and the evidence is overwhelming that this approach [that testing can improve education] has failed. … these improvements are few and small. Hard evidence is limited, a consequence of our failure as a nation to evaluate these programs appropriately before imposing them on all children." Dismissive The Testing Charade: Pretending to Make Schools Better [Kindle location 142] University of Chicago Press, 2017 https://www.press.uchicago.edu/ucp/books/book/chicago/T/bo24695545.html In fact, the evidence "that testing can improve education" is voluminous. See, for example, Phelps, R. P. (2005). The rich, robust research literature on testing’s achievement benefits. In R. P. Phelps (Ed.), Defending standardized testing (pp. 55–90). Mahwah, NJ: Psychology Press.  
10 Daniel M. Koretz   "The bottom line: the information yielded by tests, while very useful, is never by itself adequate for evaluating programs, schools, or educators. Self-evident as this should be, it has been widely ignored in recent years. Indeed, ignoring this obvious warning has been the bedrock of test-based education reform." Denigrating The Testing Charade: Pretending to Make Schools Better [Kindle location 142] University of Chicago Press, 2017 https://www.press.uchicago.edu/ucp/books/book/chicago/T/bo24695545.html I know of no testing professional who claims that testing by itself is adequate for evaluating programs, schools, or educators. But, by the same notion, neither are other measures used alone, such as inspections or graduation rates.  
11 Daniel M. Koretz   "…as of the late 1980s there was not a single study evaluating whether inflation occurred or how severe it was. With three colleagues, I set out to conduct one." 1stness The Testing Charade: Pretending to Make Schools Better [Kindle location 142] University of Chicago Press, 2017 https://www.press.uchicago.edu/ucp/books/book/chicago/T/bo24695545.html * The most famous test score inflation study of all time -- John J. Cannells "Lake Wobegon Effect" study -- preceded Koretz's by several years. See:  http://nonpartisaneducation.org/Review/Books/CannellBook1.htm  http://nonpartisaneducation.org/Review/Books/Cannell2.pdf  
12 Daniel M. Koretz   "However, value-added estimates are rarely calculated with lower-stakes tests that are less likely to be inflated." Dismissive The Testing Charade: Pretending to Make Schools Better [Kindle location 142] University of Chicago Press, 2017 https://www.press.uchicago.edu/ucp/books/book/chicago/T/bo24695545.html Almost all value-added measurements (VAM) are calculated on scores from tests with no stakes for the students. The state of Tennessee, which pioneered VAM and has continued to use it for two decades uses nationally-normed reference tests that have no stakes for anyone, including teachers. Moreover, research shows that low-stakes tests are more prone to score inflation than high-stakes tests.  
13 Daniel M. Koretz   "One reason we know less than we should … is that most of the abundant test score data available to us are too vulnerable to score inflation to be trusted. There is a second reason for the dearth of information, the blame for which lies squarely on the shoulders of many of the reformers." Dismissive The Testing Charade: Pretending to Make Schools Better [Kindle location 142] University of Chicago Press, 2017 https://www.press.uchicago.edu/ucp/books/book/chicago/T/bo24695545.html The vast amount of information already available just for the asking, worldwide, could help build better accountability systems, without wasting more research grant money on those who refuse to study what is already available.   
14 Daniel M. Koretz   "High-quality evaluations of the test-based reforms aren't common, …" Denigrating The Testing Charade: Pretending to Make Schools Better [Kindle location 142] University of Chicago Press, 2017 https://www.press.uchicago.edu/ucp/books/book/chicago/T/bo24695545.html Actually, high-quality evaluations of testing interventions have been numerous and common over the past century. Most of them do not produce the results that Koretz prefers, however, so he declares them nonexistent. See https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
15 Daniel M. Koretz   "The first solid study documenting score inflation was presented twenty-five years before I started writing this book." 1stness The Testing Charade: Pretending to Make Schools Better [Kindle location 142] University of Chicago Press, 2017 https://www.press.uchicago.edu/ucp/books/book/chicago/T/bo24695545.html * The most famous test score inflation study of all time -- John J. Cannells "Lake Wobegon Effect" study -- preceded Koretz's by several years. See:  http://nonpartisaneducation.org/Review/Books/CannellBook1.htm  http://nonpartisaneducation.org/Review/Books/Cannell2.pdf  
16 Daniel M. Koretz   "The first study showing illusory improvement in achievement gaps--the largely bogus "Texas miracle"--was publicshed only ten years after that." 1stness The Testing Charade: Pretending to Make Schools Better [Kindle location 142] University of Chicago Press, 2017 https://www.press.uchicago.edu/ucp/books/book/chicago/T/bo24695545.html * The most famous test score inflation study of all time -- John J. Cannells "Lake Wobegon Effect" study -- preceded Koretz's by several years. See:  http://nonpartisaneducation.org/Review/Books/CannellBook1.htm  http://nonpartisaneducation.org/Review/Books/Cannell2.pdf  
17 Daniel M. Koretz Jennifer L. Jennings, Hui Leng Ng, Carol Yu, David Braslow, Meredith Langi "A number of studies have estimated smaller effects of coaching for the SAT, often in the range of 0.1–0.2 standard deviation on the mathematics test (e.g., Briggs, 2009; Dominigue & Briggs, 2009; Powers & Rock, 1999). However, these studies reflect a different process than test prep in K–12 schools and are methodologically weaker; while most studies of K–12 score inflation rely on comparisons of identical or randomly equivalent groups, studies of SAT coaching rely on covariate-adjustment or propensity-score matching in an attempt to remove differences between coached and uncoached students." Denigrating Auditing for score inflation using self-monitoring assessments: Findings from three pilot studies Harvard Library Office for Scholarly Communication, to be published in Educational Assessment https://dash.harvard.edu/handle/1/28269315 So now it seems that there is other, previous research on test coaching but, Koretz, et al. pick out only three from the many available and declare them to be inferior to their work. Koretz, et al. studies do not control for any aspect of test administration and, at best, make only meager efforts at content matching between the two tests they compare.  
18 Daniel M. Koretz Holcombe, Jennings “To date, few studies have attempted to understand the sources of variation in score inflation across testing programs.” p. 3 Dismissive The roots of score inflation, an examination of opportunities in two states’ tests  Prepublication draft “to appear in Sunderman (Ed.), Charting reform: achieving equity in a diverse nation http://dash.harvard.edu/bitstream/handle/1/10880587/roots%20of%20score%20inflation.pdf?sequence=1 In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
19 Daniel M. Koretz Waldman, Yu, Langli, Orzech Few studies have applied a multi-level framework to the evaluation of inflation,” p. 1 Denigrating Using the introduction of a new test to investigate the distribution of score inflation  Working paper of Education Accountability Project at the Harvard Graduate School of Education, Nov. 2014 http://projects.iq.harvard.edu/files/eap/files/ky_cot_3_2_15_working_paper.pdf In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
20 Daniel M. Koretz   "What we don’t know, What is the net effect on student achievement?
  - Weak research designs, weaker data
  - Some evidence of inconsistent, modest effects in elementary math, none in reading
  - Effects are likely to vary across contexts...
Reason: grossly inadequate research and evaluation"
Denigrating Using tests for monitoring and accountability Presentation at:  Agencia de Calidad de la Educación Santiago, Chile, November 3, 2014   See, for example, https://journals.sagepub.com/doi/abs/10.1177/0193841X19865628#abstract & https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
21 Daniel M. Koretz Jennifer L. Jennings  “We find that research on the use of test score data is limited, and research investigating the understanding of tests and score data is meager.” p. 1 Dismissive The Misunderstanding and Use of Data from Educational Tests  Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities/ Relevant studies include: Forte Fast, E., & the Accountability Systems and Reporting State Collaborative on Assessment and Student Standards. (2002). A guide to effective accountability reporting. Washington, DC: Council of Chief State School Officers. * Goodman, D., & Hambleton, R.K. (2005). Some misconceptions about large-scale educational assessments, Chapter 4 in Richard P Phelps (Ed.) Defending Standardized Testing, Psychology Press. * Goodman, D. P., & Hambleton (2004). Student test score reports and interpretive guides: Review of current practices and suggestions for future research. Applied Measurement in Education. * Hambleton, R. K. (2002). How can we make NAEP and state test score reporting scales and reports more understandable? In R. W. Lissitz & W. D. Schafer (Eds.), Assessment in educational reform (pp. 192-205). Boston: Allyn & Bacon. * Impara, J. C., Divine, K. P., Bruce, F. A., Liverman, M. R., & Gay, A. (1991). Does interpretive test score information help teachers? Educational Measurement: Issues and Practice, 10(4), 16-18. * Wainer, H., Hambleton, R. K., & Meara, K. (1999). Alternative displays for communicating NAEP results: A redesign and validity study. Journal of Educational Measurement, 36(4), 301-335.  
22 Daniel M. Koretz Jennifer L. Jennings “Because of the sparse research literature, we rely on experience and anecdote in parts of this paper, with the premise that these conclusions should be supplanted over time by findings from systematic research.” p. 1 Dismissive The Misunderstanding and Use of Data from Educational Tests  Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities Relevant studies include: Forte Fast, E., & the Accountability Systems and Reporting State Collaborative on Assessment and Student Standards. (2002). A guide to effective accountability reporting. Washington, DC: Council of Chief State School Officers. * Goodman, D., & Hambleton, R.K. (2005). Some misconceptions about large-scale educational assessments, Chapter 4 in Richard P Phelps (Ed.) Defending Standardized Testing, Psychology Press. * Goodman, D. P., & Hambleton (2004). Student test score reports and interpretive guides: Review of current practices and suggestions for future research. Applied Measurement in Education. * Hambleton, R. K. (2002). How can we make NAEP and state test score reporting scales and reports more understandable? In R. W. Lissitz & W. D. Schafer (Eds.), Assessment in educational reform (pp. 192-205). Boston: Allyn & Bacon. * Impara, J. C., Divine, K. P., Bruce, F. A., Liverman, M. R., & Gay, A. (1991). Does interpretive test score information help teachers? Educational Measurement: Issues and Practice, 10(4), 16-18. * Wainer, H., Hambleton, R. K., & Meara, K. (1999). Alternative displays for communicating NAEP results: A redesign and validity study. Journal of Educational Measurement, 36(4), 301-335.  
23 Daniel M. Koretz Jennifer L. Jennings "...the relative performance of schools is difficult to interpret in the presence of score inflation. At this point, we know very little about the factors that may predict higher levels of inflation —for example, characteristics of tests, accountability systems, students, or schools." p.4 Dismissive The Misunderstanding and Use of Data from Educational Tests  Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities In fact, we know quite a lot about the source of higher levels of score inflation -- it is lax test security. The many experimental studies of test coaching are consistent, it has some modest effect, and not the volatile or very large effects that Koretz claims.  
24 Daniel M. Koretz Jennifer L. Jennings "Unfortunately, it is often exceedingly difficult to obtain the permission and access needed to carry out testing-related research in the public education sector. This is particularly so if the research holds out the possibility of politically inconvenient findings, which virtually all evaluations in this area do. In our experience, very few state or district superintendents or commissioners consider it an obligation to provide thepublic or the field with open and impartial research.  Dismissive The Misunderstanding and Use of Data from Educational Tests, pp.4-5 Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities/ Externally administered high-stakes testing is widely reviled among US educationists. It strains credulity that Koretz can not find one district out of the many thousands to cooperate with him to discredit testing.  
25 Daniel M. Koretz Jennifer L. Jennings “We focus on three issues that are especially relevant to test-based data and about which research is currently sparse:
  How do the types of data made available for use affect policymakers’ and educators’ understanding of data?
  What are the common errors made by policymakers and educators in interpreting test score data?
  How do high-stakes testing and the availability of test-based data affect administrator and teacher practice? (p. 5)
Dismissive The Misunderstanding and Use of Data from Educational Tests  Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities Relevant studies of the effects of tests and/or accountability program on motivation and instructional practice: Goslin (1967), *Southern Regional Education Board (1998); Johnson (1998); Schafer, Hultgren, Hawley, Abrams Seubert & Mazzoni (1997); Miles, Bishop, Collins, Fink, Gardner, Grant, Hussain, et al. (1997); Tuckman & Trimble (1997); Clarke & Stephens (1996); Zigarelli (1996); Stevenson, Lee, et al. (1995); Waters, Burger & Burger (1995); Egeland (1995); Prais (1995); Tuckman (1994); Ritchie & Thorkildsen (1994); Brown & Walberg, (1993); Wall & Alderson (1993); Wolf & Rapiau (1993); Eckstein & Noah (1993); Chao-Qun & Hui (1993); Plazak & Mazur (1992); Steedman (1992); Singh, Marimutha & Mukjerjee (1990); *Levine & Lezotte (1990); O’Sullivan (1989); Somerset (1988); Pennycuick & Murphy (1988); Stevens (1984); Marsh (1984); Brunton (1982); Solberg (1977); Foss (1977); *Kirkland (1971); Somerset (1968); Stuit (1947); and Keys (1934).  *Covers many studies; study is a research review, research synthesis, or meta-analysis. Moreover, the mastery learning/mastery testing experiments conducted from the 1960s through today varied incentives, frequency of tests, types of tests, and many other factors to determine the optimal structure of testing programs. Researchers included such notables as Bloom, Carroll, Keller, Block, Burns, Wentling, Anderson, Hymel, Kulik, Tierney, Cross, Okey, Guskey, Gates, and Jones.
26 Daniel M. Koretz Jennifer L. Jennings Systematic research exploring educators’ understanding of both the principles of testing and appropriate interpretation of test-based data is meager.”, p.5 Dismissive The Misunderstanding and Use of Data from Educational Tests  Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities Relevant studies include: Forte Fast, E., & the Accountability Systems and Reporting State Collaborative on Assessment and Student Standards. (2002). A guide to effective accountability reporting. Washington, DC: Council of Chief State School Officers. * Goodman, D., & Hambleton, R.K. (2005). Some misconceptions about large-scale educational assessments, Chapter 4 in Richard P Phelps (Ed.) Defending Standardized Testing, Psychology Press. * Goodman, D. P., & Hambleton (2004). Student test score reports and interpretive guides: Review of current practices and suggestions for future research. Applied Measurement in Education. * Hambleton, R. K. (2002). How can we make NAEP and state test score reporting scales and reports more understandable? In R. W. Lissitz & W. D. Schafer (Eds.), Assessment in educational reform (pp. 192-205). Boston: Allyn & Bacon. * Impara, J. C., Divine, K. P., Bruce, F. A., Liverman, M. R., & Gay, A. (1991). Does interpretive test score information help teachers? Educational Measurement: Issues and Practice, 10(4), 16-18. * Wainer, H., Hambleton, R. K., & Meara, K. (1999). Alternative displays for communicating NAEP results: A redesign and validity study. Journal of Educational Measurement, 36(4), 301-335.  
27 Daniel M. Koretz Jennifer L. Jennings "Although current, systematic information is lacking, our experience is that that the level of understanding of test data among both educators and education policymakers is in many cases abysmally low.", p.6 Dismissive The Misunderstanding and Use of Data from Educational Tests  Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities Relevant studies include: Forte Fast, E., & the Accountability Systems and Reporting State Collaborative on Assessment and Student Standards. (2002). A guide to effective accountability reporting. Washington, DC: Council of Chief State School Officers. * Goodman, D., & Hambleton, R.K. (2005). Some misconceptions about large-scale educational assessments, Chapter 4 in Richard P Phelps (Ed.) Defending Standardized Testing, Psychology Press. * Goodman, D. P., & Hambleton (2004). Student test score reports and interpretive guides: Review of current practices and suggestions for future research. Applied Measurement in Education. * Hambleton, R. K. (2002). How can we make NAEP and state test score reporting scales and reports more understandable? In R. W. Lissitz & W. D. Schafer (Eds.), Assessment in educational reform (pp. 192-205). Boston: Allyn & Bacon. * Impara, J. C., Divine, K. P., Bruce, F. A., Liverman, M. R., & Gay, A. (1991). Does interpretive test score information help teachers? Educational Measurement: Issues and Practice, 10(4), 16-18. * Wainer, H., Hambleton, R. K., & Meara, K. (1999). Alternative displays for communicating NAEP results: A redesign and validity study. Journal of Educational Measurement, 36(4), 301-335.  
28 Daniel M. Koretz Jennifer L. Jennings "There has been a considerably (sic) amount of research exploring problems with standards-based reporting, but less on the use and interpretation of standards-based data by important stakeholders." p.12 Dismissive The Misunderstanding and Use of Data from Educational Tests  Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities Relevant studies include: Forte Fast, E., & the Accountability Systems and Reporting State Collaborative on Assessment and Student Standards. (2002). A guide to effective accountability reporting. Washington, DC: Council of Chief State School Officers. * Goodman, D., & Hambleton, R.K. (2005). Some misconceptions about large-scale educational assessments, Chapter 4 in Richard P Phelps (Ed.) Defending Standardized Testing, Psychology Press. * Goodman, D. P., & Hambleton (2004). Student test score reports and interpretive guides: Review of current practices and suggestions for future research. Applied Measurement in Education. * Hambleton, R. K. (2002). How can we make NAEP and state test score reporting scales and reports more understandable? In R. W. Lissitz & W. D. Schafer (Eds.), Assessment in educational reform (pp. 192-205). Boston: Allyn & Bacon. * Impara, J. C., Divine, K. P., Bruce, F. A., Liverman, M. R., & Gay, A. (1991). Does interpretive test score information help teachers? Educational Measurement: Issues and Practice, 10(4), 16-18. * Wainer, H., Hambleton, R. K., & Meara, K. (1999). Alternative displays for communicating NAEP results: A redesign and validity study. Journal of Educational Measurement, 36(4), 301-335.  
29 Daniel M. Koretz Jennifer L. Jennings "We have heard former teachers discuss this frequently, saying that new teachers in many schools are inculcated with the notion that raising scores in tested subjects is in itself the appropriate goal of instruction. However, we lack systematic data about this..." p.14 Dismissive The Misunderstanding and Use of Data from Educational Tests  Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities See, for example, https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
30 Daniel M. Koretz Jennifer L. Jennings "Research on score inflation is not abundant, largely for the reason discussed above: policymakers for the most part feel no obligation to allow the relevant research, which is not in their self-interest even when it is in the interests of students in schools. However, at this time, the evidence is both abundant enough and sufficiently often discussed that the existence of the general issue of score inflation appears to be increasingly widely recognized by the media, policymakers, and educators." p.17 Dismissive The Misunderstanding and Use of Data from Educational Tests  Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities Externally administered high-stakes testing is widely reviled among US educationists. It strains credulity that Koretz can not find one district out of the many thousands to cooperate with him to discredit testing.  
31 Daniel M. Koretz Jennifer L. Jennings "The issue of score inflation is both poorly understood and widely ignored in the research community as well." p.18 Denigrating The Misunderstanding and Use of Data from Educational Tests  Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
32 Daniel M. Koretz Jennifer L. Jennings "Research on coaching is very limited." p.21 Dismissive The Misunderstanding and Use of Data from Educational Tests  Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities  
33 Daniel M. Koretz Jennifer L. Jennings "How is test-based information used by educators? … The types of research done to date on this topic, while useful, are insufficient." p.26 Denigrating The Misunderstanding and Use of Data from Educational Tests  Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities Relevant studies include: Forte Fast, E., & the Accountability Systems and Reporting State Collaborative on Assessment and Student Standards. (2002). A guide to effective accountability reporting. Washington, DC: Council of Chief State School Officers. * Goodman, D., & Hambleton, R.K. (2005). Some misconceptions about large-scale educational assessments, Chapter 4 in Richard P Phelps (Ed.) Defending Standardized Testing, Psychology Press. * Goodman, D. P., & Hambleton (2004). Student test score reports and interpretive guides: Review of current practices and suggestions for future research. Applied Measurement in Education. * Hambleton, R. K. (2002). How can we make NAEP and state test score reporting scales and reports more understandable? In R. W. Lissitz & W. D. Schafer (Eds.), Assessment in educational reform (pp. 192-205). Boston: Allyn & Bacon. * Impara, J. C., Divine, K. P., Bruce, F. A., Liverman, M. R., & Gay, A. (1991). Does interpretive test score information help teachers? Educational Measurement: Issues and Practice, 10(4), 16-18. * Wainer, H., Hambleton, R. K., & Meara, K. (1999). Alternative displays for communicating NAEP results: A redesign and validity study. Journal of Educational Measurement, 36(4), 301-335.  
34 Daniel M. Koretz Jennifer L. Jennings … We need to design ways of measuring coaching, which has been almost entirely unstudied." p.26 Dismissive The Misunderstanding and Use of Data from Educational Tests  Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
35 Daniel M. Koretz Jennifer L. Jennings  “We have few systematic studies of variations in educators’ responses. …” p. 26 Dismissive The Misunderstanding and Use of Data from Educational Tests  Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities Relevant studies of the effects of tests and/or accountability program on motivation and instructional practice: Goslin (1967), *Southern Regional Education Board (1998); Johnson (1998); Schafer, Hultgren, Hawley, Abrams Seubert & Mazzoni (1997); Miles, Bishop, Collins, Fink, Gardner, Grant, Hussain, et al. (1997); Tuckman & Trimble (1997); Clarke & Stephens (1996); Zigarelli (1996); Stevenson, Lee, et al. (1995); Waters, Burger & Burger (1995); Egeland (1995); Prais (1995); Tuckman (1994); Ritchie & Thorkildsen (1994); Brown & Walberg, (1993); Wall & Alderson (1993); Wolf & Rapiau (1993); Eckstein & Noah (1993); Chao-Qun & Hui (1993); Plazak & Mazur (1992); Steedman (1992); Singh, Marimutha & Mukjerjee (1990); *Levine & Lezotte (1990); O’Sullivan (1989); Somerset (1988); Pennycuick & Murphy (1988); Stevens (1984); Marsh (1984); Brunton (1982); Solberg (1977); Foss (1977); *Kirkland (1971); Somerset (1968); Stuit (1947); and Keys (1934).  *Covers many studies; study is a research review, research synthesis, or meta-analysis. Moreover, the mastery learning/mastery testing experiments conducted from the 1960s through today varied incentives, frequency of tests, types of tests, and many other factors to determine the optimal structure of testing programs. Researchers included such notables as Bloom, Carroll, Keller, Block, Burns, Wentling, Anderson, Hymel, Kulik, Tierney, Cross, Okey, Guskey, Gates, and Jones.
36 Daniel M. Koretz Jennifer L. Jennings "Ultimately, our concern is the impact of educators’ understanding and use of test data on student learning. However, at this point, we have very little comparative information about the validity of gains, ....  The comparative information that is beginning to emerge suggests..." p.26 Dismissive The Misunderstanding and Use of Data from Educational Tests  Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
37 Daniel Koretz Anton Béguin "In the past, score inflation has usually been evaluated by comparing trends in scores on a high-stakes test to trends on a lower-stakes audit test.",  abstract Dismissive Self-Monitoring Assessment for Educational Accountability Systems Measurement: Interdisciplinary Research and Perspectives, 8(2–3), 92–109.   No, most of the research on test prep, test coaching, and score inflation has been conducted in experiments. In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
38 Daniel Koretz Anton Béguin "In most of the research to date, score inflation has been evaluated by comparing trends on a high-stakes test to trends on an audit test—a low- or lower-stakes test intended to measure a reasonably similar domain of achievement." p.93 Dismissive Self-Monitoring Assessment for Educational Accountability Systems Measurement: Interdisciplinary Research and Perspectives, 8(2–3), 92–109.    
39 Daniel M. Koretz   "There is a lack of persuasive evidence of positive effects from test-based accountability." p.1 Dismissive Implications of Current Policy for Educational Measurment. Policy Brief Center for K–12 Assessment & Performance Management, Educational Testing Service http://www.k12center.org/publications.html See, for example, https://journals.sagepub.com/doi/abs/10.1177/0193841X19865628#abstract & https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
40 Daniel M. Koretz   "Confronting these problems requires improvements in the design of both accountability systems and the tests used in them." p.1 Denigrating Implications of Current Policy for Educational Measurment. Policy Brief Center for K–12 Assessment & Performance Management, Educational Testing Service http://www.k12center.org/publications.html In other words, there isn't enough research, ...and there never will be.   
41 Daniel M. Koretz   "The measurement field has not drawn from research in other fields on accountability systems. Rather, it has proceeded as if it were working in isolation. It also has not conducted sufficient research on the problems being encountered in test-based accountability." p.2 Dismissive Implications of Current Policy for Educational Measurment. Policy Brief Center for K–12 Assessment & Performance Management, Educational Testing Service http://www.k12center.org/publications.html In other words, there isn't enough research, ...and there never will be.   
42 Daniel M. Koretz   "It has not addressed adequately the implications of test-based accountability for the field's own activities." p.2 Denigrating Implications of Current Policy for Educational Measurment. Policy Brief Center for K–12 Assessment & Performance Management, Educational Testing Service http://www.k12center.org/publications.html In other words, there isn't enough research, ...and there never will be.   
43 Daniel M. Koretz   “The field of measurement has not kept pace with this transformation of testing.” p. 3 Denigrating Some Implications of Current Policy for Educational Measurment paper presented at the Exploratory Seminar: Measurement Challenges Within the Race to the Top Agenda, December 2009 http://www.k12center.org/rsc/pdf/KoretzPresenterSession3.pdf In other words, there isn't enough research, ...and there never will be.   
44 Daniel M. Koretz   “For the most part, notwithstanding Lindquist’s warning, the field of measurement has largely ignored the top levels of sampling.” p. 6 Dismissive Some Implications of Current Policy for Educational Measurment paper presented at the Exploratory Seminar: Measurement Challenges Within the Race to the Top Agenda, December 2009 http://www.k12center.org/rsc/pdf/KoretzPresenterSession3.pdf Many psychometricians work in the field of gifted testing. Indeed, some specialize in it, and have created a large, robust research literature. One can find much of it at web sites such as "Hoagie's Gifted" and those for the gifted education research centers such as: Belin-Blank (in Iowa); Josephson (in Nevada); Johns Hopkins Center for Talented Youth (in Maryland); and Duke University's Talent Identification Program.  
45 Daniel M. Koretz   "The field of measurement has devoted a great deal of effort to respond to the demands of TBA.  …but, however valuable they may be for other reasons, they are not helpful for confronting the core problem of Campbell's Law." p.14 Dismissive Some Implications of Current Policy for Educational Measurment paper presented at the Exploratory Seminar: Measurement Challenges Within the Race to the Top Agenda, December 2009 http://www.k12center.org/rsc/pdf/KoretzPresenterSession3.pdf No. See other blurbs above and below.  
46 Daniel M. Koretz   “Currently, research on accountability‐related topics, such as score inflation and effects on educational practice, is slowly growing but remains largely divorced from the core activities of the measurement field.” p. 15 Dismissive Some Implications of Current Policy for Educational Measurment paper presented at the Exploratory Seminar: Measurement Challenges Within the Race to the Top Agenda, December 2009 http://www.k12center.org/rsc/pdf/KoretzPresenterSession3.pdf No. See other blurbs above and below.  
47 Daniel M. Koretz   “The data, however, are more limited and more complex than is often realized, and the story they properly tell is not quite so straightforward. . . . Data about student performance at the end of high school are scarce and especially hard to collect and interpret.” p. 38 Dismissive How do American students measure up? Making Sense of International Comparisons The Future of Children 19:1 Spring 2009 http://www.princeton.edu/futureofchildren/publications/docs/19_01_FullJournal.pdf Relevant studies of the effects of testing on at-risk students, completion, dropping out, curricular offerings, attitudes, etc. include those of Schleisman (1999); the *Southern Regional Education Board (1998); Webster, Mendro, Orsak, Weerasinghe & Bembry (1997); Jones (1996); Boylan (1996); Jones, 1993; Jacobson (1992); Grisay (1991); Johnstone (1990); Task Force on Educational Assessment Programs [Florida] (1979); Wellisch, MacQueen, Carriere & Duck (1978); Enochs (1978); Pronaratna (1976); and McWilliams & Thomas (1976).  *Covers many studies; study is a research review, research synthesis, or meta-analysis.  
48 Daniel M. Koretz   “International comparisons clearly do not provide what many observers of education would like. . . . The findings are in some cases inconsistent from one study tor another. Moreover, the data from all of these studies are poorly suited to separating the effects of schooling from the myriad other influences on student achievement. p 48 Dismissive How do American students measure up? Making Sense of International Comparisons The Future of Children 19:1 Spring 2009 http://www.princeton.edu/futureofchildren/publications/docs/19_01_FullJournal.pdf If they do not provide what "many observers" want, why are they so popular? The first international comparison study included less than ten countries several decades ago. Now, several dozen participate each time, at great expense. As for the differences in results, they are to be expected. The Trends in Mathematics and Science Study (TIMSS) is an achievement test administered in primary and middle school. PISA is quite different—more or less an aptitude test administered to fifteen-year-olds.  
49 Daniel M. Koretz   If truly comparable data from the end of schooling were available, they would presumably look somewhat different, though it is unlikely that they would be greatly more optimistic.” p. 49 Dismissive How do American students measure up? Making Sense of International Comparisons The Future of Children 19:1 Spring 2009 http://www.princeton.edu/futureofchildren/publications/docs/19_01_FullJournal.pdf If they do not provide what "many observers" want, why are they so popular? The first international comparison study included less than ten countries several decades ago. Now, several dozen participate each time, at great expense. As for the differences in results, they are to be expected. The Trends in Mathematics and Science Study (TIMSS) is an achievement test administered in primary and middle school. PISA is quite different—more or less an aptitude test administered to fifteen-year-olds.  
50 Daniel M. Koretz   Few detailed studies of score inflation have been carried out. ...” p. 778 Dismissive Test-based educational accountability. Research evidence and implication Zeitschrift für Pädagogik 54 (2008) 6, S. 777–790 http://www.pedocs.de/volltexte/2011/4376/pdf/ZfPaed_2008_6_Koretz_Testbased_educational_accountability_D_A.pdf * The most famous test score inflation study of all time -- John J. Cannell's "Lake Wobegon Effect" study -- preceded Koretz's by several years. See:  http://nonpartisaneducation.org/Review/Books/CannellBook1.htm  http://nonpartisaneducation.org/Review/Books/Cannell2.pdf  
51 Scott J. Cech Daniel Koretz [interviewee] “'If you tell people that performance on that tested sample is what matters, that’s what they worry about, so you can get inappropriate responses in the classroom and inflated test scores,' he said."    Mr. Koretz pointed to research in the 1990s on the state standardized test then used in Kentucky, ... " Dismissive Testing Expert Sees ‘Illusions of Progress’ Under NCLB Education Week, October 1, 2008   Koretz's score inflation studies typically employ no controls for test administration or test content factors. One of his tests might be administered with tight security and the other with none at all. One of his tests might focus on one subject area and the other test another topic entirely. He writes as if all of his "left out" variables could not possibly matter. Moreover, he ignores completely the huge experimental literature on test prep in favor of his apples-to-oranges comparison studies.  
52 Scott J. Cech Daniel Koretz [interviewee] "Mr. Koretz said the relative dearth to date of comparative studies on large-scale state assessments isn’t for lack of trying. He said he and other scholars have often been rebuffed after approaching officials about the possibility of studying their assessment systems. Dismissive Testing Expert Sees ‘Illusions of Progress’ Under NCLB Education Week, October 1, 2008   Externally administered high-stakes testing is widely reviled among US educationists. It strains credulity that Koretz can not find one district out of the many thousands to cooperate with him to discredit testing.  
53 Scott J. Cech Daniel Koretz [interviewee] “There have not been a lot of studies of this,” Mr. Koretz said, “for the simple reason that it’s politically rather hard to do, to come to a state chief and say, ‘I’d like the chance to see whether your test scores are inflated.’?” Dismissive Testing Expert Sees ‘Illusions of Progress’ Under NCLB Education Week, October 1, 2008   Externally administered high-stakes testing is widely reviled among US educationists. It strains credulity that Koretz can not find one district out of the many thousands to cooperate with him to discredit testing.  
54 Daniel M. Koretz   “Unfortunately, while we have a lot of anecdotal evidence suggesting that this [equity as the rationale for NCLB] is the case, we have very few serious empirical studies of this.” answer to 3rd question, 1st para Denigrating What does educational testing really tell us?  Education Week [interview ], 9.23.2008 http://blogs.edweek.org/edweek/eduwonkette/2008/09/what_does_educational_testing_1.html A "rationale" is an argument, a belief, an explanation, not an empirical result. The civil rights groups that supported NCLB did so because they saw it an an equity vehicle.   
55 Daniel M. Koretz   "…we rarely know when [test] scores are inflated because we so rarely check." Dismissive Interpreting test scores: More complicated than you think [interview] Chronicle of Higher Education, August 15, 2008, p. A23   In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
56 Daniel M. Koretz Katherine E. Ryan, Lorrie A. Shepard, Eds. "...traditional psychometrics was in two critical respects tacitly premised on low stakes. The first is that it gave relatively little attention to the consequences of testing. The second is a special case of the first: traditional psychometrics focused little on behavioral responses to testing, other than the behavior of the student while taking the test and of proctors administering it." pp.71-72 Dismissive Further steps toward the development of an accountability-oriented science of measurement Chapter 4 in The Future of Test-Based Educational Accountability Routledge Actually, high-quality evaluations of testing interventions have been numerous and common over the past century. Most of them do not produce the results that Koretz prefers, however, so he declares them nonexistent. See https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
57 Daniel M. Koretz Katherine E. Ryan, Lorrie A. Shepard, Eds. "Nonetheless it is fair to say that most of the psychometric enterprise--what people in the field did when developing methods or operating testing programs--proceeded without much attention to these concerns." p.72 Denigrating Further steps toward the development of an accountability-oriented science of measurement Chapter 4 in The Future of Test-Based Educational Accountability Routledge Relevant studies of the effects of tests and/or accountability program on motivation and instructional practice: Goslin (1967), *Southern Regional Education Board (1998); Johnson (1998); Schafer, Hultgren, Hawley, Abrams Seubert & Mazzoni (1997); Miles, Bishop, Collins, Fink, Gardner, Grant, Hussain, et al. (1997); Tuckman & Trimble (1997); Clarke & Stephens (1996); Zigarelli (1996); Stevenson, Lee, et al. (1995); Waters, Burger & Burger (1995); Egeland (1995); Prais (1995); Tuckman (1994); Ritchie & Thorkildsen (1994); Brown & Walberg, (1993); Wall & Alderson (1993); Wolf & Rapiau (1993); Eckstein & Noah (1993); Chao-Qun & Hui (1993); Plazak & Mazur (1992); Steedman (1992); Singh, Marimutha & Mukjerjee (1990); *Levine & Lezotte (1990); O’Sullivan (1989); Somerset (1988); Pennycuick & Murphy (1988); Stevens (1984); Marsh (1984); Brunton (1982); Solberg (1977); Foss (1977); *Kirkland (1971); Somerset (1968); Stuit (1947); and Keys (1934).  *Covers many studies; study is a research review, research synthesis, or meta-analysis. Moreover, the mastery learning/mastery testing experiments conducted from the 1960s through today varied incentives, frequency of tests, types of tests, and many other factors to determine the optimal structure of testing programs. Researchers included such notables as Bloom, Carroll, Keller, Block, Burns, Wentling, Anderson, Hymel, Kulik, Tierney, Cross, Okey, Guskey, Gates, and Jones.
58 Daniel M. Koretz Katherine E. Ryan, Lorrie A. Shepard, Eds. "The past several decades have also witnessed a growth in empirical research exploring the effects of accountability-oriented testing programs. A limited amount of work has investigated the validity of gains obtained under high-stakes conditions (e.g., Jacob, 2005, 2007; Koretz, Linn, Dunbar, & Shepard, 1991; Koretz & Barron, 1998)." p.74 Denigrating Further steps toward the development of an accountability-oriented science of measurement Chapter 4 in The Future of Test-Based Educational Accountability Routledge Koretz's score inflation studies typically employ no controls for test administration or test content factors. One of his tests might be administered with tight security and the other with none at all. One of his tests might focus on one subject area and the other test another topic entirely. He writes as if all of his "left out" variables could not possibly matter. Moreover, he ignores completely the huge experimental literature on test prep in favor of his apples-to-oranges comparison studies.  
59 Daniel M. Koretz Katherine E. Ryan, Lorrie A. Shepard, Eds. "Although it is clear that behavioral responses to high-stakes testing pose serious challenges to conventional practices in measurment, the field's responses to them have been meager. Little has been done to explore alternative practices--either in the design of tests or in the operation of testing programs." p.86 Denigrating Further steps toward the development of an accountability-oriented science of measurement Chapter 4 in The Future of Test-Based Educational Accountability Routledge Koretz's score inflation studies typically employ no controls for test administration or test content factors. One of his tests might be administered with tight security and the other with none at all. One of his tests might focus on one subject area and the other test another topic entirely. He writes as if all of his "left out" variables could not possibly matter. Moreover, he ignores completely the huge experimental literature on test prep in favor of his apples-to-oranges comparison studies.  
60 Daniel M. Koretz Katherine E. Ryan, Lorrie A. Shepard, Eds. "Perhaps most striking, the problem of score inflation gets fleeting mention, if any at all, in most evaluations or discussions of validity, whether in technical reports of testing programs, the scholarly literature, or textbooks--even though the bias introduced by score inflatio can dwarf that caused by some factors that receive more attention." p.86 Denigrating Further steps toward the development of an accountability-oriented science of measurement Chapter 4 in The Future of Test-Based Educational Accountability Routledge His theory of score inflation gets little attention within the profession because it is a red herring. Outside the domain of psychometricians, however, it gets quite a lot of attention, and is widely believed as valid.  
61 Daniel M. Koretz Katherine E. Ryan, Lorrie A. Shepard, Eds. "There are far too few studies of the validity of scores under high-stakes conditions, and we know very little about the distribution and correlates of score inflation (e.g., its variation across types of testing programs, types of schools, or types of schools." p.87 Dismissive Further steps toward the development of an accountability-oriented science of measurement Chapter 4 in The Future of Test-Based Educational Accountability Routledge In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
62 Daniel M. Koretz Katherine E. Ryan, Lorrie A. Shepard, Eds. "Extant research on teachers' and principals' responses to testing, although somewhat more copius, is still insufficient, providing little systematic data on the use of test-preparation materials and other forms of coaching or on the relationships between test design and instructional responses." p.87   Further steps toward the development of an accountability-oriented science of measurement Chapter 4 in The Future of Test-Based Educational Accountability Routledge Relevant studies of the effects of tests and/or accountability program on motivation and instructional practice: Goslin (1967), *Southern Regional Education Board (1998); Johnson (1998); Schafer, Hultgren, Hawley, Abrams Seubert & Mazzoni (1997); Miles, Bishop, Collins, Fink, Gardner, Grant, Hussain, et al. (1997); Tuckman & Trimble (1997); Clarke & Stephens (1996); Zigarelli (1996); Stevenson, Lee, et al. (1995); Waters, Burger & Burger (1995); Egeland (1995); Prais (1995); Tuckman (1994); Ritchie & Thorkildsen (1994); Brown & Walberg, (1993); Wall & Alderson (1993); Wolf & Rapiau (1993); Eckstein & Noah (1993); Chao-Qun & Hui (1993); Plazak & Mazur (1992); Steedman (1992); Singh, Marimutha & Mukjerjee (1990); *Levine & Lezotte (1990); O’Sullivan (1989); Somerset (1988); Pennycuick & Murphy (1988); Stevens (1984); Marsh (1984); Brunton (1982); Solberg (1977); Foss (1977); *Kirkland (1971); Somerset (1968); Stuit (1947); and Keys (1934).  *Covers many studies; study is a research review, research synthesis, or meta-analysis. Moreover, the mastery learning/mastery testing experiments conducted from the 1960s through today varied incentives, frequency of tests, types of tests, and many other factors to determine the optimal structure of testing programs. Researchers included such notables as Bloom, Carroll, Keller, Block, Burns, Wentling, Anderson, Hymel, Kulik, Tierney, Cross, Okey, Guskey, Gates, and Jones.
63 Daniel M. Koretz Gail Sunderland, Ed. "... We know far too little about how to hold schools accountable for improving student performance.", p.9 Dismissive The pending reauthorization of NCLB: An opportunity to rethink the basic strategy Chapter 1 in Holding NCLB accountable: Achieving accountability, equity, and school reform, 2008 Corwin Press The vast amount of information already available just for the asking, worldwide, could help build better accountability systems, without wasting more research grant money on those who refuse to study what is already available.   
64 Daniel M. Koretz Gail Sunderland, Ed. "A modest number of studies argue that high-stakes testing does or doesn't improve student performance in tested subjects.", p.10 Dismissive The pending reauthorization of NCLB: An opportunity to rethink the basic strategy Chapter 1 in Holding NCLB accountable: Achieving accountability, equity, and school reform, 2008 Corwin Press In fact, a very large number of studies do so. See, for example, https://journals.sagepub.com/doi/abs/10.1177/0193841X19865628#abstract & https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
65 Daniel M. Koretz Gail Sunderland, Ed. "This research tells us little. Much of it is of very low quality, and even the careful studies are hobbled by data that are inadequate for the task.", p.10 Denigrating The pending reauthorization of NCLB: An opportunity to rethink the basic strategy Chapter 1 in Holding NCLB accountable: Achieving accountability, equity, and school reform, 2008 Corwin Press In fact, a very large number of studies do so. See, for example, https://journals.sagepub.com/doi/abs/10.1177/0193841X19865628#abstract & https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
66 Daniel M. Koretz Gail Sunderland, Ed. "Moreover, this research asks too simple a question. Asking whether test-based accountability works is a bit like asking whether medicine works. What medicines? For what medical conditions?", p.10 Denigrating The pending reauthorization of NCLB: An opportunity to rethink the basic strategy Chapter 1 in Holding NCLB accountable: Achieving accountability, equity, and school reform, 2008 Corwin Press In fact, a very large number of studies do so. See, for example, https://journals.sagepub.com/doi/abs/10.1177/0193841X19865628#abstract & https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
67 Daniel M. Koretz Gail Sunderland, Ed. "We need research and evaluation to address this question, because we lack a grounded answer.", p.11 Dismissive The pending reauthorization of NCLB: An opportunity to rethink the basic strategy Chapter 1 in Holding NCLB accountable: Achieving accountability, equity, and school reform, 2008 Corwin Press In fact, a very large number of studies do so. See, for example, https://journals.sagepub.com/doi/abs/10.1177/0193841X19865628#abstract & https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
68 Daniel M. Koretz Gail Sunderland, Ed. " ... research does not tell us whether high-stakes testing works.", p.11 Dismissive The pending reauthorization of NCLB: An opportunity to rethink the basic strategy Chapter 1 in Holding NCLB accountable: Achieving accountability, equity, and school reform, 2008 Corwin Press In fact, a very large number of studies do so. See, for example, https://journals.sagepub.com/doi/abs/10.1177/0193841X19865628#abstract & https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
69 Daniel M. Koretz Gail Sunderland, Ed. "The few relevant studies [of test score inflation] are of two types: detailed evaluations of scores in specific jurisdictions, .... We have far fewer ... than we should.", pp.11-12 Denigrating The pending reauthorization of NCLB: An opportunity to rethink the basic strategy Chapter 1 in Holding NCLB accountable: Achieving accountability, equity, and school reform, 2008 Corwin Press In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
70 Daniel M. Koretz Gail Sunderland, Ed. "The results of the relatively few relevant studies are both striking and consistent: gains on high-stakes tests often do not generalize well to other measures, and the gap is frequently huge." p.12 Dismissive The pending reauthorization of NCLB: An opportunity to rethink the basic strategy Chapter 1 in Holding NCLB accountable: Achieving accountability, equity, and school reform, 2008 Corwin Press  
71 Daniel M. Koretz Gail Sunderland, Ed. "But this remains only a hypothesis, not yet tested by much empirical evidence." p.14 Dismissive The pending reauthorization of NCLB: An opportunity to rethink the basic strategy Chapter 1 in Holding NCLB accountable: Achieving accountability, equity, and school reform, 2008 Corwin Press The most famous test score inflation study of all time -- John J. Cannells "Lake Wobegon Effect" study -- is largely about cheating. See:  http://nonpartisaneducation.org/Review/Books/CannellBook1.htm  http://nonpartisaneducation.org/Review/Books/Cannell2.pdf;  See also Gregory J. Cizek's Cheating on Tests: https://www.goodreads.com/book/show/5084641-cheating-on-tests ; and Caveon Test Security's resource pages: https://www.caveon.com/resources/  
72 Daniel M. Koretz Gail Sunderland, Ed. "We urgently need finer grained studies of this issue.", p.14 Denigrating The pending reauthorization of NCLB: An opportunity to rethink the basic strategy Chapter 1 in Holding NCLB accountable: Achieving accountability, equity, and school reform, 2008 Corwin Press The most famous test score inflation study of all time -- John J. Cannells "Lake Wobegon Effect" study -- is largely about cheating. See:  http://nonpartisaneducation.org/Review/Books/CannellBook1.htm  http://nonpartisaneducation.org/Review/Books/Cannell2.pdf;  See also Gregory J. Cizek's Cheating on Tests: https://www.goodreads.com/book/show/5084641-cheating-on-tests ; and Caveon Test Security's resource pages: https://www.caveon.com/resources/  
73 Daniel M. Koretz Gail Sunderland, Ed. "There are limited systematic data about cheating.", p.16 Denigrating The pending reauthorization of NCLB: An opportunity to rethink the basic strategy Chapter 1 in Holding NCLB accountable: Achieving accountability, equity, and school reform, 2008 Corwin Press The most famous test score inflation study of all time -- John J. Cannells "Lake Wobegon Effect" study -- is largely about cheating. See:  http://nonpartisaneducation.org/Review/Books/CannellBook1.htm  http://nonpartisaneducation.org/Review/Books/Cannell2.pdf;  See also Gregory J. Cizek's Cheating on Tests: https://www.goodreads.com/book/show/5084641-cheating-on-tests ; and Caveon Test Security's resource pages: https://www.caveon.com/resources/  
74 Daniel M. Koretz Gail Sunderland, Ed. "Building those better [accountability] systems requires more systematic, empirical data, and that, in turn, requires a serious agenda of R&D.", p.26 Denigrating The pending reauthorization of NCLB: An opportunity to rethink the basic strategy Chapter 1 in Holding NCLB accountable: Achieving accountability, equity, and school reform, 2008 Corwin Press The vast amount of information already available just for the asking, worldwide, could help build better accountability systems, without wasting more research grant money on those who refuse to study what is already available.   
75 Daniel M. Koretz   “… [T]he problem of score inflation is at best inconvenient and at worse [sic] threatening. (The latter is one reason that there are so few studies of this problem. …)” p. 11 Dismissive Measuring up: What educational testing really tells us Harvard University Press, 2008  Google Books Externally administered high-stakes testing is widely reviled among US educationists. It strains credulity that Koretz can not find one district out of the many thousands to cooperate with him to discredit testing.  
76 Daniel M. Koretz    “The relatively few studies that have addressed this question support the skeptical interpretation: in many cases, mastery of material on the new test simply substitutes for mastery of the old.” p. 242 Dismissive Measuring up: What educational testing really tells us Harvard University Press, 2008  Google Books Koretz's preferred method for "auditing" a high-stakes test is to compare its score trends to those of a parallel no-stakes test, which, presumably, will have totally reliable score trends. Yet, a cornucopia of research has shown "no stakes" tests to be relatively unreliable, less reliable than high stakes tests, and to dampen student effort (see, e.g., Acherman & Kanfer, 2009; S. M. Brown & Walberg, 1993; Cole, Bergin, & Whittaker, 2008; Eklof, 2007; Finn, 2015; Hawthorne, Bol, Pribesh, & Suh, 2015; Wise & DeMars, 2005, 2015).  
77 Daniel M. Koretz   “Because so many people consider test-based accountability to be self-evaluating … there is a disturbing lack of good evaluations of these systems. …”) p. 331 Denigrating Measuring up: What educational testing really tells us Harvard University Press, 2008  Google Books  Koretz's preferred method for "auditing" a high-stakes test is to compare its score trends to those of a parallel no-stakes test, which, presumably, will have totally reliable score trends. Yet, a cornucopia of research has shown "no stakes" tests to be relatively unreliable, less reliable than high stakes tests, and to dampen student effort (see, e.g., Acherman & Kanfer, 2009; S. M. Brown & Walberg, 1993; Cole, Bergin, & Whittaker, 2008; Eklof, 2007; Finn, 2015; Hawthorne, Bol, Pribesh, & Suh, 2015; Wise & DeMars, 2005, 2015).  
78 Daniel M. Koretz   Most of these few studies showed a rapid divergence of means on the two tests. …” p. 348 Dismissive Using aggregate-level linkages for estimation and valuation, etc. in Linking and Aligning Scores and Scales, Springer, 2007 Google Books Koretz's preferred method for "auditing" a high-stakes test is to compare its score trends to those of a parallel no-stakes test, which, presumably, will have totally reliable score trends. Yet, a cornucopia of research has shown "no stakes" tests to be relatively unreliable, less reliable than high stakes tests, and to dampen student effort (see, e.g., Acherman & Kanfer, 2009; S. M. Brown & Walberg, 1993; Cole, Bergin, & Whittaker, 2008; Eklof, 2007; Finn, 2015; Hawthorne, Bol, Pribesh, & Suh, 2015; Wise & DeMars, 2005, 2015).  
79 Daniel M. Koretz Valerie Strauss, journalist interviewer "“The testing culture ‘has a lot more momentum than it should,’ agreed [CRESST researcher Koretz].  He said a lack of solid research on the results of the new testing regimen—or those that predated No Child Left Behind—essentially means that the country is experimenting with its young people." Dismissive The rise of the testing culture, p.A09 Strauss, V. (2006, October 10). Washington Post   In fact, a very large number of studies do so. See, for example, https://journals.sagepub.com/doi/abs/10.1177/0193841X19865628#abstract & https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
80 Daniel M. Koretz & Laura S. Hamilton Robert L. Brennan, Ed. "Most of the studies of [testing's] effects on practice report average responses that mask some of these important variations and interactions." p.552 Denigrating Testing for Accountability in K-12 Chapter 15 in Educational Measurement, published by NCME and ACE, 2006   Relevant studies of the effects of varying types of incentive or the optimal structure of incentives include those of Kelley (1999); the *Southern Regional Education Board (1998); Trelfa (1998); Heneman (1998); Banta, Lund, Black & Oblander (1996); Brooks-Cooper, 1993; Eckstein & Noah (1993); Richards & Shen (1992); Jacobson (1992); Heyneman & Ransom (1992); *Levine & Lezotte (1990); Duran, 1989; *Crooks (1988); *Kulik & Kulik (1987); Corcoran & Wilson (1986); *Guskey & Gates (1986); Brook & Oxenham (1985); Oxenham (1984); Venezky & Winfield (1979); Brookover & Lezotte (1979); McMillan (1977); Abbott (1977); *Staats (1973); *Kazdin & Bootzin (1972); *O’Leary & Drabman (1971); Cronbach (1960); Hurlock (1925), and Zeng (2001). *Covers many studies; study is a research review, research synthesis, or meta-analysis.  Other researchers who, even prior to 2000, studied test-based incentive programs include Homme, Csanyi, Gonzales, Rechs, O’Leary, Drabman, Kaszdin, Bootzin, Staats, Cameron, Pierce, McMillan, Corcoran, Roueche, Kirk, Wheeler, Boylan, and Wilson. "Others have considered the role of tests in incentive programs.  These researchers have included Homme, Csanyi, Gonzales, Rechs, O’Leary, Drabman, Kaszdin, Bootzin, Staats, Cameron, Pierce, McMillan, Corcoran, and Wilson. International organizations, such as the World Bank or the Asian Development Bank, have studied the effects of testing on education programs they sponsor.  Researchers have included Somerset, Heynemann, Ransom, Psacharopoulis, Velez, Brooke, Oxenham, Bude, Chapman, Snyder, and Pronaratna.
Moreover, the mastery learning/mastery testing experiments conducted from the 1960s through today varied incentives, frequency of tests, types of tests, and many other factors to determine the optimal structure of testing programs. Researchers included such notables as Bloom, Carroll, Keller, Block, Burns, Wentling, Anderson, Hymel, Kulik, Tierney, Cross, Okey, Guskey, Gates, and Jones."
81 Daniel M. Koretz & Laura S. Hamilton Robert L. Brennan, Ed. "There is no comprehensive source of information on how much time schools devote to coaching activities such as practicing on released test forms, but some studies suggest these activities are widespread." p.552 Dismissive Testing for Accountability in K-12 Chapter 15 in Educational Measurement, published by NCME and ACE, 2006   In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
82 Daniel M. Koretz & Laura S. Hamilton Robert L. Brennan, Ed. "As with coaching, there are no comprehensive studies of the frequency of cheating across schools in the United States." p.553 Dismissive Testing for Accountability in K-12 Chapter 15 in Educational Measurement, published by NCME and ACE, 2006   Actually, there have been, in surveys, in which respondents freely admit that they cheat and how. Moreover, news reports of cheating, by students or educators, have been voluminous. See, for example, Caveon Test Security's "Cheating in the News" section on its web site.  
83 Daniel M. Koretz & Laura S. Hamilton Robert L. Brennan, Ed. "However, in the absence of audit testing, this hypothesis [of score inflation] cannot be tested." p.553 Denigrating Testing for Accountability in K-12 Chapter 15 in Educational Measurement, published by NCME and ACE, 2006   Koretz's preferred method for "auditing" a high-stakes test is to compare its score trends to those of a parallel no-stakes test, which, presumably, will have totally reliable score trends. Yet, a cornucopia of research has shown "no stakes" tests to be relatively unreliable, less reliable than high stakes tests, and to dampen student effort (see, e.g., Acherman & Kanfer, 2009; S. M. Brown & Walberg, 1993; Cole, Bergin, & Whittaker, 2008; Eklof, 2007; Finn, 2015; Hawthorne, Bol, Pribesh, & Suh, 2015; Wise & DeMars, 2005, 2015).  
84 Daniel M. Koretz   "Research to date makes clear that score gains achieved under high-stakes conditions should not be accepted at face value. ...policymakers embarking on an effort to create a more effective system of ...accountability must face uncertainty about how well alternatives will function in practice, and should be prepared for a period of evaluation and mid-course correction." Dismissive Alignment, High Stakes, and the Inflation of Test Scores CRESST Report 655, June 2005 https://cresst.org/wp-content/uploads/R655.pdf Koretz's preferred method for "auditing" a high-stakes test is to compare its score trends to those of a parallel no-stakes test, which, presumably, will have totally reliable score trends. Yet, a cornucopia of research has shown "no stakes" tests to be relatively unreliable, less reliable than high stakes tests, and to dampen student effort (see, e.g., Acherman & Kanfer, 2009; S. M. Brown & Walberg, 1993; Cole, Bergin, & Whittaker, 2008; Eklof, 2007; Finn, 2015; Hawthorne, Bol, Pribesh, & Suh, 2015; Wise & DeMars, 2005, 2015).  
85 Daniel M. Koretz   "Thus, even in a well-aligned system, policymakers still face the challenge of designing educational accountability systems that create the right mix of incentives: incentives that will maximize real gains in student performance, minimize score inflation, and generate other desirable changes in educational practice. This is a challenge in part because of a shortage of relevant experience and research..." Dismissive Alignment, High Stakes, and the Inflation of Test Scores CRESST Report 655, June 2005 https://cresst.org/wp-content/uploads/R655.pdf Relevant studies of the effects of varying types of incentive or the optimal structure of incentives include those of Kelley (1999); the *Southern Regional Education Board (1998); Trelfa (1998); Heneman (1998); Banta, Lund, Black & Oblander (1996); Brooks-Cooper, 1993; Eckstein & Noah (1993); Richards & Shen (1992); Jacobson (1992); Heyneman & Ransom (1992); *Levine & Lezotte (1990); Duran, 1989; *Crooks (1988); *Kulik & Kulik (1987); Corcoran & Wilson (1986); *Guskey & Gates (1986); Brook & Oxenham (1985); Oxenham (1984); Venezky & Winfield (1979); Brookover & Lezotte (1979); McMillan (1977); Abbott (1977); *Staats (1973); *Kazdin & Bootzin (1972); *O’Leary & Drabman (1971); Cronbach (1960); Hurlock (1925), and Zeng (2001). *Covers many studies; study is a research review, research synthesis, or meta-analysis.  Other researchers who, even prior to 2000, studied test-based incentive programs include Homme, Csanyi, Gonzales, Rechs, O’Leary, Drabman, Kaszdin, Bootzin, Staats, Cameron, Pierce, McMillan, Corcoran, Roueche, Kirk, Wheeler, Boylan, and Wilson. "Others have considered the role of tests in incentive programs.  These researchers have included Homme, Csanyi, Gonzales, Rechs, O’Leary, Drabman, Kaszdin, Bootzin, Staats, Cameron, Pierce, McMillan, Corcoran, and Wilson. International organizations, such as the World Bank or the Asian Development Bank, have studied the effects of testing on education programs they sponsor.  Researchers have included Somerset, Heynemann, Ransom, Psacharopoulis, Velez, Brooke, Oxenham, Bude, Chapman, Snyder, and Pronaratna.
Moreover, the mastery learning/mastery testing experiments conducted from the 1960s through today varied incentives, frequency of tests, types of tests, and many other factors to determine the optimal structure of testing programs. Researchers included such notables as Bloom, Carroll, Keller, Block, Burns, Wentling, Anderson, Hymel, Kulik, Tierney, Cross, Okey, Guskey, Gates, and Jones."
86 Daniel M. Koretz   "Research has yet to clarify how variations in the performance targets set for schools affect the incentives faced by teachers and the resulting validity of score gains." Dismissive Alignment, High Stakes, and the Inflation of Test Scores CRESST Report 655, June 2005 https://cresst.org/wp-content/uploads/R655.pdf
87 Daniel M. Koretz   "In terms of research, the jury is still out." Dismissive Alignment, High Stakes, and the Inflation of Test Scores CRESST Report 655, June 2005 https://cresst.org/wp-content/uploads/R655.pdf
88 Daniel M. Koretz   "The first study to evaluate score inflation empirically (Koretz, Linn, Dunbar, and Shepard, 1991) looked at a district-testing program in the 1980s that used commercial, off-the-shelf, multiple-choice achievement tests."  1stness Alignment, High Stakes, and the Inflation of Test Scores, p.7 CRESST Report 655, June 2005 https://cresst.org/wp-content/uploads/R655.pdf * The most famous test score inflation study of all time -- John J. Cannells "Lake Wobegon Effect" study -- preceded Koretz's by several years. See:  http://nonpartisaneducation.org/Review/Books/CannellBook1.htm  http://nonpartisaneducation.org/Review/Books/Cannell2.pdf  
89 Laura S. Hamilton Daniel F. McCaffrey, J.R. Lockwood, Daniel M. Koretz “The shortcomings of the studies make it difficult to determine the size of teacher effects, but we suspect that the magnitude of some of the effects reported in this literature are overstated.” p. xiii Denigrating Evaluating Value-Added Models for Teacher Accountability  Rand Corporation, 2003 https://www.rand.org/content/dam/rand/pubs/monographs/2004/RAND_MG158.pdf Tennessee's TVAAS value-added measurement system had been running a decade when they wrote this and did much of what these authors claim had never been done.  
90 Laura S. Hamilton Daniel F. McCaffrey, J.R. Lockwood, Daniel M. Koretz “Using VAM to estimate individual teacher effects is a recent endeavor, and many of the possible sources of error have not been thoroughly evaluated in the literature.” p. xix Dismissive Evaluating Value-Added Models for Teacher Accountability  Rand Corporation, 2003 https://www.rand.org/content/dam/rand/pubs/monographs/2004/RAND_MG158.pdf Tennessee's TVAAS value-added measurement system had been running a decade when they wrote this and did much of what these authors claim had never been done.  
91 Laura S. Hamilton Daniel F. McCaffrey, J.R. Lockwood, Daniel M. Koretz “Empirical evaluations do not exist for many of the potential sources of error we have identified. Studies need to be conducted to determine how these factors contribute to estimated teacher effects and to determine the conditions that exacerbate or mitigate the impact these factors have on teacher effects.” p. xix Dismissive Evaluating Value-Added Models for Teacher Accountability  Rand Corporation, 2003 https://www.rand.org/content/dam/rand/pubs/monographs/2004/RAND_MG158.pdf Tennessee's TVAAS value-added measurement system had been running a decade when they wrote this and did much of what these authors claim had never been done.  
92 Laura S. Hamilton Daniel F. McCaffrey, J.R. Lockwood, Daniel M. Koretz “This lack of attention to teachers in policy discussions may be attributed in part to another body of literature that attempted to determine the effects of specific teacher background characteristics, including credentialing status (e.g., Miller, McKenna, and McKenna, 1998; Goldhaber and Brewer, 2000) and subject matter coursework (e.g., Monk, 1994).” p. 8 Dismissive Evaluating Value-Added Models for Teacher Accountability  Rand Corporation, 2003 https://www.rand.org/content/dam/rand/pubs/monographs/2004/RAND_MG158.pdf Tennessee's TVAAS value-added measurement system had been running a decade when they wrote this and did much of what these authors claim had never been done.  
93 Laura S. Hamilton Daniel F. McCaffrey, J.R. Lockwood, Daniel M. Koretz “To date, there has been little empirical exploration of the size of school effects and the sensitivity of teacher effects to modeling of school effects.” p. 78 Dismissive Evaluating Value-Added Models for Teacher Accountability  Rand Corporation, 2003 https://www.rand.org/content/dam/rand/pubs/monographs/2004/RAND_MG158.pdf Tennessee's TVAAS value-added measurement system had been running a decade when they wrote this and did much of what these authors claim had never been done.  
94 Laura S. Hamilton Daniel F. McCaffrey, J.R. Lockwood, Daniel M. Koretz “There are no empirical explorations of the robustness of estimates to assumptions about prior-year schooling effects.“ p. 81 Dismissive Evaluating Value-Added Models for Teacher Accountability  Rand Corporation, 2003 https://www.rand.org/content/dam/rand/pubs/monographs/2004/RAND_MG158.pdf Tennessee's TVAAS value-added measurement system had been running a decade when they wrote this and did much of what these authors claim had never been done.  
95 Laura S. Hamilton Daniel F. McCaffrey, J.R. Lockwood, Daniel M. Koretz “There is currently no empirical evidence about the sensitivity of gain scores or teacher effects to such alternatives.” p. 89 Dismissive Evaluating Value-Added Models for Teacher Accountability  Rand Corporation, 2003 https://www.rand.org/content/dam/rand/pubs/monographs/2004/RAND_MG158.pdf Tennessee's TVAAS value-added measurement system had been running a decade when they wrote this and did much of what these authors claim had never been done.  
96 Laura S. Hamilton Daniel F. McCaffrey, J.R. Lockwood, Daniel M. Koretz “Empirical evaluations do not exist for many of the potential sources of error we have identified. Studies need to be conducted to determine how these factors contribute to estimated teacher effects and to determine the conditions that exacerbate or mitigate the impact these factors have on teacher effects.” p. 116 Dismissive Evaluating Value-Added Models for Teacher Accountability  Rand Corporation, 2003 https://www.rand.org/content/dam/rand/pubs/monographs/2004/RAND_MG158.pdf Tennessee's TVAAS value-added measurement system had been running a decade when they wrote this and did much of what these authors claim had never been done.  
97 Laura S. Hamilton Daniel F. McCaffrey, J.R. Lockwood, Daniel M. Koretz “Although we expect missing data are likely to be pervasive, there is little systematic discussion of the extent or nature of missing data in test score databases.” p. 117 Dismissive Evaluating Value-Added Models for Teacher Accountability  Rand Corporation, 2003 https://www.rand.org/content/dam/rand/pubs/monographs/2004/RAND_MG158.pdf Tennessee's TVAAS value-added measurement system had been running a decade when they wrote this and did much of what these authors claim had never been done.  
98 Daniel M. Koretz   "Empirical research on the validity of score gains on high-stakes tests is limited, but the studies conducted to date show…" Dismissive Using multiple measures to address perverse incentives an score inflation, p.21 Educational Measurement: Issues and Practice, Summer 2003 https://onlinelibrary.wiley.com/doi/10.1111/j.1745-3992.2003.tb00124.x "Validity" studies are common, even routine, parts of large-scale testing programs' technical reports.   
99 Daniel M. Koretz   "Research on educators' responses to high-stakes testing is also limited, …" Dismissive Using multiple measures to address perverse incentives an score inflation, p.21 Educational Measurement: Issues and Practice, Summer 2003 https://onlinelibrary.wiley.com/doi/10.1111/j.1745-3992.2003.tb00124.x See, for example, https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
100 Daniel M. Koretz   "Although extant research is sufficient to document problems of score inflation and unintended incentives from test-based accountability, it provides very little guidance about how one might design an accountability system to lessen these problems."  Denigrating Using multiple measures to address perverse incentives an score inflation, p.22 Educational Measurement: Issues and Practice, Summer 2003 https://onlinelibrary.wiley.com/doi/10.1111/j.1745-3992.2003.tb00124.x The vast amount of information already available just for the asking, worldwide, could help build better accountability systems, without wasting more research grant money on those who refuse to study what is already available.   
101 Daniel M. Koretz   “Relatively few studies, however, provide strong empirical evidence pertaining to inflation of entire scores on tests used for accountability.” p. 759 Denigrating Limitations in the use of achievement tests as measures of educators’ productivity  The Journal of Human Resources, 37:4 (Fall 2002) http://standardizedtests.procon.org/sourcefiles/limitations-in-the-use-of-achievement-tests-as-measures-of-educators-productivity.pdf In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
102 Daniel M. Koretz   “Only a few studies have directly tested the generalizability of gains in scores on accountability-oriented tests.” p. 759 Dismissive Limitations in the use of achievement tests as measures of educators’ productivity  The Journal of Human Resources, 37:4 (Fall 2002) http://standardizedtests.procon.org/sourcefiles/limitations-in-the-use-of-achievement-tests-as-measures-of-educators-productivity.pdf "Validity" studies are common, even routine, parts of large-scale testing programs' technical reports.   
103 Daniel M. Koretz   “Moreover, while there are numerous anecdotal reports of various types of coaching, little systematic research describes the range of coaching strategies and their effects.” p. 769 Dismissive Limitations in the use of achievement tests as measures of educators’ productivity  The Journal of Human Resources, 37:4 (Fall 2002) http://standardizedtests.procon.org/sourcefiles/limitations-in-the-use-of-achievement-tests-as-measures-of-educators-productivity.pdf In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
104 Daniel M. Koretz   "Yet we have accumulating evidence that test-based accountability policies are not working as intended, and we have no adequate research-based alternative to offer to the policy community." p.774 Dismissive Limitations in the Use of Achievement Tests as Measures of Educators’ Productivity  The Journal of Human Resources, 37:4 (Fall 2002) http://standardizedtests.procon.org/sourcefiles/limitations-in-the-use-of-achievement-tests-as-measures-of-educators-productivity.pdf Test-based accountability worked just fine before 2001, when the now dominant two citation cartels took over all policy advising on the topic. As for alternatives to Koretz's conception of test-based accountability, two come to mind. First, there is the normal type that most of the world uses: stakes for students; no stakes for teachers; only administered every few years; administered externally and securely; full battery of subjects. Second, inspectorates (a poor substitute in my opinion) are used in other countries, and, yes, quite a lot of research has accumulated about them in the countries where they are used.  
105 Laura S. Hamilton Daniel M. Koretz "There is currently no substantial evidence on the effects of published report cards on parents’ decisionmaking or on the schools themselves." Dismissive Making Sense of Test-Based Accountability in Education, Rand Corporation, 2002 Chapter 2: Tests and their use in test-based accountability systems, p.44 https://www.rand.org/content/dam/rand/pubs/monograph_reports/2002/MR1554.pdf For decades, consulting services have existed that help parents new to a city select the right school or school district for them.  
106 Daniel M. Koretz Michael Russell, Chingwei David Shin, Cathy Horn, Kelly Shasby "Although hard data on affirmative action are scanty, most observers believe that selective institutions have widely employed it for several decades." p.2 Dismissive Testing and Diversity in Postsecondary Education: The Case of California Education Policy Analysis Archives, 10(1), January 7, 2002 https://epaa.asu.edu/ojs/article/view/280    
107 Daniel M. Koretz Michael Russell, Chingwei David Shin, Cathy Horn, Kelly Shasby "As Kane noted, 'Nearly two decades after the U.S. Supreme Court's 1978 Bakke decision, we know little about the true extent of affirmative action admissions by race or ethnicity ... Hard evidence has been difficult to obtain, primarily because many colleges guard their admissions practices closely." p.4 Dismissive Testing and Diversity in Postsecondary Education: The Case of California Education Policy Analysis Archives, 10(1), January 7, 2002 https://epaa.asu.edu/ojs/article/view/280    
108 Daniel M. Koretz Michael Russell, Chingwei David Shin, Cathy Horn, Kelly Shasby "Thus research leaves unclear how substantial preferences were in the states that have been at the center of the debate abou the elimination of affirmative action, such as California and Texas." p.4 Dismissive Testing and Diversity in Postsecondary Education: The Case of California Education Policy Analysis Archives, 10(1), January 7, 2002 https://epaa.asu.edu/ojs/article/view/280    
109 Daniel M. Koretz Daniel F. McCaffrey, Laura S. Hamilton "Although high-stakes testing is now widespread, methods for evaluating the validity of gains obtained under high-stakes conditions are poorly developed. This report presents an approach for evaluating the validity of inferences based on score gains on high-stakes tests. It describes the inadequacy of traditional validation approaches for validating gains under high-stakes conditions and outlines an alternative validation framework for conceptualizing meaningful and inflated score gains.", p.1 Denigrating Toward a framework for validating gains under high-stakes conditions CSE Technical Report 551, CRESST/Harvard Graduate School of Education, CRESST/RAND Education, December 2001 https://files.eric.ed.gov/fulltext/ED462410.pdf In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)
110 Daniel M. Koretz Daniel F. McCaffrey, Laura S. Hamilton "Few efforts are made to evaluate directly score gains obtained under high-stakes conditions, and conventional validation tools are not fully adequate for the task.", p. 1 Dismissive Toward a framework for validating gains under high-stakes conditions CSE Technical Report 551, CRESST/Harvard Graduate School of Education, CRESST/RAND Education, December 2001 https://files.eric.ed.gov/fulltext/ED462410.pdf In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
111 Daniel M. Koretz Mark Berends  “[T]here has been little systematic research exploring changes in grading standards. …” p. iii Dismissive Changes in high school grading standards in mathematics, 1982–1992  Rand Education, 2001 http://www.rand.org/content/dam/rand/pubs/monograph_reports/2007/MR1445.pdf See a review of hundreds of studies:  Brookhart et al. (2016) A Century of Grading Research: A Century of Grading Research: Meaning and Value in the Most Common Educational Measure. Brookhart, S. M., Guskey, T. R., Bowers, A. J., McMillan, J. H., Smith, J. K., Smith, L. F., Stevens, M.T., Welsh, M. E. (2016). A Century of Grading Research: Meaning and Value in the Most Common Educational Measure. Review of Educational Research, 86(4), 803-848.
doi: 10.3102/0034654316672069   http://doi.org/10.3102/0034654316672069
 
112 Daniel M. Koretz Mark Berends [F]ew studies have attempted to evaluate systematically changes in grading standards over time.” p. xi Dismissive Changes in high school grading standards in mathematics, 1982–1992  Rand Education, 2001 http://www.rand.org/content/dam/rand/pubs/monograph_reports/2007/MR1445.pdf See a review of hundreds of studies:  Brookhart et al. (2016) A Century of Grading Research: A Century of Grading Research: Meaning and Value in the Most Common Educational Measure. Brookhart, S. M., Guskey, T. R., Bowers, A. J., McMillan, J. H., Smith, J. K., Smith, L. F., Stevens, M.T., Welsh, M. E. (2016). A Century of Grading Research: Meaning and Value in the Most Common Educational Measure. Review of Educational Research, 86(4), 803-848.
doi: 10.3102/0034654316672069   http://doi.org/10.3102/0034654316672069
 
113 Lynn Olson (journalist) Daniel M. Koretz, respondent "For years, the research community has been walking behind an elephant with a broom. Policymakers start accountability systems and, on rare occasions, we have an opportunity to go in and look at what's going on."    "Reporter's Notebook" Education Week, Sept.27 2000   "Validity" studies are common, even routine, parts of large-scale testing programs' technical reports.   
114 Daniel M. Koretz E. A. Hanushek, J. J. Heckman, and D. Neal (organizers) "Research provides sparse guidance about how to broaden the range of measured outcomes to provide a better mix of incentives and lessen score inflation.", p.27 Dismissive Limitations in the Use of Achievement Tests as Measures of Educators’ Productivity  Devising Incentives to Promote Human Capital, National Academy of Sciences Conference, May 2000 http://www.irp.wisc.edu/newsevents/other/symposia/koretz.pdf Relevant studies of the effects of varying types of incentive or the optimal structure of incentives include those of Kelley (1999); the *Southern Regional Education Board (1998); Trelfa (1998); Heneman (1998); Banta, Lund, Black & Oblander (1996); Brooks-Cooper, 1993; Eckstein & Noah (1993); Richards & Shen (1992); Jacobson (1992); Heyneman & Ransom (1992); *Levine & Lezotte (1990); Duran, 1989; *Crooks (1988); *Kulik & Kulik (1987); Corcoran & Wilson (1986); *Guskey & Gates (1986); Brook & Oxenham (1985); Oxenham (1984); Venezky & Winfield (1979); Brookover & Lezotte (1979); McMillan (1977); Abbott (1977); *Staats (1973); *Kazdin & Bootzin (1972); *O’Leary & Drabman (1971); Cronbach (1960); and Hurlock (1925).   *Covers many studies; study is a research review, research synthesis, or meta-analysis. "Others have considered the role of tests in incentive programs.  These researchers have included Homme, Csanyi, Gonzales, Rechs, O’Leary, Drabman, Kaszdin, Bootzin, Staats, Cameron, Pierce, McMillan, Corcoran, and Wilson. International organizations, such as the World Bank or the Asian Development Bank, have studied the effects of testing on education programs they sponsor.  Researchers have included Somerset, Heynemann, Ransom, Psacharopoulis, Velez, Brooke, Oxenham, Bude, Chapman, Snyder, and Pronaratna.
Moreover, the mastery learning/mastery testing experiments conducted from the 1960s through today varied incentives, frequency of tests, types of tests, and many other factors to determine the optimal structure of testing programs. Researchers included such notables as Bloom, Carroll, Keller, Block, Burns, Wentling, Anderson, Hymel, Kulik, Tierney, Cross, Okey, Guskey, Gates, and Jones."
115 Daniel M. Koretz E. A. Hanushek, J. J. Heckman, and D. Neal (organizers) "...what types of accountability systems might be more effective, and what role might achievement tests play in them? Unfortunately, there is little basis in research for answering this question. The simple test-based accountability systems that have been in vogue for the past two decades have appeared so commonsensical to some policymakers that they have had little incentive to permit the evaluation of alternatives.", p.25 Dismissive Limitations in the Use of Achievement Tests as Measures of Educators’ Productivity  Devising Incentives to Promote Human Capital, National Academy of Sciences Conference, May 2000 http://www.irp.wisc.edu/newsevents/other/symposia/koretz.pdf Externally administered high-stakes testing is widely reviled among US educationists. It strains credulity that Koretz can not find one district out of the many thousands to cooperate with him to discredit testing.  
116 Daniel M. Koretz E. A. Hanushek, J. J. Heckman, and D. Neal (organizers) "...while there are numerous anecdotal reports of various types of coaching, little systematic research describes the range of coaching strategies and their effects.", p.24 Denigrating Limitations in the Use of Achievement Tests as Measures of Educators’ Productivity  Devising Incentives to Promote Human Capital, National Academy of Sciences Conference, May 2000 http://www.irp.wisc.edu/newsevents/other/symposia/koretz.pdf In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
117 Daniel M. Koretz E. A. Hanushek, J. J. Heckman, and D. Neal (organizers) "Only a few studies have directly tested the generalizability of gains in scores on accountability-oriented tests.", p.11 Denigrating Limitations in the Use of Achievement Tests as Measures of Educators’ Productivity  Devising Incentives to Promote Human Capital, National Academy of Sciences Conference, May 2000 http://www.irp.wisc.edu/newsevents/other/symposia/koretz.pdf "Validity" studies are common, even routine, parts of large-scale testing programs' technical reports.   
118 Daniel M. Koretz E. A. Hanushek, J. J. Heckman, and D. Neal (organizers) "Relatively few studies, however, provide strong empirical evidence pertaining to inflation of entire scores on tests used for accountability.  Policy makers have little incentive to facilitate such studies, and they can be difficult to carry out.", p.11 Denigrating Limitations in the Use of Achievement Tests as Measures of Educators’ Productivity  Devising Incentives to Promote Human Capital, National Academy of Sciences Conference, May 2000 http://www.irp.wisc.edu/newsevents/other/symposia/koretz.pdf Externally administered high-stakes testing is widely reviled among US educationists. It strains credulity that Koretz can not find one district out of the many thousands to cooperate with him to discredit testing.  
119 Daniel M. Koretz Laura Hamilton "Efforts to increase the participation of students with disabilities in large-scale assessments, however, are hindered by a lack of experience and systematic information (National Research Council, 1997). For example, there is little systematic information on the use or effects of special testing accommodations for elementary and secondary students with disabilities. Dismissive Assessing Students With Disabilities in Kentucky:The Effects of Accommodations, Format, and Subject, p.2 CSE Technical Report 498, CRESST/Rand Education, January 1999 https://files.eric.ed.gov/fulltext/ED440148.pdf Difficult to believe given that the federal government has for decades generously funded research into testing students with disabilities. See, for example, https://nceo.info/ and Kurt Geisinger's and Janet Carlson's chapters in Defending Standardized Testing and Correcting Fallacies in Educational and Psychological Testing.   
120 Daniel M. Koretz Laura Hamilton "In addition, there is little evidence about the effects of format differences on the assessment of students with disabilities." Dismissive Assessing Students With Disabilities in Kentucky:The Effects of Accommodations, Format, and Subject, p.2 CSE Technical Report 498, CRESST/Rand Education, January 1999 https://files.eric.ed.gov/fulltext/ED440148.pdf Difficult to believe given that the federal government has for decades generously funded research into testing students with disabilities. See, for example, https://nceo.info/ and Kurt Geisinger's and Janet Carlson's chapters in Defending Standardized Testing and Correcting Fallacies in Educational and Psychological Testing.   
121 Daniel M. Koretz Laura Hamilton "Others have argued the opposite, pointing out that open-response questions, for example, mix verbal skills with other skills to bemeasured and may make it more difficult to isolate and compensate for theeffects of disabilities. Relevant research, however, is scarce." Dismissive Assessing Students With Disabilities in Kentucky:The Effects of Accommodations, Format, and Subject, p.2 CSE Technical Report 498, CRESST/Rand Education, January 1999 https://files.eric.ed.gov/fulltext/ED440148.pdf Difficult to believe given that the federal government has for decades generously funded research into testing students with disabilities. See, for example, https://nceo.info/ and Kurt Geisinger's and Janet Carlson's chapters in Defending Standardized Testing and Correcting Fallacies in Educational and Psychological Testing.   
122 Daniel M. Koretz Laura Hamilton "There is a clear need for additional descriptive studies of the performance of students with disabilities in large-scale assessments. In our earlier study, wenoted that research evidence was sparse " Dismissive Assessing Students With Disabilities in Kentucky:The Effects of Accommodations, Format, and Subject, p.56 CSE Technical Report 498, CRESST/Rand Education, January 1999 https://files.eric.ed.gov/fulltext/ED440148.pdf Difficult to believe given that the federal government has for decades generously funded research into testing students with disabilities. See, for example, https://nceo.info/ and Kurt Geisinger's and Janet Carlson's chapters in Defending Standardized Testing and Correcting Fallacies in Educational and Psychological Testing.   
123 Daniel M. Koretz Sheila I. Barron "In the absence of systematic research documenting test-based accountability systems that have avoided the problem of inflated gains …” p. xvii Dismissive The validity of gains in scores on the Kentucky Instructional Results Information System (KIRIS)  Rand Education, 1998 http://www.rand.org/content/dam/rand/pubs/monograph_reports/2009/MR1014.pdf In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
124 Daniel M. Koretz Sheila I. Barron "This study also illustrated in numerous ways the limitations of current research on the validity of gains.” p. xviii Dismissive The validity of gains in scores on the Kentucky Instructional Results Information System (KIRIS)  Rand Education, 1998 http://www.rand.org/content/dam/rand/pubs/monograph_reports/2009/MR1014.pdf In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
125 Daniel M. Koretz Sheila I. Barron  “The field of measurement has seen many decades of intensive development of methods for evaluating scores cross-sectionally, but much less attention has been devoted to the problem of evaluating gains. . . . [T]his methodological gap is likely to become ever more important.” p. 122 Dismissive The validity of gains in scores on the Kentucky Instructional Results Information System (KIRIS)  Rand Education, 1998 http://www.rand.org/content/dam/rand/pubs/monograph_reports/2009/MR1014.pdf In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
126 Daniel M. Koretz Sheila I. Barron  “The contrast between mathematics … and reading … underlines the limits of our current knowledge of the mechanisms that underlie score inflation.” p. 122 Dismissive The validity of gains in scores on the Kentucky Instructional Results Information System (KIRIS)  Rand Education, 1998 http://www.rand.org/content/dam/rand/pubs/monograph_reports/2009/MR1014.pdf In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
127 Daniel M. Koretz reported by Debra Viadero “...all of the researchers interviewed agreed with FairTest’s contention that research evidence supporting the use of high-stakes tests as a means of improving schools is thin.”   Dismissive FairTest report questions reliance on high-stakes testing by states Debra Viadero, Education Week.January 28, 1998.   In fact, a very large number of studies do so. See, for example, https://journals.sagepub.com/doi/abs/10.1177/0193841X19865628#abstract & https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
128 Robert L. Linn Daniel M. Koretz, Eva Baker “’Yet we do not have the necessary comprehensive dependable data. . . .’ (Tyler 1996a, p. 95)” p. 8 Dismissive Assessing the Validity of the National Assessment of Educational Progress CSE Technical Report 416 (June 1996) http://www.cse.ucla.edu/products/reports/TECH416.pdf There was extended discussion and consideration. Simply put, they did not get their way because others disagreed with them.  
129 Robert L. Linn Daniel M. Koretz, Eva Baker "“There is a need for more extended discussion and reconsideration of the approach being used to measure long-term trends.” p. 21  Dismissive Assessing the Validity of the National Assessment of Educational Progress CSE Technical Report 416 (June 1996) http://www.cse.ucla.edu/products/reports/TECH416.pdf There was extended discussion and consideration. Simply put, they did not get their way because others disagreed with them.  
130 Robert L. Linn Daniel M. Koretz, Eva Baker "“Only a small minority of the articles that discussed achievement levels made any mention of the judgmental nature of the levels, and most of those did so only briefly.” p. 27 Denigrating Assessing the Validity of the National Assessment of Educational Progress CSE Technical Report 416 (June 1996) http://www.cse.ucla.edu/products/reports/TECH416.pdf All achievement levels, just like all course grades, are set subjectively. This information was never hidden.  
131 Daniel M. Koretz Erik A. Hanushek, D.W. Jorgenson (Eds.) "Despite the long history of assessment-based accountability, hard evidence about its effects is surprisingly sparse, and the little evidence that is available is not encouraging. ...The large positive effects assumed by advocates...are often not substantiated by hard evidence....” p.172 Dismissive Using student assessments for educational accountability Improving America’s schools: The role of incentives. Washington, D.C.: National Academy Press, 1996 https://www.nap.edu/catalog/5143/improving-americas-schools-the-role-of-incentives In fact, a very large number of studies do so. See, for example, https://journals.sagepub.com/doi/abs/10.1177/0193841X19865628#abstract & https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
132 Daniel M. Koretz Erik A. Hanushek, D.W. Jorgenson (Eds.) "The testing of the 1980s reform movement fell into disfavor surprisingly soon. Confidence in the reforms was so high at the outset that few programs were evaluated realistically." p.173 Dismissive Using student assessments for educational accountability Improving America’s schools: The role of incentives. Washington, D.C.: National Academy Press, 1996 https://www.nap.edu/catalog/5143/improving-americas-schools-the-role-of-incentives In fact, a very large number of studies do so. See, for example, https://journals.sagepub.com/doi/abs/10.1177/0193841X19865628#abstract & https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
133 Daniel M. Koretz Erik A. Hanushek, D.W. Jorgenson (Eds.) "Although overconfidence in the test-based reforms of the 1980s resulted in a scarcity of research on their impact, there is enough evidence to paint a discouraging picture." p.181 Dismissive Using student assessments for educational accountability Improving America’s schools: The role of incentives. Washington, D.C.: National Academy Press, 1996 https://www.nap.edu/catalog/5143/improving-americas-schools-the-role-of-incentives In fact, a very large number of studies do so. See, for example, https://journals.sagepub.com/doi/abs/10.1177/0193841X19865628#abstract & https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
134 Daniel M. Koretz Erik A. Hanushek, D.W. Jorgenson (Eds.) "Although Cannell's report was wrong in some of the specifics, his basic conclusion that an implausible proportion of jurisdictions were above the national average was confirmed.  Denigrating Using student assessments for educational accountability Improving America’s schools: The role of incentives. Washington, D.C.: National Academy Press, 1996 https://www.nap.edu/catalog/5143/improving-americas-schools-the-role-of-incentives No. Cannell was exactly right. The cause was corruption, lax security, and cheating. See, for example, https://nonpartisaneducation.org/Review/Articles/v6n3.htm  
135 Daniel M. Koretz Erik A. Hanushek, D.W. Jorgenson (Eds.) "Nevertheless, evidence about the instructional effects of performance assessment programs remains scarce. It is not clear under what circumstances these programs are conducive to improved teaching or what the effects are on student achievement." p.188 Dismissive Using student assessments for educational accountability Improving America’s schools: The role of incentives. Washington, D.C.: National Academy Press, 1996 https://www.nap.edu/catalog/5143/improving-americas-schools-the-role-of-incentives In fact, a very large number of studies do so. See, for example, https://journals.sagepub.com/doi/abs/10.1177/0193841X19865628#abstract & https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
136 Daniel M. Koretz Erik A. Hanushek, D.W. Jorgenson (Eds.) "The discussion above represents a fairly discouraging assessment of test-based accountability. Traditional approaches have not worked well, and the scanty available evidence does not suggest that shifting to innovative testing formats will overcome their deficiencies." p.189 Dismissive Using student assessments for educational accountability Improving America’s schools: The role of incentives. Washington, D.C.: National Academy Press, 1996 https://www.nap.edu/catalog/5143/improving-americas-schools-the-role-of-incentives In fact, a very large number of studies do so. See, for example, https://journals.sagepub.com/doi/abs/10.1177/0193841X19865628#abstract & https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
137 Daniel M. Koretz   "Some observers have maintained that performance assessments used for accountability are vulnerable to the same problem [of score inflation]. Evidence at this point is scarce...." p.52 Dismissive Final Report: Perceived Effects of the Maryland School Performance Assessment Program CSE Technical Report 409, CRESST/Rand Education, March 1996 http://cresst.org/wp-content/uploads/TECH409.pdf In fact the test prep, or test coaching, literature is vast and dates back decades, with meta-analyses of the literature dating back at least to the 1970s. There's even a What Works Clearinghouse summary of the (post World Wide Web) college admission test prep research literature:  https://ies.ed.gov/ncee/wwc/Docs/InterventionReports/wwc_act_sat_100416.pdf . See also:  Ortar (1960)  Marron (1965)  ETS (1965). Messick & Jungeblut (1981)  Ellis, Konoske, Wulfeck, & Montague (1982)  DerSimonian and Laird (1983)  Kulik, Bangert-Drowns & Kulik (1984)  Powers (1985)  Jones (1986). Fraker (1986/1987)  Halpin (1987)  Whitla (1988)  Snedecor (1989)  Bond (1989). Baydar (1990)  Becker (1990)  Smyth (1990)  Moore (1991)  Alderson & Wall (1992)  Powers (1993)  Oren (1993). Powers & Rock (1994)  Scholes, Lane (1997)   Allalouf & Ben Shakhar (1998)  Robb & Ercanbrack (1999)  McClain (1999)  Camara (1999, 2001, 2008) Stone & Lane (2000, 2003)  Din & Soldan (2001)  Briggs (2001)  Palmer (2002)  Briggs & Hansen (2004)  Cankoy & Ali Tut (2005)  Crocker (2005)  Allensworth, Correa, & Ponisciak (2008)  Domingue & Briggs (2009)  Koljatic & Silva (2014)  Early (2019)  
138 Daniel M. Koretz   "Despite the intense controversy engendered by proposals for national testing, questions in the second and third sets-the essential questions about the practicality and likely effects of national testing-have been aired insufficiently in many quarters." p.31 Dismissive A Call for Caution: NAEP and National Testing: Issues and Implications for Educators NASSP Bulletin, September 1992   There was extended discussion and consideration. Simply put, they did not get their way because others disagreed with them.  
139 Daniel M. Koretz   "Although these proposed new uses for NAEP relatively may seem straightforward, they actually raise a number of difficult technical issues. I will note four, none of which has received sufficient attention in the policy debate about national testing." p.34 Dismissive A Call for Caution: NAEP and National Testing: Issues and Implications for Educators NASSP Bulletin, September 1992   There was extended discussion and consideration. Simply put, they did not get their way because others disagreed with them.  
140 Daniel M. Koretz   "Data [from the NAEP] about educational factors that influence achievement are sparse,… " p.37 Dismissive A Call for Caution: NAEP and National Testing: Issues and Implications for Educators NASSP Bulletin, September 1992   However, there exist an abundance of other sources of that information which could be combined with NAEP data to paint the bigger picture.  
141 Daniel M. Koretz   "Moreover, even if NAEP could be strengthened to the point where it could reliably identify states with better educational programs, it would be unable, as is currently structured, to provide trustworthy information about which aspects of those programs matter, because its information on educational policies and practices is limited." pp.37–38 Dismissive A Call for Caution: NAEP and National Testing: Issues and Implications for Educators NASSP Bulletin, September 1992   However, there exist an abundance of other sources of that information which could be combined with NAEP data to paint the bigger picture.  
142 Daniel M. Koretz Robert L. Linn, Stephen Dunbar, Lorrie A. Shepard “Evidence relevant to this debate has been limited.” p. 2 Dismissive The Effects of High-Stakes Testing On Achievement: Preliminary Findings About Generalization Across Tests  Originally presented at the annual meeting of the AERA and the NCME, Chicago, April 5, 1991 http://nepc.colorado.edu/files/HighStakesTesting.pdf In fact, a very large number of studies do so. See, for example, https://journals.sagepub.com/doi/abs/10.1177/0193841X19865628#abstract & https://www.tandfonline.com/doi/full/10.1080/15305058.2011.602920 ; https://nonpartisaneducation.org/Review/Resources/QuantitativeList.htm ; https://nonpartisaneducation.org/Review/Resources/SurveyList.htm ; https://nonpartisaneducation.org/Review/Resources/QualitativeList.htm  
                   
  IRONIES:                
  Daniel M. Koretz   "I discuss a number of important issues that have arisen in K-12 testing and explore their implications for testing in the postsecondary sector. These include ... overstating comparability ... and unwarranted causal inference."   Measuring Postsecondary Achievement: Lessons from Large-Scale Assessments in the K-12 Sector Higher Education Policy, April 24, 2019, Abstract https://link.springer.com/article/10.1057/s41307-019-00142-4    
  Daniel M. Koretz   "Although this problem has been documented for more than a quarter of a century, it is still widely ignored, and the public is fed a steady diet of seriously misleading information about improvements in schools."   The Testing Charade: Pretending to Make Schools Better [Kindle location 723] University of Chicago Press, 2017      
  Daniel M. Koretz   "It is worth considering why we are so unlikely to ever find out how common cheating has become. … the press remains gullible…"   The Testing Charade: Pretending to Make Schools Better [Kindle location 1424] University of Chicago Press, 2017      
  Daniel M. Koretz     "…putting a stop to this disdain for evidence--this arrogant assumption that we know so much that we don't have to bother evaluating our ideas before imposing them on teachers and students--is one of the most important changes we have to make."   The Testing Charade: Pretending to Make Schools Better [Kindle location 2573] University of Chicago Press, 2017      
  Daniel M. Koretz   "But the failure to evaluate the reforms also reflects a particular arrogance."   The Testing Charade: Pretending to Make Schools Better [Kindle location 3184] University of Chicago Press, 2017      
  Daniel M. Koretz   "I've several times excoriated some of the reformers for assuming that whatever they dreamed up would work well without turning to actual evidence."   The Testing Charade: Pretending to Make Schools Better [Kindle location 3229] University of Chicago Press, 2017      
  Daniel M. Koretz Jennifer L. Jennings "Data are considered proprietary—a position that the restrictions imposed by the federal Family Educational Rights and Privacy Act (FERPA) have made easier to maintain publicly. Access is usually provided only for research which is not seen as unduly threatening to the leaders’ immediate political agendas. The fact that this last consideration is often openly discussed underscores the lack of a culture of public accountability."   The Misunderstanding and Use of Data from Educational Tests, pp.4-5 Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities/    
  Daniel M. Koretz Jennifer L. Jennings "This unwillingness to countenance honest but potentially threatening research garners very little discussion, but in this respect, education is an anomaly. In many areas of public policy, such as drug safety or vehicle safety, there is an expectation that the public is owed honest and impartial evaluation and research. For example, imagine what would have happed if the CEO of Merck had responded to reports of side-effects from Vioxx by saying that allowing access to data was “not our priority at present,” which is a not infrequent response to data requests made to districts or states. In public education, there is no expectation that the public has a right to honest evaluation, and data are seen as the policymakers’ proprietary sandbox, to which they can grant access when it happens to serve their political needs."   The Misunderstanding and Use of Data from Educational Tests, p.5 Prepared for Spencer Foundation meetings, Chicago, IL, February 11, 2010. Revised November 21, 2010 http://www.spencer.org/data-use-and-educational-improvement-initiative-activities/    
  Daniel M. Koretz   One sometimes disquieting consequence of the incompleteness of tests is that different tests often provide somewhat inconsistent results. (p. 10)   Measuring up: What educational testing really tells us. Harvard University Press, 2008  Google Books    
  Daniel M. Koretz   "Even a single test can provide varying results. Just as polls have a margin of error, so do achievement tests. Students who take more than one form of a test typically obtain different scores." (p. 11)   Measuring up: What educational testing really tells us. Harvard University Press, 2008  Google Books    
  Daniel M. Koretz   "Even well-designed tests will often provide substantially different views of trends because of differences in content and other aspects of the tests' design. . . . [W]e have to be careful not to place too much confidence in detailed findings, such as the precise size of changes over time or of differences between groups." (p. 92)   Measuring up: What educational testing really tells us. Harvard University Press, 2008  Google Books    
  Daniel M. Koretz   "[O]ne cannot give all the credit or blame to one factor . . . without investigating the impact of others. Many of the complex statistical models used in economics, sociology, epidemiology, and other sciences are efforts to take into account (or 'control' for') other factors that offer plausible alternative explanations of the observed data, and many apportion variation in the outcome-say, test scores-among various possible causes. …A hypothesis is only scientifically credible when the evidence gathered has ruled out plausible alternative explanations." (pp. 122-123)   Measuring up: What educational testing really tells us. Harvard University Press, 2008  Google Books Yet, in his studies test administration and security characteristics are totally left out, as if they could not matter.  
  Daniel M. Koretz   "[A] simple correlation need not indicate that one of the factors causes the other." (p. 123)   Measuring up: What educational testing really tells us. Harvard University Press, 2008  Google Books Yet, Koretz rejects decades of experimental evidence on test coaching and, instead, relies on purely correlational, apples and oranges comparisons of unrelated tests.  
  Daniel M. Koretz   "Any number of studies have shown the complexity of the non-educational factors that can affect achievement and test scores." (p. 129)   Measuring up: What educational testing really tells us. Harvard University Press, 2008  Google Books    
  Daniel M. Koretz   "For a test to be even approximately parallel, it has to be so close in content that the effects of inappropriate coaching are likely to generalize to some degree to the new form."   Measuring up: What educational testing really tells us. Harvard University Press, 2008  Google Books Yet, he argues that high-stakes tests should be "audited" by comparing their score trends to those of unrelated no-stakes tests.  
                   
      Cite selves or colleagues in the group, but dismiss or denigrate all other work            
      Falsely claim that research has only recently been done on topic.            
      Author cites (and accepts as fact without checking) someone else's dismissive review            
                   
  * Cannell, J.J. (1987). Nationally Normed Elementary Achievement Testing in America's Public Schools: How All Fifty States are Above the National Average, Daniels, WV: Friends for Education;  Cannell, J.J. (1989). How Public Educators Cheat on Standardized Achievement Tests: The “Lake Wobegon” Report. Albuquerque, NM: Friends for Education.