Ahamadian, M.J. & Tavakoli, M. (2011). The effects of simultaneous use of careful online planning and task repetition on accuracy, complexity, and fluency in EFL learners’ oral production. Language Teaching Research, 15(1), 35-59.
Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99-115.
Bijani, H. (2010). Raters’ perception and expertise in evaluating second language compositions. The Journal of Applied Linguistics, 3(2), 69-89.
Cohen, L., Manion, L. & Morrison, K. (2007). Research methods in education. London: Routledge.
Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117-135.
Eckes, T. (2015). Introduction to many-facet Rasch measurement. Frankfurt: Peter Lang Edition.
Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiency tasks: What does the test-taker have to offer? Language Testing, 19(4), 347-368.
Fulcher, G., Davidson, F., & Kamp, J. (2011). Effective rating scale development for speaking tests: Performance decision trees. Language Testing, 28(1), 5-29.
Gardner, R. (1992). Task classification, methodology and task selection. Unpublished manuscript, Department of Linguistics and Applied Linguistics,University of Melbourne.
In’nami, Y., & Koizumi, R. (2016). Task and rater effects in L2 speaking and writing: A synthesis of generalizability studies. Language Testing, 33(3), 341-366.
Jeong, H., & Hashizume, H. (2011). Testing second language oral proficiency in direct and semidirect settings: A social-cognitive neuroscience perspective. Language Learning, 61(3), 675-699.
Khabbazbashi, N. (2017). Topic and background knowledge effects on performance in speaking assessment. Language Testing, 34(1), 23-48.
Kuiken, F., & Vedder, I. (2014). Raters’ decisions, rating procedures and rating scales. Language Testing, 31(3), 279-284.
Kyle, K., Crossley, S. A., & McNamara, D. S. (2016). Construct validity in TOEFL iBT speaking tasks: Insights from natural language processing. Language Testing, 33(3), 319-340.
Leaper, D. A., & Riazi, M. (2014). The influence of prompt on group oral tests. Language Testing, 31(2), 177-204.
Ling, G., Mollaun, P., & Xi, X. (2014). A study on the impact of fatigue on human raters when scoring speaking responses. Language Testing, 31(4), 479-499.
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54-71.
May, L. (2009). Co-constructed interaction in a paired speaking test: The rater’s perspective. Language Testing, 26(3), 397-421.
Nakatsuhara, F. (2011). Effect of test-taker characteristics and the number of participants in group oral tests. Language Testing, 28(4), 483-508.
O’Sullivan, B. (2002). Learner acquaintanceship and oral proficiency test pair task performance. Language Testing, 19(3), 277-295.
Robinson, P. (2001). Task complexity, task difficulty and task production: Exploring interactions in a componential framework. Applied Linguistics, 21(1), 27-57.
Skehan, P (1998). A cognitive approach to language learning. Oxford: Oxford University Press.
Skehan, P., & Foster, P. (1999). The influence of task structure and processing conditions on narrative retellings. Language Learning, 49(1), 93-120.
Steiger, J. H., (1980). Test for comparing elements of a correlation matrix. Psychological Bulletin, 87(2), 245-251.
Trace, J., Janssen, G., & Meier, V. (2017). Measuring the impact of rater negotiation in writing performance assessment. Language Testing, 34(1), 3-22.
Winke, P., Gass, S., & Myford, C. (2012). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2), 231-252.
Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 369-386.