Why the reliability of UK Examination Boards’ assessment of A Level writing papers is questionable

download

Often, our year 12 or Year 13 students who have consistently scored high in mock exams or other assessment in the writing component of the A Level exam paper, do significantly less well in the actual exam. And, when the teachers and/or students, in disbelief, apply for a remark, they often see the controversial original grade reconfirmed or, as it has actually happened to two of my students in the past, even lowered. In the last two years, colleagues of mine around the world have seen this phenomenon worsen: are UK examinations boards becoming harsher or stricter in their grading? Or is it that the essay papers are becoming more complicated? Or, could it be that the current students are generally less able than the previous cohorts?

Although I do not discount any of the above hypotheses, I personally believe that the phenomenon is also partly due to serious issues undermining the reliability of the assessment procedures involved in the grading of A Level papers by external assessors. The main issues refer to:

(1) the assessment rubrics used to grade the essays, which lend themselves, as I intend to show, to subjective interpretation by the assessors;

(2) the way raters form their judgement about the quality of an essay;

(3) the absence of an inter-rater reliability process – which compounds the previous issues.

1.Subjective interpretation of rubrics

1.1 The fuzziness of the terms ‘Fluency’ and ‘Communication’

The interpretation of what ‘fluency’ means is one of the most controversial issues in Applied Linguistics research. If one asks ten teachers what the word refers to, every single one of them will tell you they know. However, when it comes down to articulating what ‘good’ or ‘very good’ fluency involve and to giving criteria to measure it, each and every one of them will provide different criteria and measures. How do we know this? Several studies have been carried out which show that this is case (Chamber et al., 1995). Moreover, it will suffice to look at how greatly the measures of fluency vary across the plethora of analytical/holistic scales used in applied linguistics research to realize how ambiguous a concept ‘fluency’ is. As Bruton and Kirby (1987) put it:

“The word fluency crops up often in discussions of written composition and holds an ambiguous position in theory and in practice…Written fluency is not easily explained, apparently, even when researchers rely on simple, traditional measures such as composing rate. Yet, when any of these researchers referred to the term fluency, they did so as though the term were already widely understood and not in need of any further explication.” (p. 89)

Yet, rubrics used in the assessment of the writing component of the Edexcel AS level exams use ‘fluency’ as one of the criteria to attain the highest level of Quality of Language. This is taken from the writing assessment rubric on page 32 of the Edexcel specification:

“Excellent communication; high level of accuracy; language almost always fluent, varied and appropriate.”

If scholars and researchers around the world cannot agree on what this concept means how can teachers be expected to be able to do so? The solution? Edexcel could either eliminate the reference to fluency or provide a clear explanation of what it refers to, for teachers to understand what is expected of their students at the highest level of the Quality of Language scale.

Another issue is the use of the concept of ‘Communication’ in the above example from Edexcel. Communication is a very fuzzy concept, too, as it subsumes quite a wide range of skills as well as linguistic and sociolinguistic features. What does ‘Excellent communication” actually mean? Here, too, it would be useful for teachers to know what is meant by the examiners and what features would constitute a step up from ‘good communication’ (the criterion which characterizes the previous level).

1.2 What is a ‘simple’ and what is a ‘complex’ structure?  

Researchers have also found that teachers do not often agree on what constitutes a complex structure (Chambers et al., 1995). What may appear very complex to one teacher seems to appear moderately complex or relatively easy to another, depending on their own personal bias and experience. In the light of this ambiguity, one can see how the two top levels of the AQA ‘Complexity of language’ scale lend themselves to subjective interpretation.

  1. Very wide range of complex structures;
  2. A wide range of structures, including complex constructions;

How does one decide with absolute certainty when a structure is ‘simple’ or ‘complex’ or somewhere in the middle? Interestingly, the Edexcel specification lists in the Appendix (pages 72-74), all the structures an A Level candidate is expected to have learnt by the end of the course. Each time I go through them- despite an MA and a PhD in Applied Linguistics and 25 years of foreign language teaching – I find it difficult to draw the line between less and more complex structures. For instance, does Agreement count as a complex grammar structure? Some of my colleagues think it is not, whilst they think the subjunctive is; but I can think of contexts in which agreement rules are applied which the students find more challenging than the deployment of certain subjunctives rules. Moreover, say one produces set phrases (formulaic language) such as ‘Quoiqu’il en soit” (= whatever the case), which I teach as a ready-made ‘chunk’ to all of my students, before I even teach the subjunctive. Does it count as a complex structure, even though the students use it without knowing the grammar ‘behind’ it? And saying “One must use one’s professional judgment”, it’s not good enough, because that’s when we legitimize subjectivity!

1.3 When is language ‘rich’?

The Edexcel ‘Range and Application of language’ trait, a subset of the holistic scale used to assess the Research essay, contains the following criteria at the top two levels:

7–8 A wide range of appropriate lexis and structures; successful manipulation of language.

9-10 Rich and complex language; very successful manipulation of language.

We have already dealt with the issue of ‘complex language’. Now let us focus on the adjective ‘rich’? What constitutes ‘rich’ language? How does it differ from ‘A wide range of appropriate lexis and structures’? Although one can sense the difference between the two levels, it is not clear what A2 candidates need to write in their essays for the language to be considered as ‘rich’. Again, it would be helpful to obtain from Edexcel clear guidelines as to what constitute ‘rich language’ (e.g. how many and what kind of idioms one should use) so that one will not go about assigning a grade impressionistically.

1.4 Use of intensifiers in the rubrics to indicate progression

Let us go back to the example above from AQA:

5 (marks) – Very wide range of complex structures;

4 (marks) –  A wide range of structures, including complex constructions;

Considering that the word limit set by AQA is 250 words, how many ‘complex’ structures can an A level student ‘pack in’ in such a limited space so as to be considered as using a ‘very wide’ range of complex structures? Does the use of 10 different complex structures qualify as a ‘wide’ or a ‘very wide’ range? Does this encourage a student to artificially use as many ‘complex’ structures as possible potentially to the detriment of the message he/she is trying to convey in order to score a ‘5’?

This issue refers to an overuse, very fashionable in this day and age, in rubrics, of incrementally stronger intensifiers to define progression from one level to another; this looks intuitively right, and maybe it is, in theory, but in practice creates a lot of ambiguity along the way. Here is another example, from Edexcel, this time (A level specification, page 46), from the essay-organization rubric; the two statements below define the top levels of the taxonomy:

10–12 Organisation and development logical and clear.

13–15 Extremely clear and effective organization and development of ideas

I personally find it difficult to differentiate between ‘clear’ and ‘extremely clear’; when does ‘clear’ become ‘extremely clear’?. Moreover, doesn’t ‘effective organization’ mean ‘clear’ organization to most people, ‘clear’ meaning that the text is both cohesive and coherent? I believe that in order to be fair to teachers and students, Edexcel should provide several ‘extremely clear’ examples as to what constitute ‘clear’ and ‘extremely clear’ organization.

2. The exam format

This is an issue which refers to the Edexcel Research essay component. Let us look at the top descriptors for the scale “Understanding, reading and research”:

19–24 Good to very good understanding; clear evidence of in-depth reading and research.

25–30 Very good to excellent understanding; clear evidence of extensive and in-depth reading and research.

Considering that the set word limit is only 240 to 270 words, how can a student provide clear evidence of extensive reading and research? And what is meant by ‘extensive’, anyway?

3. General issues undermining reliability of essay-rater assessment

Research shows that there are issues which compound the problems already highlighted above. This issues have to do with findings as to the way essay raters go about grading essays, which have the potential to affect the objectivity of the assessment process.

3.1 The pre-scoring stage

It appears that during the reading of an essay, which the raters usually do in the pre-scoring stage, the raters are already forming a judgment which does not necessarily refer to the categories in the assessment rubrics. Although they do usually attempt to make their judgment fit the categories when awarding the marks it is difficult to dispel the positive or negative effect on the objectivity of the grading that that initial bias brings to bear on the assessment.

3.2 Idiosyncratic focus on specific categories

Some raters seem to focus on specific categories more than others. The type of category they focus can vary greatly from rater to rater. One rater, for instance, will focus on grammar, whilst another will concentrate on lexical choice or spelling. This is important, as the extent to which a rater focuses his attention onto grammar, for instance, may bias him/her negatively, in the case of an essay with quite a few grammatical mistakes but great content,towards that essay and lead him/her to be ‘harsher’, even when s/he tries to apply the assessment rubrics objectively.

3.3 Level of engagement with the rubrics during the assessment

Research also shows that some raters engage meticulously in the reading of the rubrics so as to apply them as accurately as possible, whereas others give them a much more superficial read and apply their own impressions. In the light of the first section of this article, an assessor ought to be as meticulous as possible in ensuring they apply the descriptors correctly, even when s/he is experienced in the use of the rubrics. How can one be sure that the rater assessing our students’ essays belong to the conscientious and meticulous sort rather than the more superficial kind?

Implications for A Level examination boards

The above issues highlight the importance of implementing measures to control for the threats to the reliability of the essay assessment process. Two measures can be undertaken:

(1) As already suggested above, the wording of the rubrics may be changed, in order to disambiguate the meaning of certain statements/criteria in the rubrics. As McNamara (1996, p.118) points out, “the refinement of descriptors and training workshops are vital to rating consistency”.

(2) Examination boards should train their examiners more frequently than they currently do in the use of assessment rubrics;

(3) Most importantly, Examination Boards should implement multiple-marking procedures whereby each essay is graded by two or more raters, who will, in the event of serious grading discrepancies – and of low inter-rater reliability – engage in a discussion to address the issues which cause them to disagree. As Wu Siew Mei in a brilliant study I could not locate the date of but found at http://www.nus.edu.sg/celc/research/books/relt/vol9/no2/069to104_wu.pdf  states,

“it is […] a good strategy to do multiple ratings where each script is rated by more than one rater and where there is a clear procedure for reconciliation of varied scores. However, such strategies are again limited by manpower availability and time constraints.”

Currently, (3) does not happen. This is a serious flaw in the current assessment procedures of UK examination boards in view of the subjectivity of the assessment scale descriptors they use and of other issues pointed out in the course of this article. Placing the onus of essay rating onto one marker only is unfair and unreliable. UK examination boards should act as soon as possible on this shortcoming, regardless of the extra costs, training and other issues which changing the current system would entail. After all, the grades our students obtain at A level can have huge repercussions on their university applications and their future in general. This consideration should come first.

5 thoughts on “Why the reliability of UK Examination Boards’ assessment of A Level writing papers is questionable

  1. I agree with the points you make here and wonder whether you read Daisy Christodoulou’s blog The Wing to Heaven. She wrote a post recently called Comparative Judgment: 21st century assessment that touches on the same issues. It makes sense to me and I’d be interested to know what you think.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s