Every so often, whenever a government, lobby and educational or business establishment want to persuade us of the value of changing our current instructional approach or taking on a new initiative, we are presented with some new ground-breaking research findings from some study which seems to support their case for the envisaged innovation. In what follows, I will concisely list and discuss the main shortcomings common to a lot of educational studies, which seriously undermine their validity and should give us some reasons to be skeptical about their claims.
One small caveat, before we proceed: I am a strong believer in the importance of staying open to learning and innovation, but I also believe that many teachers are insufficiently conversant with research methodology and procedures, which makes them more ‘vulnerable’ to sensationalist research claims. This article – and my blog in general – aims at addressing such knowledge gaps.
- Use of verbal reports (e.g. questionnaires, interviews, concurrent/retrospective think-aloud reports)
Scores of books and academic-journal articles have been published which argue against the use of verbal report data to draw any objective conclusions about a phenomenon, hypothesis or educational methodology being investigated. Why? Because they do not yield direct, objective data but merely subjective interpretations/reconstructions of events by teachers or learners and their perceptions and opinions about self- and other-phenomena inaccessible through objective means. Imagine, for instance, what a group of disgruntled teachers – the negative ‘clique’ in the staffroom – would write in an anonymous survey about their senior leadership team. Objective statements?
Retrospective verbal reports, whereby the learners reconstruct a posteriori their thought processes during the execution of a task they have just carried out are also unreliable as they will not capture automatic mental operations (which by-pass consciousness) occurred during performance and will be incomplete (due to working memory loss).
The practice of using three or more different forms of verbal reports (triangulation) does strengthen the validity of the data; however, because of the huge logistic effort that using so many data elicitation procedures involve, very few studies, if any, implement it.
What is surprising is that, despite their widely-acknowledged subjectivity and unreliability, verbal reports, especially questionnaires, are still widely used in educational research and their findings make it to the headlines of reputable newspapers and social media and end up seriously affecting our professional practice!
- Use of observational data
Observational data are also ‘tricky’, even when the phenomena observed are recorded on video. This is because when the researchers analyze a video of a lesson, for instance, they need to code each observed student/teacher behaviour under a category, a label (e.g. ‘recast’, ‘request for help’, ‘explicit correction’, ‘critical attitude’, etc.). These categories can be quite subjective and are vulnerable to bias or manipulation on the part of the researcher. Once created a coding scheme, the subjectivity issues are compounded by the fact that – if more lessons were recorded – that scheme must be applied in the analysis of every single video. How can one be sure that the coding scheme is used objectively and correctly?
The issues arising from the subjectivity of ‘coding’ can be more or less effectively addressed by having several independent coders working on the same video (inter-coder reliability procedures). However, being time-consuming, few studies do it and when the do it, they use two coders only because it is easier to resolve any disagreement.
- Opportunistic sampling
To test the superiority of a new methodology over another, you need to compare two groups which are homogenous. The group receiving the treatment will be your experimental group and the other one your control group. A sample should be randomized and like should be compared with like. However, in educational research it is difficult to randomize and to find two schools or groups of individuals that are 100 % equivalent.
Comparing the effect of an independent variable (e.g. a new instructional approach) in ten schools in the same Local Education Authority, is not a valid procedure because it presumes that they are identical at some level. No two schools are the same – even if they are located in the same neighbourhood – and the initiative being tested (the independent variable) will be affected by many contextual and individual differences.
- The human factor
This is possiby the most important threat to the validity of educational research. Every teaching strategy, tool or methodology is going to be affected by the teacher who deploys it, the learners on the receiving end and by the interaction between the two (the ‘chemistry’). Not to mention the fact that, as I have experienced first-hand, not all the administrators/teachers involved can be relied on to do exactly as instructed by the researchers. Consequently, it is difficult to dissociate the effect of the specific ‘treatment’ the experiment involved from the human factor. This problem is related to and compounded by another issue: the ‘researcher effect’
- The researcher effect
When you implement a new initiative or instructional approach you are majorly sensitizing to it everyone involved. You may generate enthusiasm, indifference, anxiety or even resentment in the teacher/student population. The negative or positive emotional arousal in the informants will create an important source of bias. Knowing (and it is very difficult to hide it in educational research) that you are part of an experiment will inevitable affect your behavior.
- The use of multi-traits evaluative scales and other proficiency measures
Studies investigating the impact of a methodology on L2 learner proficiency use multi-traits assessment scales to evaluate students’ performance in speaking and/or writing. Forty years of use of such scales in L2 research have shown that the vast majority of these, when used with students of fairly similar levels of proficiency and applied by more than two independent assessors – and often even with two -, do not yield statistically significant inter-rater reliability scores (i.e. the raters do not come up with assessment scores which are close enough to be statistically valid). This has been documented by several studies. Hence, most studies use only two raters or, often, no rater at all thereby undermining the validity of their findings.
An assessment scale, in order to be valid (and fair), must yield relatively high levels of inter-rater reliability (as obtained by using at least three independent assessors). We have all experienced disappointment and disbelief when we find out that our predicted-A* A-level students are awarded ‘B’ or even ‘C’ grades by an Examination board. But when one looks at the assessment scales they use, vague and ‘sketchy’ as they are and in view of the lack of serious inter-rater reliability procedures, it is not surprising at all.
There are even more problems with other measures used to asses written performance (e.g. T-units; Error counts, etc.) which I will not even go into. It will suffice to say that they are very commonly used and their reliability is highly questionable.
- The research design
A typical research design adopted in educational research is a pre-test / post-test design. For instance, imagine a study where a school tries a new instructional approach with 70 of their 150 year seven students. Both are given a test before the ‘treatment’ and another (similar) test at the end of it to see if there are any improvements. They find that there are indeed significant improvements. Problem: the second test is only a snapshot of the student performance. How do we know that it wasn’t because of that particular test (type or content) or other surrounding variables? Truth is that this kind of design is cheaper, logistically more manageable and less time-consuming. A better design would be a repeated-measure test design where there are several tests throughout the year (which would also control for learner maturation, that is the extent to which the observed improvements are actually due to the ‘treatment’ and not to developmental factors).
- Significance tests
Example: When comparing the essay or oral performance scores obtained by the two groups under study (the experimental and the control group) one usually performs a test (called t-test) which compares the means of the scores obtained by the two groups. The test will yield a score on which the researcher will perform a significance test to verify that the relationships between the two scores is probable enough to be statistically significant. That will finally tell you if your treatment has been successful or not, your hypothesis proven or disproven.
However, what it is interesting is that not all the significance tests normally used in research will give you the same result; one test may give you a positive verdict whereas four or five of the others will not. Guess which significance test results do researchers normally publish?
- Lack of transparency
Many studies are not 100% transparent about all of the above, especially when it comes to inter-rater reliability procedures and score. More than often, they will not tell you which significant tests they failed, they will only tell you the one they ‘passed’.
- The generalizability of the findings and replicability of the study
This is the most crucial issue of all because after all, what governments or international agencies do when they quote a piece of research is state that an initiative/intervention has worked in twenty or thirty schools around the country and thus should be implemented in all schools. And if the government happens to be the American or British one, it may spread to other countries, too. The question is: are the schools where the experiment took place – their teachers, administrators, students and other stakeholders – truly representative of the whole country, continent, the whole world? In my experience, more than often, the research findings have low generalizability power.
These are only 10 of the 25 reasons I brainstormed prior to writing this article as to why one should be skeptical about much educational research and about any imposed theory of or instructional approach to foreign language teaching instruction based on it. The above does not rule out the existence of sound and credible educational research within and outside the realm of foreign language learning and acquisition. There are in fact several examples of it. My main point, ultimately, is that educational research data may yield rich and highly informative data; however such data are often not as reliable and generalizable as they are made to be by the governments or establishments who use them to support their political or economic agendas.
This article does not intend to incite the reader against change or innovations. Not at all. Its aim is to raise teachers’ awareness of some of the flaws in research design and procedures common to many studies carried out to-date. Such awareness may prompt them to look at research in the future in a more ‘savvy’ and discerning way and to be more selective as to what they take on board and incorporate into their professional practice. Openness to change is a marker of a growth mindset, but the ‘blind’ embracing of any initiative claimed to be supported by unverified ‘research’ is unethical in a profession like teaching, where the cognitive development and welfare of our children is at stake.
In the last thirty years we have witnessed the implementation of great educational initiatives and innovations which have benefitted teachers and students (the K-stage 3 strategy, for instance, in the 90s, and Assessment For learning). I have seen others, however (e.g. Learning styles and Multiple intelligences), which were not only rooted in ‘phony’ theory and research, but also, in my view, wasted a lot of teacher and student time.