Sarah Garner and Rachel Thomas consider why well-designed and properly analysed experiments are so important when testing how effective a medical treatment is.
Suppose a new medical treatment has been developed that may reduce high blood pressure. The treatment has been extensively tested in the laboratory and on a few volunteers, and the researchers believe that it will work on the general population. Now it is time to find out if they are right.
Historically, doctors found out whether a treatment worked in practice by using it on their patients. They could then compare the patients’ responses to the new treatment and to previous treatments for the same illness, and also compare how responses to the new treatment varied between patients. However, if patients did indeed recover from their condition, there was no way of telling whether it was due to the treatment or to something else.
There are many other factors that could have caused the patients to recover: for example, they may have felt better simply because they were being treated by a doctor (a reaction known as the placebo effect); they may have recovered anyway, regardless of the treatment; or perhaps their recovery was due to changes in their personal circumstances or lifestyle. Without taking these and other factors into account, it could be easy to conclude incorrectly that the treatment worked. Doctors would then incorporate it into their everyday practice, mistakenly believing it to be effective.
In the 19th century, scientists proposed a method of controlling exactly what was happening and recording any changes in the patients’ condition. In these controlled experiments, there were two groups of patients – the study group, which received the new treatment, and the control group, which received a placebo (an inert medication) or an established treatment. The patients were then observed, and the outcomes of the two groups (such as whether each patient lived or died) were recorded and compared.
Some time later, in 1917, the process of ‘blinding’ improved the scientific method even further. If neither the patient nor the researcher knows which treatment the patient is receiving, then the results cannot be interfered with either intentionally or unintentionally. This is known as a double-blind trial (in a single-blind trial, either the patient or researcher knows which treatment is being received).
However, the results could still be deliberately biased to prove that a treatment worked, by including sicker patients in the study group than in the control group. The solution to this, first used by the UK’s Medical Research Council in the 1940s for its study of whooping cough vaccines, is to randomly choose which patients will get the new treatment, and which will get the control treatment.
Controlled trials with random allocation to the two groups became known as randomised–controlled trials or RCTs. By randomising, you not only end up with a random distribution of sicker and healthier patients between the two groups, but also achieve a random distribution of things you do not know about (but which may also affect the patient’s health and therefore the outcome of the treatment). Then – because, in theory, the only difference between the two groups is whether they received the treatment being tested – you can assume that any differences in outcome are most likely due to the treatment and nothing else.
RCTs are now universally used in clinical research to evaluate new treatments.
Before 1994, doctors recommended that patients with lower back pain rest in bed. However, after reviewing all the available evidence, the Clinical Standards Advisory Group realised that bed rest was not beneficial and was perhaps even harmful. This led to a radical change in treatment, with patients being advised to remain activew1.
If you are planning to test your blood-pressure treatment with an RCT, you need to design it carefully. One important question is: How many patients should you include in the trial? This depends on how big an effect the new treatment has: the bigger the effect, the smaller the number of patients you need to distinguish it from the random fluctuations that happened by chance.
Of course, the effect of the treatment is exactly what you want to find out with your RCT. Before you start an RCT, however, you will already have some evidence that the treatment works, perhaps from laboratory or small-scale testing. This allows you to estimate the effect size.
In a healthy patient, blood pressure should be between 90 and 120 mmHg. But patients with high blood pressure will consistently have measurements of more than 140 mmHg, putting them at increased risk of heart attack and stroke. You might estimate that the new treatment will reduce a patient’s maximum blood pressure by 5 mmHg: after treatment, you would expect that the average blood pressure of the study group would be at least 5 mmHg lower than the average blood pressure of the control group.
There are statistical formulae to determine the sample size you need to have a good chance of detecting that estimated effectw2. For your blood-pressure treatment, these formulae tell you that you would need around 64 patients in each group to detect a treatment difference of 5 mmHgw3.
The trial has run its course, the participants have been monitored, and you have recorded a difference in blood pressure between the patients in the study and control groups. Thanks to randomising, you know that the two groups were comparable before the trial. So either your new treatment has had an effect, or a very surprising event has occurred: the treatment really has no effect at all and the difference you recorded in your RCT was due to chance alone.
Imagine that the average blood pressure of the study group was 5.2 mmHg lower than the average blood pressure of the control group. How do you decide if that difference is due to chance or to a real effect of the treatment? After all, blood pressure can vary for many reasons, not all of which can be controlled in your RCT.
What statisticians do is to allow for some variation; rather than rely on one average for each group, they calculate a range of values for each group that they are pretty confident will include the true value. This range of values is called a confidence interval. If the confidence intervals in your blood-pressure study are 141.2-148.9 mmHg in the control group and 133.7-139.3 mmHg in the study group, you can see that the two confidence intervals do not overlap. Statisticians, therefore, say that the observed difference between the two groups is statistically significant – and you can assume that it really was caused by the treatment.
But how confident is confident? Statisticians usually say that 95% confident is good enough; this means that they are prepared to live with the fact that 5% of the time (or 1 in 20 times) they will be wrong due to chance alone. To be even surer that you have the right value, you have to measure more patients and even then, the only way to be 100% sure is to measure the whole population!
If the result turns out not to be statistically significant, one of the key questions to ask is whether you included enough patients in the trial. Perhaps the effect of the treatment is smaller than you estimated – with a larger sample size, you might have detected a difference between the two groups of patients.
A well-designed and properly analysed RCT is a very powerful tool for medical researchers – providing doctors with the information they need to make the right decisions when treating their patients. Nonetheless, RCTs do have limitations.
Firstly, it is not enough to know that the new treatment makes a statistically significant difference. Is the difference also clinically significant – for example, does a decrease in maximum blood pressure of 5 mmHg make a real difference to a patient’s health and well-being? After all, in our example, the treatment still did not reduce the blood pressure to the normal range of 90-120 mmHg. To judge if this is clinically relevant, doctors may have to turn to other types of research.
A further limitation of RCTs is that patients in the trial may not represent the real-world population of people to be treated. Because trials aim to control as many factors as possible, they usually have strict inclusion and exclusion criteria. For example, pregnant women are not included due to potential risks to the unborn child; this meant that no one realised that thalidomide caused birth defects until it was introduced into general practice in the late 1950sw4.
Then there is the question of how RCTs are reported. No one wants to publish bad news, particularly people who have spent time and effort to develop a new treatment. Historically, therefore, researchers did not publish trials that showed no difference or even that an older treatment was better. Some unscrupulous researchers have also reported selective or incomplete results, which made a new treatment look better than it really was. The research community has taken steps to stop both these biases by making companies and researchers register the start of a trial, so that it is more difficult to hide unfavourable outcomes, although there is still no requirement to report all outcomes. Journals are also standardising the information they require researchers to submit with their manuscripts, which makes it more difficult for bad results to be hidden.
Above all, RCTs are expensive and time consuming. As a result, many trials are not conducted at all, or their sample size or duration is limited. This may mean that the trial is not powerful enough to detect whether a treatment is effective, when in fact it is. Smaller trials may also miss important adverse effects (which may be rare), and shorter trials are unable to capture long-term outcomes.
Clinical researchers, therefore, often review the outcomes of a number of trials together in a meticulous analysis known as a systematic review – this effectively increases the sample size. Organisations such as the Cochrane Collaborationw5 and the UK’s National Institute for Health and Clinical Excellencew6 base their recommendations to the medical community on systematic reviews.
Since the 1940s, the use of RCTs has significantly changed medical practice. Doctors are no longer reliant on their own observations but can rely on rigorous evaluation to ensure that the benefit of a new treatment outweighs the risks.
A systematic review of the evidence for minocycline, an antibiotic that was heavily promoted as the best cure for acne, was recently conducted to investigate its efficacy and its safety.
One side effect of minocycline is potentially fatal autoimmune liver problems. The problems are rare and can have a number of causes. Most doctors do not come across them, and even if they do, the connection might not necessarily be made with the drug.
It was only when all the information was reviewed together that the link was made. A systematic review showed that there was no evidence that minocycline was any better at curing acne than any other known treatment. Given the risks, the authors of the review concluded it should not be used in preference to other treatments (Garner et al., 2003).
If you enjoyed this article but would like to learn more about the mathematics involved, read the original, longer version of this articlew3, which appeared in Plus magazinew7, a free online magazine which opens a door to the world of mathematics with all its beauty and applications.