Applying hypothesis testing and confidence intervals to assess the 2016 EU referendum polling results

In the last blog post, I discussed how polls do not account for variations that arise when sample results are extended to the population. I proposed that hypothesis testing and confidence intervals should be used in polls to enable stakeholders and the general public to assess the decisiveness of the results. While hypothesis testing gives a clear-cut answer on whether a poll can conclusively favour one side over the other, confidence intervals reveal the possibilities that might arise from a poll.

In this blog post, I used hypothesis testing and confidence intervals to analyse results from EU referendum polls in 2016. Using these techniques, I drew some interesting insights identifying polls that show a significant result and evaluated the usefulness of telephone polls compared to online ones.

Methodology

I collected a list of online and telephone British polls in 2016 from Wikipedia that surveyed people on how they would respond to the EU referendum question “Should the United Kingdom remain a member of the European Union or leave the European Union?”. I only included polls that had the raw numbers (not proportions) of people that would vote Remain or Leave in the referendum. These were required for calculating the proportion of the sample voting Remain or Leave to a high level of precision (to two decimal places) and to calculate the sample size (the total number of voters participating in the poll) which are used to calculate confidence intervals and p-values (see my previous blog post for more details). I collected weighted values from these polls as they make adjustments to create a nationally-represented sample of the UK.

From these polls, I excluded voters who were undecided or would refuse to vote for two reasons. Firstly, this reduces the number of possibilities to two (Remain or Leave) which enabled me to apply hypothesis testing and confidence intervals to a specific mutually-exclusive side. Secondly, on the day of the referendum, the count excludes anyone who is undecided or refused to vote when deciding whether the UK should Remain or Leave the EU. This can be simulated by excluding the number of voters who are undecided or refuse to vote. I then calculated p-values and confidence intervals of each poll according to the formulae from the last blog post.

Comparing statistical significance via z-values and confidence intervals

## Observations: 93
## Variables: 15
## $ poll_start <date> 2016-01-08, 2016-01-08, 2016-01-15, 2016-01-15, 20...
## $ poll_end   <date> 2016-01-10, 2016-01-14, 2016-01-16, 2016-01-17, 20...
## $ pollster   <chr> "ICM", "Panelbase", "Survation", "ICM", "ORB", "Com...
## $ poll_type  <chr> "Online", "Online", "Online", "Online", "Online", "...
## $ num_remain <dbl> 901, 704, 368, 857, 1050, 544, 821, 289, 589, 664, ...
## $ num_leave  <dbl> 778, 757, 392, 815, 965, 362, 826, 186, 569, 724, 7...
## $ total      <dbl> 1679, 1461, 760, 1672, 2015, 906, 1647, 475, 1158, ...
## $ prop_leave <dbl> 0.4633711, 0.5181383, 0.5157895, 0.4874402, 0.47890...
## $ z_value    <dbl> -3.00178625, 1.38659861, 0.87057150, -1.02714357, -...
## $ p_value    <dbl> 2.684006e-03, 1.655642e-01, 3.839882e-01, 3.043529e...
## $ z_sig      <chr> "Yes", "No", "No", "No", "Maybe", "Yes", "No", "Yes...
## $ error      <dbl> 0.02385241, 0.02562212, 0.03553061, 0.02395912, 0.0...
## $ low_ci     <dbl> 0.4395186, 0.4925161, 0.4802589, 0.4634811, 0.45709...
## $ high_ci    <dbl> 0.4872235, 0.5437604, 0.5513201, 0.5113993, 0.50072...
## $ ci_sig     <chr> "Yes", "No", "No", "No", "No", "Yes", "No", "Yes", ...

In total, 93 telephone and online polls were included in the analysis. Statistical significance can be determined by both the z-value which measures the deviation of the proportion of Leave voters from the null value of 50% (representing equal numbers of Remain and Leave voters) or via confidence intervals which depicts statistical significance by the confidence interval not crossing the 50% proportion threshold. I initially compared z-values and confidence intervals to see whether both techniques can similarly detect statistical significance at the p < 0.05 level.

##        
##         No Yes
##   Maybe  4   1
##   No    54   0
##   Yes    0  34

The rows and columns in the table represent statistical significance detected by z-values and confidence intervals respectively. Overall, both techniques can identically delineate statistically significant and non-significant polls. I also included a “maybe” category for z-value hypothesis testing representing polls of borderline statistical significance (i.e., p-values between 0.05 and 0.10). Of these polls, four of them had a non-significant result as derived from confidence intervals.

We can visualise the comparison of statistical significance from z-values and confidence intervals:

In general, polls that are statistically significant via z-values have confidence intervals that do not cross over the 50% threshold. In contrast, polls that are not statistically significant via z-values have confidence intervals that cross over the 50% threshold. Polls that are borderline statistically significant (represented by orange) had one end of the confidence interval “touching” or slightly crossing over the 50% threshold. This visualisation shows the ability of statistical techniques to distinguish polls that decisively favour one side from those that show a balance of votes between the two sides.

##        
##           No  Yes
##   Maybe 0.04 0.01
##   No    0.58 0.00
##   Yes   0.00 0.37

Most polls (58%) are not statistically significant with the proportion of Leave voters not deviating significantly from the 50% null value. These results suggest that these polls cannot decisively favour Leave over Remain and vice-versa. The remaining polls have a Leave result that is significantly (37%) or borderline (5%) different from the 50% null value.

Overall, these results underline that statistical significance derived from z-values and confidence intervals are equivalent to each other. Therefore, in subsequent analyses, I used statistical significance derived from z-values to investigate polling results further.

How can statistical significance assist in interpretability of polls?

I firstly counted the number of statistically significant and non-significant results from online and telephone polls.

From the graph, most non-statistically significant results come from online polls while most statistically significant results are derived from telephone polls. Identifying an interesting result, I investigated further at the characteristics of online and telephone polls in EU referendum polls.

Surprisingly, all telephone polls that had a statistically significant result favoured Remain as indicated by the lower than 50% proportion of Leave voters. They also had relatively small sample sizes, surveying less than 1,000 people. In contrast, most online polls, which survey more than 1,000 people, do not show a statistically significant result, describing a 50:50 split between Remain and Leave. Of the handful of online polls that are statistically significant, the number of online polls that favoured Leave (indicated by >50% Leave voters) is double that of those favouring Remain.

I also investigated the margins of error between online and telephone polls. Telephone polls had higher margins of error (median = 3.5%) compared to online polls (median = 2.4%) due to the smaller sample sizes. Given that the 2016 EU referendum favoured Leave over Remain, these results suggest that while online polls are more likely to be robust as they survey a larger sample size, telephone polls are more likely to declare a decisive result but are more prone to favour the wrong side. This might be one of the myriad of contributing factors why phone polls are more likely to get the EU referendum result wrong compared to online polls.

How do EU referendum polls track overtime?

I next looked at the polling results overtime and how they are affected by statistical significance, taking into consideration the poll type and sample size.

Most polls are not statistically significant, straddling along the 50% threshold. These polls suggest a 50:50 split between Remain and Leave voters, not favouring one side over the other. Of the polls that showed statistical significance, up until June nearly all of them are telephone polls that favoured Remain (indicated by lower than 50% Leave voters). There were only four online polls that showed statistically significant results with two favouring Remain and the other two favouring Leave. However, from June onwards, nearly all statistically significant results came from online polls. Most of them favoured Leave up until just before the referendum where the last two statistically significant online polls favoured Remain.

These online polls were different from the referendum result which favoured Leave over Remain. This is because there are other sources of error such as voter turnout that are not accounted for in the statistical techniques used. Nevertheless, hypothesis testing is a very powerful tool for limiting polls that show a decisive result from those that do not. This gives us a subset of the data from which insightful conclusions can be made.

Conclusion

Hypothesis testing is very useful for selecting polls that show a significant deviation from a 50:50 split between Remain and Leave voters. Combined with confidence intervals, these statistical techniques have unveiled some interesting results. For instance, while telephone polls are more likely to declare a decisive result, they are also more prone to favour the wrong side as can be seen in the 2016 EU referendum. In contrast, online polls are less likely to declare a decisive result but are more likely to favour the correct side due to surveying more people. It is why nearly all Brexit polls after the 2016 EU referendum are conducted online rathern than by phone.

In summary, the use of hypothesis testing and confidence intervals will serve to better clarify the usefulness of polls and whether they show a decisive result on a particular issue and what range of possibilities are available. Given the explosion of data and information in the modern age, it is more important than ever that people are equipped with the tools to interpret and assess the legitimacy and accuracy of facts. Teaching people how to use and interpret hypothesis tests and confidence intervals will serve them well for not only deciding whether they should care about a poll but also for assessing and debating findings from different sources of information.

Appreciating statistical variation to improve the interpretability of polls

The 2016 European Union (EU) referendum produced one of the biggest surprises of the 21st century. Most polls leading up to the referendum suggested that the majority of UK would vote Remain in the EU. However, the referendum produced a different result with 51.9% voters wanting to Leave the EU. Since then, there have been chaotic scenes on whether and how Brexit would be enforced. The polling industry has also been under attack with debate surrounding whether polls are still useful for predicting how the population would vote on important issues such as Brexit.

The conflict between Remaining and Leaving the EU still rages on in the UK. Source

What the mass media and the general public do not appreciate is that polls, whose results are taken as the overall view of the population, only survey one small part of a population that might vote differently from the sample. This introduces variation in the polling result which might introduce a situation where it does not favour one side over the other. Hypothesis testing and confidence intervals can be used to decide whether the public should care about a polling result and the range of referendum results that are possible from a poll. If communicated simply to politicians and the public, more informed decisions can be made on what, if any, one should do to influence people’s views towards voting Remain or Leave in the EU referendum.

How do opinion polls report variation in results?

Opinion polls are conducted on people drawn randomly from a population to gauge the population’s views of an issue. It is like tasting a small sample of a meal such as soup and using our thoughts of the sample to make a general judgement of the meal. In our case, we use statistics to extend the results of the sample to make conclusions about a population. The problem of this method is that people outside the sample may vote differently from those in it, causing population results to differ from a poll.

Hence, in statistics, it is important to account for the variation in polling results to capture the true value of the population. This is encapsulated by the margin of error which is added or subtracted from the sample value obtained in a poll. Mathematically, this can be defined as:

Population value = sample value ± margin of error (± means add or subtract)

In the case of an EU referendum poll, the sample value would be the proportion of the sample that vote Remain or Leave while the margin of error provides the space to capture the proportion of the population that would vote Remain or Leave. This margin of error is set to 3% in most polls which is not reported by mass media. Hence, readers might erroneously assume that the polling results represent the true proportions of the population voting in a particular way. Although the margin of error is an essential tool to account for variations of the population value from a poll, it is insufficient as readers cannot comprehend the different outcomes that might be generated. This can be resolved by using confidence intervals.

A better tool for reporting variation in results: confidence intervals!

We can subtract or add the margin of error from the sample value to produce the lower and upper bounds of a population value respectively. Combining these bounds produces confidence intervals, the range of values that we are almost certain captures the true population value. Although they can be produced at varying levels of confidence, they are usually set at 95% confidence to almost guarantee (at 95% confidence) that what we conclude from a poll can be generalised to the whole population. This makes it very useful for thinking about the different outcomes of the EU referendum that might be generated from conducting a poll.

Calculating the confidence interval of a polling result

1. Find the sample value of a poll. In our case, we want to calculate the proportion of the sample that vote Leave. This can be calculated as:

prop_{Leave} = \frac{Total(Leave)}{Total(voters)}

2. Calculate the margin of error (MOE) to measure variation in the poll results. In our case, to calculate the MOE for our 95% confidence interval, we use the formula:

MOE_{Leave} = 1.96 \times \sqrt{\frac{prop_{Leave} \times (1 - prop_{Leave})}{Total(voters)}}

The MOE is influenced by the sample size which describes the total number of people that have voted (Total(voters)) in the poll.

3. Subtract or add the MOE from the sample value to get the lower and upper bounds of the confidence interval respectively.

Lower bound = sample value – margin of error

Upper bound = sample value + margin of error

4. Combine the lower and upper bound values to generate the 95% confidence interval. This describes the range of values that we are 95% sure captures the true population value (in our case, the proportion of the population that vote Leave).  

Confidence interval = (lower bound, upper bound)

The basics of hypothesis testing

Although confidence intervals can describe the different outcomes of the EU referendum, it does not give a clear-cut answer of whether we should care about a poll. This is where hypothesis testing comes in.

In hypothesis testing, we assess assumptions of a population value against data of a random sample. It is analogous to a criminal trial, where a criminal is presumed innocent until enough evidence is collected to prove guilt. In the same sense, we assume that the null (H0) hypothesis is true until proven otherwise. The null hypothesis proposes that there is no deviation from a set population value in response to some event. This arises when we produce a polling result that would have most likely appeared by chance. Conversely, if we have a polling result that is so rare and unusual that it would not have arisen by chance, then we have enough evidence to reject the null hypothesis and accept the alternative (Ha) hypothesis. The alternative hypothesis describes a deviation of the population value from some set value.

In our case, we want to assess whether a poll can decisively conclude that most of the population would vote Remain or Leave. We write our two hypotheses as follows (prop0,Leave is the proportion of the population that would vote Leave from the null hypothesis):

H0: there is an even split of Remain and Leave voters in the poll.  prop0,Leave = 0.5

Ha: the poll decisively favours Remain or Leave. prop0,Leave ≠ 0.5

Should we care about a polling result? Let’s use a hypothesis test to find out!

There are many statistical tests that can be used depending on the kind of data that we are analysing. As we are analysing the proportion of people that vote Remain or Leave in a poll, we convert the proportion to a standardised z-value that can be used to calculate probabilities on a normal z-distribution (better known as a “bell curve”).

What a normal z-distribution looks like. Source

The z-value can be calculated by the formula:

z-value = \frac{prop_{Leave} - prop_{0,Leave}}{\sqrt{\frac{prop_{0,Leave} \times (1-prop_{0,Leave})}{Total(voters)}}}

If we set prop0,Leave = 0.5 (meaning an even split of Remain and Leave voters in the population), we can simplify the z-value to:

z-value = \frac{prop_{Leave} - 0.5}{\sqrt{\frac{0.5 \times (1-0.5)}{Total(voters)}}} = 2 \times \sqrt{Total(voters)} \times (prop_{Leave}-0.5)

This z-distribution (Z) can be used to calculate the probability (the p-value) that we generate a random result that is just as or more extreme that the polling result given some set value from a null hypothesis. This is represented mathematically as:

p-value = Pr(Z \leq -z-value \hspace{2mm} OR \hspace{2mm} Z \geq +z-value)

And can be calculated using normal tables, a calculator or a computer. We compare the p-value to an alpha value which is the threshold that the p-value has to go below to reject the null hypothesis. Although we can set different alpha-values between 0 and 1, it is usually set to 0.05 (which describes a 5% chance that we get a random result that is just as or more extreme than the polling result given some null value).

  • If the p-value is more than the alpha value (i.e., p > 0.05), then we have failed to reject the null hypothesis. We conclude that the poll cannot decide whether most of the population would vote Remain or Leave in the EU referendum.
  • If the p-value falls below the alpha value (i.e., p < 0.05), then we reject the null hypothesis and accept the alternative hypothesis. We conclude that the poll can decisively favour Remain or Leave in the EU referendum among the population.

Hypothesis testing is useful for deciding whether the public and stakeholders should care about a polling result, facilitating informed decisions on how campaigning needs to be done.

Applying hypothesis testing and confidence intervals to a real-life EU referendum poll

Let’s look at an online poll run from 27th to 29th May 2016 by the polling company ICM. Out of 1753 people, 848 (48.37%) voted Remain and 905 (51.63%) voted Leave. Should we care about the ICM poll?

First, let’s use hypothesis testing to decide whether the ICM poll is decisive. We declare two hypotheses:

H0: There is an even split of Remain and Leave voters in the population. prop0,Leave = 0.5

Ha: There are more Remain voters in the population than Leave voters. prop0,Leave ≠ 0.5

Since we have propLeave = 0.5163 (converted from percentage to decimal), we calculate the z-value as follows:

z-value = \frac{0.5163 - 0.5}{\sqrt{\frac{0.5 \times (1-0.5)}{1753}}} = 1.3614

And calculate its p-value:

p-value = Pr(Z \leq -1.3614 \hspace{2mm} OR \hspace{2mm} Z \geq +1.3614) = 0.1738

The p-value of 0.1738 exceeds the alpha-value of 0.05, so we failed to reject the null hypothesis. The ICM poll cannot decisively favour Remain or Leave, implying an even split of two sides among voters in the population.

How can we visualise the indecisiveness of this poll? We can use confidence intervals to do this.

First, calculate the margin of error (MOE). The MOE will be the same regardless of whether the proportions of Remain or Leave voters are used.

MOE_{Leave} = 1.96 \times \sqrt{\frac{0.5163 \times (1-0.5163)}{1753}} = 2.34\%

This is within the 3% MOE mentioned in most polls.

We use the MOE to calculate the confidence intervals of Leave and Remain voters.

Leave confidence interval = 51.63 ± 2.34% = (49.29%, 53.97%). This confidence interval states that we are 95% sure that the true proportion of the population that would vote Leave is between 49.29% and 53.97%.

Remain confidence interval = 48.37 ± 2.34% = (46.03%, 50.71%). This confidence interval states that we are 95% sure that the true proportion of the population that would vote Remain is between 46.03% and 50.71%.

These confidence intervals appreciate that the proportion of the population voting Remain or Leave might differ from the polling results. The real power of confidence intervals; though, comes when we visualise them in a number line.

The number lines above show the ICM poll results (indicated by the middle point of the line) along with the Leave and Remain confidence intervals. Two things can be observed from the number line:

  1. A 50:50 split between Leave and Remain voters is possible in an EU referendum (indicated by a dashed line) because the confidence intervals of both the Leave and Remain sides contain the 50% proportion. This result would not provide a clear indication of which side would win, something the mass media does not appreciate when hyping up a particular result.
  2. A referendum involving the population might produce a different result from a poll. Although the poll had a higher proportion of Leave than Remain voters in the sample, it is possible that in a referendum over the population, there might be more Remain than Leave voters. Hence, the poll cannot conclusively favour one side over the other.

These two points open the possibility that the poll might not capture the views of the population. This is something the reader overlooks not only because the mass media excludes the margin of error but because they do not realise that the polling results may not reflect the views of the whole population. If the confidence intervals of two groups in a sample overlap each other, it is possible that the referendum results of a population might be very different from the polling results of a sample.

Conclusion

How polling results are reported by the mass media today covers up the dangers of extending results from a sample to infer conclusions about a population. Even citing the margin of error does not paint a true picture of the range of possibilities that might arise from a poll. In contrast, hypothesis testing and confidence intervals can produce a lot of insights of how we interpret polls. While hypothesis testing can tell us whether we should care about a polling result, confidence intervals can reveal the variability produced when polling results are extended to the overall population.

Ideally, the mass media would adopt hypothesis testing and confidence intervals as tools to correctly interpret polls and to responsibly extend results to the population. Given the mass media’s interest in hyping up polling results regardless of whether they are warranted or not, this is most likely not possible. Hence, independent companies should be set up to analyse polling results and to provide a truthful interpretation of the polls to the public so that they can decide whether they should act on a poll or not. Keeping the polling industry accountable to these statistical measures will ensure the viability of polls in painting a truthful picture of how the population thinks on various issues of the country.