Lies, Damned Lies And Covid Statistics: Why Data-Illiterate Media Gets It Wrong 

Anonymous Contributor

Jun 02, 2021, 01:53 PM | Updated 01:53 PM IST

Representative Image
Representative Image
  • Here is a list of pitfalls that readers should watch out for while consuming statistical data on media.
  • By A Special Contributor

    In conducting research for my article on estimating Covid deaths, I realised many pitfalls in typical media analysis. I’m not referring to anecdotal reasoning, which is an oxymoron. Even when data is sound (i.e. death records), statistical analysis is often bad and misleading, despite the involvement of external experts in some articles.

    I list here a set of pitfalls that readers should watch out for so as not to get suckered.

    1. Just a number

    Headlines like “40,000 excess-deaths in Gujarat, 11,000 in Delhi” are just that: headlines. Even if the numbers are correct, they are meaningless without proper framing. The figure 40,000 is better than 11,000 if the former refers to a two-month period and 70 million people, while the latter is over one month and 19 million. Sound statistics frames numbers against context and allows like-to-like comparisons.

    2. Residuals are very sensitive

    Excess death is a residual, i.e. (A) minus (B). If A is known, B is an estimate (which is what any baseline is, since the choice of the period, normalisation for trend and population adjustment are non-trivial issues). Small changes in baseline can yield big changes in excess deaths. The baseline has to be carefully constructed with a transparently detailed method. In most studies, baseline choice is driven by agenda, not reasoning.

    3. Baseline too low

    The shortest path to a sensational headline on exaggerated deaths is by deflating the baseline. Multiple reports use the 2015-19 average without population-growth adjustment. They try to downplay lower excess deaths if population adjustment is made or if the most recent year is used as a baseline. The aim is to obtain the most convenient baseline, not the most reliable.

    4. Tyranny of averages

    Media hides individual year data and only shows averages. Pre-Covid spike-years are hidden lest we figure out that sporadic spikes are normal in death data. If all we’re dealing with is five years, it’s good practice to show all the data (i.e. trend within and the average for every year). The less that is hidden, the more robust the analysis.

    5. Inconsistent internals

    The Gujarat data had massive variations across cities. If Rajkot was as doomed as the data suggested, we wouldn’t need journalists or analysis to find out. Younger age groups showed a death spike, which is not Covid-consistent. Cherry-picking convenient bits while downplaying inconvenient or inconsistent bits is a red flag.

    6. Adjacent periods

    With the vast majority of deaths due to ageing/ailments, data is inherently mean-reverting. Tragic as it sounds, if fewer people die in a certain period, the chances of higher deaths rise in subsequent periods are higher. The average is steady since underlying vulnerabilities are mostly a function of age and lifestyle. If a prior period was below trend, mean-reversion can explain excess, not Covid. This also makes it crucial to wait for subsequent period data. Sound analysis should show prior and subsequent periods.

    7. Attribution

    Articles blindly take the ratio of excess to reported Covid deaths and claim 5x or 10x under-counting. But there’s no basis for 100 per cent of the excess (even correctly estimated) to be attributed to Covid. Since other ailments suffer care deprivation during a pandemic, they could also be elevated during Covid spike periods.

    8. Selection bias

    Real scientists report data where conclusions were opposite or inconclusive. If you scrub dozens of datasets, one will have a few convenient conclusions to showcase. Even if data within a region is representative of the region, the choice of the region is not representative of a country, especially given the media’s incentives to only carry sensationalist stories.

    9. Extrapolation

    Analyses are sometimes based on cherry-picked, spotlighted corners of time and space. Extrapolation to population or wider duration has to be done with great caution and humility. Taking the worst headline (10x here) and casting innuendos about everywhere (5x-10x pan-India) are baseless as they are mala fide.

    Sound statistical analysis should pass many tests: reliably extracting deviation from the trend with noisy data, correct causal attribution, framing numbers as a ratio of population, like-to-like comparisons, accounting for mean-reversions, acknowledging biased sampling and lack of representativeness, and careful extrapolation. Even with sound data, a typical article on excess deaths fails many of the above tests, primarily due to motivation issues.

    The media seeks stories, not the soundness of methodology. Exaggerated, preferably negative, headlines sell. “Nothing to see here” or “it’s complicated” is not in its pecuniary interest. While I mention some of these pitfalls in my article on excess deaths, it’s a topic that needs to be separately highlighted. Readers should beware of such pitfalls in any "analysis" presented to them, and not just in Covid-related articles. A careless pen is 10x more dangerous than a sword.

    Get Swarajya in your inbox.