All that we don’t know

What a month – what a week – and much more to come.

Meanwhile, I enjoyed reading David Hand’s new book Dark Data: Why What You Don’t Know Matters. A former president of the Royal Statistical Society, Hand has written an excellent guide to the many reasons for caution in interepreting data. He uses the overall metaphor of ‘dark’ data (like dark matter in the universe) to categorise these pitfalls, which is a nice way of organising a book that could have ended up being a list of the rather varied things statisticians need to worry about. The book ends with a very useful, rather sobering, taxonomy of the 15 data issues a careful empiricist should be aware of.

For example these include missing data we know about; missing data or omitted variables we are unaware of (the unfairly mocked Rumsfeldian ‘unknown unknowns’); sample selection bias; gaming and feedback; fraudulent data; measurement error; extrapolation; counterfactuals; and even the process of scientific discovery. This is still obviously quite a varied collection of issues which does lead to some tangents (such as why Sigmund Freud was not a scientist, how Facebook breached the Nuremberg Convention, and that Randolph Churchill found decimal places mysterious).


But it is all very clearly and amusingly written so it is an excellent overview of all the potential mis-steps a student (or practitioner) might make, from p-hacking to omitted variable bias, Simpson’s Paradox to the Hawthorne Effect.The final section has some positive suggestions too: linking datasets, replication, RCTs, and anonymisation. A very useful book and one I will recommend to students.

There are limits of course: sometimes the uncertainty is irreducible, as we are learning now.