All data is social

One of my strands of work at the moment is thinking about data as a social asset, including how to value it. Anybody playing a word association game would probably link ‘personal’ with ‘data’, but the usefulness of any individual data record will generally depend on context – on other pieces of data. Information requires the combinations. What’s more, the way data is categorised and collected is highly socially structured too. If I buy a book online (it has been known), that datum will be classified in different ways to be useful, for the recommender algorithm or for marketing analytics: what genre? how much was the book? what else has she bought?

This is a preamble to this handy Primer on Powerful Numbers, a brief overview of the sociology of data with fantastic lists of references. It’s well worth a read. It sent me back after a long time to Benedict Anderson’s classic Imagined Communities, specifically the chapter ‘Census, Map, Museum’: “The real innovation of the census-takers of the 1870s was not in the construction of the ethnic-racial classifications but rather in their systematic quantification. …… The flow of subject populations through the mesh of differential schools, courts, clinics, police stations, immigration offices created ‘traffic habits’ which in time gave real social life to the state’s earlier fantasies.” On censuses, see also Andrew Whitby’s excellent history, The Sum of The People.









Census and map-making as social intstruments also feaure in J├╝rgen Osterhammel’s monumental The Transformation of the World: A Global History of the 19th Century, which I’ve recently been reading. I’m enjoying its woven approach – big themes as warp, teased out across the whole of the globe as weft, to paint a rich tapestry. One example relevant to data being shaped by society and in turn reshaping it is map-making, and the way physical maps in reflecting mental maps led to new actions (say in colonial administration) that in turn alter the physical maps – for example by tighter specification of borders as definitive lines (However, the book’s simply too big for me to hold, even in paperback, so I doubt I’ll get through it all. Given the big book trend, can I plead with publishers to return to the tradition of multiple volumes?)

51Xsz21SnJL._AC_UY436_QL65_As we recognise ourselves to be in a data-driven economy and society, thinking about the social construction of data, the unavoidably social use of data, and the way data will alter the society it catalogues is vital. I’m lucky to have some amazing Bennett Institute colleagues thinking about these questions – Sam Gilbert’s Good Data, Jeni Tennison’s new project Connected by Data, Claire Melamed’s work on data for the SDGs, and Stephanie Diepeveen, Annabel Manley and Sumedha Deshmukh working with me on the value of data including specific applications (eg transport, finance).

All that we don’t know

What a month – what a week – and much more to come.

Meanwhile, I enjoyed reading David Hand’s new book Dark Data: Why What You Don’t Know Matters. A former president of the Royal Statistical Society, Hand has written an excellent guide to the many reasons for caution in interepreting data. He uses the overall metaphor of ‘dark’ data (like dark matter in the universe) to categorise these pitfalls, which is a nice way of organising a book that could have ended up being a list of the rather varied things statisticians need to worry about. The book ends with a very useful, rather sobering, taxonomy of the 15 data issues a careful empiricist should be aware of.

For example these include missing data we know about; missing data or omitted variables we are unaware of (the unfairly mocked Rumsfeldian ‘unknown unknowns’); sample selection bias; gaming and feedback; fraudulent data; measurement error; extrapolation; counterfactuals; and even the process of scientific discovery. This is still obviously quite a varied collection of issues which does lead to some tangents (such as why Sigmund Freud was not a scientist, how Facebook breached the Nuremberg Convention, and that Randolph Churchill found decimal places mysterious).


But it is all very clearly and amusingly written so it is an excellent overview of all the potential mis-steps a student (or practitioner) might make, from p-hacking to omitted variable bias, Simpson’s Paradox to the Hawthorne Effect.The final section has some positive suggestions too: linking datasets, replication, RCTs, and anonymisation. A very useful book and one I will recommend to students.

There are limits of course: sometimes the uncertainty is irreducible, as we are learning now.