“Conducting a survey [to answer a research question] should be the last resort,” said Tom Smith in his keynote presentation at the BigSurv18 Conference in Barcelona, Spain.
Tom is the Managing Director of the Data Science Campus for the Office for National Statistics in the UK, and he knows a thing or two about data. His point is that given the near-infinite supply of data available from such disparate sources as administrative records, internet transactions (Amazon, Google, etc.), social media (Facebook, Twitter), commercially available databases, and so on, the answer to many research questions can be extracted from existing “Big Data” sources. Why conduct an expensive survey when extant data will serve the purpose?
Tom noted that survey costs are ever-increasing as are the error risks. Survey nonresponse rates have been on the rise for decades. Measurement errors, especially for sensitive questions or questions requiring recall, add to the inaccuracy of survey estimates. However, Big Data often come with “big error” as well.
For example, in my presentation at BigSurv18, I showed that a sample survey of about 6,000 housing units can provide a more accurate estimate of the housing-unit square footage than the entire Zillow data base consisting of the square footage estimates for more than 200 million households. Survey quality often trumps Big Data quantity, as was the case for Zillow data.
At BigSurv18, many presenters provided a compromise. Rather than choose one data source over another, why not combine data sources in such a way as to maximize the strengths and minimize the weaknesses of each source – a process called data integration?
For example, survey data on dwelling square footage could be used to correct the measurement bias in the Zillow data. Such “hybrid” estimates retain some of the benefits of the massive data set with the measurement accuracy of the survey to produce estimates that are better than either single source estimate.
However, quite often Big Data cannot answer specific questions posed by researchers and data users. This is because, unlike survey data, Big Data are “found” not “designed.” In those cases, researchers will continue to resort to surveys for a wide range of questions than cannot be answered by an found data set. However, as many BigSurv18 presenters showed, Big Data is well-suited for answering many “unposed” questions–i.e., questions that are discovered in the process of mining Big Data. This only requires imagination–as well as perhaps a research team with expertise in domain science, data science, computer science and statistics.
Nothing is free.