15 February 2023

Big data: it’s not the size of the sample, it’s what you do with it

By Frances An

When it comes to analysing behaviour, is big data beautiful?

The corporate behemoths certainly seem to think so. The size and apparent analytical power of the data sets many of them have accumulated has led to much head-scratching, not least among human-run market research firms who fear their stock-in-trade could soon be redundant. Likewise, consumer privacy advocates fear the rise of so-called ‘surveillance capitalism’ where transnational corporations are merrily harvesting personal data from their customers.

A case in point was the now renowned article from an employee of supermarket chain Target, in which he claimed to have used big data to discover a teenage girl’s pregnancy before her own father did.

This trend is far from confined to the world of business though. Since IBM released SPSS data management software in the late 1960s, researchers of various stripes have been chasing after bigger and bigger data.

But do these vast data sets always help to explain the world more clearly? As marketing expert David W Stewart observed several decades ago, market research has often been seen as a waste of time by corporates who complain about unactionable and noisy data sets. That leads to a vicious cycle where firms don’t want to put money into market research, which further drives down the quality of what eventually gets produced. The result is diminished market research departments dealing with overwhelming amounts of data that companies view as a crystal ball for their business practices.

The problem is that using big data to target customers might appear frightening, but is really just an enhanced version of educated guessing. To take a prominent example, Twitter has about 450 million users (for now, at least), each of whom might have multiple different data points collected about them. Trying to break that down and infer anything about an individual user or Twitter users’ behaviour in general is therefore a Sisyphean task, not least because the data itself is changing all the time, and users’ behaviour isn’t necessarily all that consistent.

In the case of Target, the teenage girl’s pregnancy was identified – accurately on this occasion – based on some pretty basic statistics about pregnant women’s buying habits. But on other occasions making assumptions based on previous stats can lead you down the garden path. To use myself as an example, for the last three years YouTube’s advertising algorithm has steadfastly insisted that I am a pregnant woman who is desperate to send money to relatives in Vietnam. Only the woman part is correct.  An old school market research technique – like asking someone about their answers to a survey – could have uncovered that my interest in Nhạc Vàng (a pre-1975 genre of Vietnamese music once banned by the Communist government) is connected to my pro-democracy sentiments, not evidence of being a recently migrated young mother from Vietnam. Unravelling connections between observable individual behaviours is straightforward if you use basic interview techniques, but convoluted or impossible through big data mining.

Of course, that doesn’t mean big data doesn’t have its use, particularly for gauging broad details such a product or service’s dominant demographic and regional distribution. However, no amount of data mining will uncover the effect of specific contexts, motivations, relations or decision-making processes behind someone’s action. Google and YouTube cannot seem to connect my searches for pre-1975 Vietnamese songs to my interests in Asian pro-democracy movements, despite the fact I look up both regularly.

One oft-cited advantage of big data is its capacity to obtain a large and representative sample. However, the lust for a representative sample may not be satisfiable or even desirable for understanding people’s perceptions of a product or service. To skip the academic dross about sampling techniques, Liana Epstein’s article ‘Random(ish) sampling: balancing the ideal and the real’ highlights how diversified and innovative one’s sampling techniques need to be in order to have any confidence in the sample’s randomness. Adam Ferrier and Jen Flemming’s book Stop Listening to the Customer: Try Hearing Your Brand Instead (2020) argues that kowtowing to mass consumers at the expense of building brand identity diminishes its perceived value and distinctiveness. The average consumer does not consciously reflect on the reasons they chose Colgate over Oral B and would rather market researchers leave them alone about their toothpaste preferences.

The psychoanalytic market researcher Ernest Dichter (1907-1991) likened asking an average consumer about a buying decision to asking a patient about why they are sick. While I have previously argued against the total surrender of people’s choices to unconscious nudges, Dichter is correct to state that people are not fully aware, able or willing to directly verbalise their decisions. To elicit the reasons, Dichter used qualitative techniques to gauge driving factors behind a consumer’s decision. While some of his methods take on an overly psychosexual perspective, others are closer to what qualitative researchers would now term thematic analysis: a foundational, atheoretical analytic method that involves categorising semantic data into themes which capture the key aspects of a psychological or sociological phenomenon. Rather than asking directly about soap, Dichter gleaned feedback for improving Ivory Soap by allowing consumers to describe their bathing habits, thereby allowing soap’s cultural significance to unveil itself. This included using narrative-style interviews with consumers to understand the purchased good’s symbolic meaning in a person’s life.

In contrast to our current research culture’s obsession with having sufficient sample size, quantity has not always been seen as an indicator of a better represented population or market. For example, General Motors previously conducted their market research with car enthusiasts rather than the ‘average driver’. While true ‘petrolheads’ are relatively rare and not representative of the average driver, their enthusiasm for vehicle mechanics and profound relations with their own cars may provide more information about how best to optimise driving experience for the average car user. General Motors could gauge accurate, detailed information (e.g., model of participants’ previous cars) and pose interactive survey types (e.g., asking participants to draw or describe the features of an ideal vehicle) from a participant pool that was educated and interested in the vehicle market’s future.

The capacity to instantaneously collect large volumes of data globally could certainly help us understand and identify world issues. But an obsession with gigantic data sets should not leave basic market research skills such as sampling, interviews and focus group to turn rusty. These are still foundational for understanding the relationships between factors that underlie people’s perceptions and participation in various markets.

Click here to subscribe to our daily briefing – the best pieces from CapX and across the web.

CapX depends on the generosity of its readers. If you value what we do, please consider making a donation.

Frances An is a Mannkal Scholar and an intern at the Centre for Policy Studies.