BIG DATA IN PSYCHOLOGY
ELSA LOISSEL, SANDRA MATZ, SANDRINE MÜLLER, MATTHEW SAMSON, JESSICA SCHALLOCK
UNIVERSITY OF CAMBRIDGE
GNOTHI SEAUTON – know thyself.
Whether it is for us as a species, or for ourselves as individuals, the imperative echoes through the centuries and many academic fields. Know thyself, perhaps, but how? Personality, passions, feelings, opinions, patterns of behaviour… all vague and complex concepts that often evade our understanding. Yet, the answers to this question may lie at the tips of our fingers, on the keyboards of our computers, or the touchscreens of our smartphones. In fact, a recent study by Youyou et al. showed that our Facebook ‘Likes’ predict our personalities better than our colleagues, our friends and even our family.
Behavioural scientists seek to understand and predict human behaviour using the methods of the so-called ‘hard’ sciences, laying a quantitative grid over the fuzziness of real life. Most of the concepts, such as personality, are abstract, difficult to measure directly, and even to define. Instead, researchers focus on concrete, measurable traits that are accurate representations of the original abstract concept, weaving these variables into statistical models that aim to represent and predict our intimate lives with varying degrees of confidence. Michal Kosinski, Assistant Professor in Organisational Behaviour at Stanford Graduate School of Business, explains: “Usually, the psychological traits we are measuring are really behavioural constructs. Attitudes, thoughts and feelings are also behaviours, so we can simply talk about predicting behaviours.”
Every action that involves a computer generates some kind of data, from a text message to an Amazon order to a Facebook Like. In 2020, we are expected to produce over 44 zettabytes, equivalent to the storage on eleven trillion standard DVDs. These are our digital footprints, crumbs of behaviours we leave while making our way through cyberspace. And they are key to answering many of the questions that behavioural scientists have had for centuries.
Traditional behavioural science experiments often entail finding individuals willing to participate, bringing them to a lab to fill out questionnaires about themselves or perform cognitive tasks, and hoping they will be willing to return for regular follow up sessions. Most studies are limited to a small sample of undergraduate students – a group constantly shown not to be representative of the general human population. Even with a large and diverse participant base, it can be difficult to assess how much questionnaires reflect how people will really behave. It is, unsurprisingly, difficult to reproduce natural behaviour in a laboratory environment. Humans are also extremely unpredictable. Even with the best of models and the most detailed personality profiles, there are numerous variables affecting how someone will act in any given situation. Only with a large enough sample do these apparently chaotic effects begin to resolve into visible patterns – yet how to obtain these gargantuan volumes of data?
INSIDE AN ALGORITHM
Between 2007 and 2012, Cambridge’s David Stillwell (then at the University of Nottingham) was amassing a huge database of survey and Facebook profile information from his pioneering myPersonality Facebook App. It allowed users to take validated psychological surveys and receive immediate, customised feedback. This went viral and gained around 7.5 million participants, two million of whom also volunteered their Facebook accounts. After the application closed, Stillwell and colleagues sought a universal and quantifiable subset of their Facebook data that captured psychological information. Page Likes proved the silver bullet: the researchers could put users into a matrix that assigned them a 1 if they liked a Page and a 0 if they did not. This matrix accurately predicted demographic data, from intelligence, personality and well-being, to sensitive information like sexual orientation and religiosity.
BIG DATA, BIG RESULTS
In a few short years, computers and smartphones have made their way into billions of homes around the world. In 1995, only 1% of the world population had access to the Internet. Today that number is up to 90% in North America, 85% in Europe, and 50% worldwide. These devices have grown beyond a simple tool for calculations and are now an integral part of the way we live.
More importantly, though, they have also become a window into who we are. Kosinski clarifies: “The dominant thinking even a few years ago was that people put masks on when they go online. Our main question was, is it even worth looking at online behaviour? Maybe it’s so different from offline behaviour that they should be separate fields of study.” In a major breakthrough study published in 2013, Kosinski and colleagues analysed the Facebook Likes of over 58,000 volunteers, checking the results against the participants’ answers to very detailed questionnaires. The algorithms accurately predicted the subjects’ gender and ethnic origin in over 93% of cases. More strikingly, they also could identify sexual orientation, political affiliation, and religious beliefs with over 82% accuracy. Even more intimate behaviours such as drug use and relationship status could be inferred to a remarkably accurate extent, just based on which Facebook pages the participants favoured.
It is not just about our demographic data. Personality is a fascinating human component as it relates to our job performances, mental health, life satisfaction, successes in romantic relationships, even the music we prefer. The Big Five is one of the most commonly used and well supported models that defines this complex feature. It relies on five dimensions: openness (open to new ideas vs. conservative), conscientiousness (self-controlled vs. easy-going), extraversion (outgoing vs. reserved), agreeableness (compassionate vs. egotistic) and neuroticism (emotionally unstable vs. stable). Predictions can be made based on some of these traits, for example individuals with high scores of neuroticism are more likely to have mental health issues such as depression.
In 2010, Tal Yarkoni made an interesting discovery: using writing samples from 694 blogs he managed to replicate and extend previous associations between personality and language use. In his 2013 study, Kosinski went further by showing that participants’ Facebook Likes could reveal some of their personality profiles with as much accuracy as traditional questionnaires. “From the psychological point of view, what’s important is that we can actually use digital footprints to measure psychological constructs. Now we have accepted that online is an extension of offline,” says Kosinski.
Data collected through social media give us access to thousands, if not millions of users, allowing research to operate at a scale barely imaginable only a few years ago. Yet, this is not just about sample size. Another type of Big Data provide an insight into people’s behaviour with better accuracy and detail. This stems from querying our closest digital companions, our smartphones.
Smartphones are ubiquitous, computationally powerful mobile sensors. They can record an array of psychologically relevant variables through specifically designed research apps. Accelerometers and GPS/Wi-Fi record how physically active we are. Bluetooth, microphone, call & text logs, social media APIs and our contacts show the extent of our social lives. Light, temperature, pressure and photographs give a glimpse into the environments we visit. Browser history, media files, and running apps demonstrate what interests us. Questionnaires that participants are prompted to answer several times a day to record their moods, stress levels or current activities lift the veil on day to day feelings with incredible detail and accuracy.
Overall, this unobtrusive and unbiased methodology offers the chance to detect subtle changes, to make much more accurate estimations of the frequency of events, and to obtain data points more often than with traditional forms of punctual assessment. It gives insight into users’ feelings and spontaneous actions in the natural context where they occur, allowing the detection of low intensity behaviours and characterising processes in detail as they unfold over time.
For example, a group at the University of Dartmouth tracked a single class of 48 students across ten weeks. They recorded stress, sleep, activity, mood, sociability, and mental well-being, all through sensory data or questionnaires logged on smartphones. Researchers identified a term cycle. Students start the year with positive emotions, high levels of social interaction and low stress levels as well as balanced sleep and physical activity patterns. With increased workload over the months, the undergraduates sleep, socialise, and work out less. They become more stressed and exhibit more negative emotions. Such detailed data give an insight into the strain of student life and how it correlates with academic performance and general well-being.
From Facebook Likes and your GPS data, scientists can reconstruct some aspects of your personality and behaviours, but what if they could actually see in great detail how we react, interact with others, and make our decisions in specific life situations? It may sound far-fetched, but this is already happening. In their quest to make games as life-like as possible, the gaming industry has incidentally built a scientist’s dream. Complex behaviours and environments have been converted into digital representations, pre-populated by human gamers having virtual but still emotional and immersive experiences. Naturalistic observation, virtual laboratories, experiments and even psychometric assessments can be carried out in a virtual environment that is ready-made, pre-distributed to millions of people who comfortably and naturally interact in a quantified reality. For example, studying how online players interact with the Church of the Holy Light, the World of Warcraft in-game religion, gave researchers an insight into how young adults think and relate to off-line religious concepts.They re-enacted versions of their own selves at different times in their spiritual journeys of becoming an atheist, toyed with the guilty pleasure of playing a pagan character while they are themselves religious, or on the contrary experimented with religious worldviews through priest avatars while not believing in a deity.
Minecraft is a now ubiquitous ‘virtual LEGO’ game, which has sold more than 100 million copies since its release in 2011. It is now used by pioneering researchers to study how personality and stress are related to gaming behaviour. Minecraft is unique in that it has no scripted storyline and no clear goals. Instead, players are dropped into a procedurally-generated natural environment where they are free to play however they choose, exploring, fighting, or building, alone, with friends, or on large servers where hundreds of players work together to build huge structures and even entire cities.
Every step that players take, every block they place or tree they cut down generates data which can be logged and analysed by researchers. Instead of a broad picture of the general characteristics of a group, we can observe the minutiae of behaviour such as the way in which two people with autism work together to build a house, map the wandering movements of a gamer with depression, or observe players of differing personalities and cognitive styles develop their own preferred mining systems.
What Big Data represent for scientists goes far beyond just an improvement of their methodologies and access to larger samples. Instead, it holds the key to a qualitative – not just quantitative – revolution in behavioural sciences. Indeed, most of the models at the core of the psychological studies powered by Big Data are based on theories designed when only a few hundred participants could be called to a lab. From the terabytes of data recording the minute behaviours of millions of people could arise the opportunity to validate or refine previous models. Even more excitingly, we could have the chance to detect completely new psychological and social phenomena, some until now invisible to the human eye, some that only emerge in an online environment. Unlike scientists, algorithms do not need to have a preconception of how the human psyche could work, possibly revealing patterns no one would have thought to explore before.
BIG DATA, BIG APPLICATIONS
Is Big Data only a scientist’s tool, aiming to refine our understanding of the human psyche? Far from it. Real life applications are already appearing, bringing together academic, applied and commercial ventures.
Most of us are already aware of companies’ efforts to personalise their services and marketing efforts to the individual needs and preferences of consumers. Amazon diligently offers recommendations based on their customers’ past purchases. In fact, a sophisticated personalisation mechanism can help to turn advertising into useful information rather than irrelevant spam (e.g. Amazon’s “People who like X also like Y”).
Grasping people’s personality traits through their online behaviour can take this effort further and unlock a plethora of new opportunities. Individuals respond more positively to products and marketing messages aligned with their own personality characteristics – something well known to both researchers and your local shopkeeper. A highly open-minded person, as identified on a Big Five personality test, is expected to show a stronger interest in artistic and creative products. An individual low in openness may favour instead items that convey a sense of tradition and continuity. Combined with Big Data, this knowledge can transform how advertising is designed. For example, using personality predictions from Facebook Likes, researchers have been able to assess whether participants were extroverted or introverted. The same product was then advertised differently to appeal to introverts’ or extroverts’ core values: each audience significantly preferred the ads tailored to their personalities, leading to more purchases.
INSIDE AN ALGORITHM
By 2015, there was incentive to increase the number of participants further than what Stillwell and colleagues had achieved. Inferential psychological research was widely distrusted because of its (relatively!) small sample sizes. Thus, far more profiles were necessary to yield the kinds of robust, segmented, and granular insights social science was lacking. The challenge was now to develop algorithms from smaller samples that could be exported to large publicly-accessible databases, which were characterised by the near-infinite range of ways users could express the same preference. For example, where users might Like Jay Z’s official Page or a fan site on Facebook, they might Tweet any of the thousands of variants on phrases including “Jay Z”, “Brooklyn”, “Roc Nation”, etc. Raw preferences were unlikely to hold up
It is not just about advertising. Kosinski argues that inferring personality traits and aptitudes through online behaviours could one day underpin job recruitment. The way new employees are hired is fraught with conscious or unconscious biases from recruiters that often prevent minorities from breaking glass ceilings. Instead, candidates’ aptitudes and relevant personality traits could be evaluated from their digital footprints, bypassing human preconceptions. Crucially, this also could unveil someone’s potential, even if life circumstances have not yet allowed them to fully develop it. However, the arrival of this new approach in our day-to-day lives would put the algorithms governing and actually computing the predictions under acute scrutiny. How these algorithms are created and what they learn from the real world must be carefully monitored so as to prevent the replication of the very biases we already observe. “Sometimes if you train algorithms in the wrong fashion they start replicating the biases in us humans - yet another proof that we are biased and unfair. The response is not to abandon algorithms but to train them with as little human input as possible and use objective criteria instead,” Kosinski explains.
In this blurring of the line between research and real-world applications, smartphones and apps hold a special place, particularly in relation to well-being and health monitoring. “It is all very well having something that works in the lab, but for it to be really useful we need to get it working with individuals in real life. Apps are a way of getting an intervention out in the community and into people’s everyday lives,” explains Dr Lucy Cheke, lecturer at the University of Cambridge. In the context of her research on how obesity interacts with cognition, she studies the connection between eating behaviours and episodic memory. The app she designed investigates how making the experience of eating a meal more vivid to a participant, both at the time and before they next consume food, helps to reduce their food intake. The app collects the data needed to test the researcher’s theoretical framework while being beneficial for overweight subjects.
The intersection between research and Big Data can also empower participants themselves. Students taking part in smartphone studies reported in exit interviews how useful and motivational the feedback they got from the research app had been. Such feedback might, for example, display the time they spent on different activities and how it related to their emotional states. This provides information that might not have been obvious to users at the time, for instance around which people they feel happiest or how their sleep patterns correlate with their well-being. As such, this resembles the ‘mood diaries’ commonly used as part of cognitive behavioural therapy. In fact, introspection has been linked to an increase in well-being and a decrease in symptoms of depression. This ability to reflect on one’s behaviour could be facilitated by smartphone feedback. Research on the quantified-self or life logging movement (tracking one’s ownfitness, sleep and diet) is still young, yet already hints at how this could support healthy lifestyle changes.
This new way to understand people’s behaviour can also redraw the way we conceive health monitoring and intervention at a larger scale. A recent study by Canzian and Musolesi demonstrated that participants’ mobility, recorded passively through their phone, was significantly correlated with their depression scores – people move less as they get more depressed. More importantly, this information was objective, passively sensed, and therefore obtained without requiring the participants’ input. This could open the door to better support systems. A healthcare officer could for example get in touch with a patient whose activity patterns reveal a worsening of their symptoms. At a time when the leading cause of death for English and Welsh men between 20 and 34 is suicide, Big Data could represent a powerful agent of change.
BIG DATA, BIG CONCERNS
In 2014, Facebook conducted a study aiming to understand how people’s emotions were affected by the contents of their newsfeeds - the flux of links, statuses and videos posted by their friends. To do so, the company hid certain elements from 689,003 people’s feeds over the course of a week, limiting their access to either positive or negative emotional content. Individuals who saw less positive content in turn shared fewer positive posts themselves. The opposite happened for those exposed to less negative information. In effect, Facebook researchers demonstrated for the first time that emotional contagion takes place online. Unlike studies conducted by universities, Facebook did not inform users that the experiment was taking place and that they were enrolled in it.
Cambridge Analytica (which is not affiliated with the University of Cambridge) is a company that combines data mining and analysis with strategic communication. Using online information such as demographics, consumer behaviour and social media presence, its data scientists infer the personality profile of millions of people, with a particular focus onpredicting voting behaviour. Political campaigning materials were then tailored to answer people’s different personality styles, making sure the tone and the core messages resonate deeply with the recipients. Cambridge Analytica has worked for Ted Cruz’s, Donald Trump’s and Brexit’s ‘Leave’ campaigns. Its CEO claims the company has key psychological data on over 230 million Americans.
Life insurance company John Hancock has started a program whereby customers who have adopted healthy lifestyles can reduce their premiums on a sliding scale. How can users show that they are leaving behind their unhealthy habits? By wearing a FitBit provided by the company and giving the insurers access to some of the data.
Did these examples make you think? If so, you are not alone. There is no denying that Big Data has brought to the table crucial ethical concerns, especially privacy issues.
Theoretically, we do sign up for services like Twitter or Facebook knowing our information will be in the public domain. By agreeing to the terms and conditions of the host platform, we also surrender most of our data rights. Yet very few (if any) of us have the time or impetus to actually read these terms and conditions. We may notice that the ads in our Facebook feed are uncannily targeting our demographics but few realise our data can also be subjected to the kinds of algorithms that infer the most intimate aspects of our character.
INSIDE AN ALGORITHM
A new fundamental assumption can then be added to the model. When someone does not explicitly show a preference for something, they still might if given the opportunity. This transformed many 0’s in the Facebook Likes matrix into probabilities using matrix algebra. Suddenly, data from a few thousand idiosyncratic survey takers sharing their online profiles could produce algorithms accurate enough for tens of millions of unseen user accounts. The method is not constrained to Facebook Likes and can theoretically be applied to more complex online ecosystems like Twitter, Amazon and Google. The method is currently being evaluated on Twitter.
One promising way forward could be greater transparency. Including users in the process of personalisation (whether it is for research or actual marketing) gives them the opportunity to opt-in to better service rather than opt-out of a potential privacy breach. This is crucial to maintain and build trust in innovative predictive technologies. Maybe there is also much to be drawn from the example of academic consent forms, which place a strong emphasis on the concept of informed consent – it is not enough for participants to sign the paperwork, they must have understood what the study entails.
Big Data also poses the challenge of redefining who should benefit from privacy rights. When it comes down to certain psychological traits, large scale, Big Data studies on online platforms allow very little prediction in terms of individuals. They are, however, extremely powerful at the scale of a group. For example, scientists might reasonably conclude that group X is happier than Y on average. Are groups entitled to the same or similar privacy rights as individuals? And, more importantly, should the privacy of minority groups have special protection to limit profiling? Put more concretely, are data analysts entitled to know whether a neighbourhood is relatively outgoing? Or, more controversially, are they allowed to study and report the propensity for homosexuality among British Muslim women?
A few months before the sequencing of the first human genome was even finished, on the 8th of February 2000, President Bill Clinton signed Executive Order 13145, prohibiting the use of genetic information for hiring or promotional action in federal departments and agencies. Big Data elicits many similar concerns around privacy rights as genomics does; as their influences grow in our societies, so too does the need to regulate and control them. Citizens have a strong role to play in this process, and it must be an informed one.
The use of Big Data in psychological research is only just emerging within academia. And yet it is already fully present in companies such as Google, Facebook and even government agencies such as the NSA. “Academia cannot compete with industry in terms of access to datasets. We also have more stringent ethical norms than they have to follow. From my perspective science should try to answer more fundamental questions, trying to see what else we could learn about human psychology from these data,” reveals Kosinski. By empowering responsible research, scientists can have a positive impact on society, model good behaviour and advance conversations about how digital footprints ought to be used across the social sciences and beyond.
The public is very much still unaware of the recent advances in Big Data and how our information could be used and analysed. As companies have little incentive to reveal the work they are conducting, academics have an essential role to play in raising awareness within the community. Kosinski explains: “Big Data predictions are being done, but no one is talking about it. Big companies will run predictions but they never tell anyone they do it. For me, the role of scientists is to poke in different directions, see what could potentially be achieved, and this may also reveal what the industry has been doing for years now.”
Know thyself. Thanks to Big Data, it has never been easier to dig into our behaviours and shine a light on our thoughts and motivations. As the revolution silently unfolds, the question is now:
Who will know us, and for what purpose?
- Elsa Loissel-- Research Assistant, Comparative Cognition Lab
- Sandra Matz--PhD Student, Department of Psychology
- Sandrine Müller--PhD Student, Department of Psychology
- Matthew Samson--PhD Student, Department of Psychology
- Jessica Schallock--PhD Student, Department of Psychology
- Oran Maguire
- Yasemin Gyford
- Timothy Winn